dhlab.api.dhlab_api
¶
Module Contents¶
Functions¶
Retrive images from bokhylla |
|
Get NER annotations for a text ( |
|
Get part of speech tags and dependency parse annotations for a text ( |
|
Show available SpaCy model names. |
|
Look up placenames in a specific URN. |
|
From a list of places, return their geolocations |
|
Count occurrences of words in the given URN object. |
|
Get metadata for a list of URNs. |
|
Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids |
|
Get the text in the document |
|
Fetch chunks and their frequencies from paragraphs in a document ( |
|
Count and aggregate occurrences of topic |
|
Reference frequency list of the n most frequent words from a given corpus in a given period. |
|
Return a list of URNs from a collection of docids. |
|
Count occurrences of one or more words over a time period. |
|
Collect reference data for a list of words over a time period. |
|
Count occurrences of one or more words in books over a given time period. |
|
Get a time series of frequency counts for |
|
Get a time series of frequency counts for |
|
Create a sparse matrix from an API counts object |
|
Fetch frequency counts of |
|
Fetch frequency numbers for |
|
Fetch frequency counts of documents as URNs or DH-lab ids. |
|
Fetch a corpus based on metadata. |
|
Create a collocation from a list of URNs. |
|
Get aggregated raw frequencies of all words in the National Library’s database. |
|
Get a list of concordances from the National Library’s database. |
|
Count concordances (keyword in context) for a corpus query (used for collocation analysis). |
|
Wrapper for :func: |
|
Get a list of concordances from the National Library’s database. |
|
Make a collocation from a corpus query. |
|
Find alternative |
|
Find paradigms for a given |
|
Find alternative forms for a list of words. |
|
Look up the morphological feature specification of a |
|
Look up the morphological feature specifications for word forms in a |
|
Find the list of possible lemmas for a given |
|
Find lemmas for a list of given word forms. |
|
Fetch data from imagination corpus |
API¶
- dhlab.api.dhlab_api.wildcard_search(word, factor=2, freq_limit=10, limit=50)¶
- dhlab.api.dhlab_api.images(text=None, part=True)¶
Retrive images from bokhylla
- Parameters:
text – fulltext query expression for sqlite
part – if a number the whole page is shown … bug prevents these from going thru
delta – if part=True then show additional pixels around image
- Parsm hits:
number of images
- dhlab.api.dhlab_api.ner_from_urn(urn: str = None, model: str = None, start_page=0, to_page=0) pandas.DataFrame ¶
Get NER annotations for a text (
urn
) using a spacymodel
.- Parameters:
urn (str) – uniform resource name, example:
URN:NBN:no-nb_digibok_2011051112001
model (str) – name of a spacy model. Check which models are available with :func:
show_spacy_models
- Returns:
Dataframe with annotations and their frequencies
- dhlab.api.dhlab_api.pos_from_urn(urn: str = None, model: str = None, start_page=0, to_page=0) pandas.DataFrame ¶
Get part of speech tags and dependency parse annotations for a text (
urn
) with a SpaCymodel
.- Parameters:
urn (str) – uniform resource name, example:
URN:NBN:no-nb_digibok_2011051112001
model (str) – name of a spacy model. Check which models are available with :func:
show_spacy_models
- Returns:
Dataframe with annotations and their frequencies
- dhlab.api.dhlab_api.show_spacy_models() List ¶
Show available SpaCy model names.
- dhlab.api.dhlab_api.get_places(urn: str) pandas.DataFrame ¶
Look up placenames in a specific URN.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/places <https://api.nb.no/dhlab/#/default/post_places>
_.- Parameters:
urn (str) – uniform resource name, example:
URN:NBN:no-nb_digibok_2011051112001
- dhlab.api.dhlab_api.geo_lookup(places: List, feature_class: str = None, feature_code: str = None, field: str = 'alternatename') pandas.DataFrame ¶
From a list of places, return their geolocations
- Parameters:
places (list) – a list of place names - max 1000
feature_class (str) – which GeoNames feature class to return. Example:
P
feature_code (str) – which GeoNames feature code to return. Example:
PPL
field (str) – which name field to match - default “alternatename”.
- dhlab.api.dhlab_api.get_dispersion(urn: str = None, words: List = None, window: int = 300, pr: int = 100) pandas.Series ¶
Count occurrences of words in the given URN object.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/dispersion
.- Parameters:
urn (str) – uniform resource name, example:
URN:NBN:no-nb_digibok_2011051112001
words (list) – list of words. Defaults to a list of punctuation marks.
window (int) – The number of tokens to search through per row. Defaults to 300.
pr (int) – defaults to 100.
- Returns:
a
pandas.Series
with frequency counts of the words in the URN object.
- dhlab.api.dhlab_api.get_metadata(urns: List[str] = None) pandas.DataFrame ¶
Get metadata for a list of URNs.
Calls the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/get_metadata <https://api.nb.no/dhlab/#/default/post_get_metadata>
_.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
- dhlab.api.dhlab_api.get_identifiers(identifiers: list = None) list ¶
Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids
- dhlab.api.dhlab_api.get_chunks(urn: str = None, chunk_size: int = 300) Union[Dict, List] ¶
Get the text in the document
urn
as frequencies of chunks of the givenchunk_size
.Calls the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/chunks
.- Parameters:
urn (str) – uniform resource name, example:
URN:NBN:no-nb_digibok_2011051112001
chunk_size (int) – Number of tokens to include in each chunk.
- Returns:
list of dicts with the resulting chunk frequencies, or an empty dict
- dhlab.api.dhlab_api.get_chunks_para(urn: str = None) Union[Dict, List] ¶
Fetch chunks and their frequencies from paragraphs in a document (
urn
).Calls the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/chunks_para
.- Parameters:
urn (str) – uniform resource name, example:
URN:NBN:no-nb_digibok_2011051112001
- Returns:
list of dicts with the resulting chunk frequencies, or an empty dict
- dhlab.api.dhlab_api.evaluate_documents(wordbags: Dict = None, urns: List[str] = None) pandas.DataFrame ¶
Count and aggregate occurrences of topic
wordbags
for each document in a list ofurns
.- Parameters:
wordbags (dict) – a dictionary of topic keywords and lists of associated words. Example:
{"natur": ["planter", "skog", "fjell", "fjord"], ... }
urns (list) – uniform resource names, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
- Returns:
a
pandas.DataFrame
with the topics as columns, indexed by the dhlabids of the documents.
- dhlab.api.dhlab_api.get_reference(corpus: str = 'digavis', from_year: int = 1950, to_year: int = 1955, lang: str = 'nob', limit: int = 100000) pandas.DataFrame ¶
Reference frequency list of the n most frequent words from a given corpus in a given period.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/reference_corpus <https://api.nb.no/dhlab/#/default/get_reference_corpus>
_.- Parameters:
corpus (str) – Document type to include in the corpus, can be either
'digibok'
or'digavis'
.from_year (int) – Starting point for time period of the corpus.
to_year (int) – Last year of the time period of the corpus.
lang (str) – Language of the corpus, can be one of
'nob,', 'nno,', 'sme,', 'sma,', 'smj', 'fkv'
limit (int) – Maximum number of most frequent words.
- Returns:
A
pandas.DataFrame
with the results.
- dhlab.api.dhlab_api.find_urns(docids: Union[Dict, pandas.DataFrame] = None, mode: str = 'json') pandas.DataFrame ¶
Return a list of URNs from a collection of docids.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/find_urn
.- Parameters:
docids – dictionary of document IDs (
{docid: URN}
) or apandas.DataFrame
.mode (str) – Default ‘json’.
- Returns:
the URNs that were found, in a
pandas.DataFrame
.
- dhlab.api.dhlab_api._ngram_doc(doctype: str = None, word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None) pandas.DataFrame ¶
Count occurrences of one or more words over a time period.
The type of document to search through is decided by the
doctype
. Filter the selection of documents with metadata. Use % as wildcard where appropriate - no wildcards inword
orlang
.Args: doctype: API endpoint for the document type to get ngrams for. Can be
'book'
,'periodicals'
, or'newspapers'
. word: Word(s) to search for. Can be several words in a single string, separated by comma, e.g."ord,ordene,orda"
. title: Title of a specific document to search through. period: Start and end years or dates of a time period, given as(YYYY, YYYY)`` or
(YYYYMMDD, YYYYMMDD). publisher: Name of a publisher. lang: Language as a 3-letter ISO code (e.g.
“nob”or
”nno”`) city: City of publication. ddk: Dewey Decimal Classification identifier. topic: Topic of the documents.Returns: a
pandas.DataFrame
with the resulting frequency counts of the word(s), spread across years. One year per row.
- dhlab.api.dhlab_api.reference_words(words: List = None, doctype: str = 'digibok', from_year: Union[str, int] = 1800, to_year: Union[str, int] = 2000) pandas.DataFrame ¶
Collect reference data for a list of words over a time period.
Reference data are the absolute and relative frequencies of the
words
across all documents of the givendoctype
in the given time period (from_year
-to_year
).- Parameters:
words (list) – list of word strings.
doctype (str) –
type of reference document. Can be
"digibok"
or"digavis"
. Defaults to"digibok"
.… note:: If any other string is given as the
doctype
, the resulting data is equivalent to what you get withdoctype="digavis"
.from_year (int) – first year of publication
to_year (int) – last year of publication
- Returns:
a DataFrame with the words’ frequency data
- dhlab.api.dhlab_api.ngram_book(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None) pandas.DataFrame ¶
Count occurrences of one or more words in books over a given time period.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/ngram_book
.Filter the selection of books with metadata. Use % as wildcard where appropriate - no wildcards in
word
orlang
.- Parameters:
word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g.
"ord,ordene,orda"
.title (str) – Title of a specific document to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as
(YYYY, YYYY)
or(YYYYMMDD, YYYYMMDD)
.publisher (str) – Name of a publisher.
lang (str) – Language as a 3-letter ISO code (e.g.
"nob"
or"nno"
)city (str) – City of publication.
ddk (str) –
Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>
_ identifier.topic (str) – Topic of the documents.
- Returns:
a
pandas.DataFrame
with the resulting frequency counts of the word(s), spread across years. One year per row.
- dhlab.api.dhlab_api.ngram_periodicals(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None, **kwargs) pandas.DataFrame ¶
Get a time series of frequency counts for
word
in periodicals.Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/ngram_periodicals
.- Parameters:
word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g.
"ord,ordene,orda"
.title (str) – Title of a specific document to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as
(YYYY, YYYY)
or(YYYYMMDD, YYYYMMDD)
.publisher (str) – Name of a publisher.
lang (str) – Language as a 3-letter ISO code (e.g.
"nob"
or"nno"
)city (str) – City of publication.
ddk (str) –
Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>
_ identifier.topic (str) – Topic of the documents.
- Returns:
a
pandas.DataFrame
with the resulting frequency counts of the word(s), spread across years. One year per row.
- dhlab.api.dhlab_api.ngram_news(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None) pandas.DataFrame ¶
Get a time series of frequency counts for
word
in newspapers.Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/ngram_newspapers
.- Parameters:
word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g.
"ord,ordene,orda"
.title (str) – Title of a specific newspaper to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as
(YYYY, YYYY)
or(YYYYMMDD, YYYYMMDD)
.
- Returns:
a
pandas.DataFrame
with the resulting frequency counts of the word(s), spread across the dates given in the time period. Either one year or one day per row.
- dhlab.api.dhlab_api.create_sparse_matrix(structure)¶
Create a sparse matrix from an API counts object
- dhlab.api.dhlab_api.get_document_frequencies(urns: List[str] = None, cutoff: int = 0, words: List[str] = None, sparse: bool = False) pandas.DataFrame ¶
Fetch frequency counts of
words
in documents (urns
).Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/frequencies
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
cutoff (int) – minimum frequency of a word to be counted
words (list) – a list of words to be counted - if left None, whole document is returned. If not None both the counts and their relative frequency is returned.
sparse (bool) – create a sparse matrix for memory efficiency
- dhlab.api.dhlab_api.get_word_frequencies(urns: List[str] = None, cutoff: int = 0, words: List[str] = None) pandas.DataFrame ¶
Fetch frequency numbers for
words
in documents (urns
).Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/frequencies
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
cutoff (int) – minimum frequency of a word to be counted
words (list) – a list of words to be counted - should not be left None.
- dhlab.api.dhlab_api.get_urn_frequencies(urns: List[str] = None, dhlabid: List = None) pandas.DataFrame ¶
Fetch frequency counts of documents as URNs or DH-lab ids.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/frequencies
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
dhlabid (list) – list of numbers for dhlabid:
[1000001, 2000003]
- dhlab.api.dhlab_api.get_document_corpus(**kwargs)¶
- dhlab.api.dhlab_api.document_corpus(doctype: str = None, author: str = None, freetext: str = None, fulltext: str = None, from_year: int = None, to_year: int = None, from_timestamp: int = None, to_timestamp: int = None, title: str = None, ddk: str = None, subject: str = None, publisher: str = None, literaryform: str = None, genres: str = None, city: str = None, lang: str = None, limit: int = None, order_by: str = None) pandas.DataFrame ¶
Fetch a corpus based on metadata.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/build_corpus <https://api.nb.no/dhlab/#/default/post_build_corpus>
_.- Parameters:
doctype (str) –
"digibok"
,"digavis"
,"digitidsskrift"
or"digistorting"
author (str) – Name of an author.
freetext (str) – any of the parameters, for example:
"digibok AND Ibsen"
.fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
to_timestamp (int) – End date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
title (str) – Name or title of a document.
ddk (str) –
Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>
_ identifier.subject (str) – subject (keywords) of the publication.
publisher (str) – Name of publisher.
literaryform (str) – literary form of the publication (books)
genres (str) – genre of the publication.
city (str) – place of publication
lang (str) – Language of the publication, as a 3-letter ISO code. Example:
"nob"
or"nno"
limit (int) – number of items to sample.
order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example
"random"
(random order, the slowest),"rank"
(ordered by relevance, faster) or"first"
(breadth-first, using the order in the database table, the fastest method)
- Returns:
a
pandas.DataFrame
with the corpus information.
- dhlab.api.dhlab_api.urn_collocation(urns: List = None, word: str = 'arbeid', before: int = 5, after: int = 0, samplesize: int = 200000) pandas.DataFrame ¶
Create a collocation from a list of URNs.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/urncolldist_urn
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
word (str) – word to construct collocation with.
before (int) – number of words preceding the given
word
.after (int) – number of words following the given
word
.samplesize (int) – total number of
urns
to search through.
- Returns:
a
pandas.DataFrame
with distance (sum of distances and bayesian distance) and frequency for words collocated withword
.
- dhlab.api.dhlab_api.totals(top_words: int = 50000) pandas.DataFrame ¶
Get aggregated raw frequencies of all words in the National Library’s database.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/totals/{top_words} <https://api.nb.no/dhlab/#/default/get_totals__top_words_>
_.- Parameters:
top_words (int) – The number of words to get total frequencies for.
- Returns:
a
pandas.DataFrame
with the most frequent words.
- dhlab.api.dhlab_api.concordance(urns: list = None, words: str = None, window: int = 25, limit: int = 100) pandas.DataFrame ¶
Get a list of concordances from the National Library’s database.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/conc <https://api.nb.no/dhlab/#/default/post_conc>
_.- Parameters:
urns (list) – uniform resource names, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.
window (int) – number of tokens on either side to show in the collocations, between 1-25.
limit (int) – max. number of concordances per document. Maximum value is 1000.
- Returns:
a table of concordances
- dhlab.api.dhlab_api.concordance_counts(urns: list = None, words: str = None, window: int = 25, limit: int = 100) pandas.DataFrame ¶
Count concordances (keyword in context) for a corpus query (used for collocation analysis).
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/conccount <https://api.nb.no/dhlab/#/default/post_conccount>
_.- Parameters:
urns (list) – uniform resource names, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.
window (int) – number of tokens on either side to show in the collocations, between 1-25.
limit (int) – max. number of concordances per document. Maximum value is 1000.
- Returns:
a table of counts
- dhlab.api.dhlab_api.konkordans(urns: list = None, words: str = None, window: int = 25, limit: int = 100)¶
Wrapper for :func:
concordance
.
- dhlab.api.dhlab_api.word_concordance(urn: list = None, dhlabid: list = None, words: list = None, before: int = 12, after: int = 12, limit: int = 100, samplesize: int = 50000) pandas.DataFrame ¶
Get a list of concordances from the National Library’s database.
Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/conc <https://api.nb.no/dhlab/#/default/conc_word_urn>
_.- Parameters:
urns (list) – dhlab serial ids. (server can take both urns and dhlabid but so we may rewrite this to)
words (str) – Word(s) to search for – must be a list
before (int) – between 0-24.
after (int) – between 0-24 (before + sum <= 24)
limit (int) – max. number of concordances per server process.
samplesize (int) – samples from urns.
- Returns:
a table of concordances
- dhlab.api.dhlab_api.collocation(corpusquery: str = 'norge', word: str = 'arbeid', before: int = 5, after: int = 0) pandas.DataFrame ¶
Make a collocation from a corpus query.
- Parameters:
corpusquery (str) – query string
word (str) – target word for the collocations.
before (int) – number of words prior to
word
after (int) – number of words following
word
- Returns:
a dataframe with the resulting collocations
- dhlab.api.dhlab_api.word_variant(word: str, form: str, lang: str = 'nob') list ¶
Find alternative
form
for a givenword
form.Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/variant_form
Example:
word_variant('spiste', 'pres-part')
- Parameters:
word (str) – any word string
form (str) – a morphological feature tag from the Norwegian wordbank
"Orbanken" <https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-5/>
_.lang (str) – either “nob” or “nno”
- dhlab.api.dhlab_api.word_paradigm(word: str, lang: str = 'nob') list ¶
Find paradigms for a given
word
form.Call the API :py:obj:
~dhlab.constants.BASE_URL
endpoint/paradigm
Example:
… code-block:: python
word_paradigm('spiste') # [['adj', ['spisende', 'spist', 'spiste']], # ['verb', ['spis', 'spise', 'spiser', 'spises', 'spist', 'spiste']]]
- Parameters:
word (str) – any word string
lang (str) – either “nob” or “nno”
- dhlab.api.dhlab_api.word_paradigm_many(wordlist: list, lang: str = 'nob') list ¶
Find alternative forms for a list of words.
- dhlab.api.dhlab_api.word_form(word: str, lang: str = 'nob') list ¶
Look up the morphological feature specification of a
word
form.
- dhlab.api.dhlab_api.word_form_many(wordlist: list, lang: str = 'nob') list ¶
Look up the morphological feature specifications for word forms in a
wordlist
.
- dhlab.api.dhlab_api.word_lemma(word: str, lang: str = 'nob') list ¶
Find the list of possible lemmas for a given
word
form.
- dhlab.api.dhlab_api.word_lemma_many(wordlist, lang='nob')¶
Find lemmas for a list of given word forms.
- dhlab.api.dhlab_api.query_imagination_corpus(category=None, author=None, title=None, year=None, publisher=None, place=None, oversatt=None)¶
Fetch data from imagination corpus