dhlab.api.dhlab_api

Module Contents

Functions

wildcard_search

images

Retrive images from bokhylla

ner_from_urn

Get NER annotations for a text (urn) using a spacy model.

pos_from_urn

Get part of speech tags and dependency parse annotations for a text (urn) with a SpaCy model.

show_spacy_models

Show available SpaCy model names.

get_places

Look up placenames in a specific URN.

geo_lookup

From a list of places, return their geolocations

get_dispersion

Count occurrences of words in the given URN object.

get_metadata

Get metadata for a list of URNs.

get_identifiers

Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids

get_chunks

Get the text in the document urn as frequencies of chunks of the given chunk_size.

get_chunks_para

Fetch chunks and their frequencies from paragraphs in a document (urn).

evaluate_documents

Count and aggregate occurrences of topic wordbags for each document in a list of urns.

get_reference

Reference frequency list of the n most frequent words from a given corpus in a given period.

find_urns

Return a list of URNs from a collection of docids.

_ngram_doc

Count occurrences of one or more words over a time period.

reference_words

Collect reference data for a list of words over a time period.

ngram_book

Count occurrences of one or more words in books over a given time period.

ngram_periodicals

Get a time series of frequency counts for word in periodicals.

ngram_news

Get a time series of frequency counts for word in newspapers.

create_sparse_matrix

Create a sparse matrix from an API counts object

get_document_frequencies

Fetch frequency counts of words in documents (urns).

get_word_frequencies

Fetch frequency numbers for words in documents (urns).

get_urn_frequencies

Fetch frequency counts of documents as URNs or DH-lab ids.

get_document_corpus

document_corpus

Fetch a corpus based on metadata.

urn_collocation

Create a collocation from a list of URNs.

totals

Get aggregated raw frequencies of all words in the National Library’s database.

concordance

Get a list of concordances from the National Library’s database.

concordance_counts

Count concordances (keyword in context) for a corpus query (used for collocation analysis).

konkordans

Wrapper for :func:concordance.

word_concordance

Get a list of concordances from the National Library’s database.

collocation

Make a collocation from a corpus query.

word_variant

Find alternative form for a given word form.

word_paradigm

Find paradigms for a given word form.

word_paradigm_many

Find alternative forms for a list of words.

word_form

Look up the morphological feature specification of a word form.

word_form_many

Look up the morphological feature specifications for word forms in a wordlist.

word_lemma

Find the list of possible lemmas for a given word form.

word_lemma_many

Find lemmas for a list of given word forms.

query_imagination_corpus

Fetch data from imagination corpus

API

dhlab.api.dhlab_api.images(text=None, part=True)

Retrive images from bokhylla

Parameters:
  • text – fulltext query expression for sqlite

  • part – if a number the whole page is shown … bug prevents these from going thru

  • delta – if part=True then show additional pixels around image

Parsm hits:

number of images

dhlab.api.dhlab_api.ner_from_urn(urn: str = None, model: str = None, start_page=0, to_page=0) pandas.DataFrame

Get NER annotations for a text (urn) using a spacy model.

Parameters:
  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • model (str) – name of a spacy model. Check which models are available with :func:show_spacy_models

Returns:

Dataframe with annotations and their frequencies

dhlab.api.dhlab_api.pos_from_urn(urn: str = None, model: str = None, start_page=0, to_page=0) pandas.DataFrame

Get part of speech tags and dependency parse annotations for a text (urn) with a SpaCy model.

Parameters:
  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • model (str) – name of a spacy model. Check which models are available with :func:show_spacy_models

Returns:

Dataframe with annotations and their frequencies

dhlab.api.dhlab_api.show_spacy_models() List

Show available SpaCy model names.

dhlab.api.dhlab_api.get_places(urn: str) pandas.DataFrame

Look up placenames in a specific URN.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /places <https://api.nb.no/dhlab/#/default/post_places>_.

Parameters:

urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

dhlab.api.dhlab_api.geo_lookup(places: List, feature_class: str = None, feature_code: str = None, field: str = 'alternatename') pandas.DataFrame

From a list of places, return their geolocations

Parameters:
  • places (list) – a list of place names - max 1000

  • feature_class (str) – which GeoNames feature class to return. Example: P

  • feature_code (str) – which GeoNames feature code to return. Example: PPL

  • field (str) – which name field to match - default “alternatename”.

dhlab.api.dhlab_api.get_dispersion(urn: str = None, words: List = None, window: int = 300, pr: int = 100) pandas.Series

Count occurrences of words in the given URN object.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /dispersion.

Parameters:
  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • words (list) – list of words. Defaults to a list of punctuation marks.

  • window (int) – The number of tokens to search through per row. Defaults to 300.

  • pr (int) – defaults to 100.

Returns:

a pandas.Series with frequency counts of the words in the URN object.

dhlab.api.dhlab_api.get_metadata(urns: List[str] = None) pandas.DataFrame

Get metadata for a list of URNs.

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /get_metadata <https://api.nb.no/dhlab/#/default/post_get_metadata>_.

Parameters:

urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

dhlab.api.dhlab_api.get_identifiers(identifiers: list = None) list

Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids

dhlab.api.dhlab_api.get_chunks(urn: str = None, chunk_size: int = 300) Union[Dict, List]

Get the text in the document urn as frequencies of chunks of the given chunk_size.

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /chunks.

Parameters:
  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • chunk_size (int) – Number of tokens to include in each chunk.

Returns:

list of dicts with the resulting chunk frequencies, or an empty dict

dhlab.api.dhlab_api.get_chunks_para(urn: str = None) Union[Dict, List]

Fetch chunks and their frequencies from paragraphs in a document (urn).

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /chunks_para.

Parameters:

urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

Returns:

list of dicts with the resulting chunk frequencies, or an empty dict

dhlab.api.dhlab_api.evaluate_documents(wordbags: Dict = None, urns: List[str] = None) pandas.DataFrame

Count and aggregate occurrences of topic wordbags for each document in a list of urns.

Parameters:
  • wordbags (dict) – a dictionary of topic keywords and lists of associated words. Example: {"natur": ["planter", "skog", "fjell", "fjord"], ... }

  • urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

Returns:

a pandas.DataFrame with the topics as columns, indexed by the dhlabids of the documents.

dhlab.api.dhlab_api.get_reference(corpus: str = 'digavis', from_year: int = 1950, to_year: int = 1955, lang: str = 'nob', limit: int = 100000) pandas.DataFrame

Reference frequency list of the n most frequent words from a given corpus in a given period.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /reference_corpus <https://api.nb.no/dhlab/#/default/get_reference_corpus>_.

Parameters:
  • corpus (str) – Document type to include in the corpus, can be either 'digibok' or 'digavis'.

  • from_year (int) – Starting point for time period of the corpus.

  • to_year (int) – Last year of the time period of the corpus.

  • lang (str) – Language of the corpus, can be one of 'nob,', 'nno,', 'sme,', 'sma,', 'smj', 'fkv'

  • limit (int) – Maximum number of most frequent words.

Returns:

A pandas.DataFrame with the results.

dhlab.api.dhlab_api.find_urns(docids: Union[Dict, pandas.DataFrame] = None, mode: str = 'json') pandas.DataFrame

Return a list of URNs from a collection of docids.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /find_urn.

Parameters:
  • docids – dictionary of document IDs ({docid: URN}) or a pandas.DataFrame.

  • mode (str) – Default ‘json’.

Returns:

the URNs that were found, in a pandas.DataFrame.

dhlab.api.dhlab_api._ngram_doc(doctype: str = None, word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None) pandas.DataFrame

Count occurrences of one or more words over a time period.

The type of document to search through is decided by the doctype. Filter the selection of documents with metadata. Use % as wildcard where appropriate - no wildcards in word or lang.

Args: doctype: API endpoint for the document type to get ngrams for. Can be 'book', 'periodicals', or 'newspapers'. word: Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda". title: Title of a specific document to search through. period: Start and end years or dates of a time period, given as (YYYY, YYYY)`` or (YYYYMMDD, YYYYMMDD).     publisher: Name of a publisher.     lang: Language as a 3-letter ISO code (e.g. “nob”or”nno”`) city: City of publication. ddk: Dewey Decimal Classification identifier. topic: Topic of the documents.

Returns: a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.reference_words(words: List = None, doctype: str = 'digibok', from_year: Union[str, int] = 1800, to_year: Union[str, int] = 2000) pandas.DataFrame

Collect reference data for a list of words over a time period.

Reference data are the absolute and relative frequencies of the words across all documents of the given doctype in the given time period (from_year - to_year).

Parameters:
  • words (list) – list of word strings.

  • doctype (str) –

    type of reference document. Can be "digibok" or "digavis". Defaults to "digibok".

    … note:: If any other string is given as the doctype, the resulting data is equivalent to what you get with doctype="digavis".

  • from_year (int) – first year of publication

  • to_year (int) – last year of publication

Returns:

a DataFrame with the words’ frequency data

dhlab.api.dhlab_api.ngram_book(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None) pandas.DataFrame

Count occurrences of one or more words in books over a given time period.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_book.

Filter the selection of books with metadata. Use % as wildcard where appropriate - no wildcards in word or lang.

Parameters:
  • word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".

  • title (str) – Title of a specific document to search through.

  • period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).

  • publisher (str) – Name of a publisher.

  • lang (str) – Language as a 3-letter ISO code (e.g. "nob" or "nno")

  • city (str) – City of publication.

  • ddk (str) – Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.

  • topic (str) – Topic of the documents.

Returns:

a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.ngram_periodicals(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None, **kwargs) pandas.DataFrame

Get a time series of frequency counts for word in periodicals.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_periodicals.

Parameters:
  • word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".

  • title (str) – Title of a specific document to search through.

  • period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).

  • publisher (str) – Name of a publisher.

  • lang (str) – Language as a 3-letter ISO code (e.g. "nob" or "nno")

  • city (str) – City of publication.

  • ddk (str) – Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.

  • topic (str) – Topic of the documents.

Returns:

a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.ngram_news(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None) pandas.DataFrame

Get a time series of frequency counts for word in newspapers.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_newspapers.

Parameters:
  • word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".

  • title (str) – Title of a specific newspaper to search through.

  • period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).

Returns:

a pandas.DataFrame with the resulting frequency counts of the word(s), spread across the dates given in the time period. Either one year or one day per row.

dhlab.api.dhlab_api.create_sparse_matrix(structure)

Create a sparse matrix from an API counts object

dhlab.api.dhlab_api.get_document_frequencies(urns: List[str] = None, cutoff: int = 0, words: List[str] = None, sparse: bool = False) pandas.DataFrame

Fetch frequency counts of words in documents (urns).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

Parameters:
  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • cutoff (int) – minimum frequency of a word to be counted

  • words (list) – a list of words to be counted - if left None, whole document is returned. If not None both the counts and their relative frequency is returned.

  • sparse (bool) – create a sparse matrix for memory efficiency

dhlab.api.dhlab_api.get_word_frequencies(urns: List[str] = None, cutoff: int = 0, words: List[str] = None) pandas.DataFrame

Fetch frequency numbers for words in documents (urns).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

Parameters:
  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • cutoff (int) – minimum frequency of a word to be counted

  • words (list) – a list of words to be counted - should not be left None.

dhlab.api.dhlab_api.get_urn_frequencies(urns: List[str] = None, dhlabid: List = None) pandas.DataFrame

Fetch frequency counts of documents as URNs or DH-lab ids.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

Parameters:
  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • dhlabid (list) – list of numbers for dhlabid: [1000001, 2000003]

dhlab.api.dhlab_api.get_document_corpus(**kwargs)
dhlab.api.dhlab_api.document_corpus(doctype: str = None, author: str = None, freetext: str = None, fulltext: str = None, from_year: int = None, to_year: int = None, from_timestamp: int = None, to_timestamp: int = None, title: str = None, ddk: str = None, subject: str = None, publisher: str = None, literaryform: str = None, genres: str = None, city: str = None, lang: str = None, limit: int = None, order_by: str = None) pandas.DataFrame

Fetch a corpus based on metadata.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /build_corpus <https://api.nb.no/dhlab/#/default/post_build_corpus>_.

Parameters:
  • doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"

  • author (str) – Name of an author.

  • freetext (str) – any of the parameters, for example: "digibok AND Ibsen".

  • fulltext (str) – words within the publication.

  • from_year (int) – Start year for time period of interest.

  • to_year (int) – End year for time period of interest.

  • from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • title (str) – Name or title of a document.

  • ddk (str) – Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.

  • subject (str) – subject (keywords) of the publication.

  • publisher (str) – Name of publisher.

  • literaryform (str) – literary form of the publication (books)

  • genres (str) – genre of the publication.

  • city (str) – place of publication

  • lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"

  • limit (int) – number of items to sample.

  • order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example "random" (random order, the slowest), "rank" (ordered by relevance, faster) or "first" (breadth-first, using the order in the database table, the fastest method)

Returns:

a pandas.DataFrame with the corpus information.

dhlab.api.dhlab_api.urn_collocation(urns: List = None, word: str = 'arbeid', before: int = 5, after: int = 0, samplesize: int = 200000) pandas.DataFrame

Create a collocation from a list of URNs.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /urncolldist_urn.

Parameters:
  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • word (str) – word to construct collocation with.

  • before (int) – number of words preceding the given word.

  • after (int) – number of words following the given word.

  • samplesize (int) – total number of urns to search through.

Returns:

a pandas.DataFrame with distance (sum of distances and bayesian distance) and frequency for words collocated with word.

dhlab.api.dhlab_api.totals(top_words: int = 50000) pandas.DataFrame

Get aggregated raw frequencies of all words in the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /totals/{top_words} <https://api.nb.no/dhlab/#/default/get_totals__top_words_>_.

Parameters:

top_words (int) – The number of words to get total frequencies for.

Returns:

a pandas.DataFrame with the most frequent words.

dhlab.api.dhlab_api.concordance(urns: list = None, words: str = None, window: int = 25, limit: int = 100) pandas.DataFrame

Get a list of concordances from the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conc <https://api.nb.no/dhlab/#/default/post_conc>_.

Parameters:
  • urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.

  • window (int) – number of tokens on either side to show in the collocations, between 1-25.

  • limit (int) – max. number of concordances per document. Maximum value is 1000.

Returns:

a table of concordances

dhlab.api.dhlab_api.concordance_counts(urns: list = None, words: str = None, window: int = 25, limit: int = 100) pandas.DataFrame

Count concordances (keyword in context) for a corpus query (used for collocation analysis).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conccount <https://api.nb.no/dhlab/#/default/post_conccount>_.

Parameters:
  • urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.

  • window (int) – number of tokens on either side to show in the collocations, between 1-25.

  • limit (int) – max. number of concordances per document. Maximum value is 1000.

Returns:

a table of counts

dhlab.api.dhlab_api.konkordans(urns: list = None, words: str = None, window: int = 25, limit: int = 100)

Wrapper for :func:concordance.

dhlab.api.dhlab_api.word_concordance(urn: list = None, dhlabid: list = None, words: list = None, before: int = 12, after: int = 12, limit: int = 100, samplesize: int = 50000) pandas.DataFrame

Get a list of concordances from the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conc <https://api.nb.no/dhlab/#/default/conc_word_urn>_.

Parameters:
  • urns (list) – dhlab serial ids. (server can take both urns and dhlabid but so we may rewrite this to)

  • words (str) – Word(s) to search for – must be a list

  • before (int) – between 0-24.

  • after (int) – between 0-24 (before + sum <= 24)

  • limit (int) – max. number of concordances per server process.

  • samplesize (int) – samples from urns.

Returns:

a table of concordances

dhlab.api.dhlab_api.collocation(corpusquery: str = 'norge', word: str = 'arbeid', before: int = 5, after: int = 0) pandas.DataFrame

Make a collocation from a corpus query.

Parameters:
  • corpusquery (str) – query string

  • word (str) – target word for the collocations.

  • before (int) – number of words prior to word

  • after (int) – number of words following word

Returns:

a dataframe with the resulting collocations

dhlab.api.dhlab_api.word_variant(word: str, form: str, lang: str = 'nob') list

Find alternative form for a given word form.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /variant_form

Example: word_variant('spiste', 'pres-part')

Parameters:
  • word (str) – any word string

  • form (str) – a morphological feature tag from the Norwegian wordbank "Orbanken" <https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-5/>_.

  • lang (str) – either “nob” or “nno”

dhlab.api.dhlab_api.word_paradigm(word: str, lang: str = 'nob') list

Find paradigms for a given word form.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /paradigm

Example:

… code-block:: python

word_paradigm('spiste')
# [['adj', ['spisende', 'spist', 'spiste']],
# ['verb', ['spis', 'spise', 'spiser', 'spises', 'spist', 'spiste']]]
Parameters:
  • word (str) – any word string

  • lang (str) – either “nob” or “nno”

dhlab.api.dhlab_api.word_paradigm_many(wordlist: list, lang: str = 'nob') list

Find alternative forms for a list of words.

dhlab.api.dhlab_api.word_form(word: str, lang: str = 'nob') list

Look up the morphological feature specification of a word form.

dhlab.api.dhlab_api.word_form_many(wordlist: list, lang: str = 'nob') list

Look up the morphological feature specifications for word forms in a wordlist.

dhlab.api.dhlab_api.word_lemma(word: str, lang: str = 'nob') list

Find the list of possible lemmas for a given word form.

dhlab.api.dhlab_api.word_lemma_many(wordlist, lang='nob')

Find lemmas for a list of given word forms.

dhlab.api.dhlab_api.query_imagination_corpus(category=None, author=None, title=None, year=None, publisher=None, place=None, oversatt=None)

Fetch data from imagination corpus