
Module Contents



Get words, with frequencies, using ‘*’ as a wildcard.


Retrive images from bokhylla


Get NER annotations for a text (urn) using a spacy model.


Get part of speech tags and dependency parse annotations for a text (urn) with a SpaCy model.


Show available SpaCy model names.


Look up placenames in a specific URN.


From a list of places, return their geolocations


Count occurrences of words in the given URN object.


Get metadata for a list of URNs.


Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids


Get the text in the document urn as frequencies of chunks of the given chunk_size.


Fetch chunks and their frequencies from paragraphs in a document (urn).


Count and aggregate occurrences of topic wordbags for each document in a list of urns.


Reference frequency list of the n most frequent words from a given corpus in a given period.


Return a list of URNs from a collection of docids.


Count occurrences of one or more words over a time period.


Collect reference data for a list of words over a time period.


Count occurrences of one or more words in books over a given time period.


Get a time series of frequency counts for word in periodicals.


Get a time series of frequency counts for word in newspapers.


Create a sparse matrix from an API counts object


Fetch frequency counts of words in documents (urns).


Fetch frequency numbers for words in documents (urns).


Fetch frequency counts of documents as URNs or DH-lab ids.



Fetch a corpus based on metadata.


Create a collocation from a list of URNs.


Get aggregated raw frequencies of all words in the National Library’s database.


Get a list of concordances from the National Library’s database.


Count concordances (keyword in context) for a corpus query (used for collocation analysis).


Get a list of concordances from the National Library’s database.


Make a collocation from a corpus query.


Find alternative form for a given word form.


Find paradigms for a given word form.


Find alternative forms for a list of words.


Look up the morphological feature specification of a word form.


Look up the morphological feature specifications for word forms in a wordlist.


Find the list of possible lemmas for a given word form.


Find lemmas for a list of given word forms.


Fetch data from imagination corpus



Get words, with frequencies, using ‘*’ as a wildcard.

For example, searching “orden” might return:                   freq     ordbogen       874     ordboken     10604     ...     ordningen   368131     ordnmgen       722     ...   

  • word – Word to search, allowing (potentially multiple) ‘*’ as a wildcard

  • factor – Max length of matched words, as a factor of word

  • freq_limit – Lower frequency limit of returned matched words

  • limit – Max number of returned results, prioritized by frequency

dhlab.api.dhlab_api.images(text: str | None = None, part: int | None = True, hits: int | None = 500, delta: int | None = 0)

Retrive images from bokhylla

  • text – Fulltext query expression for sqlite.

  • part – If a number, the whole page is shown. If True, get auto-scaled image.

  • delta – If part==True, show delta additional pixels on each side of image

  • hits – Number of images


Requests res.json() return value

dhlab.api.dhlab_api.ner_from_urn(urn: str | None = None, model: str | None = None, start_page: int = 0, to_page: int = 0) pandas.DataFrame

Get NER annotations for a text (urn) using a spacy model.

  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • model (str) – name of a spacy model. Check which models are available with :func:show_spacy_models


Dataframe with annotations and their frequencies

dhlab.api.dhlab_api.pos_from_urn(urn: str | None = None, model: str | None = None, start_page: int = 0, to_page: int = 0) pandas.DataFrame

Get part of speech tags and dependency parse annotations for a text (urn) with a SpaCy model.

  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • model (str) – name of a spacy model. Check which models are available with :func:show_spacy_models

  • start_page (int)

  • to_page (int)


Dataframe with annotations and their frequencies

dhlab.api.dhlab_api.show_spacy_models() List

Show available SpaCy model names.

dhlab.api.dhlab_api.get_places(urn: str) pandas.DataFrame

Look up placenames in a specific URN.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /places <>_.


urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

dhlab.api.dhlab_api.geo_lookup(places: List, feature_class: str | None = None, feature_code: str | None = None, field: str = 'alternatename') pandas.DataFrame

From a list of places, return their geolocations

  • places (list) – a list of place names - max 1000

  • feature_class (str) – which GeoNames feature class to return. Example: P

  • feature_code (str) – which GeoNames feature code to return. Example: PPL

  • field (str) – which name field to match - default “alternatename”.

dhlab.api.dhlab_api.get_dispersion(urn: str | None = None, words: List | None = None, window: int = 300, pr: int = 100) pandas.Series

Count occurrences of words in the given URN object.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /dispersion.

  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • words (list) – list of words. Defaults to a list of punctuation marks.

  • window (int) – The number of tokens to search through per row. Defaults to 300.

  • pr (int) – defaults to 100.


a pandas.Series with frequency counts of the words in the URN object.

dhlab.api.dhlab_api.get_metadata(urns: List[str] | None = None) pandas.DataFrame

Get metadata for a list of URNs.

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /get_metadata <>_.


urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

dhlab.api.dhlab_api.get_identifiers(identifiers: list | None = None) list

Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids

dhlab.api.dhlab_api.get_chunks(urn: str | None = None, chunk_size: int = 300) Union[Dict, List]

Get the text in the document urn as frequencies of chunks of the given chunk_size.

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /chunks.

  • urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

  • chunk_size (int) – Number of tokens to include in each chunk.


list of dicts with the resulting chunk frequencies, or an empty dict

dhlab.api.dhlab_api.get_chunks_para(urn: str | None = None) Union[Dict, List]

Fetch chunks and their frequencies from paragraphs in a document (urn).

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /chunks_para.


urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001


list of dicts with the resulting chunk frequencies, or an empty dict

dhlab.api.dhlab_api.evaluate_documents(wordbags: Dict | None = None, urns: List[str] | None = None) pandas.DataFrame

Count and aggregate occurrences of topic wordbags for each document in a list of urns.

  • wordbags (dict) – a dictionary of topic keywords and lists of associated words. Example: {"natur": ["planter", "skog", "fjell", "fjord"], ... }

  • urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]


a pandas.DataFrame with the topics as columns, indexed by the dhlabids of the documents.

dhlab.api.dhlab_api.get_reference(corpus: str = 'digavis', from_year: int = 1950, to_year: int = 1955, lang: str = 'nob', limit: int = 100000) pandas.DataFrame

Reference frequency list of the n most frequent words from a given corpus in a given period.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /reference_corpus <>_.

  • corpus (str) – Document type to include in the corpus, can be either 'digibok' or 'digavis'.

  • from_year (int) – Starting point for time period of the corpus.

  • to_year (int) – Last year of the time period of the corpus.

  • lang (str) – Language of the corpus, can be one of 'nob,', 'nno,', 'sme,', 'sma,', 'smj', 'fkv'

  • limit (int) – Maximum number of most frequent words.


A pandas.DataFrame with the results.

dhlab.api.dhlab_api.find_urns(docids: Union[Dict, pandas.DataFrame] | None = None, mode: str = 'json') pandas.DataFrame

Return a list of URNs from a collection of docids.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /find_urn.

  • docids – dictionary of document IDs ({docid: URN}) or a pandas.DataFrame.

  • mode (str) – Default ‘json’.


the URNs that were found, in a pandas.DataFrame.

dhlab.api.dhlab_api._ngram_doc(doctype: str | None = None, word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None, publisher: str | None = None, lang: str | None = None, city: str | None = None, ddk: str | None = None, topic: str | None = None) pandas.DataFrame

Count occurrences of one or more words over a time period.

The type of document to search through is decided by the doctype. Filter the selection of documents with metadata. Use % as wildcard where appropriate - no wildcards in word or lang.

Args: doctype: API endpoint for the document type to get ngrams for. Can be 'book', 'periodicals', or 'newspapers'. word: Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda". title: Title of a specific document to search through. period: Start and end years or dates of a time period, given as (YYYY, YYYY)`` or (YYYYMMDD, YYYYMMDD).     publisher: Name of a publisher.     lang: Language as a 3-letter ISO code (e.g. “nob”or”nno”`) city: City of publication. ddk: Dewey Decimal Classification identifier. topic: Topic of the documents.

Returns: a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.reference_words(words: List | None = None, doctype: str = 'digibok', from_year: Union[str, int] = 1800, to_year: Union[str, int] = 2000) pandas.DataFrame

Collect reference data for a list of words over a time period.

Reference data are the absolute and relative frequencies of the words across all documents of the given doctype in the given time period (from_year - to_year).

  • words (list) – list of word strings.

  • doctype (str) –

    type of reference document. Can be "digibok" or "digavis". Defaults to "digibok".

    … note:: If any other string is given as the doctype, the resulting data is equivalent to what you get with doctype="digavis".

  • from_year (int) – first year of publication

  • to_year (int) – last year of publication


a DataFrame with the words’ frequency data

dhlab.api.dhlab_api.ngram_book(word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None, publisher: str | None = None, lang: str | None = None, city: str | None = None, ddk: str | None = None, topic: str | None = None) pandas.DataFrame

Count occurrences of one or more words in books over a given time period.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_book.

Filter the selection of books with metadata. Use % as wildcard where appropriate - no wildcards in word or lang.

  • word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".

  • title (str) – Title of a specific document to search through.

  • period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).

  • publisher (str) – Name of a publisher.

  • lang (str) – Language as a 3-letter ISO code (e.g. "nob" or "nno")

  • city (str) – City of publication.

  • ddk (str) – Dewey Decimal Classification <>_ identifier.

  • topic (str) – Topic of the documents.


a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.ngram_periodicals(word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None, publisher: str | None = None, lang: str | None = None, city: str | None = None, ddk: str | None = None, topic: str | None = None, **kwargs) pandas.DataFrame

Get a time series of frequency counts for word in periodicals.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_periodicals.

  • word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".

  • title (str) – Title of a specific document to search through.

  • period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).

  • publisher (str) – Name of a publisher.

  • lang (str) – Language as a 3-letter ISO code (e.g. "nob" or "nno")

  • city (str) – City of publication.

  • ddk (str) – Dewey Decimal Classification <>_ identifier.

  • topic (str) – Topic of the documents.


a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.ngram_news(word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None) pandas.DataFrame

Get a time series of frequency counts for word in newspapers.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_newspapers.

  • word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".

  • title (str) – Title of a specific newspaper to search through.

  • period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).


a pandas.DataFrame with the resulting frequency counts of the word(s), spread across the dates given in the time period. Either one year or one day per row.


Create a sparse matrix from an API counts object

dhlab.api.dhlab_api.get_document_frequencies(urns: List[str] | None = None, cutoff: int = 0, words: List[str] | None = None, sparse: bool = False) pandas.DataFrame

Fetch frequency counts of words in documents (urns).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • cutoff (int) – minimum frequency of a word to be counted

  • words (list) – a list of words to be counted - if left None, whole document is returned. If not None both the counts and their relative frequency is returned.

  • sparse (bool) – create a sparse matrix for memory efficiency

dhlab.api.dhlab_api.get_word_frequencies(urns: List[str] | None = None, cutoff: int = 0, words: List[str] | None = None) pandas.DataFrame

Fetch frequency numbers for words in documents (urns).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • cutoff (int) – minimum frequency of a word to be counted

  • words (list) – a list of words to be counted - should not be left None.

dhlab.api.dhlab_api.get_urn_frequencies(urns: List[str] | None = None, dhlabid: List[int] | None = None) pandas.DataFrame

Fetch frequency counts of documents as URNs or DH-lab ids.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • dhlabid (list) – list of numbers for dhlabid: [1000001, 2000003]

dhlab.api.dhlab_api.document_corpus(doctype: str | None = None, author: str | None = None, freetext: str | None = None, fulltext: str | None = None, from_year: int | None = None, to_year: int | None = None, from_timestamp: int | None = None, to_timestamp: int | None = None, title: str | None = None, ddk: str | None = None, subject: str | None = None, publisher: str | None = None, literaryform: str | None = None, genres: str | None = None, city: str | None = None, lang: str | None = None, limit: int | None = None, order_by: str | None = None) pandas.DataFrame

Fetch a corpus based on metadata.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /build_corpus <>_.

  • doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"

  • author (str) – Name of an author.

  • freetext (str) – any of the parameters, for example: "digibok AND Ibsen".

  • fulltext (str) – words within the publication.

  • from_year (int) – Start year for time period of interest.

  • to_year (int) – End year for time period of interest.

  • from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • title (str) – Name or title of a document.

  • ddk (str) – Dewey Decimal Classification <>_ identifier.

  • subject (str) – subject (keywords) of the publication.

  • publisher (str) – Name of publisher.

  • literaryform (str) – literary form of the publication (books)

  • genres (str) – genre of the publication.

  • city (str) – place of publication

  • lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"

  • limit (int) – number of items to sample.

  • order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example "random" (random order, the slowest), "rank" (ordered by relevance, faster) or "first" (breadth-first, using the order in the database table, the fastest method)


a pandas.DataFrame with the corpus information.

dhlab.api.dhlab_api.urn_collocation(urns: List[str] | None = None, word: str = 'arbeid', before: int = 5, after: int = 0, samplesize: int = 200000) pandas.DataFrame

Create a collocation from a list of URNs.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /urncolldist_urn.

  • urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • word (str) – word to construct collocation with.

  • before (int) – number of words preceding the given word.

  • after (int) – number of words following the given word.

  • samplesize (int) – total number of urns to search through.


a pandas.DataFrame with distance (sum of distances and bayesian distance) and frequency for words collocated with word.

dhlab.api.dhlab_api.totals(top_words: int = 50000) pandas.DataFrame

Get aggregated raw frequencies of all words in the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /totals/{top_words} <>_.


top_words (int) – The number of words to get total frequencies for.


a pandas.DataFrame with the most frequent words.

dhlab.api.dhlab_api.concordance(urns: list | None = None, words: str | None = None, window: int = 25, limit: int = 100) pandas.DataFrame

Get a list of concordances from the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conc <>_.

  • urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.

  • window (int) – number of tokens on either side to show in the collocations, between 1-25.

  • limit (int) – max. number of concordances per document. Maximum value is 1000.


a table of concordances



dhlab.api.dhlab_api.concordance_counts(urns: list | None = None, words: str | None = None, window: int = 25, limit: int = 100) pandas.DataFrame

Count concordances (keyword in context) for a corpus query (used for collocation analysis).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conccount <>_.

  • urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

  • words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.

  • window (int) – number of tokens on either side to show in the collocations, between 1-25.

  • limit (int) – max. number of concordances per document. Maximum value is 1000.


a table of counts

dhlab.api.dhlab_api.word_concordance(urn: list | None = None, dhlabid: list | None = None, words: list | None = None, before: int = 12, after: int = 12, limit: int = 100, samplesize: int = 50000) pandas.DataFrame

Get a list of concordances from the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conc <>_.

  • urns (list) – dhlab serial ids. (server can take both urns and dhlabid but so we may rewrite this to)

  • words (str) – Word(s) to search for – must be a list

  • before (int) – between 0-24.

  • after (int) – between 0-24 (before + sum <= 24)

  • limit (int) – max. number of concordances per server process.

  • samplesize (int) – samples from urns.


a table of concordances

dhlab.api.dhlab_api.collocation(corpusquery: str = 'norge', word: str = 'arbeid', before: int = 5, after: int = 0) pandas.DataFrame

Make a collocation from a corpus query.

  • corpusquery (str) – query string

  • word (str) – target word for the collocations.

  • before (int) – number of words prior to word

  • after (int) – number of words following word


a dataframe with the resulting collocations

dhlab.api.dhlab_api.word_variant(word: str, form: str, lang: str = 'nob') list

Find alternative form for a given word form.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /variant_form

Example: word_variant('spiste', 'pres-part')

  • word (str) – any word string

  • form (str) – a morphological feature tag from the Norwegian wordbank "Orbanken" <>_.

  • lang (str) – either “nob” or “nno”

dhlab.api.dhlab_api.word_paradigm(word: str, lang: str = 'nob') list

Find paradigms for a given word form.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /paradigm


… code-block:: python

# [['adj', ['spisende', 'spist', 'spiste']],
# ['verb', ['spis', 'spise', 'spiser', 'spises', 'spist', 'spiste']]]
  • word (str) – any word string

  • lang (str) – either “nob” or “nno”

dhlab.api.dhlab_api.word_paradigm_many(wordlist: list, lang: str = 'nob') list

Find alternative forms for a list of words.

  • wordlistList of words

  • lang – Language

dhlab.api.dhlab_api.word_form(word: str, lang: str = 'nob') list

Look up the morphological feature specification of a word form.

  • word – Word

  • lang – Language

dhlab.api.dhlab_api.word_form_many(wordlist: list, lang: str = 'nob') list

Look up the morphological feature specifications for word forms in a wordlist.

  • wordlistList of words

  • lang – Language

dhlab.api.dhlab_api.word_lemma(word: str, lang: str = 'nob') list

Find the list of possible lemmas for a given word form.

  • word – Word to find lemmas for

  • lang – Language

dhlab.api.dhlab_api.word_lemma_many(wordlist, lang='nob')

Find lemmas for a list of given word forms.

dhlab.api.dhlab_api.query_imagination_corpus(category=None, author=None, title=None, year=None, publisher=None, place=None, oversatt=None)

Fetch data from imagination corpus