Module Contents¶
Get words, with frequencies, using ‘*’ as a wildcard. |
Retrive images from bokhylla |
Get NER annotations for a text ( |
Get part of speech tags and dependency parse annotations for a text ( |
Show available SpaCy model names. |
Look up placenames in a specific URN. |
From a list of places, return their geolocations |
Count occurrences of words in the given URN object. |
Get metadata for a list of URNs. |
Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids |
Get the text in the document |
Fetch chunks and their frequencies from paragraphs in a document ( |
Count and aggregate occurrences of topic |
Reference frequency list of the n most frequent words from a given corpus in a given period. |
Return a list of URNs from a collection of docids. |
Count occurrences of one or more words over a time period. |
Collect reference data for a list of words over a time period. |
Count occurrences of one or more words in books over a given time period. |
Get a time series of frequency counts for |
Get a time series of frequency counts for |
Create a sparse matrix from an API counts object |
Fetch frequency counts of |
Fetch frequency numbers for |
Fetch frequency counts of documents as URNs or DH-lab ids. |
Fetch a corpus based on metadata. |
Create a collocation from a list of URNs. |
Get aggregated raw frequencies of all words in the National Library’s database. |
Get a list of concordances from the National Library’s database. |
Count concordances (keyword in context) for a corpus query (used for collocation analysis). |
Get a list of concordances from the National Library’s database. |
Make a collocation from a corpus query. |
Find alternative |
Find paradigms for a given |
Find alternative forms for a list of words. |
Look up the morphological feature specification of a |
Look up the morphological feature specifications for word forms in a |
Find the list of possible lemmas for a given |
Find lemmas for a list of given word forms. |
Fetch data from imagination corpus |
- dhlab.api.dhlab_api.wildcard_search(word: str, factor: int | None = 2, freq_limit: int | None = 10, limit: int | None = 50) pandas.DataFrame ¶
Get words, with frequencies, using ‘*’ as a wildcard.
For example, searching “orden” might return:
freq ordbogen 874 ordboken 10604 ... ordningen 368131 ordnmgen 722 ...
- Parameters:
word – Word to search, allowing (potentially multiple) ‘*’ as a wildcard
factor – Max length of matched words, as a factor of
freq_limit – Lower frequency limit of returned matched words
limit – Max number of returned results, prioritized by frequency
- dhlab.api.dhlab_api.images(text: str | None = None, part: int | None = True, hits: int | None = 500, delta: int | None = 0)¶
Retrive images from bokhylla
- Parameters:
text – Fulltext query expression for sqlite.
part – If a number, the whole page is shown. If True, get auto-scaled image.
delta – If part==True, show
additional pixels on each side of imagehits – Number of images
- Returns:
return value
- dhlab.api.dhlab_api.ner_from_urn(urn: str | None = None, model: str | None = None, start_page: int = 0, to_page: int = 0) pandas.DataFrame ¶
Get NER annotations for a text (
) using a spacymodel
.- Parameters:
urn (str) – uniform resource name, example:
model (str) – name of a spacy model. Check which models are available with :func:
- Returns:
Dataframe with annotations and their frequencies
- dhlab.api.dhlab_api.pos_from_urn(urn: str | None = None, model: str | None = None, start_page: int = 0, to_page: int = 0) pandas.DataFrame ¶
Get part of speech tags and dependency parse annotations for a text (
) with a SpaCymodel
.- Parameters:
urn (str) – uniform resource name, example:
model (str) – name of a spacy model. Check which models are available with :func:
start_page (int)
to_page (int)
- Returns:
Dataframe with annotations and their frequencies
- dhlab.api.dhlab_api.show_spacy_models() List ¶
Show available SpaCy model names.
- dhlab.api.dhlab_api.get_places(urn: str) pandas.DataFrame ¶
Look up placenames in a specific URN.
Call the API :py:obj:
endpoint/places <>
_.- Parameters:
urn (str) – uniform resource name, example:
- dhlab.api.dhlab_api.geo_lookup(places: List, feature_class: str | None = None, feature_code: str | None = None, field: str = 'alternatename') pandas.DataFrame ¶
From a list of places, return their geolocations
- Parameters:
places (list) – a list of place names - max 1000
feature_class (str) – which GeoNames feature class to return. Example:
feature_code (str) – which GeoNames feature code to return. Example:
field (str) – which name field to match - default “alternatename”.
- dhlab.api.dhlab_api.get_dispersion(urn: str | None = None, words: List | None = None, window: int = 300, pr: int = 100) pandas.Series ¶
Count occurrences of words in the given URN object.
Call the API :py:obj:
.- Parameters:
urn (str) – uniform resource name, example:
words (list) – list of words. Defaults to a list of punctuation marks.
window (int) – The number of tokens to search through per row. Defaults to 300.
pr (int) – defaults to 100.
- Returns:
with frequency counts of the words in the URN object.
- dhlab.api.dhlab_api.get_metadata(urns: List[str] | None = None) pandas.DataFrame ¶
Get metadata for a list of URNs.
Calls the API :py:obj:
endpoint/get_metadata <>
_.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
- dhlab.api.dhlab_api.get_identifiers(identifiers: list | None = None) list ¶
Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids
- dhlab.api.dhlab_api.get_chunks(urn: str | None = None, chunk_size: int = 300) Union[Dict, List] ¶
Get the text in the document
as frequencies of chunks of the givenchunk_size
.Calls the API :py:obj:
.- Parameters:
urn (str) – uniform resource name, example:
chunk_size (int) – Number of tokens to include in each chunk.
- Returns:
list of dicts with the resulting chunk frequencies, or an empty dict
- dhlab.api.dhlab_api.get_chunks_para(urn: str | None = None) Union[Dict, List] ¶
Fetch chunks and their frequencies from paragraphs in a document (
).Calls the API :py:obj:
.- Parameters:
urn (str) – uniform resource name, example:
- Returns:
list of dicts with the resulting chunk frequencies, or an empty dict
- dhlab.api.dhlab_api.evaluate_documents(wordbags: Dict | None = None, urns: List[str] | None = None) pandas.DataFrame ¶
Count and aggregate occurrences of topic
for each document in a list ofurns
.- Parameters:
wordbags (dict) – a dictionary of topic keywords and lists of associated words. Example:
{"natur": ["planter", "skog", "fjell", "fjord"], ... }
urns (list) – uniform resource names, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
- Returns:
with the topics as columns, indexed by the dhlabids of the documents.
- dhlab.api.dhlab_api.get_reference(corpus: str = 'digavis', from_year: int = 1950, to_year: int = 1955, lang: str = 'nob', limit: int = 100000) pandas.DataFrame ¶
Reference frequency list of the n most frequent words from a given corpus in a given period.
Call the API :py:obj:
endpoint/reference_corpus <>
_.- Parameters:
corpus (str) – Document type to include in the corpus, can be either
.from_year (int) – Starting point for time period of the corpus.
to_year (int) – Last year of the time period of the corpus.
lang (str) – Language of the corpus, can be one of
'nob,', 'nno,', 'sme,', 'sma,', 'smj', 'fkv'
limit (int) – Maximum number of most frequent words.
- Returns:
with the results.
- dhlab.api.dhlab_api.find_urns(docids: Union[Dict, pandas.DataFrame] | None = None, mode: str = 'json') pandas.DataFrame ¶
Return a list of URNs from a collection of docids.
Call the API :py:obj:
.- Parameters:
docids – dictionary of document IDs (
{docid: URN}
) or apandas.DataFrame
.mode (str) – Default ‘json’.
- Returns:
the URNs that were found, in a
- dhlab.api.dhlab_api._ngram_doc(doctype: str | None = None, word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None, publisher: str | None = None, lang: str | None = None, city: str | None = None, ddk: str | None = None, topic: str | None = None) pandas.DataFrame ¶
Count occurrences of one or more words over a time period.
The type of document to search through is decided by the
. Filter the selection of documents with metadata. Use % as wildcard where appropriate - no wildcards inword
.Args: doctype: API endpoint for the document type to get ngrams for. Can be
, or'newspapers'
. word: Word(s) to search for. Can be several words in a single string, separated by comma, e.g."ord,ordene,orda"
. title: Title of a specific document to search through. period: Start and end years or dates of a time period, given as(YYYY, YYYY)`` or
(YYYYMMDD, YYYYMMDD). publisher: Name of a publisher. lang: Language as a 3-letter ISO code (e.g.
”nno”`) city: City of publication. ddk: Dewey Decimal Classification identifier. topic: Topic of the documents.Returns: a
with the resulting frequency counts of the word(s), spread across years. One year per row.
- dhlab.api.dhlab_api.reference_words(words: List | None = None, doctype: str = 'digibok', from_year: Union[str, int] = 1800, to_year: Union[str, int] = 2000) pandas.DataFrame ¶
Collect reference data for a list of words over a time period.
Reference data are the absolute and relative frequencies of the
across all documents of the givendoctype
in the given time period (from_year
).- Parameters:
words (list) – list of word strings.
doctype (str) –
type of reference document. Can be
. Defaults to"digibok"
.… note:: If any other string is given as the
, the resulting data is equivalent to what you get withdoctype="digavis"
.from_year (int) – first year of publication
to_year (int) – last year of publication
- Returns:
a DataFrame with the words’ frequency data
- dhlab.api.dhlab_api.ngram_book(word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None, publisher: str | None = None, lang: str | None = None, city: str | None = None, ddk: str | None = None, topic: str | None = None) pandas.DataFrame ¶
Count occurrences of one or more words in books over a given time period.
Call the API :py:obj:
.Filter the selection of books with metadata. Use % as wildcard where appropriate - no wildcards in
.- Parameters:
word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g.
.title (str) – Title of a specific document to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as
.publisher (str) – Name of a publisher.
lang (str) – Language as a 3-letter ISO code (e.g.
)city (str) – City of publication.
ddk (str) –
Dewey Decimal Classification <>
_ identifier.topic (str) – Topic of the documents.
- Returns:
with the resulting frequency counts of the word(s), spread across years. One year per row.
- dhlab.api.dhlab_api.ngram_periodicals(word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None, publisher: str | None = None, lang: str | None = None, city: str | None = None, ddk: str | None = None, topic: str | None = None, **kwargs) pandas.DataFrame ¶
Get a time series of frequency counts for
in periodicals.Call the API :py:obj:
.- Parameters:
word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g.
.title (str) – Title of a specific document to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as
.publisher (str) – Name of a publisher.
lang (str) – Language as a 3-letter ISO code (e.g.
)city (str) – City of publication.
ddk (str) –
Dewey Decimal Classification <>
_ identifier.topic (str) – Topic of the documents.
- Returns:
with the resulting frequency counts of the word(s), spread across years. One year per row.
- dhlab.api.dhlab_api.ngram_news(word: Union[List, str] = ['.'], title: str | None = None, period: Tuple[int, int] | None = None) pandas.DataFrame ¶
Get a time series of frequency counts for
in newspapers.Call the API :py:obj:
.- Parameters:
word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g.
.title (str) – Title of a specific newspaper to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as
- Returns:
with the resulting frequency counts of the word(s), spread across the dates given in the time period. Either one year or one day per row.
- dhlab.api.dhlab_api.create_sparse_matrix(structure)¶
Create a sparse matrix from an API counts object
- dhlab.api.dhlab_api.get_document_frequencies(urns: List[str] | None = None, cutoff: int = 0, words: List[str] | None = None, sparse: bool = False) pandas.DataFrame ¶
Fetch frequency counts of
in documents (urns
).Call the API :py:obj:
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
cutoff (int) – minimum frequency of a word to be counted
words (list) – a list of words to be counted - if left None, whole document is returned. If not None both the counts and their relative frequency is returned.
sparse (bool) – create a sparse matrix for memory efficiency
- dhlab.api.dhlab_api.get_word_frequencies(urns: List[str] | None = None, cutoff: int = 0, words: List[str] | None = None) pandas.DataFrame ¶
Fetch frequency numbers for
in documents (urns
).Call the API :py:obj:
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
cutoff (int) – minimum frequency of a word to be counted
words (list) – a list of words to be counted - should not be left None.
- dhlab.api.dhlab_api.get_urn_frequencies(urns: List[str] | None = None, dhlabid: List[int] | None = None) pandas.DataFrame ¶
Fetch frequency counts of documents as URNs or DH-lab ids.
Call the API :py:obj:
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
dhlabid (list) – list of numbers for dhlabid:
[1000001, 2000003]
- dhlab.api.dhlab_api.get_document_corpus(**kwargs)¶
- dhlab.api.dhlab_api.document_corpus(doctype: str | None = None, author: str | None = None, freetext: str | None = None, fulltext: str | None = None, from_year: int | None = None, to_year: int | None = None, from_timestamp: int | None = None, to_timestamp: int | None = None, title: str | None = None, ddk: str | None = None, subject: str | None = None, publisher: str | None = None, literaryform: str | None = None, genres: str | None = None, city: str | None = None, lang: str | None = None, limit: int | None = None, order_by: str | None = None) pandas.DataFrame ¶
Fetch a corpus based on metadata.
Call the API :py:obj:
endpoint/build_corpus <>
_.- Parameters:
doctype (str) –
author (str) – Name of an author.
freetext (str) – any of the parameters, for example:
"digibok AND Ibsen"
.fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format:
, books haveYYYY0101
to_timestamp (int) – End date for time period of interest. Format:
, books haveYYYY0101
title (str) – Name or title of a document.
ddk (str) –
Dewey Decimal Classification <>
_ identifier.subject (str) – subject (keywords) of the publication.
publisher (str) – Name of publisher.
literaryform (str) – literary form of the publication (books)
genres (str) – genre of the publication.
city (str) – place of publication
lang (str) – Language of the publication, as a 3-letter ISO code. Example:
limit (int) – number of items to sample.
order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example
(random order, the slowest),"rank"
(ordered by relevance, faster) or"first"
(breadth-first, using the order in the database table, the fastest method)
- Returns:
with the corpus information.
- dhlab.api.dhlab_api.urn_collocation(urns: List[str] | None = None, word: str = 'arbeid', before: int = 5, after: int = 0, samplesize: int = 200000) pandas.DataFrame ¶
Create a collocation from a list of URNs.
Call the API :py:obj:
.- Parameters:
urns (list) – list of uniform resource name strings, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
word (str) – word to construct collocation with.
before (int) – number of words preceding the given
.after (int) – number of words following the given
.samplesize (int) – total number of
to search through.
- Returns:
with distance (sum of distances and bayesian distance) and frequency for words collocated withword
- dhlab.api.dhlab_api.totals(top_words: int = 50000) pandas.DataFrame ¶
Get aggregated raw frequencies of all words in the National Library’s database.
Call the API :py:obj:
endpoint/totals/{top_words} <>
_.- Parameters:
top_words (int) – The number of words to get total frequencies for.
- Returns:
with the most frequent words.
- dhlab.api.dhlab_api.concordance(urns: list | None = None, words: str | None = None, window: int = 25, limit: int = 100) pandas.DataFrame ¶
Get a list of concordances from the National Library’s database.
Call the API :py:obj:
endpoint/conc <>
_.- Parameters:
urns (list) – uniform resource names, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.
window (int) – number of tokens on either side to show in the collocations, between 1-25.
limit (int) – max. number of concordances per document. Maximum value is 1000.
- Returns:
a table of concordances
- dhlab.api.dhlab_api.konkordans¶
- dhlab.api.dhlab_api.concordance_counts(urns: list | None = None, words: str | None = None, window: int = 25, limit: int = 100) pandas.DataFrame ¶
Count concordances (keyword in context) for a corpus query (used for collocation analysis).
Call the API :py:obj:
endpoint/conccount <>
_.- Parameters:
urns (list) – uniform resource names, for example:
["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.
window (int) – number of tokens on either side to show in the collocations, between 1-25.
limit (int) – max. number of concordances per document. Maximum value is 1000.
- Returns:
a table of counts
- dhlab.api.dhlab_api.word_concordance(urn: list | None = None, dhlabid: list | None = None, words: list | None = None, before: int = 12, after: int = 12, limit: int = 100, samplesize: int = 50000) pandas.DataFrame ¶
Get a list of concordances from the National Library’s database.
Call the API :py:obj:
endpoint/conc <>
_.- Parameters:
urns (list) – dhlab serial ids. (server can take both urns and dhlabid but so we may rewrite this to)
words (str) – Word(s) to search for – must be a list
before (int) – between 0-24.
after (int) – between 0-24 (before + sum <= 24)
limit (int) – max. number of concordances per server process.
samplesize (int) – samples from urns.
- Returns:
a table of concordances
- dhlab.api.dhlab_api.collocation(corpusquery: str = 'norge', word: str = 'arbeid', before: int = 5, after: int = 0) pandas.DataFrame ¶
Make a collocation from a corpus query.
- Parameters:
corpusquery (str) – query string
word (str) – target word for the collocations.
before (int) – number of words prior to
after (int) – number of words following
- Returns:
a dataframe with the resulting collocations
- dhlab.api.dhlab_api.word_variant(word: str, form: str, lang: str = 'nob') list ¶
Find alternative
for a givenword
form.Call the API :py:obj:
word_variant('spiste', 'pres-part')
- Parameters:
word (str) – any word string
form (str) – a morphological feature tag from the Norwegian wordbank
"Orbanken" <>
_.lang (str) – either “nob” or “nno”
- dhlab.api.dhlab_api.word_paradigm(word: str, lang: str = 'nob') list ¶
Find paradigms for a given
form.Call the API :py:obj:
… code-block:: python
word_paradigm('spiste') # [['adj', ['spisende', 'spist', 'spiste']], # ['verb', ['spis', 'spise', 'spiser', 'spises', 'spist', 'spiste']]]
- Parameters:
word (str) – any word string
lang (str) – either “nob” or “nno”
- dhlab.api.dhlab_api.word_paradigm_many(wordlist: list, lang: str = 'nob') list ¶
Find alternative forms for a list of words.
- Parameters:
wordlist –
of wordslang – Language
- dhlab.api.dhlab_api.word_form(word: str, lang: str = 'nob') list ¶
Look up the morphological feature specification of a
form.- Parameters:
word – Word
lang – Language
- dhlab.api.dhlab_api.word_form_many(wordlist: list, lang: str = 'nob') list ¶
Look up the morphological feature specifications for word forms in a
.- Parameters:
wordlist –
of wordslang – Language
- dhlab.api.dhlab_api.word_lemma(word: str, lang: str = 'nob') list ¶
Find the list of possible lemmas for a given
form.- Parameters:
word – Word to find lemmas for
lang – Language
- dhlab.api.dhlab_api.word_lemma_many(wordlist, lang='nob')¶
Find lemmas for a list of given word forms.
- dhlab.api.dhlab_api.query_imagination_corpus(category=None, author=None, title=None, year=None, publisher=None, place=None, oversatt=None)¶
Fetch data from imagination corpus