dhlab.text.corpus

Module Contents

Classes

Corpus

Class representing as DHLAB Corpus

API

class dhlab.text.corpus.Corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, publisher=None, literaryform=None, genres=None, city=None, lang=None, limit=10, limit_by_year=False, order_by='random', allow_duplicates=False)

Bases: dhlab.text.dhlab_object.DhlabObj

Class representing as DHLAB Corpus

Primary object for working with dhlab data. Contains references to texts in National Library’s collections and metadata about them. Use with .coll, .conc or .freq to analyse using dhlab tools.

Initialization

Create Corpus

Parameters:
  • doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"

  • author (str) – Name of an author.

  • freetext (str) – any of the parameters, for example: "digibok AND Ibsen".

  • fulltext (str) – words within the publication.

  • from_year (int) – Start year for time period of interest.

  • to_year (int) – End year for time period of interest.

  • from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • title (str) – Name or title of a document.

  • ddk (str) – Dewey Decimal Classification             <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon> _ identifier.

  • subject (str) – subject (keywords) of the publication.

  • publisher (str) – Name of publisher.

  • literaryform (str) – literary form of the publication (books)

  • genres (str) – genre of the publication.

  • city (str) – place of publication.

  • lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"

  • limit (int) – number of items to sample.

  • limit_by_year (bool) – sample from each year in the query year range.

doctypes

[‘digibok’, ‘digavis’, ‘digitidsskrift’, ‘digistorting’, ‘digimanus’, ‘kudos’]

classmethod from_identifiers(identifiers: List[Union[str, int]])

Construct Corpus from list of identifiers

classmethod from_df(df: pandas.DataFrame, check_for_urn: bool = False)

Typecast Pandas DataFrame to Corpus class

DataFrame most contain URN column

classmethod from_csv(path: str)

Import corpus from csv

static _urn_id_in_dataframe_cols(dataframe: Union[pandas.DataFrame, type('Corpus')]) pandas.DataFrame

Checks if dataframe contains URN column

extend_from_identifiers(identifiers: list = None)
evaluate_words(wordbags=None)
add(new_corpus: Union[pandas.DataFrame, type('Corpus')])

Utility for appending Corpus or DataFrame to self

sample(n: int = 5)

Create random subkorpus with n entries

only_one_author()

Only select items with one author

only_one_language()

Only select items with one language

conc(words, window: int = 20, limit: int = 500) dhlab.text.conc_coll.Concordance

Get concodances of words in corpus

coll(words=None, before=10, after=10, reference=None, samplesize=20000, alpha=False, ignore_caps=False) dhlab.text.conc_coll.Collocations

Get collocations of words in corpus

count(words=None, cutoff=0, sparse=True)

Get word frequencies for corpus

freq(words=None, cutoff=0, sparse=True)

Get word frequencies for corpus

static _is_Corpus(corpus: dhlab.text.corpus.Corpus) bool

Check if input is Corpus or DataFrame

__add__(other)

Add two Corpus objects

_make_subcorpus(**kwargs) dhlab.text.corpus.Corpus
make_subcorpus(authors: str = None, title: str = None) dhlab.text.corpus.Corpus

Make subcorpus based on author and title

Args: authors (str, optional): search for author field. Defaults to None. title (str, optional): search title field. Defaults to None.

Returns: Corpus: A subset of the original corpus

check_integrity()

Check the integrity of the corpus data.

_check_for_urn_duplicates()

Check for duplicate URNs in corpus

_drop_urn_duplicates(reset_index=True)

Drop duplicate URNs in corpus

dhlab sometimes contains multiple versions of the text for a text object. Usually these are different OCR results. This method drops all but the last as this is usually the best. Dhlabid is always unique.