dhlab.text.corpus¶
Module Contents¶
Classes¶
| Class representing as DHLAB Corpus | 
API¶
- class dhlab.text.corpus.Corpus(doctype: str | None = None, author: str | None = None, freetext: str | None = None, fulltext: str | None = None, from_year: int | None = None, to_year: int | None = None, from_timestamp: int | None = None, to_timestamp: int | None = None, title: str | None = None, ddk: str | None = None, subject: str | None = None, publisher: str | None = None, literaryform: str | None = None, genres: str | None = None, city: str | None = None, lang: str | None = None, limit: int | None = 10, limit_by_year: bool = False, order_by: str | None = 'random', allow_duplicates: bool = False)¶
- Bases: - dhlab.text.dhlab_object.DhlabObj- Class representing as DHLAB Corpus - Primary object for working with dhlab data. Contains references to texts in National Library’s collections and metadata about them. Use with - .coll,- .concor- .freqto analyse using dhlab tools.- Initialization - Create Corpus - Parameters:
- doctype (str) – - "digibok",- "digavis",- "digitidsskrift"or- "digistorting"
- author (str) – Name of an author. 
- freetext (str) – any of the parameters, for example: - "digibok AND Ibsen".
- fulltext (str) – words within the publication. 
- from_year (int) – Start year for time period of interest. 
- to_year (int) – End year for time period of interest. 
- from_timestamp (int) – Start date for time period of interest. Format: - YYYYMMDD, books have- YYYY0101
- to_timestamp (int) – End date for time period of interest. Format: - YYYYMMDD, books have- YYYY0101
- title (str) – Name or title of a document. 
- ddk (str) – - Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.
- subject (str) – subject (keywords) of the publication. 
- publisher (str) – Name of publisher. 
- literaryform (str) – literary form of the publication (books) 
- genres (str) – genre of the publication. 
- city (str) – place of publication. 
- lang (str) – Language of the publication, as a 3-letter ISO code. Example: - "nob"or- "nno"
- limit (int) – number of items to sample. 
- limit_by_year (bool) – sample from each year in the query year range. 
 
 - doctypes¶
- [‘digibok’, ‘digavis’, ‘digitidsskrift’, ‘digistorting’, ‘digimanus’, ‘kudos’] 
 - property corpus¶
 - classmethod from_identifiers(identifiers: List[Union[str, int]])¶
- Construct Corpus from list of identifiers 
 - classmethod from_df(df: pandas.DataFrame, check_for_urn: bool = False) dhlab.text.corpus.Corpus | pandas.Series¶
- Typecast Pandas DataFrame to Corpus class - DataFrame most contain URN column 
 - classmethod from_csv(path: str)¶
- Import corpus from csv 
 - static _urn_id_in_dataframe_cols(dataframe: Union[pandas.DataFrame, dhlab.text.corpus.Corpus]) pandas.DataFrame¶
- Checks if dataframe contains URN column 
 - extend_from_identifiers(identifiers: list | None = None)¶
 - evaluate_words(wordbags: dict | None = None)¶
 - add(new_corpus: Union[pandas.DataFrame, dhlab.text.corpus.Corpus])¶
- Utility for appending Corpus or DataFrame to self 
 - sample(n: int = 5)¶
- Create random subkorpus with - nentries
 - only_one_author()¶
- Only select items with one author 
 - only_one_language()¶
- Only select items with one language 
 - conc(words: str | None, window: int = 20, limit: int = 500) dhlab.text.conc_coll.Concordance¶
- Get concodances of - wordsin corpus
 - coll(words: str | list[str] | None = None, before: int = 10, after: int = 10, reference: pandas.DataFrame | None = None, samplesize: int = 20000, alpha: bool = False, ignore_caps: bool = False) dhlab.text.conc_coll.Collocations¶
- Get collocations of - wordsin corpus
 - count(words: list[str] | None = None, cutoff: int = 0, sparse: bool = True)¶
- Get word frequencies for corpus 
 - freq(words: list[str] | None = None, cutoff: int = 0, sparse: bool = True)¶
- Get word frequencies for corpus 
 - static _is_Corpus(corpus: dhlab.text.corpus.Corpus) bool¶
- Check if - inputis Corpus or DataFrame
 - __add__(other: dhlab.text.corpus.Corpus)¶
- Add two Corpus objects 
 - _make_subcorpus(**kwargs) dhlab.text.corpus.Corpus | pandas.Series | None¶
 - make_subcorpus(authors: str | None = None, title: str | None = None) dhlab.text.corpus.Corpus | pandas.Series | None¶
- Make subcorpus based on author and title - Args: authors (str, optional): search for author field. Defaults to None. title (str, optional): search title field. Defaults to None. - Returns: Corpus: A subset of the original corpus 
 - check_integrity()¶
- Check the integrity of the corpus data. 
 - _check_for_urn_duplicates()¶
- Check for duplicate URNs in corpus 
 - _drop_urn_duplicates(reset_index=True)¶
- Drop duplicate URNs in corpus - dhlab sometimes contains multiple versions of the text for a text object. Usually these are different OCR results. This method drops all but the last as this is usually the best. Dhlabid is always unique.