dhlab.text.corpus
¶
Module Contents¶
Classes¶
Class representing as DHLAB Corpus |
API¶
- class dhlab.text.corpus.Corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, publisher=None, literaryform=None, genres=None, city=None, lang=None, limit=10, limit_by_year=False, order_by='random', allow_duplicates=False)¶
Bases:
dhlab.text.dhlab_object.DhlabObj
Class representing as DHLAB Corpus
Primary object for working with dhlab data. Contains references to texts in National Library’s collections and metadata about them. Use with
.coll
,.conc
or.freq
to analyse using dhlab tools.Initialization
Create Corpus
- Parameters:
doctype (str) –
"digibok"
,"digavis"
,"digitidsskrift"
or"digistorting"
author (str) – Name of an author.
freetext (str) – any of the parameters, for example:
"digibok AND Ibsen"
.fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
to_timestamp (int) – End date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
title (str) – Name or title of a document.
ddk (str) –
Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>
_ identifier.subject (str) – subject (keywords) of the publication.
publisher (str) – Name of publisher.
literaryform (str) – literary form of the publication (books)
genres (str) – genre of the publication.
city (str) – place of publication.
lang (str) – Language of the publication, as a 3-letter ISO code. Example:
"nob"
or"nno"
limit (int) – number of items to sample.
limit_by_year (bool) – sample from each year in the query year range.
- doctypes¶
[‘digibok’, ‘digavis’, ‘digitidsskrift’, ‘digistorting’, ‘digimanus’, ‘kudos’]
- classmethod from_identifiers(identifiers: List[Union[str, int]])¶
Construct Corpus from list of identifiers
- classmethod from_df(df: pandas.DataFrame, check_for_urn: bool = False)¶
Typecast Pandas DataFrame to Corpus class
DataFrame most contain URN column
- classmethod from_csv(path: str)¶
Import corpus from csv
- static _urn_id_in_dataframe_cols(dataframe: Union[pandas.DataFrame, type('Corpus')]) pandas.DataFrame ¶
Checks if dataframe contains URN column
- extend_from_identifiers(identifiers: list = None)¶
- evaluate_words(wordbags=None)¶
- add(new_corpus: Union[pandas.DataFrame, type('Corpus')])¶
Utility for appending Corpus or DataFrame to self
- sample(n: int = 5)¶
Create random subkorpus with
n
entries
- only_one_author()¶
Only select items with one author
- only_one_language()¶
Only select items with one language
- conc(words, window: int = 20, limit: int = 500) dhlab.text.conc_coll.Concordance ¶
Get concodances of
words
in corpus
- coll(words=None, before=10, after=10, reference=None, samplesize=20000, alpha=False, ignore_caps=False) dhlab.text.conc_coll.Collocations ¶
Get collocations of
words
in corpus
- count(words=None, cutoff=0, sparse=True)¶
Get word frequencies for corpus
- freq(words=None, cutoff=0, sparse=True)¶
Get word frequencies for corpus
- static _is_Corpus(corpus: dhlab.text.corpus.Corpus) bool ¶
Check if
input
is Corpus or DataFrame
- __add__(other)¶
Add two Corpus objects
- _make_subcorpus(**kwargs) dhlab.text.corpus.Corpus ¶
- make_subcorpus(authors: str = None, title: str = None) dhlab.text.corpus.Corpus ¶
Make subcorpus based on author and title
Args: authors (str, optional): search for author field. Defaults to None. title (str, optional): search title field. Defaults to None.
Returns: Corpus: A subset of the original corpus
- check_integrity()¶
Check the integrity of the corpus data.
- _check_for_urn_duplicates()¶
Check for duplicate URNs in corpus
- _drop_urn_duplicates(reset_index=True)¶
Drop duplicate URNs in corpus
dhlab sometimes contains multiple versions of the text for a text object. Usually these are different OCR results. This method drops all but the last as this is usually the best. Dhlabid is always unique.