![resources/images/dhlab-logo-nb.png](../resources/images/dhlab-logo-nb.png)

# Frekvenslister
Frekvenslister fra tekster kan danne utgangspunkt for mye informasjon om teksten.

Her bruker vi klassen `Counts` for å hente frekvensklister (absoluttfrekvenser) fra boka.

`urn`er identifikator for boka i Nasjonalbibliotekets samling. Under eksempelfila om korpusbygging() finnes mer informasjon om hvordan man bygger et korpus. Urn finner man i bokas metadata, og de er på denne formen: https://urn.nb.no/URN:NBN:no-nb_digibok_2012051608012. Det er de 13 sifrene på slutten vi bruker her.

In [1]:
from dhlab import Counts, totals, Corpus
import dhlab.graph_networkx_louvain as gnl
import pandas as pd
import dhlab.nbtext as nb

### Frekvensliste for enkeltbøker

#### Kort sakprosabok

In [2]:
urn = "URN:NBN:no-nb_digibok_2021011848587"
bok = Counts([urn])

In [3]:
bok.frame.iloc[:, 0]

,                   184
.                   162
og                   94
i                    93
det                  51
                   ... 
ekstra                1
eget                  1
egenkjærligheden      1
egenflige             1
eneste                1
Name: 100431574, Length: 1597, dtype: int64

Vi kan sammenlikne frekvenslista for boka med et større korpus, som vi her legger i variabelen tot

In [4]:
tot = totals()

Vi lager en dataramme, `nb.frame()` for forholdet mellom frekvensene for ordene i vann og for de 50000 mest frekvente ordene i hele nb.no

In [5]:
(bok.frame.iloc[:, 0] / tot.iloc[:, 0]).sort_values(ascending=False).head(20)

Hedda          0.000050
def            0.000030
Ivo            0.000026
Skjønberg      0.000023
Claes          0.000019
Gill           0.000017
oppførelse     0.000016
teatrene       0.000016
VAN            0.000015
væsen          0.000013
Alten          0.000013
Rosenkrantz    0.000011
Kaas           0.000011
poetiske       0.000010
ensemblet      0.000010
Høst           0.000010
Jæger          0.000010
kontra         0.000009
Peggy          0.000009
Vauxhall       0.000009
dtype: float64

In [6]:
tot

Unnamed: 0,freq
.,7655423257
",",5052171514
i,2531262027
og,2520268056
-,1314451583
...,...
tidspunkter,110667
dirigenter,110660
ondartet,110652
kulturtilbud,110652


#### Knausgårds *Min kamp / Første bok*

In [7]:
min_kamp_1 = Counts(["URN:NBN:no-nb_digibok_2014032405041"])
min_kamp_1.frame.head(20)

Unnamed: 0,100197175
",",14332
.,9480
og,5383
det,4430
jeg,4175
i,3657
var,3238
på,3152
-,2500
som,2463


In [8]:
(min_kamp_1.frame.iloc[:, 0] / tot.iloc[:,0]).sort_values(ascending=False).head(20).to_frame("ratio")

Unnamed: 0,ratio
Yngve,0.000572
Farmor,0.000376
pappas,0.000231
farmor,0.000203
Ylva,0.00019
skrudde,0.000154
Jada,0.00013
Fløgstad,0.000119
helte,0.000114
koppen,0.000108


### Sammenlikning av frekvens for ulike bøker

In [9]:
# Frekvenslisiter for 20 Anne-Cath. Vestly bøker
vestly = Counts(
    Corpus(
        author="Anne*Vestly*",
        limit=20
    )
)
vestly.frame.head()

Unnamed: 0,100503685,100503477,100503478,100516696,100445919,100440922,100558943,100401171,100566074,100463823,100031786,100035823,100039821,100124378,100164506,100495412,100235984,100325187,100351724,100387104
",",3361.0,1877.0,1933.0,46.0,2741.0,2279.0,2088.0,1356.0,2316.0,2726.0,2087.0,365.0,426.0,2745.0,1330.0,578.0,2904.0,2082.0,39.0,2039.0
og,2794.0,1114.0,1290.0,32.0,1631.0,1588.0,1389.0,915.0,1479.0,1475.0,1387.0,229.0,237.0,1616.0,1146.0,466.0,1640.0,1384.0,14.0,1336.0
.,2528.0,1664.0,1840.0,42.0,2365.0,2079.0,1979.0,1276.0,1734.0,2378.0,1983.0,330.0,435.0,1969.0,1932.0,688.0,2665.0,1992.0,45.0,2070.0
var,1531.0,458.0,587.0,9.0,691.0,660.0,622.0,293.0,704.0,700.0,621.0,75.0,91.0,617.0,459.0,81.0,649.0,623.0,12.0,656.0
det,1320.0,699.0,841.0,18.0,949.0,863.0,775.0,504.0,831.0,999.0,776.0,135.0,132.0,867.0,642.0,196.0,925.0,775.0,13.0,834.0


In [10]:
# Vi kan legge på et varmekart for lettere å få oversikt

nb.heatmap(vestly.frame.head(20))

Unnamed: 0,100503685,100503477,100503478,100516696,100445919,100440922,100558943,100401171,100566074,100463823,100031786,100035823,100039821,100124378,100164506,100495412,100235984,100325187,100351724,100387104
",",3361.0,1877.0,1933.0,46.0,2741.0,2279.0,2088.0,1356.0,2316.0,2726.0,2087.0,365.0,426.0,2745.0,1330.0,578.0,2904.0,2082.0,39.0,2039.0
og,2794.0,1114.0,1290.0,32.0,1631.0,1588.0,1389.0,915.0,1479.0,1475.0,1387.0,229.0,237.0,1616.0,1146.0,466.0,1640.0,1384.0,14.0,1336.0
.,2528.0,1664.0,1840.0,42.0,2365.0,2079.0,1979.0,1276.0,1734.0,2378.0,1983.0,330.0,435.0,1969.0,1932.0,688.0,2665.0,1992.0,45.0,2070.0
var,1531.0,458.0,587.0,9.0,691.0,660.0,622.0,293.0,704.0,700.0,621.0,75.0,91.0,617.0,459.0,81.0,649.0,623.0,12.0,656.0
det,1320.0,699.0,841.0,18.0,949.0,863.0,775.0,504.0,831.0,999.0,776.0,135.0,132.0,867.0,642.0,196.0,925.0,775.0,13.0,834.0
jeg,1303.0,163.0,154.0,3.0,325.0,99.0,120.0,193.0,375.0,262.0,120.0,64.0,48.0,517.0,91.0,86.0,272.0,119.0,2.0,162.0
på,961.0,489.0,518.0,19.0,651.0,641.0,629.0,366.0,500.0,666.0,630.0,89.0,111.0,584.0,437.0,146.0,619.0,632.0,7.0,634.0
i,879.0,413.0,385.0,20.0,544.0,461.0,466.0,334.0,429.0,559.0,466.0,63.0,65.0,488.0,546.0,178.0,528.0,457.0,17.0,473.0
til,721.0,330.0,373.0,8.0,447.0,408.0,376.0,244.0,422.0,495.0,374.0,77.0,84.0,410.0,348.0,89.0,479.0,384.0,2.0,420.0
en,697.0,203.0,306.0,19.0,365.0,378.0,286.0,250.0,309.0,375.0,286.0,68.0,53.0,378.0,178.0,152.0,354.0,294.0,15.0,290.0


### Frekvens for spesifikke termer

In [11]:
def heatmap_loc(dr, ordliste, sorter=0):
    df = dr.loc[ordliste]
    return nb.heatmap(df.sort_values(by=df.columns[sorter], ascending=False))

In [12]:
# Her lager vi en funksjon heatmap_loc som vi bruker for å plukke ut enkelte ord fra korpuset som vi vil sammenlikne.


heatmap_loc(vestly.frame, "Mormor Morten Marte Ole Aleksander Guro Lillebror Knerten".split())


Unnamed: 0,100503685,100503477,100503478,100516696,100445919,100440922,100558943,100401171,100566074,100463823,100031786,100035823,100039821,100124378,100164506,100495412,100235984,100325187,100351724,100387104
Ole,13.0,20.0,46.0,0.0,13.0,7.0,46.0,309.0,2.0,93.0,46.0,84.0,104.0,0.0,1.0,0.0,1.0,46.0,0.0,2.0
Lillebror,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mormor,2.0,45.0,28.0,0.0,4.0,64.0,58.0,0.0,2.0,59.0,57.0,0.0,0.0,2.0,0.0,0.0,31.0,52.0,0.0,69.0
Marte,1.0,44.0,67.0,0.0,2.0,72.0,23.0,0.0,0.0,14.0,23.0,0.0,0.0,0.0,0.0,1.0,43.0,22.0,0.0,148.0
Knerten,1.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
Aleksander,1.0,20.0,43.0,0.0,12.0,7.0,40.0,305.0,1.0,89.0,40.0,84.0,103.0,0.0,1.0,0.0,1.0,40.0,0.0,2.0
Morten,0.0,78.0,323.0,0.0,7.0,147.0,136.0,0.0,17.0,517.0,136.0,0.0,0.0,0.0,0.0,0.0,144.0,134.0,0.0,180.0
Guro,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,440.0,0.0,0.0,0.0,0.0,506.0,1.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# Er forfatteren konsekvent når det kommer til ordformer?

heatmap_loc(vestly.frame, "boken boka sola solen".split())

Unnamed: 0,100503685,100503477,100503478,100516696,100445919,100440922,100558943,100401171,100566074,100463823,100031786,100035823,100039821,100124378,100164506,100495412,100235984,100325187,100351724,100387104
boken,9.0,0.0,0.0,0.0,17.0,1.0,1.0,1.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
solen,2.0,0.0,3.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,1.0,1.0,3.0,0.0,0.0
boka,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,0.0,0.0,0.0,0.0
sola,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,7.0,2.0,0.0,1.0,0.0,2.0
