
1.6. Frekvenslister#
Frekvenslister fra tekster kan danne utgangspunkt for mye informasjon om teksten.
Her bruker vi klassen Counts for å hente frekvensklister (absoluttfrekvenser) fra boka.
urner identifikator for boka i Nasjonalbibliotekets samling. Under eksempelfila om korpusbygging() finnes mer informasjon om hvordan man bygger et korpus. Urn finner man i bokas metadata, og de er på denne formen: https://urn.nb.no/URN:NBN:no-nb_digibok_2012051608012. Det er de 13 sifrene på slutten vi bruker her.
from dhlab import Counts, totals, Corpus
import dhlab.graph_networkx_louvain as gnl
import pandas as pd
import dhlab.nbtext as nb
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 2
1 from dhlab import Counts, totals, Corpus
----> 2 import dhlab.graph_networkx_louvain as gnl
3 import pandas as pd
4 import dhlab.nbtext as nb
ModuleNotFoundError: No module named 'dhlab.graph_networkx_louvain'
1.6.1. Frekvensliste for enkeltbøker#
1.6.1.1. Kort sakprosabok#
urn = "URN:NBN:no-nb_digibok_2021011848587"
bok = Counts([urn])
bok.frame.iloc[:, 0]
, 184
. 162
og 94
i 93
det 51
...
ekstra 1
eget 1
egenkjærligheden 1
egenflige 1
eneste 1
Name: 100431574, Length: 1597, dtype: int64
Vi kan sammenlikne frekvenslista for boka med et større korpus, som vi her legger i variabelen tot
tot = totals()
Vi lager en dataramme, nb.frame() for forholdet mellom frekvensene for ordene i vann og for de 50000 mest frekvente ordene i hele nb.no
(bok.frame.iloc[:, 0] / tot.iloc[:, 0]).sort_values(ascending=False).head(20)
Hedda 0.000050
def 0.000030
Ivo 0.000026
Skjønberg 0.000023
Claes 0.000019
Gill 0.000017
oppførelse 0.000016
teatrene 0.000016
VAN 0.000015
væsen 0.000013
Alten 0.000013
Rosenkrantz 0.000011
Kaas 0.000011
poetiske 0.000010
ensemblet 0.000010
Høst 0.000010
Jæger 0.000010
kontra 0.000009
Peggy 0.000009
Vauxhall 0.000009
dtype: float64
tot
| freq | |
|---|---|
| . | 7655423257 |
| , | 5052171514 |
| i | 2531262027 |
| og | 2520268056 |
| - | 1314451583 |
| ... | ... |
| tidspunkter | 110667 |
| dirigenter | 110660 |
| ondartet | 110652 |
| kulturtilbud | 110652 |
| trassig | 110651 |
50000 rows × 1 columns
1.6.1.2. Knausgårds Min kamp / Første bok#
min_kamp_1 = Counts(["URN:NBN:no-nb_digibok_2014032405041"])
min_kamp_1.frame.head(20)
| 100197175 | |
|---|---|
| , | 14332 |
| . | 9480 |
| og | 5383 |
| det | 4430 |
| jeg | 4175 |
| i | 3657 |
| var | 3238 |
| på | 3152 |
| - | 2500 |
| som | 2463 |
| en | 2175 |
| hadde | 2010 |
| ikke | 1947 |
| av | 1900 |
| den | 1895 |
| til | 1815 |
| med | 1787 |
| sa | 1705 |
| meg | 1581 |
| å | 1575 |
(min_kamp_1.frame.iloc[:, 0] / tot.iloc[:,0]).sort_values(ascending=False).head(20).to_frame("ratio")
| ratio | |
|---|---|
| Yngve | 0.000572 |
| Farmor | 0.000376 |
| pappas | 0.000231 |
| farmor | 0.000203 |
| Ylva | 0.000190 |
| skrudde | 0.000154 |
| Jada | 0.000130 |
| Fløgstad | 0.000119 |
| helte | 0.000114 |
| koppen | 0.000108 |
| kroppene | 0.000108 |
| gitaren | 0.000105 |
| jævla | 0.000105 |
| rekkverket | 0.000104 |
| cola | 0.000103 |
| røykte | 0.000101 |
| sigaretten | 0.000101 |
| kløv | 0.000097 |
| faen | 0.000094 |
| stuen | 0.000090 |
1.6.2. Sammenlikning av frekvens for ulike bøker#
# Frekvenslisiter for 20 Anne-Cath. Vestly bøker
vestly = Counts(
Corpus(
author="Anne*Vestly*",
limit=20
)
)
vestly.frame.head()
| 100503685 | 100503477 | 100503478 | 100516696 | 100445919 | 100440922 | 100558943 | 100401171 | 100566074 | 100463823 | 100031786 | 100035823 | 100039821 | 100124378 | 100164506 | 100495412 | 100235984 | 100325187 | 100351724 | 100387104 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| , | 3361.0 | 1877.0 | 1933.0 | 46.0 | 2741.0 | 2279.0 | 2088.0 | 1356.0 | 2316.0 | 2726.0 | 2087.0 | 365.0 | 426.0 | 2745.0 | 1330.0 | 578.0 | 2904.0 | 2082.0 | 39.0 | 2039.0 |
| og | 2794.0 | 1114.0 | 1290.0 | 32.0 | 1631.0 | 1588.0 | 1389.0 | 915.0 | 1479.0 | 1475.0 | 1387.0 | 229.0 | 237.0 | 1616.0 | 1146.0 | 466.0 | 1640.0 | 1384.0 | 14.0 | 1336.0 |
| . | 2528.0 | 1664.0 | 1840.0 | 42.0 | 2365.0 | 2079.0 | 1979.0 | 1276.0 | 1734.0 | 2378.0 | 1983.0 | 330.0 | 435.0 | 1969.0 | 1932.0 | 688.0 | 2665.0 | 1992.0 | 45.0 | 2070.0 |
| var | 1531.0 | 458.0 | 587.0 | 9.0 | 691.0 | 660.0 | 622.0 | 293.0 | 704.0 | 700.0 | 621.0 | 75.0 | 91.0 | 617.0 | 459.0 | 81.0 | 649.0 | 623.0 | 12.0 | 656.0 |
| det | 1320.0 | 699.0 | 841.0 | 18.0 | 949.0 | 863.0 | 775.0 | 504.0 | 831.0 | 999.0 | 776.0 | 135.0 | 132.0 | 867.0 | 642.0 | 196.0 | 925.0 | 775.0 | 13.0 | 834.0 |
# Vi kan legge på et varmekart for lettere å få oversikt
nb.heatmap(vestly.frame.head(20))
| 100503685 | 100503477 | 100503478 | 100516696 | 100445919 | 100440922 | 100558943 | 100401171 | 100566074 | 100463823 | 100031786 | 100035823 | 100039821 | 100124378 | 100164506 | 100495412 | 100235984 | 100325187 | 100351724 | 100387104 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| , | 3361.000000 | 1877.000000 | 1933.000000 | 46.000000 | 2741.000000 | 2279.000000 | 2088.000000 | 1356.000000 | 2316.000000 | 2726.000000 | 2087.000000 | 365.000000 | 426.000000 | 2745.000000 | 1330.000000 | 578.000000 | 2904.000000 | 2082.000000 | 39.000000 | 2039.000000 |
| og | 2794.000000 | 1114.000000 | 1290.000000 | 32.000000 | 1631.000000 | 1588.000000 | 1389.000000 | 915.000000 | 1479.000000 | 1475.000000 | 1387.000000 | 229.000000 | 237.000000 | 1616.000000 | 1146.000000 | 466.000000 | 1640.000000 | 1384.000000 | 14.000000 | 1336.000000 |
| . | 2528.000000 | 1664.000000 | 1840.000000 | 42.000000 | 2365.000000 | 2079.000000 | 1979.000000 | 1276.000000 | 1734.000000 | 2378.000000 | 1983.000000 | 330.000000 | 435.000000 | 1969.000000 | 1932.000000 | 688.000000 | 2665.000000 | 1992.000000 | 45.000000 | 2070.000000 |
| var | 1531.000000 | 458.000000 | 587.000000 | 9.000000 | 691.000000 | 660.000000 | 622.000000 | 293.000000 | 704.000000 | 700.000000 | 621.000000 | 75.000000 | 91.000000 | 617.000000 | 459.000000 | 81.000000 | 649.000000 | 623.000000 | 12.000000 | 656.000000 |
| det | 1320.000000 | 699.000000 | 841.000000 | 18.000000 | 949.000000 | 863.000000 | 775.000000 | 504.000000 | 831.000000 | 999.000000 | 776.000000 | 135.000000 | 132.000000 | 867.000000 | 642.000000 | 196.000000 | 925.000000 | 775.000000 | 13.000000 | 834.000000 |
| jeg | 1303.000000 | 163.000000 | 154.000000 | 3.000000 | 325.000000 | 99.000000 | 120.000000 | 193.000000 | 375.000000 | 262.000000 | 120.000000 | 64.000000 | 48.000000 | 517.000000 | 91.000000 | 86.000000 | 272.000000 | 119.000000 | 2.000000 | 162.000000 |
| på | 961.000000 | 489.000000 | 518.000000 | 19.000000 | 651.000000 | 641.000000 | 629.000000 | 366.000000 | 500.000000 | 666.000000 | 630.000000 | 89.000000 | 111.000000 | 584.000000 | 437.000000 | 146.000000 | 619.000000 | 632.000000 | 7.000000 | 634.000000 |
| i | 879.000000 | 413.000000 | 385.000000 | 20.000000 | 544.000000 | 461.000000 | 466.000000 | 334.000000 | 429.000000 | 559.000000 | 466.000000 | 63.000000 | 65.000000 | 488.000000 | 546.000000 | 178.000000 | 528.000000 | 457.000000 | 17.000000 | 473.000000 |
| til | 721.000000 | 330.000000 | 373.000000 | 8.000000 | 447.000000 | 408.000000 | 376.000000 | 244.000000 | 422.000000 | 495.000000 | 374.000000 | 77.000000 | 84.000000 | 410.000000 | 348.000000 | 89.000000 | 479.000000 | 384.000000 | 2.000000 | 420.000000 |
| en | 697.000000 | 203.000000 | 306.000000 | 19.000000 | 365.000000 | 378.000000 | 286.000000 | 250.000000 | 309.000000 | 375.000000 | 286.000000 | 68.000000 | 53.000000 | 378.000000 | 178.000000 | 152.000000 | 354.000000 | 294.000000 | 15.000000 | 290.000000 |
| ikke | 676.000000 | 343.000000 | 371.000000 | 5.000000 | 502.000000 | 419.000000 | 372.000000 | 194.000000 | 417.000000 | 472.000000 | 372.000000 | 55.000000 | 77.000000 | 478.000000 | 234.000000 | 94.000000 | 493.000000 | 371.000000 | 8.000000 | 420.000000 |
| som | 667.000000 | 244.000000 | 289.000000 | 8.000000 | 305.000000 | 328.000000 | 278.000000 | 166.000000 | 323.000000 | 382.000000 | 278.000000 | 47.000000 | 41.000000 | 355.000000 | 345.000000 | 112.000000 | 314.000000 | 277.000000 | 2.000000 | 264.000000 |
| med | 611.000000 | 207.000000 | 301.000000 | 1.000000 | 350.000000 | 286.000000 | 256.000000 | 122.000000 | 280.000000 | 321.000000 | 256.000000 | 31.000000 | 44.000000 | 316.000000 | 225.000000 | 97.000000 | 307.000000 | 256.000000 | 5.000000 | 263.000000 |
| så | 601.000000 | 493.000000 | 552.000000 | 10.000000 | 729.000000 | 633.000000 | 534.000000 | 377.000000 | 625.000000 | 734.000000 | 535.000000 | 86.000000 | 119.000000 | 776.000000 | 325.000000 | 117.000000 | 726.000000 | 536.000000 | 6.000000 | 543.000000 |
| hadde | 595.000000 | 189.000000 | 261.000000 | 0.000000 | 336.000000 | 252.000000 | 232.000000 | 100.000000 | 329.000000 | 309.000000 | 232.000000 | 22.000000 | 46.000000 | 262.000000 | 195.000000 | 36.000000 | 300.000000 | 232.000000 | 0.000000 | 294.000000 |
| vi | 545.000000 | 146.000000 | 159.000000 | 3.000000 | 187.000000 | 256.000000 | 179.000000 | 134.000000 | 185.000000 | 285.000000 | 179.000000 | 36.000000 | 59.000000 | 274.000000 | 82.000000 | 45.000000 | 306.000000 | 178.000000 | 0.000000 | 149.000000 |
| å | 541.000000 | 293.000000 | 309.000000 | 5.000000 | 356.000000 | 374.000000 | 377.000000 | 199.000000 | 290.000000 | 417.000000 | 376.000000 | 47.000000 | 64.000000 | 306.000000 | 257.000000 | 124.000000 | 387.000000 | 373.000000 | 3.000000 | 385.000000 |
| for | 531.000000 | 262.000000 | 291.000000 | 5.000000 | 413.000000 | 331.000000 | 341.000000 | 173.000000 | 392.000000 | 404.000000 | 341.000000 | 47.000000 | 46.000000 | 374.000000 | 223.000000 | 75.000000 | 330.000000 | 338.000000 | 1.000000 | 274.000000 |
| Det | 478.000000 | 229.000000 | 271.000000 | 5.000000 | 296.000000 | 299.000000 | 252.000000 | 114.000000 | 252.000000 | 310.000000 | 252.000000 | 29.000000 | 55.000000 | 303.000000 | 143.000000 | 47.000000 | 325.000000 | 250.000000 | 2.000000 | 282.000000 |
| at | 448.000000 | 193.000000 | 259.000000 | 11.000000 | 291.000000 | 259.000000 | 263.000000 | 148.000000 | 289.000000 | 313.000000 | 263.000000 | 40.000000 | 40.000000 | 277.000000 | 214.000000 | 69.000000 | 238.000000 | 256.000000 | 2.000000 | 259.000000 |
1.6.3. Frekvens for spesifikke termer#
def heatmap_loc(dr, ordliste, sorter=0):
df = dr.loc[ordliste]
return nb.heatmap(df.sort_values(by=df.columns[sorter], ascending=False))
# Her lager vi en funksjon heatmap_loc som vi bruker for å plukke ut enkelte ord fra korpuset som vi vil sammenlikne.
heatmap_loc(vestly.frame, "Mormor Morten Marte Ole Aleksander Guro Lillebror Knerten".split())
| 100503685 | 100503477 | 100503478 | 100516696 | 100445919 | 100440922 | 100558943 | 100401171 | 100566074 | 100463823 | 100031786 | 100035823 | 100039821 | 100124378 | 100164506 | 100495412 | 100235984 | 100325187 | 100351724 | 100387104 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ole | 13.000000 | 20.000000 | 46.000000 | 0.000000 | 13.000000 | 7.000000 | 46.000000 | 309.000000 | 2.000000 | 93.000000 | 46.000000 | 84.000000 | 104.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 46.000000 | 0.000000 | 2.000000 |
| Lillebror | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| Mormor | 2.000000 | 45.000000 | 28.000000 | 0.000000 | 4.000000 | 64.000000 | 58.000000 | 0.000000 | 2.000000 | 59.000000 | 57.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 31.000000 | 52.000000 | 0.000000 | 69.000000 |
| Marte | 1.000000 | 44.000000 | 67.000000 | 0.000000 | 2.000000 | 72.000000 | 23.000000 | 0.000000 | 0.000000 | 14.000000 | 23.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 43.000000 | 22.000000 | 0.000000 | 148.000000 |
| Knerten | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| Aleksander | 1.000000 | 20.000000 | 43.000000 | 0.000000 | 12.000000 | 7.000000 | 40.000000 | 305.000000 | 1.000000 | 89.000000 | 40.000000 | 84.000000 | 103.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 40.000000 | 0.000000 | 2.000000 |
| Morten | 0.000000 | 78.000000 | 323.000000 | 0.000000 | 7.000000 | 147.000000 | 136.000000 | 0.000000 | 17.000000 | 517.000000 | 136.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 144.000000 | 134.000000 | 0.000000 | 180.000000 |
| Guro | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 440.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 506.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
# Er forfatteren konsekvent når det kommer til ordformer?
heatmap_loc(vestly.frame, "sola solen".split())
| 100503685 | 100503477 | 100503478 | 100516696 | 100445919 | 100440922 | 100558943 | 100401171 | 100566074 | 100463823 | 100031786 | 100035823 | 100039821 | 100124378 | 100164506 | 100495412 | 100235984 | 100325187 | 100351724 | 100387104 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| boken | 9.000000 | 0.000000 | 0.000000 | 0.000000 | 17.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| solen | 2.000000 | 0.000000 | 3.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 3.000000 | 0.000000 | 0.000000 |
| boka | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| sola | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 |