Word Frequency Analysis – The Most Common Words

There are any number of reasons why you might need a list of the most common words in the language. In my case, I was working on a piece of software to speed the process of building indexes for my print books. My program reads the book and suggests a list of words that the author might want to include in the index. It needed a list of the most common words so it would know not to bother suggesting them. I’ll post that script in a couple of days. For now, though, I thought I would give you a very simple piece of Python code that reads a directory full of text files, counts how many times each word occurs, and prints a list of those which show up most often. I set it to give me the most common 1000 words. You could generate a list of any length, though, just by changing one number in the code.

If you don’t care to look behind the curtain and just want to cut and paste my word list, feel free to scroll down to the bottom of the post.

For raw data, I used a sample of 37,358 Project Gutenberg texts. PG is kind enough to offer an interface for researchers like me to harvest books. Note that this would work nearly as well with a much smaller sample. But I had already downloaded the books for another project, so I figured I might as well use them. If you use a PG harvest for your data set, make sure and remove the Human Genome Project gene sequence files (a full dump contains at least three copies of the full human genome). Otherwise, this script will have major grief when it tries to count each gene as a word.

Note that, as currently written, this script requires GNU Aspell and a system that works correctly with pipes. This means it should run fine on nearly any Unix-like system, but you Windoze people are on your own.

The first part of the script loads a a few standard modules. Then it gets a listing of the current directory and starts looping through each text file in it. With each iteration it prints a status message with the file name and percent completion. With scripts like this that take a day or two to run I like to be able to see at a glance how far along I am. As an aside, if you access your computer through a terminal like I do you will probably want to use GNU Screen or a similar utility to protect yourself from accidental disconnects while it’s still running.

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess

wordcounts={}

filelist = os.listdir('.')

counter = 0
for f in filelist:
    
    counter += 1
    
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%'

The next portion opens each book file and reads it in. Next, because I’m using PG books as a data set I need to trip off all of the boilerplate license text which occurs at the beginning and end of the files. Otherwise, because similar text appears in every file, it will skew the word distributions. Luckily, PG marks the actual text of the book by bracketing it in the words “START OF THIS PROJECT GUTENBERG EBOOK” and “END OF THIS PROJECT GUTENBERG EBOOK”. The front part is easy, we just do a string find to get the location of the first line-feed character after the start text appears. The end part is a little trickier; the easiest way to get it is to reverse the whole book. This means, however, that we also need to flip the search text. Pretty neato, huh?

   with open(f, "rb") as infile:  book=infile.read()
    
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
    
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]

The next step is to check the book text for words that aren’t in the dictionary, simply because there is no reason to count words that aren’t part of Standard English. The easiest way to do this on a Linux system like mine is to run the system’s spellcheck, Aspell, on the file. We also want to eliminate duplicate words from this list, since it will save iterations later.

        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)

Next, we go through the book text and strip out most of the punctuation. The string containing the punctuation to be removed looks a lot like the string you get by calling string.punctuation. Note, though, that I left in the “‘” and “-” characters because they are actually a part of contractions and compound words, respectively. I also split the book text, which at this point is one big string, into a list of words and capitalize them.

        #strip out most of the punctuation
        book2=''
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
                book2=book2+str(book[i])
                
        book=str(book2).capitalize().split()

In the final segment of the script we count how many times the words occur and update the counters, which are kept as a dictionary object. Then we convert the dictionary to a list, sort it, and print the 1000 most common words to a CSV data file. If you need a different number of words, just change the 1000 to another value.

        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                else:
                    wordcounts[w] += 1
                    

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])

final_list.sort()
final_list.reverse()

                    
with open('wordcounts_pg', 'w') as wc_output:
    
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')
        

That’s all there is to it. Pretty easy, huh? Now set it to run, detach the terminal, and ignore it until this time tomorrow. My machine can count words in about 1500 books per hour, so it takes about 25 hours to make it through the full sample.

And now, finally, here is the list of words. Feel free to cut and paste it to use for your own projects:

Word Occurrences
the 149164503
of 81154540
and 73797877
to 60771291
a 47925287
in 41773446
that 26590286
was 24584688
he 24462836
i 24025629
it 22795878
his 20173668
is 18378165
with 18081192
as 17645451
for 17473870
had 14408612
you 13939609
be 13252982
on 13207285
not 13181744
at 13015022
but 12718486
by 12438046
her 11878371
which 10826405
this 10263128
have 10196168
from 10088968
she 9778689
they 9715080
all 8819085
him 8771048
were 8314601
or 8143254
are 7787136
my 7572900
we 7412199
one 7373621
so 7203582
their 7018823
an 6518028
me 6419080
there 6267776
no 6185033
said 5938853
when 5899530
who 5878132
them 5808758
been 5787319
would 5689624
if 5655080
will 5166315
what 4895509
out 4556168
more 4440752
up 4416055
then 4222409
into 4129481
has 4000893
some 3929663
do 3914008
could 3749041
now 3747314
very 3630489
time 3571298
man 3559452
its 3544086
your 3522411
our 3517346
than 3494543
about 3349698
upon 3337366
other 3316391
only 3285019
any 3236410
little 3183383
like 2993385
these 2979508
two 2943507
may 2934056
did 2915540
after 2853393
see 2852408
made 2842273
great 2839852
before 2774768
can 2746279
such 2734113
should 2708032
over 2672597
us 2651042
first 2553483
well 2517899
must 2484839
mr 2465607
down 2433044
much 2428947
good 2376889
know 2372135
where 2353232
old 2291164
men 2286995
how 2261780
come 2217201
most 2188746
never 2160804
those 2135489
here 2122731
day 2071427
came 2061124
way 2042813
own 2037103
go 2009804
life 2007769
long 1992150
through 1989883
many 1982797
being 1976737
himself 1941387
even 1915129
shall 1890432
back 1865988
make 1852069
again 1848115
every 1845835
say 1817170
too 1810172
might 1807261
without 1781441
while 1759890
same 1701541
am 1696903
new 1687809
think 1665563
just 1660367
under 1649489
still 1643537
last 1616539
take 1614771
went 1595714
people 1593685
away 1582685
found 1574065
yet 1563963
thought 1556184
place 1543300
hand 1500131
though 1481938
small 1478723
eyes 1469270
also 1467931
house 1438223
years 1435529
1433313
another 1415606
don’t 1381480
young 1379348
three 1378462
once 1377940
off 1376942
work 1375035
right 1360201
get 1345597
nothing 1344419
against 1325938
left 1289397
ever 1269433
part 1261573
let 1260289
each 1258840
give 1258179
head 1254870
face 1253762
god 1249406
0 1239969
between 1225531
world 1219519
few 1213621
put 1200519
saw 1190392
things 1188437
took 1172602
letter 1167755
tell 1160034
because 1155609
far 1154860
always 1152942
night 1152416
mrs 1137055
love 1121812
both 1111644
sir 1100855
why 1097538
look 1095059
having 1069812
mind 1067461
father 1062643
called 1062190
side 1053255
looked 1051044
home 1036554
find 1036485
going 1034663
whole 1033731
seemed 1031466
however 1027701
country 1026854
got 1024945
thing 1022424
name 1020634
among 1019175
seen 1012779
heart 1011155
told 1004061
done 1000189
king 995498
water 994392
asked 993082
heard 983747
soon 982546
whom 979785
better 978434
something 957812
knew 956448
lord 956398
course 953585
end 947889
days 929530
moment 926478
enough 925144
almost 916006
general 903316
quite 902582
until 902333
thus 900738
hands 899106
nor 876106
light 869941
room 869532
since 864596
woman 864072
words 858824
gave 857475
b 853639
mother 852308
set 851757
white 850183
taken 848343
given 838078
large 835292
best 833941
brought 833270
does 826725
next 823345
whose 821731
state 820812
yes 817047
oh 815302
door 804702
turned 804433
others 800845
poor 800544
power 797133
present 792424
want 791194
perhaps 789201
death 788617
morning 786748
la 783512
rather 775384
word 774340
miss 771733
less 770410
during 763957
began 762442
themselves 762418
felt 757580
half 752587
lady 742708
full 742062
voice 740567
cannot 738450
feet 737299
order 736997
near 736832
true 735006
1 730887
it’s 727886
matter 726818
stood 725802
together 725703
year 723517
used 723293
war 720950
till 720824
use 719314
thou 714663
son 714275
high 713720
round 710093
above 709745
certain 703716
often 698006
kind 696975
indeed 696469
i’m 690646
along 688169
case 688098
fact 687334
myself 684387
children 683334
anything 682888
four 677704
dear 676320
keep 675722
nature 674055
known 671288
point 668710
p 668356
friend 666493
says 666011
passed 665792
within 665633
land 663605
sent 662540
church 659035
believe 656459
girl 652783
city 650397
times 649022
form 647388
herself 646989
therefore 644835
hundred 640059
john 639007
wife 636379
fire 632762
several 632704
body 630129
sure 629252
money 629251
means 627640
air 626921
open 626306
held 625660
second 622526
gone 614808
already 613870
least 609236
alone 606078
hope 602206
thy 599253
chapter 597339
whether 596307
boy 596048
english 594784
itself 591838
2 591413
women 589579
hear 587189
cried 586705
leave 586112
either 581618
number 576685
rest 575648
child 574531
behind 572007
read 571445
lay 571286
black 569530
government 567320
friends 567282
became 564384
around 559161
river 556286
sea 552753
ground 550622
help 549284
c 548349
i’ll 546929
short 546465
question 545629
reason 545464
become 544896
call 544576
replied 544286
town 543694
family 542309
england 542109
lost 537241
speak 537188
answered 536154
five 535088
coming 534713
possible 534639
making 530530
hour 530471
dead 529575
really 528631
looking 528622
law 528248
captain 525928
different 522269
manner 519256
business 516115
states 511757
earth 511042
st 510820
human 510666
early 508769
sometimes 507383
spirit 506297
care 505984
sat 505109
public 504862
close 503948
towards 503262
kept 502051
french 501813
party 500749
truth 500365
line 498822
strong 498492
book 496520
able 494330
later 494101
return 492237
hard 490701
mean 489853
feel 487798
story 486538
m 485841
received 483744
following 481558
fell 480591
wish 480562
person 480508
beautiful 479656
seems 477423
dark 476293
history 475744
followed 474307
subject 473058
thousand 470929
ten 469675
returned 469387
thee 467513
age 466838
turn 466674
fine 466630
across 466545
show 465685
arms 465504
character 464946
live 464642
soul 463939
met 463300
evening 463176
die 462851
common 459553
ready 457764
suddenly 456627
doubt 455415
bring 453346
ii 453190
red 450793
free 447675
that’s 445572
account 444530
cause 444403
necessary 444147
can’t 443812
need 443326
answer 442440
miles 441924
carried 438793
although 438423
fear 437796
hold 437493
interest 437382
force 436993
illustration 436577
sight 435854
act 435269
master 433105
ask 432510
idea 432424
ye 432036
sense 430693
an’ 430321
art 430226
position 429722
rose 428624
3 427441
company 427142
road 425669
further 425131
nearly 424118
table 424064
everything 423740
brother 423088
sort 422809
south 421800
reached 420190
london 418755
six 418131
didn’t 416216
cut 412716
taking 412571
continued 411607
understand 411326
appeared 409564
sun 407584
none 407168
else 406851
big 406799
o 406388
longer 406382
deep 406170
army 405897
beyond 405580
view 404378
strange 400814
natural 400483
talk 399814
north 398556
suppose 396693
court 396267
service 393925
bed 393878
past 393609
ought 393331
street 392970
cold 391836
hours 391460
toward 390231
added 389818
spoke 389420
seem 388757
neither 388355
late 388105
probably 387568
real 386926
clear 385649
chief 385350
run 385269
certainly 385179
est 384982
united 384930
stand 384385
forward 384028
front 383866
purpose 382457
sound 382443
feeling 382032
eye 380164
happy 378251
i’ve 377633
except 374853
knowledge 374155
blood 373563
low 373268
remember 373173
pretty 372548
change 372221
living 371264
american 369773
bad 369425
horse 369396
peace 369168
meet 366864
effect 365907
boys 364460
en 364172
school 362681
comes 362575
france 360771
fair 359826
forth 359249
died 359161
fall 358176
placed 357047
note 354944
led 354740
saying 354703
length 354502
pass 353234
gold 350268
entered 349397
doing 348304
latter 347844
written 347699
laid 346808
4 344382
according 343990
daughter 343682
opened 343526
dr 340867
trees 339826
distance 339817
office 339771
attention 339722
hair 337835
n 337111
prince 335635
wild 335514
wanted 335167
society 335139
husband 332251
play 331807
wind 330079
green 329633
greater 329453
tried 328784
west 328702
important 327851
ago 327793
bear 325469
various 325246
especially 324511
mine 321967
paper 320046
island 320002
glad 319989
makes 319717
instead 319188
faith 318882
lived 318731
pay 318090
heaven 316878
ran 315958
s 315761
blue 315697
minutes 315172
duty 315065
foot 314708
ship 314700
fellow 314523
letters 313624
persons 311105
action 310840
below 309831
heavy 309808
york 309749
strength 308836
pleasure 307965
immediately 307823
remained 307750
save 306991
standing 306911
whatever 306070
won’t 305381
trouble 305338
e 305293
window 305257
object 305202
try 304928
parts 304007
period 303992
desire 303985
beauty 303513
opinion 303459
arm 303347
system 302641
third 302389
chance 301890
books 301331
george 300975
doctor 300779
british 300353
silence 300238
he’s 300053
enemy 298899
hardly 298533
5 296045
greek 295622
exclaimed 294602
send 293592
food 293239
happened 293092
lips 292334
sleep 291632
influence 290698
slowly 290590
works 289252
months 288930
generally 288629
gentleman 287966
beginning 287473
tree 287341
boat 286781
mouth 285685
there’s 285569
sweet 285425
drew 284944
deal 284389
v 284339
future 284186
queen 284002
yourself 283364
condition 283335
figure 283153
single 283016
smile 282793
places 282793
besides 281838
girls 281703
rich 281130
afterwards 281017
battle 280676
thinking 280651
footnote 280245
presence 279893
stone 279829
appearance 279691
follow 279498
iii 279239
started 278072
caught 277993
ancient 277595
filled 277238
walked 276882
impossible 276720
broken 276365
former 276016
century 275990
march 275880
274800
field 274479
horses 274255
stay 274139
twenty 273187
sister 272290
getting 271641
william 270478
knows 269506
afraid 269150
result 268749
seeing 268724
you’re 268500
hall 267020
carry 266780
arrived 266706
easy 266309
lines 265956
wrote 265929
east 265852
top 265242
wall 264942
merely 264898
giving 264484
raised 264154
appear 264015
simple 263923
thoughts 263760
struck 263694
moved 263492
mary 263463
direction 263444
christ 263262
wood 263260
born 263084
quickly 262966
paris 262393
man’s 262105
visit 261882
outside 260418
holy 260348
entirely 259045
somewhat 259020
week 258960
laughed 258562
secret 258198
village 257758
henry 257557
christian 257504
danger 257486
wait 257012
wonder 256770
learned 256420
stopped 256191
tom 256117
covered 256117
6 255876
bright 255349
walk 255090
leaving 254851
experience 254763
unto 254610
particular 254564
loved 254479
usual 254307
plain 253867
to-day 253804
seven 253567
wrong 253172
easily 252954
occasion 252780
formed 252707
ah 252144
uncle 252120
quiet 252035
write 251743
scene 251380
evil 250993
married 250965
please 250781
fresh 250507
camp 249947
german 248539
beside 248522
mere 248276
fight 247957
showed 247904
grew 247866
expression 247804
scarcely 247641
board 247578
command 247398
language 247302
considered 247260
regard 247101
hill 246854
finally 246533
national 246452
paid 246364
joy 246060
worth 245352
piece 244733
religion 244677
perfect 244671
royal 244615
tears 244448
president 244135
value 244084
dinner 243572
spring 242721
produced 242576
middle 242282
charles 242134
brown 241885
expected 241668
lower 241299
circumstances 241150
remain 241102
wide 240773
political 240686
charge 240464
success 240254
per 240083
officers 239806
hath 239618
indian 239572
observed 239548
lives 239448
respect 238787
greatest 238784
w 238776
cases 238527
tone 238005
america 237215
youth 236992
summer 236698
garden 236552
music 236354
waiting 236223
due 236178
modern 235763
jack 235557
unless 235428
study 235093
allowed 234852
leaves 234652
bit 233774
race 233156
military 232907
news 232435
meant 232274
afternoon 232063
winter 231867
picture 231735
houses 231575
goes 231281
sudden 230675
proper 230476
justice 230410
difficult 229784
changed 229658
grace 229281
chair 228931
10 228875
private 228392
eight 228222
hot 227873
reach 226608
silent 226552
‘i 226540
flowers 226379
laws 226197
noble 225931
watch 225328
floor 225326
killed 225020
built 224484
declared 224477
judge 224393
colonel 224303
members 224213
broke 224166
fast 223897
duke 223481
o’ 223293
shot 223105
sit 222222
usually 222162
step 222119
speaking 222101
attempt 221687
marriage 221054
walls 220575
stop 220466
special 220316
religious 220300
discovered 220260
beneath 219894
supposed 219260
james 219013
gives 218988
forms 218743
turning 218692
authority 218686
original 218519
straight 218414
property 218393
page 218233
plan 218185
drawn 217873
personal 217458
l 217130
cry 217022
passing 216926
class 216527
likely 216216
sitting 215841
cross 215821
spot 215719
soldiers 215683
escape 215311
complete 215288
eat 215120
bound 214985
conversation 214895
trying 214332
meeting 213898
determined 213756
simply 213506
shown 213457
bank 213261
shore 212917
running 212509
corner 212507
soft 212163
journey 212007
isn’t 211316
i’d 211132
reply 210852
author 210827
believed 210653
rate 210607
prepared 210558
lead 210548
existence 210220
enter 209851
indians 209589
troops 209398
wished 209068
glass 208986
notice 208859
higher 208770
social 208685
iron 208019
rule 207943
orders 207856
building 207813
madame 207780
mountains 207700
minute 207575
receive 207440
offered 207306
h 206821
names 206725
learn 206618
similar 206437
closed 206419
considerable 206102
lake 206017
wouldn’t 206012
8 205864
pleasant 205487

And here is the complete script:

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess

wordcounts={}

filelist = os.listdir('.')

counter = 0
for f in filelist:
    
    counter += 1
    
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%' 
    
        with open(f, "rb") as infile:  book=infile.read()
    
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
    
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]
                
        
        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)
            
        
        #strip out most of the punctuation
        book2=''
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
                book2=book2+str(book[i])
                
        book=str(book2).capitalize().split()
        
        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                else:
                    wordcounts[w] += 1
                    

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])

final_list.sort()
final_list.reverse()

                    
with open('wordcounts_pg', 'w') as wc_output:
    
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')
        
 

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *