Automatic Indexing of LaTex Documents

A couple weeks ago I mentioned in a post that I was working on a Python script to automatically generate indexes of books written in the LaTex typesetting system.  At the time I promised to post the script in “a couple of days”.  Predictably, weeks have passed, my little script has ballooned into a full on open-source software project, and the code is now too long to post (or explain) in a single blog article.  If you’re interested, however, you can now download my alpha release from sourceforge.

The package includes two Python programs.  Indexmeister is a console utility which reads a file (in several formats, not just LaTex) and suggests terms for indexing.  It uses three different methods to figure out which terms are important.  Imbrowse is a Curses program which helps you interactively browse multi-file LaTex books and quickly insert the right tags to generate an index.

I made this video tutorial to show how the system works:

In the future I am thinking of adding a plug-in for LibreOffice, and possibly a graphical interface (probably using GTK bindings). Porting it to Windoze is not a priority, however.

Word Frequency Analysis – The Most Common Words

There are any number of reasons why you might need a list of the most common words in the language. In my case, I was working on a piece of software to speed the process of building indexes for my print books. My program reads the book and suggests a list of words that the author might want to include in the index. It needed a list of the most common words so it would know not to bother suggesting them. I’ll post that script in a couple of days. For now, though, I thought I would give you a very simple piece of Python code that reads a directory full of text files, counts how many times each word occurs, and prints a list of those which show up most often. I set it to give me the most common 1000 words. You could generate a list of any length, though, just by changing one number in the code.

If you don’t care to look behind the curtain and just want to cut and paste my word list, feel free to scroll down to the bottom of the post.

For raw data, I used a sample of 37,358 Project Gutenberg texts. PG is kind enough to offer an interface for researchers like me to harvest books. Note that this would work nearly as well with a much smaller sample. But I had already downloaded the books for another project, so I figured I might as well use them. If you use a PG harvest for your data set, make sure and remove the Human Genome Project gene sequence files (a full dump contains at least three copies of the full human genome). Otherwise, this script will have major grief when it tries to count each gene as a word.

Note that, as currently written, this script requires GNU Aspell and a system that works correctly with pipes. This means it should run fine on nearly any Unix-like system, but you Windoze people are on your own.

The first part of the script loads a a few standard modules. Then it gets a listing of the current directory and starts looping through each text file in it. With each iteration it prints a status message with the file name and percent completion. With scripts like this that take a day or two to run I like to be able to see at a glance how far along I am. As an aside, if you access your computer through a terminal like I do you will probably want to use GNU Screen or a similar utility to protect yourself from accidental disconnects while it’s still running.

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess

wordcounts={}

filelist = os.listdir('.')

counter = 0
for f in filelist:
    
    counter += 1
    
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%'

The next portion opens each book file and reads it in. Next, because I’m using PG books as a data set I need to trip off all of the boilerplate license text which occurs at the beginning and end of the files. Otherwise, because similar text appears in every file, it will skew the word distributions. Luckily, PG marks the actual text of the book by bracketing it in the words “START OF THIS PROJECT GUTENBERG EBOOK” and “END OF THIS PROJECT GUTENBERG EBOOK”. The front part is easy, we just do a string find to get the location of the first line-feed character after the start text appears. The end part is a little trickier; the easiest way to get it is to reverse the whole book. This means, however, that we also need to flip the search text. Pretty neato, huh?

   with open(f, "rb") as infile:  book=infile.read()
    
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
    
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]

The next step is to check the book text for words that aren’t in the dictionary, simply because there is no reason to count words that aren’t part of Standard English. The easiest way to do this on a Linux system like mine is to run the system’s spellcheck, Aspell, on the file. We also want to eliminate duplicate words from this list, since it will save iterations later.

        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)

Next, we go through the book text and strip out most of the punctuation. The string containing the punctuation to be removed looks a lot like the string you get by calling string.punctuation. Note, though, that I left in the “‘” and “-” characters because they are actually a part of contractions and compound words, respectively. I also split the book text, which at this point is one big string, into a list of words and capitalize them.

        #strip out most of the punctuation
        book2=''
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
                book2=book2+str(book[i])
                
        book=str(book2).capitalize().split()

In the final segment of the script we count how many times the words occur and update the counters, which are kept as a dictionary object. Then we convert the dictionary to a list, sort it, and print the 1000 most common words to a CSV data file. If you need a different number of words, just change the 1000 to another value.

        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                else:
                    wordcounts[w] += 1
                    

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])

final_list.sort()
final_list.reverse()

                    
with open('wordcounts_pg', 'w') as wc_output:
    
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')
        

That’s all there is to it. Pretty easy, huh? Now set it to run, detach the terminal, and ignore it until this time tomorrow. My machine can count words in about 1500 books per hour, so it takes about 25 hours to make it through the full sample.

And now, finally, here is the list of words. Feel free to cut and paste it to use for your own projects:

Word Occurrences
the 149164503
of 81154540
and 73797877
to 60771291
a 47925287
in 41773446
that 26590286
was 24584688
he 24462836
i 24025629
it 22795878
his 20173668
is 18378165
with 18081192
as 17645451
for 17473870
had 14408612
you 13939609
be 13252982
on 13207285
not 13181744
at 13015022
but 12718486
by 12438046
her 11878371
which 10826405
this 10263128
have 10196168
from 10088968
she 9778689
they 9715080
all 8819085
him 8771048
were 8314601
or 8143254
are 7787136
my 7572900
we 7412199
one 7373621
so 7203582
their 7018823
an 6518028
me 6419080
there 6267776
no 6185033
said 5938853
when 5899530
who 5878132
them 5808758
been 5787319
would 5689624
if 5655080
will 5166315
what 4895509
out 4556168
more 4440752
up 4416055
then 4222409
into 4129481
has 4000893
some 3929663
do 3914008
could 3749041
now 3747314
very 3630489
time 3571298
man 3559452
its 3544086
your 3522411
our 3517346
than 3494543
about 3349698
upon 3337366
other 3316391
only 3285019
any 3236410
little 3183383
like 2993385
these 2979508
two 2943507
may 2934056
did 2915540
after 2853393
see 2852408
made 2842273
great 2839852
before 2774768
can 2746279
such 2734113
should 2708032
over 2672597
us 2651042
first 2553483
well 2517899
must 2484839
mr 2465607
down 2433044
much 2428947
good 2376889
know 2372135
where 2353232
old 2291164
men 2286995
how 2261780
come 2217201
most 2188746
never 2160804
those 2135489
here 2122731
day 2071427
came 2061124
way 2042813
own 2037103
go 2009804
life 2007769
long 1992150
through 1989883
many 1982797
being 1976737
himself 1941387
even 1915129
shall 1890432
back 1865988
make 1852069
again 1848115
every 1845835
say 1817170
too 1810172
might 1807261
without 1781441
while 1759890
same 1701541
am 1696903
new 1687809
think 1665563
just 1660367
under 1649489
still 1643537
last 1616539
take 1614771
went 1595714
people 1593685
away 1582685
found 1574065
yet 1563963
thought 1556184
place 1543300
hand 1500131
though 1481938
small 1478723
eyes 1469270
also 1467931
house 1438223
years 1435529
1433313
another 1415606
don’t 1381480
young 1379348
three 1378462
once 1377940
off 1376942
work 1375035
right 1360201
get 1345597
nothing 1344419
against 1325938
left 1289397
ever 1269433
part 1261573
let 1260289
each 1258840
give 1258179
head 1254870
face 1253762
god 1249406
0 1239969
between 1225531
world 1219519
few 1213621
put 1200519
saw 1190392
things 1188437
took 1172602
letter 1167755
tell 1160034
because 1155609
far 1154860
always 1152942
night 1152416
mrs 1137055
love 1121812
both 1111644
sir 1100855
why 1097538
look 1095059
having 1069812
mind 1067461
father 1062643
called 1062190
side 1053255
looked 1051044
home 1036554
find 1036485
going 1034663
whole 1033731
seemed 1031466
however 1027701
country 1026854
got 1024945
thing 1022424
name 1020634
among 1019175
seen 1012779
heart 1011155
told 1004061
done 1000189
king 995498
water 994392
asked 993082
heard 983747
soon 982546
whom 979785
better 978434
something 957812
knew 956448
lord 956398
course 953585
end 947889
days 929530
moment 926478
enough 925144
almost 916006
general 903316
quite 902582
until 902333
thus 900738
hands 899106
nor 876106
light 869941
room 869532
since 864596
woman 864072
words 858824
gave 857475
b 853639
mother 852308
set 851757
white 850183
taken 848343
given 838078
large 835292
best 833941
brought 833270
does 826725
next 823345
whose 821731
state 820812
yes 817047
oh 815302
door 804702
turned 804433
others 800845
poor 800544
power 797133
present 792424
want 791194
perhaps 789201
death 788617
morning 786748
la 783512
rather 775384
word 774340
miss 771733
less 770410
during 763957
began 762442
themselves 762418
felt 757580
half 752587
lady 742708
full 742062
voice 740567
cannot 738450
feet 737299
order 736997
near 736832
true 735006
1 730887
it’s 727886
matter 726818
stood 725802
together 725703
year 723517
used 723293
war 720950
till 720824
use 719314
thou 714663
son 714275
high 713720
round 710093
above 709745
certain 703716
often 698006
kind 696975
indeed 696469
i’m 690646
along 688169
case 688098
fact 687334
myself 684387
children 683334
anything 682888
four 677704
dear 676320
keep 675722
nature 674055
known 671288
point 668710
p 668356
friend 666493
says 666011
passed 665792
within 665633
land 663605
sent 662540
church 659035
believe 656459
girl 652783
city 650397
times 649022
form 647388
herself 646989
therefore 644835
hundred 640059
john 639007
wife 636379
fire 632762
several 632704
body 630129
sure 629252
money 629251
means 627640
air 626921
open 626306
held 625660
second 622526
gone 614808
already 613870
least 609236
alone 606078
hope 602206
thy 599253
chapter 597339
whether 596307
boy 596048
english 594784
itself 591838
2 591413
women 589579
hear 587189
cried 586705
leave 586112
either 581618
number 576685
rest 575648
child 574531
behind 572007
read 571445
lay 571286
black 569530
government 567320
friends 567282
became 564384
around 559161
river 556286
sea 552753
ground 550622
help 549284
c 548349
i’ll 546929
short 546465
question 545629
reason 545464
become 544896
call 544576
replied 544286
town 543694
family 542309
england 542109
lost 537241
speak 537188
answered 536154
five 535088
coming 534713
possible 534639
making 530530
hour 530471
dead 529575
really 528631
looking 528622
law 528248
captain 525928
different 522269
manner 519256
business 516115
states 511757
earth 511042
st 510820
human 510666
early 508769
sometimes 507383
spirit 506297
care 505984
sat 505109
public 504862
close 503948
towards 503262
kept 502051
french 501813
party 500749
truth 500365
line 498822
strong 498492
book 496520
able 494330
later 494101
return 492237
hard 490701
mean 489853
feel 487798
story 486538
m 485841
received 483744
following 481558
fell 480591
wish 480562
person 480508
beautiful 479656
seems 477423
dark 476293
history 475744
followed 474307
subject 473058
thousand 470929
ten 469675
returned 469387
thee 467513
age 466838
turn 466674
fine 466630
across 466545
show 465685
arms 465504
character 464946
live 464642
soul 463939
met 463300
evening 463176
die 462851
common 459553
ready 457764
suddenly 456627
doubt 455415
bring 453346
ii 453190
red 450793
free 447675
that’s 445572
account 444530
cause 444403
necessary 444147
can’t 443812
need 443326
answer 442440
miles 441924
carried 438793
although 438423
fear 437796
hold 437493
interest 437382
force 436993
illustration 436577
sight 435854
act 435269
master 433105
ask 432510
idea 432424
ye 432036
sense 430693
an’ 430321
art 430226
position 429722
rose 428624
3 427441
company 427142
road 425669
further 425131
nearly 424118
table 424064
everything 423740
brother 423088
sort 422809
south 421800
reached 420190
london 418755
six 418131
didn’t 416216
cut 412716
taking 412571
continued 411607
understand 411326
appeared 409564
sun 407584
none 407168
else 406851
big 406799
o 406388
longer 406382
deep 406170
army 405897
beyond 405580
view 404378
strange 400814
natural 400483
talk 399814
north 398556
suppose 396693
court 396267
service 393925
bed 393878
past 393609
ought 393331
street 392970
cold 391836
hours 391460
toward 390231
added 389818
spoke 389420
seem 388757
neither 388355
late 388105
probably 387568
real 386926
clear 385649
chief 385350
run 385269
certainly 385179
est 384982
united 384930
stand 384385
forward 384028
front 383866
purpose 382457
sound 382443
feeling 382032
eye 380164
happy 378251
i’ve 377633
except 374853
knowledge 374155
blood 373563
low 373268
remember 373173
pretty 372548
change 372221
living 371264
american 369773
bad 369425
horse 369396
peace 369168
meet 366864
effect 365907
boys 364460
en 364172
school 362681
comes 362575
france 360771
fair 359826
forth 359249
died 359161
fall 358176
placed 357047
note 354944
led 354740
saying 354703
length 354502
pass 353234
gold 350268
entered 349397
doing 348304
latter 347844
written 347699
laid 346808
4 344382
according 343990
daughter 343682
opened 343526
dr 340867
trees 339826
distance 339817
office 339771
attention 339722
hair 337835
n 337111
prince 335635
wild 335514
wanted 335167
society 335139
husband 332251
play 331807
wind 330079
green 329633
greater 329453
tried 328784
west 328702
important 327851
ago 327793
bear 325469
various 325246
especially 324511
mine 321967
paper 320046
island 320002
glad 319989
makes 319717
instead 319188
faith 318882
lived 318731
pay 318090
heaven 316878
ran 315958
s 315761
blue 315697
minutes 315172
duty 315065
foot 314708
ship 314700
fellow 314523
letters 313624
persons 311105
action 310840
below 309831
heavy 309808
york 309749
strength 308836
pleasure 307965
immediately 307823
remained 307750
save 306991
standing 306911
whatever 306070
won’t 305381
trouble 305338
e 305293
window 305257
object 305202
try 304928
parts 304007
period 303992
desire 303985
beauty 303513
opinion 303459
arm 303347
system 302641
third 302389
chance 301890
books 301331
george 300975
doctor 300779
british 300353
silence 300238
he’s 300053
enemy 298899
hardly 298533
5 296045
greek 295622
exclaimed 294602
send 293592
food 293239
happened 293092
lips 292334
sleep 291632
influence 290698
slowly 290590
works 289252
months 288930
generally 288629
gentleman 287966
beginning 287473
tree 287341
boat 286781
mouth 285685
there’s 285569
sweet 285425
drew 284944
deal 284389
v 284339
future 284186
queen 284002
yourself 283364
condition 283335
figure 283153
single 283016
smile 282793
places 282793
besides 281838
girls 281703
rich 281130
afterwards 281017
battle 280676
thinking 280651
footnote 280245
presence 279893
stone 279829
appearance 279691
follow 279498
iii 279239
started 278072
caught 277993
ancient 277595
filled 277238
walked 276882
impossible 276720
broken 276365
former 276016
century 275990
march 275880
274800
field 274479
horses 274255
stay 274139
twenty 273187
sister 272290
getting 271641
william 270478
knows 269506
afraid 269150
result 268749
seeing 268724
you’re 268500
hall 267020
carry 266780
arrived 266706
easy 266309
lines 265956
wrote 265929
east 265852
top 265242
wall 264942
merely 264898
giving 264484
raised 264154
appear 264015
simple 263923
thoughts 263760
struck 263694
moved 263492
mary 263463
direction 263444
christ 263262
wood 263260
born 263084
quickly 262966
paris 262393
man’s 262105
visit 261882
outside 260418
holy 260348
entirely 259045
somewhat 259020
week 258960
laughed 258562
secret 258198
village 257758
henry 257557
christian 257504
danger 257486
wait 257012
wonder 256770
learned 256420
stopped 256191
tom 256117
covered 256117
6 255876
bright 255349
walk 255090
leaving 254851
experience 254763
unto 254610
particular 254564
loved 254479
usual 254307
plain 253867
to-day 253804
seven 253567
wrong 253172
easily 252954
occasion 252780
formed 252707
ah 252144
uncle 252120
quiet 252035
write 251743
scene 251380
evil 250993
married 250965
please 250781
fresh 250507
camp 249947
german 248539
beside 248522
mere 248276
fight 247957
showed 247904
grew 247866
expression 247804
scarcely 247641
board 247578
command 247398
language 247302
considered 247260
regard 247101
hill 246854
finally 246533
national 246452
paid 246364
joy 246060
worth 245352
piece 244733
religion 244677
perfect 244671
royal 244615
tears 244448
president 244135
value 244084
dinner 243572
spring 242721
produced 242576
middle 242282
charles 242134
brown 241885
expected 241668
lower 241299
circumstances 241150
remain 241102
wide 240773
political 240686
charge 240464
success 240254
per 240083
officers 239806
hath 239618
indian 239572
observed 239548
lives 239448
respect 238787
greatest 238784
w 238776
cases 238527
tone 238005
america 237215
youth 236992
summer 236698
garden 236552
music 236354
waiting 236223
due 236178
modern 235763
jack 235557
unless 235428
study 235093
allowed 234852
leaves 234652
bit 233774
race 233156
military 232907
news 232435
meant 232274
afternoon 232063
winter 231867
picture 231735
houses 231575
goes 231281
sudden 230675
proper 230476
justice 230410
difficult 229784
changed 229658
grace 229281
chair 228931
10 228875
private 228392
eight 228222
hot 227873
reach 226608
silent 226552
‘i 226540
flowers 226379
laws 226197
noble 225931
watch 225328
floor 225326
killed 225020
built 224484
declared 224477
judge 224393
colonel 224303
members 224213
broke 224166
fast 223897
duke 223481
o’ 223293
shot 223105
sit 222222
usually 222162
step 222119
speaking 222101
attempt 221687
marriage 221054
walls 220575
stop 220466
special 220316
religious 220300
discovered 220260
beneath 219894
supposed 219260
james 219013
gives 218988
forms 218743
turning 218692
authority 218686
original 218519
straight 218414
property 218393
page 218233
plan 218185
drawn 217873
personal 217458
l 217130
cry 217022
passing 216926
class 216527
likely 216216
sitting 215841
cross 215821
spot 215719
soldiers 215683
escape 215311
complete 215288
eat 215120
bound 214985
conversation 214895
trying 214332
meeting 213898
determined 213756
simply 213506
shown 213457
bank 213261
shore 212917
running 212509
corner 212507
soft 212163
journey 212007
isn’t 211316
i’d 211132
reply 210852
author 210827
believed 210653
rate 210607
prepared 210558
lead 210548
existence 210220
enter 209851
indians 209589
troops 209398
wished 209068
glass 208986
notice 208859
higher 208770
social 208685
iron 208019
rule 207943
orders 207856
building 207813
madame 207780
mountains 207700
minute 207575
receive 207440
offered 207306
h 206821
names 206725
learn 206618
similar 206437
closed 206419
considerable 206102
lake 206017
wouldn’t 206012
8 205864
pleasant 205487

And here is the complete script:

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess

wordcounts={}

filelist = os.listdir('.')

counter = 0
for f in filelist:
    
    counter += 1
    
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%' 
    
        with open(f, "rb") as infile:  book=infile.read()
    
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
    
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]
                
        
        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)
            
        
        #strip out most of the punctuation
        book2=''
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
                book2=book2+str(book[i])
                
        book=str(book2).capitalize().split()
        
        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                else:
                    wordcounts[w] += 1
                    

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])

final_list.sort()
final_list.reverse()

                    
with open('wordcounts_pg', 'w') as wc_output:
    
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')
        
 

Dynamic HTML in Python – A Simple E-Book Server

I thought it was high time I wrote another Python how-to article, since I haven’t for months, and they tend to be the highest traffic posts on this site. This one should give you plenty of “bang for your buck”, though, since it includes examples of web development, file i/o, and working with PDF files.

I should start with a little background to the problem. I have an older Kindle device, to which I am more or less addicted. I download hundreds of classic books, technical manuals, and journal articles, put them on the Kindle, and read them at my leisure. Unfortunately, the device only has 2 gigabytes of storage. This seems like a lot, but by the time I downloaded all the books on my Great Books Reading List on it, it was already 3/4 full. Lately I’ve been having to delete files to make room for new ones–which is a problem, because I don’t always know where to find them again if I need them. A few weeks ago I complained about the problem on Facebook, and one of my buddies suggested (jokingly?) that I build a “Kindle Server”. So I did.

It only took a couple of hours to dump all of my books on one of the servers that live in my garage, set up a simple Python-based web server, and write a Python script to dynamically serve up a listing of titles. Now I just point my Kindle’s browser at the server and download whatever I want on the fly.

This works with non-DRM .MOBI files, like the ones on Project Gutenberg, The Online Library of Liberty, and the University of Adelaide. It also works on .PDF files. It ignores DRP protected Kindle books that you bought from Amazon, because they stay in the “Archived Items” folder on your Kindle and you can re-download them directly from Amazon.

Step 1 – Set Up a Web Server

Python itself comes with all the libraries you need to function as a simple CGI web sever. A simple script like this, slightly adapted from an example on the Pointless Programming blog, should be all you need. Note that my web server root directory is “/var/www”, my Kindle books are in “/var/www/kindle” and my CGI scripts are in “/var/www/cgi-bin”. I don’t have another server on that IP, so I could use port 80, which is the “main” web server port.

#!/usr/bin/env python

import BaseHTTPServer
import CGIHTTPServer
import cgitb; cgitb.enable()  

server = BaseHTTPServer.HTTPServer
handler = CGIHTTPServer.CGIHTTPRequestHandler
server_address = ("", 80)
handler.cgi_directories = ["/cgi-bin"]

httpd = server(server_address, handler)
httpd.serve_forever()

Once you make sure the web server is working correctly, you will probably want to add a few lines to your rc.local file to start it in the background on system startup:

cd /var/www
./webserver.py &

Step 2 – Write the CGI Script

This is a mildly long script, so I will break it down and explain it by sections. The entire source file is at the bottom of this post.

The first lines in the script tell the server to use Python to run is and import the libraries that you will need. “os” is used to read the directory and is included with Python. “pyPdf” is used to get titles from PDF files and is widely available in repositories. On Debian based systems the package is called “python-pypdf”.

#! /usr/bin/env python

import os
from pyPdf import PdfFileReader

Here we print some text to let browser know that it is receiving HTML output. We also set the title of the page and print a header at the top.

print "Content-Type: text/html"
print
print """
<html>
<head><title>Kindle Library</title></head>
<body>
<h1>Kindle Library</h1><hr>
<ul>
"""

Here we initialize blank lists: one for MOBI format books, one for PDF format books, and one for subdirectories.

mobifiles = []
pdffiles = []
dirlist = []

Now we get a listing of the directory where the books are stored and sort the entries based on the file extension. Anything without an extension is assumed to be a subdirectory and anything with an unrecognized extension is simply ignored.

filelist = os.listdir("/var/www/kindle")
for f in filelist:
    if os.path.splitext(f)[1] == ".mobi":
        mobifiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == ".pdf":
        pdffiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == "":
        dirlist.append("/var/www/kindle/"+f)

And now we do the same thing for any files in subdirectories.

for d in dirlist:
    filelist = os.listdir(d)
    for f in filelist:
        if os.path.splitext(f)[1] == ".mobi":
            mobifiles.append(d+'/'+f)
        elif os.path.splitext(f)[1] == ".pdf":
            pdffiles.append("/var/www/kindle/"+f)

This code looks in each of the MOBI format files and extracts a title. The MOBI format is a fairly complex binary file–usually compressed–and I couldn’t easily find a Python library to read the metadata. Let me know if you know of one. A little tinkering with a hex dumper revealed that the first 32 bytes of each file contain an abbreviated title, which works fine for this application.

Screen Shot of Hex Dump

Screen Shot of Hex Dump

print "<h2>Kindle MOBI Books</h2>"
booklinks = []
for f in mobifiles:
    with open(f, "rb") as book:
        t = book.read(32)
        title = t.strip()
        title = title.replace("_", " ")
    booklinks.append(['<li><a href="'+f.replace("/var/www/","/")+'">',
          title,"</a></li>"])

And then this part sorts the list by title and prints the hyperlinked titles.

booklinks.sort(key=lambda x: x[1])      #sort by title

for b in booklinks:
    print b[0]+b[1]+b[2]

print "</ul>"

This part does the same thing for .PDF books. The PyPdf library makes it silly easy to retrieve PDF metadata. The only thing to worry about is that not all PDF creators bother to put a title in. When the title returns as “None” we use the file name for a title.

print "<h2>PDF Books</h2>"
print "<ul>" 
booklinks = [] 
for f in pdffiles: 
    pdfinput = PdfFileReader(file(f, "rb")) 
    title = str(pdfinput.getDocumentInfo().title) 
    if title == "None": 
        title = f.replace("/var/www/kindle/", "") 
    booklinks.append(['<li><a href="'+f.replace("/var/www/",
                "/")+'">',title,"</a></li>"]) 

booklinks.sort(key=lambda x: x[1]) #sort by title 

for b in booklinks: 
    print b[0]+b[1]+b[2] 
    print "</ul>"</ul>

And, finally, we print a count of the number of books and subdirectories and close the <body> and <html> tags.

print str(len(mobifiles)+len(pdffiles)), "books in", 
print str(len(dirlist)+1), "directories.
"

print """
</body>
</html>
"""

Remember to move the finished script to your /cgi-bin directory and change the permissions to make it executable for all users.

The final result runs fast and looks pretty slick:

Screen Shot from Browser

Screen Shot from Browser

It would be easy to add some CSS to make it even prettier, but I didn’t bother since I’ll mostly be looking at it though a Kindle screen:

Screen Shot from Kindle

Screen Shot from Kindle

I hope this is helpful to you. If nothing else, it shows good simple examples of how to create dynamic HTML with Python and how to get the titles from MOBI and PDF files.

Full Source Listing

#! /usr/bin/env python

import os
from pyPdf import PdfFileReader

print "Content-Type: text/html"
print
print """
<html>
<head><title>Kindle Library</title></head>
<body>
<h1>Kindle Library</h1><hr>
<ul>
"""

mobifiles = []
pdffiles = []
dirlist = []

filelist = os.listdir("/var/www/kindle")
for f in filelist:
    if os.path.splitext(f)[1] == ".mobi":
        mobifiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == ".pdf":
        pdffiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == "":
        dirlist.append("/var/www/kindle/"+f)

for d in dirlist:
    filelist = os.listdir(d)
    for f in filelist:
        if os.path.splitext(f)[1] == ".mobi":
            mobifiles.append(d+'/'+f)   
        elif os.path.splitext(f)[1] == ".pdf":
            pdffiles.append("/var/www/kindle/"+f)

print "<h2>Kindle MOBI Books</h2>"
booklinks = []
for f in mobifiles:
    with open(f, "rb") as book:
        t = book.read(32)
        title = t.strip()
        title = title.replace("_", " ")
    booklinks.append(['<li><a href="'+f.replace("/var/www/",
            "/")+'">',title,"</a></li>"])
    
booklinks.sort(key=lambda x: x[1])      #sort by title

for b in booklinks:
    print b[0]+b[1]+b[2]
        
print "</ul>"

print "<h2>PDF Books</h2>"
print "<ul>"

booklinks = []
for f in pdffiles:
    pdfinput = PdfFileReader(file(f, "rb"))
    title = str(pdfinput.getDocumentInfo().title)
    if title == "None":
        title = f.replace("/var/www/kindle/", "")
    booklinks.append(['<li><a href="'+f.replace("/var/www/",
            "/")+'">',title,"</a></li>"])
    
booklinks.sort(key=lambda x: x[1])      #sort by title   

for b in booklinks:
    print b[0]+b[1]+b[2]

print "</ul>"
    
print str(len(mobifiles)+len(pdffiles)), "books in", 
print str(len(dirlist)+1), "directories.<br>"



print """
</body>
</html>
"""

First Preview of my Upcoming Book

Last week I placed a new academic working paper on Academia.edu that roughly parallels Chapter 11 of my upcoming book.  The version in the book will be written at a different reading level and without the math equations, but this is still a pretty good taste of what is coming.

Screenshot of paper from academia.edu

Scholars like to post these preliminary drafts for several reasons.  The most important one for an independent researcher like myself is to receive feedback and suggestions prior to submission.  Another reason is to make findings available to the community sooner.  The average turn-around time to publish a journal article is two or three years and the field may have moved on by the time the paper hits the presses.

I probably don’t need to worry about obsolescence with this particular article, since the events with which it deals happened back in the 1950’s and 1960’s.  My book will be a study of the role of outside scholars in our society and, in particular, their ability to shape public policy.  Outside Scholars, in my usage, are people who engage in research and knowledge creation without being formally affiliated with the dominant academic community.  This particular article/chapter deals with an outside scholar named Victor Sharrow who devoted his life to arguing for what he saw as the “correct” interpretation on the Fourteenth Amendment.  He was ultimately unsuccessful, but I feel his career provides several intriguing insights as a characteristic outside scholar narrative.

Sharrow saw the Fourteenth Amendment as the key to dismantling the Jim Crow system in the South.  In the months prior to the 1958 election he mounted an intense one-man lobbying campaign to sway Dwight Eisenhower and other politicians to his views.  In my article I examine several of his arguments from a standpoint of modern data science.

Those of you who read my posts on data science and Python programming might be interested in the simulation models I describe in the paper.  I would be happy to send my spreadsheet and code to anyone who is interested.  Just e-mail me or message my Facebook page.

If all goes well, the book should be released in late 2016 or early 2017.

 

Degree of Voting Restriction by State in 1956, as Calculated by the Model Described in my Paper

Degree of Voting Restriction by State in 1956, as Calculated by the Model Described in my Paper

Easy Double Exponential Smoothing in Python

I realized this morning that it has been a while since I posted any Python code. I’ve been a bit busy with Handyman Kevin and haven’t been doing much data science. Still, I decided it was time to carve out a couple hours this morning to practice my skills. The result are these functions, which perform basic double exponential smoothing using the Holt-Winters method. I deliberately avoided using NumPy, SciPy, or any other libraries. It isn’t that I dislike Numpy/Scipy (far from it), but you can’t always get sysadmins to install extra libraries on the machines you’re using, especially if you are a guerrilla data scientist like me.

There are a lot of different time series methods out there, and they all have their points. Holt-Winters is the one that I keep coming back to, though. One of the reasons is simplicity–I can always remember it and bang it into a spreadsheet without needing to Google anything or download libraries. About the 40th time I typed it into a spreadsheet, though, it occurred to me that it would be smart to implement it in Python so I could save some typing.

The first function, MAPE, simply calculates the mean absolute percentage error (MAPE) of a list of estimated values, as compared to a list of actual values.

The next function, holtwinters, uses Holt-Winters to predict the next three values in a time series. You need to supply two smoothing coefficients, alpha and beta, for the level and trend, respectively. Typically, you would have a pretty good idea what these were from doing similar forecasts in the past.

If you don’t know the coefficients then use the third function, holtwinters_auto, to automatically determine them. This function uses a grid search. Those of you who have read my monograph probably remember that I’m not usually wild about grid searches. In this case it makes sense, though, since you don’t usually need more than a few digits of precision on the coefficients.

Screenshot (3)

def MAPE(actual, estimate):
    '''Given two lists, one of actual values and one of estimated values, 
        computes the Mean Absolute Percentage Error'''
        
    if len(actual) != len(estimate):
        print "ERROR: Lists not the same length."
        return []
        
    pcterrors = []
    
    for i in range(len(estimate)):
        pcterrors.append(abs(estimate[i]-actual[i])/actual[i])
    
    return sum(pcterrors)/len(pcterrors)
def holtwinters(ts, *args):
    '''Uses the Holt-Winters exp. smoothing method to forecast the next
       three points in a time series.  The second two arguments are 
       smoothing coefficients, alpha and beta.  If no coefficients are given,
       both are assumed to be 0.5.
       '''
       
    if len(args) >= 1:
        alpha = args[0]
       
    else:
        alpha = .5
        findcoeff = True
    
    if len(args) >= 2:
        beta = args[1]
    else:
        beta = .5
            
    if len(ts) < 3:
        print "ERROR: At least three points are required for TS forecast."
        return 0
    
    est = []    #estimated value (level)
    trend = []  #estimated trend
    
    '''For first value, assume trend and level are both 0.'''
    est.append(0)
    trend.append(0)
    
    '''For second value, assume trend still 0 and level same as first          
        actual value'''
    est.append(ts[0])
    trend.append(0)
    
    '''Now roll on for the rest of the values'''
    for i in range(len(ts)-2):
        trend.append(beta*(ts[i+1]-ts[i])+(1-beta)*trend[i+1])
        est.append(alpha*ts[i+1]+(1-alpha)*est[i+1]+trend[i+2])
        
    
    '''now back-cast for the first three values that we fudged'''
    est.reverse()
    trend.reverse()
    ts.reverse()
    
    for i in range(len(ts)-3, len(ts)):
        trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1])
        est[i] = alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i]
    
       
    est.reverse()
    trend.reverse()
    ts.reverse()
    
    '''and do one last forward pass to smooth everything out'''
    for i in range(2, len(ts)):
        trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1])
        est[i]= alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i]
        
    
    '''Holt-Winters method is only good for about 3 periods out'''
    next3 = [alpha*ts[-1]+(1-alpha)*(est[-1])+beta*(ts[-1]-ts[-2])+(1-beta)*         trend[-1]]
    next3.append(next3[0]+trend[-1])
    next3.append(next3[1]+trend[-1])
    
    return next3, MAPE(ts,est)
def holtwinters_auto(ts, *args):
    '''Calls the holtwinters function, but automatically determines the
    alpha and betta coefficients which minimize the error.
    
    The optional argument is the number of digits of precision you need
    for the coefficients.  The default is 4, which is plenty for most real
    life forecasting applications.
    '''
    
    if len(args) > 0:
        digits = args[0]
    else:
        digits = 4
    
    '''Perform an iterative grid search to find minimum MAPE'''
    
    alpha = .5
    beta = .5
    
    for d in range(1,digits):
        grid = []
        for b in [x * .1**d+beta for x in range(-5,6)]:
            for a in [x * .1**d+alpha for x in range(-5,6)]:
                grid.append(holtwinters(ts, a, b)[-1])
                if grid[-1]==min(grid):
                    alpha = a
                    beta = b
            
    next3, mape = holtwinters(ts, alpha, beta)
        
    return(next3, mape, alpha, beta)

Commercial vs GPL: Data Analysis Software Showdown

As graduation nears, I look at my computer desktop and realize that most of the academic software licenses will expire before I start my next graduate program. For a data scientist, this is serious. How am I going to cross-tab survey results without SPSS? Am I going to have to do my stepwise regressions manually now? How am I going to create presentation quality geographic displays without Tableau? What package am I going to use for linear algebra? I freak out pretty badly, until I remember that just about every application you could want for data analysis is available for free on a GPL license that never expires.

Some of the software I use is already open source. Data scientists’ two favorite programming languages, Python and R, are already open source. The same goes for Linux, the operating system that runs on five of my seven computers. Many of the applications I use, though, are commercial but either the company or my university gives me a free license. What follows is a quick survey of open source alternatives for the most commonly used software.

Spreadsheet

Commercial: Microsoft Excel
Open Source: Gnumeric

Spreadsheet applications are to the data analyst what a table saw is to a woodworker: the big tool in the middle of the shop that gets used somehow in nearly every project. For many of us, Excel is the first spreadsheet we learn, probably because it is standard equipment on most office and university computers.

Excel is the Chrysler New Yorker of spreadsheet applications–huge and comfortable but not too nimble, loaded with lots of features that are nice to have, but you don’t really need them. Then again some features, like the way Excel handles data tables, advanced filtering, and pivot tables, can save a lot of time. Even the conditional formatting is nice to have. Plus, if you need to interference with business major types, Excel will be the only spreadsheet they’ve ever heard of.

Excel has plenty of drawbacks too, though. It is a huge program. It only runs on Windows (or OS X, if you don’t need to run any add-ins). The only natively supported scripting language is VBA. Perhaps worst of all, and unforgivably, its slow. If you haven’t noticed this for yourself, go try and do some sensitivity analysis on a simulation with 10,000 or more trials. Expect to have time for two or three cups of coffee every time you press F9.

Gnumeric is a totally different take on the concept of spreadsheets. When I first experimented with it about 15 years ago, I concluded that it was too limited to be useful. Since then, however, the the project has reinvented itself as the the lightweight, stripped down spreadsheet for data analysis. If Excel is a Chrysler New Yorker, then Gnumeric is a Dodge Dart.

In recent years, an SPSS style “statistics” menu has appeared in the Gnumeric interface. Now the most-used features of a spreadsheet and a statistics package are within easy reach, which will appeal to anyone who ever spent a morning clicking back and forth between Excel and SPSS while they analyzed a data set.

By far the most appealing feature of Gnumeric is that it uses Python as one of its scripting languages. This means that not only is it painless to create user functions but, given the multitude of libraries available for Python, you probably won’t need to very often. Excel’s non-linear solver seems pretty rudimentary when you have access to scipy.optimize. Also, since Gnumeric allows Python and C plug-ins, it is useful as a graphic front end to more complicated programs written in these languages. Pretty cool, especially considering the whole application is still lightweight enough to run on a $100 garage sale computer.

Other possibilities: OpenOffice (aka LibreOffice) is also free and seems to be designed as a more direct replacement for Excel. If you need a more general purpose spreadsheet it might be a good choice.

Statistics Package

Commercial: SPSS, SAS, Minitab
Open Source: PSPP

When all you need to do is calculate a few confidence intervals or run a T-test, a spreadsheet application will probably be adequate. If you are creating complex statistical models, you are probably going to write them in R. Between these extremes, about 90% of statistics work gets done in a statistics package. Which one you prefer probably depends on which one your college statistics professor used. The thing they all have in common, however, is that a full license costs a small fortune. Usually, the coolest add-on packages (for simulation, predictive analytics, etc) are even more of a buy-up. Luckily, for those of us of modest means, there is PSPP.

PSPP is intended as a direct clone of SPSS, but is GPL licensed, so it is completely free to use. One important difference is that PSPP is written mostly in Python notice a theme here?). This means that if you are Python hacker, you should have an easy time creating add-ins. It also means that PSPP runs on just about any platform that runs Python, which is nearly all of them.

Linear Algebra System

Commercial: Matlab
Open Source: Octave

When you start building series matrix-heavy models, such as anything involving Markov chains or finite element analysis, you are going to want to seriously think about using a language that is built for linear algebra. Sure, Python has good linear algebra support though <a “href=http://numpy.org”>numpy</a> and other libraries. But Python is basically a general purpose, list based language. Linear algebra looks ugly in Python, and ugly code takes longer to write and is harder to debug. Its better to use the right tool for the job.

The first good language for linear algebra is Matlab. It is still incredibly popular, especially among the PhD Engineering crowd. For decades now, however, there has been a free open source alternative. Octave started out as clone of Matlab and the language syntax is still very similar. However, development on Octave often moves a little faster than Matlab, and Octave often gets features and functions before Matlab. This has cause Octave to become somewhat less compatible with Matlab as time goes on. Still, anyone who can code in Matlab should have not trouble picking up Octave, and Octave is free.

Image Editing and Drawing Software

Commercial: Adobe PhotoShop, Adobe Illustrator
Open Source: GIMP, Inkscape

While few of us are visual artists, either by training or inclination, visual display of data is an important part of data analysis. It is important to be able to illustrate our findings in professional quality maps, charts, and diagrams. To do this, you need to have the ability to work with bitmap images and raster drawings. The Adobe products have been the commercial state of the art for some time now, but GIMP and Inkscape have most of the same functionality. Like the commercial programs, you tend to use them together as a team. One word of warning: if you are migrating from Adobe to the open source programs, you will find that the interfaces are very different, at least in the stock configuration. These programs are replacements, not clones.

One interesting feature of both GIMP and Inkscape is that both allow you to create scripts and plug-ins. One of the languages that is supported is (you guessed it) Python. While I haven’t experimented with it much, it seems like there is the potential here for some pretty serious data display. For instance, it seems like you could build a pretty killer mapping plug-in for Inkscape to display geographic data on different layers of a drawing.

Python Histograms from the Console

Lately, while working on my MBA thesis, there have been many times when I’ve been working in Python and wanted to plot a quick histogram of a distribution. The whole process has been far too time consuming. Either I need to remember how to write my list out as a CSV file so I can plot it in Excel/Tableau/SPSS (none of which, IMHO, has a particularly intuitive mechanism for drawing histograms) or I need to be able plot the histogram directly from Python. The later would be fine, except that I’m usually in an SSH shell and I always have trouble with X servers on my Windows 8 laptop.

So, anyway, I wrote this little function and it works quite nice for plotting histograms in pure text mode. It should be fine with any terminal client of the last 50 years. It isn’t particularly sophisticated, but it works for me and may work for you too.

def crappyhist(a, bins):
    '''Draws a crappy text-mode histogram of an array'''
    import numpy as np
    import string
    from math import log10

    h,b = np.histogram(a, bins)

    for i in range (0, bins-1):
	    print string.rjust(`b[i]`, 7)[:int(log10(
                   np.amax(b)))+5], '| ', '#'*int(70*h[i-1]/np.amax(h))
		
    print string.rjust(`b[bins]`, 7)[:int(log10(np.amax(b)))+5] 

Screen shot of the crappyhist() function in action.