Dynamic HTML in Python – A Simple E-Book Server

I thought it was high time I wrote another Python how-to article, since I haven’t for months, and they tend to be the highest traffic posts on this site. This one should give you plenty of “bang for your buck”, though, since it includes examples of web development, file i/o, and working with PDF files.

I should start with a little background to the problem. I have an older Kindle device, to which I am more or less addicted. I download hundreds of classic books, technical manuals, and journal articles, put them on the Kindle, and read them at my leisure. Unfortunately, the device only has 2 gigabytes of storage. This seems like a lot, but by the time I downloaded all the books on my Great Books Reading List on it, it was already 3/4 full. Lately I’ve been having to delete files to make room for new ones–which is a problem, because I don’t always know where to find them again if I need them. A few weeks ago I complained about the problem on Facebook, and one of my buddies suggested (jokingly?) that I build a “Kindle Server”. So I did.

It only took a couple of hours to dump all of my books on one of the servers that live in my garage, set up a simple Python-based web server, and write a Python script to dynamically serve up a listing of titles. Now I just point my Kindle’s browser at the server and download whatever I want on the fly.

This works with non-DRM .MOBI files, like the ones on Project Gutenberg, The Online Library of Liberty, and the University of Adelaide. It also works on .PDF files. It ignores DRP protected Kindle books that you bought from Amazon, because they stay in the “Archived Items” folder on your Kindle and you can re-download them directly from Amazon.

Step 1 – Set Up a Web Server

Python itself comes with all the libraries you need to function as a simple CGI web sever. A simple script like this, slightly adapted from an example on the Pointless Programming blog, should be all you need. Note that my web server root directory is “/var/www”, my Kindle books are in “/var/www/kindle” and my CGI scripts are in “/var/www/cgi-bin”. I don’t have another server on that IP, so I could use port 80, which is the “main” web server port.

#!/usr/bin/env python

import BaseHTTPServer
import CGIHTTPServer
import cgitb; cgitb.enable()  

server = BaseHTTPServer.HTTPServer
handler = CGIHTTPServer.CGIHTTPRequestHandler
server_address = ("", 80)
handler.cgi_directories = ["/cgi-bin"]

httpd = server(server_address, handler)
httpd.serve_forever()

Once you make sure the web server is working correctly, you will probably want to add a few lines to your rc.local file to start it in the background on system startup:

cd /var/www
./webserver.py &

Step 2 – Write the CGI Script

This is a mildly long script, so I will break it down and explain it by sections. The entire source file is at the bottom of this post.

The first lines in the script tell the server to use Python to run is and import the libraries that you will need. “os” is used to read the directory and is included with Python. “pyPdf” is used to get titles from PDF files and is widely available in repositories. On Debian based systems the package is called “python-pypdf”.

#! /usr/bin/env python

import os
from pyPdf import PdfFileReader

Here we print some text to let browser know that it is receiving HTML output. We also set the title of the page and print a header at the top.

print "Content-Type: text/html"
print
print """
<html>
<head><title>Kindle Library</title></head>
<body>
<h1>Kindle Library</h1><hr>
<ul>
"""

Here we initialize blank lists: one for MOBI format books, one for PDF format books, and one for subdirectories.

mobifiles = []
pdffiles = []
dirlist = []

Now we get a listing of the directory where the books are stored and sort the entries based on the file extension. Anything without an extension is assumed to be a subdirectory and anything with an unrecognized extension is simply ignored.

filelist = os.listdir("/var/www/kindle")
for f in filelist:
    if os.path.splitext(f)[1] == ".mobi":
        mobifiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == ".pdf":
        pdffiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == "":
        dirlist.append("/var/www/kindle/"+f)

And now we do the same thing for any files in subdirectories.

for d in dirlist:
    filelist = os.listdir(d)
    for f in filelist:
        if os.path.splitext(f)[1] == ".mobi":
            mobifiles.append(d+'/'+f)
        elif os.path.splitext(f)[1] == ".pdf":
            pdffiles.append("/var/www/kindle/"+f)

This code looks in each of the MOBI format files and extracts a title. The MOBI format is a fairly complex binary file–usually compressed–and I couldn’t easily find a Python library to read the metadata. Let me know if you know of one. A little tinkering with a hex dumper revealed that the first 32 bytes of each file contain an abbreviated title, which works fine for this application.

Screen Shot of Hex Dump

Screen Shot of Hex Dump

print "<h2>Kindle MOBI Books</h2>"
booklinks = []
for f in mobifiles:
    with open(f, "rb") as book:
        t = book.read(32)
        title = t.strip()
        title = title.replace("_", " ")
    booklinks.append(['<li><a href="'+f.replace("/var/www/","/")+'">',
          title,"</a></li>"])

And then this part sorts the list by title and prints the hyperlinked titles.

booklinks.sort(key=lambda x: x[1])      #sort by title

for b in booklinks:
    print b[0]+b[1]+b[2]

print "</ul>"

This part does the same thing for .PDF books. The PyPdf library makes it silly easy to retrieve PDF metadata. The only thing to worry about is that not all PDF creators bother to put a title in. When the title returns as “None” we use the file name for a title.

print "<h2>PDF Books</h2>"
print "<ul>" 
booklinks = [] 
for f in pdffiles: 
    pdfinput = PdfFileReader(file(f, "rb")) 
    title = str(pdfinput.getDocumentInfo().title) 
    if title == "None": 
        title = f.replace("/var/www/kindle/", "") 
    booklinks.append(['<li><a href="'+f.replace("/var/www/",
                "/")+'">',title,"</a></li>"]) 

booklinks.sort(key=lambda x: x[1]) #sort by title 

for b in booklinks: 
    print b[0]+b[1]+b[2] 
    print "</ul>"</ul>

And, finally, we print a count of the number of books and subdirectories and close the <body> and <html> tags.

print str(len(mobifiles)+len(pdffiles)), "books in", 
print str(len(dirlist)+1), "directories.
"

print """
</body>
</html>
"""

Remember to move the finished script to your /cgi-bin directory and change the permissions to make it executable for all users.

The final result runs fast and looks pretty slick:

Screen Shot from Browser

Screen Shot from Browser

It would be easy to add some CSS to make it even prettier, but I didn’t bother since I’ll mostly be looking at it though a Kindle screen:

Screen Shot from Kindle

Screen Shot from Kindle

I hope this is helpful to you. If nothing else, it shows good simple examples of how to create dynamic HTML with Python and how to get the titles from MOBI and PDF files.

Full Source Listing

#! /usr/bin/env python

import os
from pyPdf import PdfFileReader

print "Content-Type: text/html"
print
print """
<html>
<head><title>Kindle Library</title></head>
<body>
<h1>Kindle Library</h1><hr>
<ul>
"""

mobifiles = []
pdffiles = []
dirlist = []

filelist = os.listdir("/var/www/kindle")
for f in filelist:
    if os.path.splitext(f)[1] == ".mobi":
        mobifiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == ".pdf":
        pdffiles.append("/var/www/kindle/"+f)
    elif os.path.splitext(f)[1] == "":
        dirlist.append("/var/www/kindle/"+f)

for d in dirlist:
    filelist = os.listdir(d)
    for f in filelist:
        if os.path.splitext(f)[1] == ".mobi":
            mobifiles.append(d+'/'+f)   
        elif os.path.splitext(f)[1] == ".pdf":
            pdffiles.append("/var/www/kindle/"+f)

print "<h2>Kindle MOBI Books</h2>"
booklinks = []
for f in mobifiles:
    with open(f, "rb") as book:
        t = book.read(32)
        title = t.strip()
        title = title.replace("_", " ")
    booklinks.append(['<li><a href="'+f.replace("/var/www/",
            "/")+'">',title,"</a></li>"])
    
booklinks.sort(key=lambda x: x[1])      #sort by title

for b in booklinks:
    print b[0]+b[1]+b[2]
        
print "</ul>"

print "<h2>PDF Books</h2>"
print "<ul>"

booklinks = []
for f in pdffiles:
    pdfinput = PdfFileReader(file(f, "rb"))
    title = str(pdfinput.getDocumentInfo().title)
    if title == "None":
        title = f.replace("/var/www/kindle/", "")
    booklinks.append(['<li><a href="'+f.replace("/var/www/",
            "/")+'">',title,"</a></li>"])
    
booklinks.sort(key=lambda x: x[1])      #sort by title   

for b in booklinks:
    print b[0]+b[1]+b[2]

print "</ul>"
    
print str(len(mobifiles)+len(pdffiles)), "books in", 
print str(len(dirlist)+1), "directories.<br>"



print """
</body>
</html>
"""

One comment

  • Kevin A. Straight

    When I wrote this program I determined the titles of the MOBI e-books by reading the first 32 bytes of the files–which worked fine for what I was doing. I have since learned a much more powerful way to read e-book meta-data, though. Just install Calibre and call the included “ebook-meta” command line utility from your own program. It is very easy to trap the output, which is in plain text and looks like,

    Title               : With the Lightnings
    Author(s)           : David Drake
    Publisher           : Baen Publishing Enterprises
    Book Producer       : calibre (0.8.12) [http://calibre-ebook.com]
    Tags                : Science Fiction
    Languages           : en
    Published           : 2000-07-07T04:00:00+00:00
    Identifiers         : isbn:0671578189
    

    Interestingly, though not surprisingly, is looks like Baen uses Calibre to make their e-books too.

Leave a Reply

Your email address will not be published. Required fields are marked *