Contrary to what I had in mind, I didn’t go all the way to the roots of EPUB after all. I had heard about Pandoc, but this was the first time I tried it myself. What an impressive tool! Respect. As one of its output formats, Pandoc supports EPUB. The only bigger job for me to do was to transform the exported XML file to one of the input formats Pandoc accepts. I chose HTML.
Because the blog posts are encapsulated in CDATA (character data) sections, I decided to do the XSLT transformation in two parts: first with wp2html.xsl, and then cleanhtml.xsl where the original image src attributes are replaced to point to a local directory. Here, I’m using the Saxon XSLT processor.
java -jar saxon9.jar wordpress.2013-02-17.xml wp2html.xsl >wp.html
java -jar saxon9.jar wp.html cleanhtml.xsl >cleaned.html
Next, thumbnail pictures. Easy to fetch with e.g. wget.
When all bits and pieces are at hand, this Pandoc command stitches them together:
pandoc -f html -t epub --toc -o wpblog.epub cleaned.html
The default CSS stylesheet does a very decent job, but I found links to be a little hard to see, so I made a tiny epub.css:
And while at it, I added also a cover photo and minimum metadata to show on the page header: metadata.xml
<dc:title>Suoritin II blog posts</dc:title>
With these enhancements, the command is a bit longer:
pandoc -f html -t epub --toc --epub-metadata=metadata.xml --epub-cover-image=images/cover.jpg --epub-stylesheet=epub.css -o wpblog.epub cleaned.html
Here is the “book”. The iBooks reader by Apple renders the text just fine, so does Aldiko on Android, although with a slightly different look and feel. The ebrary reader understands EPUB too but it doesn’t divide the text in chapters. It is a matter of taste whether links feel annoying or not; they are opened separately in a browser, so the reading flow inevitably breaks.
Graphic designers would want to change the default style. Fonts can be embedded too.