Creating an E-book: Tips on Document Formatting

After years of marginal acceptance, e-books have finally started to eclipse their printed-and-bound ancestors. Casual and sophisticated readers alike are growing much more accustomed to reading from a device -- a Kindle, a smartphone, an iPad or a laptop. They're also catching on for business and technical audiences -- for example, HR departments can distribute employee manuals digitally, while IT staffers can carry around 800-page references for their favorite programming languages or operating systems without having to dislocate a shoulder.

One of the most attractive features about this process is that you don't have to be a professional publisher to produce a useful and well-formatted e-book. Almost anyone can take an existing manuscript -- a technical manual, a corporate white paper or even a personal biography -- and turn it into an e-book.

But you need more than just your document. You also need the right software and the know-how -- because producing an e-book is a little more complicated than it ought to be. The breadth of e-book formats out there, and the quirks of converting your source document into one of those target formats, can make the conversion process anything but straightforward.

In the following article, I've attempted to unravel that particular knot by looking at the e-book creation process from beginning to end -- from formatting the source document to reading the finished product. I'll discuss what formats you need to start with and convert to, detail some of the issues you may encounter along the way, and suggest some software applications that can help.

E-book creation tips

Creating an e-book can be a rocky process, often with no pre-set path from the original document to the finished product. It's difficult to tell in advance what you might need to do, or not do, to make sure a given project renders correctly. However, before you begin the conversion process, there are ways to make things go more smoothly.

Start with the cleanest possible input document. There should be no stylization, formatting or elements present that you don't want in the final product. If something can't be supported in the destination format, it may well get stripped out automatically, but sometimes it might just be translated into something you don't want. You might have no choice but to clean up the original by hand, but it may well be possible to script the cleanup process depending on what you're using to compose your originals.

Consider using HTML as an intermediate target format in all cases. Since the majority of e-book formats revolve around some variant of HTML, it might be a good idea to standardize on HTML as the format to export to first from whatever program you used to edit the document. This minimizes the amount of processing that has to be done by the e-book converter itself. What's more, if you need to perform any manual editing on the file to get it to process correctly, HTML is a convenient format to do that: You have direct access to the source code via nothing more than a plain-text editor.

Test the results on multiple devices. Get your hands on as many reading devices as possible -- or, failing that, get in touch with people who have a number of different reading devices and get feedback from them. The desktop Kindle application, for instance, has quirks that the actual device does not (e.g., how each handles non-Western characters), so it helps to know when problems like this are relevant.

Be prepared to repeat as necessary. You will almost certainly have to make multiple passes across an e-book to make sure everything translated correctly. Odds are it won't -- at least not the first time -- and you'll have to go back and tweak many different things by hand. In a way this is another argument for using HTML as an intermediate format, because many of the tweaks that might need to be carried out could be partly automated. Keep notes of what breaks each time so you don't have to repeat your mistakes.

Source formats

The creation of any e-book starts with a source document: a manuscript that you have written or which someone else has provided for you. Right there, the problems begin, since even a "clean" document can pose conversion difficulties. Your goal is to ensure that the document's formatting will be preserved intact.

Odds are most documents used as a source for an e-book will have to go through at least two conversions: First, into a format that the conversion software can use, and then into the actual e-book format -- or formats. Sometimes this can be cut down to one stage, but it's best for the time being to assume you'll need two steps to do the job completely.

Here's a rundown of the most likely formats you'll start with:

HTML

I already mentioned this in the previous section, but it bears repeating: If you're looking for a standard, HTML is more or less it. For one, it's ubiquitous; almost every text-processing program can generate or read HTML. It also supports many features e-books will use: hyperlinks, font control, section headings, images and so on.

The tricky part is if you weren't working with HTML in the first place. If you're collating posts from a blog or a wiki and assembling them into an e-book, you won't have to put up with quite as much drudgery. But if you're starting with a Microsoft Word (DOC or DOCX) or Open Document Format (OpenDocument or ODF) document, your best bet is to export it directly from the source application into HTML. (Word users should do a "Save as..." using the "Web Page, Filtered (HTML)" option, which strips out most of Word's generated cruft.)

Exporting to HTML from your source program helps preserve the most crucial formatting and typically also preserves sections and chapters: outline headers are turned into H1 / H2 / H3 tags, which most conversion programs correctly recognize. Some are even able to auto-generate tables of contents from those tags. That said, I've had good results using Word to generate TOCs before I send the document to the e-book program, since Word typically gives you a broader range of formatting options.

Microsoft Word (DOC or DOCX)

If you're dealing with an original manuscript, odds are it's probably going to be in Microsoft Word format. Proprietary as Word may be, almost every device on the face of the earth can read or write Word documents. And the format has native support for most everything you could think of: formulas, chaptering, footnotes, indices -- in other words, anything that might show up in an e-book.

That said, Word documents are best seen as a starting point for an intermediate conversion format, most likely HTML, rather than a format that can be converted directly into an e-book. In fact, most e-book conversion programs don't accept Word natively as a source document type. They may accept Word's sibling format, RTF, but that is already at least one stage of conversion away from the original and increases the chance that certain features might not make it through the conversion process. For example, RTF does support features like sections and footnotes, but the Calibre [[WILL LINK TO REVIEW BELOW]] e-book creation suite, for one, didn't process them correctly when I tested it for this article.

OpenDocument (ODF)

OpenDocument, or ODF, is the format used by OpenOffice.org. (Microsoft Word also supports ODF, although it isn't the default format for Word -- it's just one of the formats it reads and writes.) Third-party OpenOffice offers extensions that let you export directly to e-pub formats; there are also a number of standalone applications, such as ODFToEPub, which will do the same. If you're already in the habit of creating your documents in ODF, your path to creating a finished e-book may be slightly shortened because of this.

PDF

Adobe's PDF format is all but impossible not to encounter and is used consistently enough as an e-book format that it would be foolish not to mention it. Many programs (such as Word and OpenOffice.org) export directly to PDF, and the files can be opened and read in many applications. In fact, before dedicated e-reader devices made significant inroads into the market, most e-books were just PDF distillations of their print counterparts.

However, it's generally not a good idea to try to use PDF as a source format. Because it's designed to precisely reproduce printed pages, a PDF document needs to be taken apart and put back together if it's being used as a source format for a non-PDF e-book. As a result, PDF should only be used as a source for other e-book formats if you have no choice.

Destination formats

Odds are you won't have just one destination format for your e-book, but several. If your target readers are using a variety of devices -- a Nook, a Kindle, an iPad -- it helps to support as many of those devices as possible. The Kindle, for instance, is notorious for not supporting Epub format files.

These are the most common e-book destination formats and their quirks.

Epub

An open, non-proprietary format that uses XHTML as the basis for its document format, Epub is widely supported as an output format by various e-book production applications -- iTunes, for instance, only accepts Epub as a source format. In fact, it couldn't hurt to render a copy of your product as Epub no matter what other formats you're also planning to output to.

Epub has a few downsides. Its formatting methodology assumes the text will be reflowed to fit the target device, so books that require PDF-style page fidelity won't work well in Epub. Also, there's no support for equations apart from inserting them as images -- TeX or MathML, two commonly used languages for representing math, aren't supported. And Epub doesn't have a standard way to interpret or share annotations, which might be another drawback for people publishing e-textbooks.

To that end, it's best for "straight" text, or documents where reflowed formatting won't be an issue.

MOBI and Kindle

A variant of an earlier version of Epub, MOBI -- or Mobipocket -- was developed by the company of the same name as a format to be used with its e-book reader software, designed originally for PDAs and later smartphones. After Amazon bought the company, it made MOBI into the basis for the Kindle reader's own e-book format. MOBI supports Digital Rights Management (DRM), but unencrypted MOBI documents can be read on the Kindle without issues.

PDF

PDFs can be read as-is in the majority of e-book readers, including the Kindle. Exporting to PDF is best when you want to maintain absolute fidelity to page layout -- images, typefaces, and so on.

Ironically, this is the very feature that can make PDFs a problem in some scenarios, which I hinted at before. Other e-book formats are designed to work independent of any particular device resolution, so pages reflow automatically for each device. This is one of the reasons the Kindle didn't make use of page numbers at first, since the page numbering for a particular book could vary depending on what device or screen size you were reading it with.

PDFs, on the other hand, reproduce as closely as possible the formatting of the original page, no matter what the size of the destination device. A PDF formatted for an 8.5 × 11-in. page may be quite readable on a large display, but looks cramped on a Kindle or Nook. Some PDF readers, such as Adobe's own Acrobat Reader application, are able to re-flow a PDF to fit an arbitrary screen size -- but this isn't a universally available function, and you shouldn't count on it being present.

If you're committed to using PDFs, you may want to consider exporting your document with different page size as a courtesy for people using e-readers with small screens. This may require some research to figure out what page sizes render best with popular e-book readers.

Elements to include

When you're building a book, elements that you've included in the original document may need a little extra work to translate properly into the finished product. In addition, some elements that didn't seem important for a print publication may be more useful in an e-book.

Tables of contents

An e-book that isn't properly chaptered is difficult to navigate -- doubly so with devices where going to an arbitrary point in a book is not as easy as it should be. The Kindle, for instance, has no touch screen, so jumping around in a book without a table of contents is a chore.

Font variation

This is most important if you want to set certain elements apart from the rest of the text -- for example, examples of code in a monospaced font. This isn't so much a formatting issue as it is a conversion issue, since font choices can sometimes get stripped out entirely during the conversion process, or not be supported at all on some target devices.

Be sure to try out at least two different font types in your documents -- a standard body-text font and a monospaced font -- to see how they render on different devices and in different book formats. Sometimes font declarations don't work at all: with the Kindle, for instance, you need to use the HTML <pre> tag in e-books to reliably show text in a monospaced font.

Illustrations

This can be a crucial issue for some books. You need to make sure any illustrations convert correctly depending on the system you're using. Exporting to HTML as an intermediate step helps here, since image references in HTML are honored pretty consistently throughout the conversion process.

Footnotes

Footnotes are typically translated into hyperlinks in e-books, but also run the risk of disappearing if the conversion process doesn't know how to honor them correctly. This is another reason why exporting to HTML as a first step is a good idea: if footnotes and endnotes render as properly hyperlinked elements in that step, they should remain accessible in the finished product, too.

Pronunciation marks

Some languages -- Japanese, for instance -- use what is called "ruby markup" -- annotations that appear next to the text -- to indicate how certain things are pronounced. HTML supports ruby markup, but that doesn't mean it'll always render correctly in the converted e-book.

There are a number of other curious issues that can arise. For instance, if you have a document where outline headings (which typically indicate chapters) are auto-numbered, the numbering doesn't always survive the conversion process. One document I had automatically added "Chapter __:" to the beginning of each chapter, but once converted into an e-book the auto-numbering vanished.

Conversion applications

Content-creation programs, such as word processors or publishing suites, are only starting to add e-book formats to their lists of possible exports. Most of the time, you'll need to use some kind of standalone application to perform the final conversion.

Some of the tools you might encounter are designed for extremely specific jobs and are not general conversion utilities. Those producing e-books for the Kindle, for instance, need to use Amazon's own e-book tool called KindleGen to produce a Kindle-compatible file from HTML or Epub input.

These are only four of the better-known conversion applications; there are a lot more out there. In contrasting their behaviors and capabilities, it's clear we're still a ways from having a single end-to-end suite that fits the majority of users' needs.

Adobe InDesign CS5.5

InDesign is normally thought of as a full-blown desktop publishing suite, but in its last couple of incarnations -- especially in the upcoming 5.5 release -- it's been positioned more as a platform for generating output to many different destinations.

The program now includes export options for the Epub format. InDesign accepts a broad range of document formats for import, and can even map style information from the source document to whatever style definitions you have set up in InDesign. A plug-in from Amazon also lets you export directly from InDesign to the Kindle format.

InDesign has two big downsides. The first is the scope and scale of the program. Because it's a full-blown publishing solution, it requires a lot more work to generate a finished product than a simple conversion utility. Second is the price tag: it starts at $699. That puts it out of reach for users not prepared to invest that much money, although the 30-day trial version should give you an idea of whether it's worth the money or is overkill for your needs.

Adobe InDesign CS5 from Adobe Systems Inc.

Price: $699

Platforms: Windows XP/Vista/7, OS X 10.5.8/10.6

Calibre

Calibre, a free and open-source application, is marketed more as a personal e-book management solution than a production suite. That said, it can be used as an e-book conversion utility, and a remarkably powerful one -- provided you understand the full range of options. For that reason it may well be the best place to start, especially if you're distilling output for multiple e-book formats.

The best thing about Calibre is its support for a broad range of input document types: the program can accept ODF, RTF, Epub, MOBI, PDF and HTML. Calibre can also reformat documents according to various heuristic rules (unwrapping plain text that has too many line breaks, for instance), or inserting chapter breaks by looking for certain text structures (such as a line break, the word "Chapter" and then a number).

However, Calibre doesn't support DOC or DOCX documents, so anything coming from Word will have to be saved in another format first. Saving in either ODF or HTML from Word seemed to do the best job of preserving formatting and features, including things like monospaced formatting for code examples. The program can also convert books in bulk as well as individually.

Calibre from Kovid Goyal

Price: Free

Platforms: Windows, OS X, Linux

OpenOffice.org

OpenOffice.org is itself not an e-book system, of course: It's a free open-source productivity suite. That said, a number of people have authored add-ons for OpenOffice.org for exporting to e-book formats from within the program.

Writer2ePub, for instance, exports directly from within OpenOffice into Epub format; ODFToEPub can perform standalone conversion of ODF files or work as an OpenOffice add-in.

OpenOffice.org also has a powerful native PDF export function, one with a greater range of options than the native exporter in Microsoft Word. That's useful as long as you don't mind using PDF as a target document type.

OpenOffice.org from Oracle

Price: Free

Platforms: Windows, OS X, Linux, Solaris

Sigil

A more modest example of an e-book production application, Sigil is both free and open source. It's a lot closer to an editor that exports to e-books (it sports a built-in document editor) than a conversion suite for existing documents, but it also includes various tools for collating and assembling a finished e-book (such as a table of contents editor).

Sigil's main drawback is how it handles importing. It only accepts HTML, plain text or existing Epub files as input documents, so it will most likely work best if you are able to export your original document to HTML in a way that preserves all the most important formatting. A similar program, Jutoh, accepts OPL files and has slightly more robust editing options; it costs $39.

Sigil from Strahinja Markovic

Price: Free

Platforms: Windows XP/Vista/7, OS X 10.5/10.6, Linux

Conclusions

The recent massive surge in demand for e-books hasn't yet triggered a concomitant surge in development of polished products for e-book production. The one thing that's most conspicuously lacking is a single gold-standard product that guides users through the whole workflow and helps them check their results. With all the different book formats that are floating around, putting together such a product might well be an order of magnitude tougher than anyone expects.

The good news is that the e-book boom has helped consolidate formats a bit. The Kindle, the Nook and the iTunes Bookstore (which services both the iPhone and iPad) now stand out as the most common targets for e-books.

The time's right for a product that can walk you through the whole process. For now, though, we'll have to settle for using the tools that do exist, and using them with care and attention.

Serdar Yegulalp has been writing about computers and information technology for over 15 years for a variety of publications.

Subscribe to the Now Playing Newsletter

Comments