Word2CleanHTML to convert the Word Document to HTML. I’ve also used Word to HTML in the past.
Both of these tools do their job well, but I’ve since found that using the method described below is faster and better for very large documents.
But in case you don’t want to install an additional tool, you can still use these websites to get to a clean HTML file.
Note, however, that the websites will not generate the full HTML document and just give you the content that goes within the <body>
tag.
To use pandoc, we’ll open a terminal within the folder where our Word Document is and run the following command:
You may get a warning
This is about the content that is put within the HTML <title>
tag. You can ignore this warning, this is the title, that will show up in Calibre, but since we’re going to be editing the Metadata and the HTML, it doesn’t really matter.
Now if you open the HTML file in your code editor, you’ll see that pandoc has generated a clean HTML file with all of your content.
Let’s focus on the Code of Content first.
If you have a messy formatted word, like I showed earlier, your output might look like this:
With a cleanly formatted Document using the Word Styles, my generated Code is properly distinguished.
If you don’t intend on adding custom fonts, we can just work with the HTML file that pandoc created for us.
First off, let’s remove some unnecessary things. We’ll write our own CSS later on, so get rid of everything pandoc wrote.
We also don’t want it to be an XHTML document, so we’ll remove the XML namespace and the XML language attributes. We don’t need the viewport or generator info either.
Now instead, we can edit the Book Title and also add information about the author. (You can do this within Calibre too, if you want.) And make it a regular HTML document, in UTF-8 encoding.
If you want drop caps or have additional elements you want styled, you’ll have to manually add these.
One of the things I tend to do is add ***
as a scene-break within a chapter. and I’d like to have those three asterisks centered and also apply a bit more space above and below it. All I do is search for my ***
and then apply a class attribute, that allows me to apply some CSS to it.
If you want to have consistent dropcaps in your ebook, you’ll have to wrap the first letter of each chapter. Search for the <h1>
tag and then in each first paragraph below it, apply Code with a class around the first letter.
Now once that’s done we’ll add CSS within the <style>
tags in the document. A lot of this is personal preference so feel free to do as you please if you know how to write CSS.
If we go back to our output.html, we’re placing the CSS within the <style>
tags.
In the following, you’ll see a commented version of the different types of CSS.
The first is the basic CSS that will work on any kindle, the second is an example of how to use a custom font and also how to enable hyphenation and ligatures. The CSS is a bit more advanced and will only work on kindles that support enhanced typesetting.
Here, everything related to the custom Font is marked. If you don’t want to use the custom font, you’ll just remove those lines.
If you are importing the custom fonts, you’ll have to package up your folder as a zip
-Archive. if you are just using your one HTML file you can just drag it into Calibre.
Once you imported your zip or HTML into Calibre via Drag and Drop, we can get started on Conversion.
You’ll want to edit your meta data before starting conversion, so they’re already stored.
If you’re just using an HTML the author and title will have been taken from your HTML file. If you imported a zip
archive the folder name will have been used. Add title and author information and then you’ll be able to import a cover image or generate one from the templates.
Also make sure to set the language of your book in the language tab if you want hyphenation to actually work.
.azw3
Now if your Word was properly formatted, this is a very simple step. You’ll just have to adjust a few settings. These are my default settings for AZW3 Conversions, you can set those in the preferences Calibre Behaviour, you’ll also be able to set the default export format there. I recommend adjusting these in the common options to save you time.
Start the conversion, if you forgot metadata, you’ll be able to adjust them here too. I disable font size rescaling and since we have a clean HTML code, we don’t have to do heuristic processing (highly recommended though if you are converting a Word Doc saved directly to HTML)
In the Page Setup, obviously choose your own model.
In Structure and TOC you’ll be using expressions to find the H1 Tag, which is super simple. //h:h1
I also like to have a table of contents at the start, so I add that, you don’t have to add if you don’t want to.
Once you’re done hit okay and you’re done.
If you did not take my advice and are dealing with messy code, with an endless amount of chapters, but they all follow a simple logic, you can still generate a Table of Content and apply chapter breaks. If unlike my sample CSS they also all at least look identical, your CSS will be quick to style. In my example, I have a few different ones, so this will still be a bit of work.
First in your conversion, instead of the //h:h1
you’ll replace them all with a different expression. In my sample, they all ended up being in <p>
tags, and since it was german, they use the German “Kapitel” (meaning “Chapter”) as a content string. Since it is fiction, it’s also possible that “Prolog” and “Epilog” are present, so I wanted to include that in the string too.
This will find all p tags that have a content of “Kapitel, Prolog or Epilog”. So I’ll put these in structure detection and Table of Contents wherever in the above pictures you see the h1 expression.
This then properly generates a TOC and sets page breaks accordingly.
But if we look through the book we see all of the ugly chapter headers.
If you are bothered (you might not have that many different ones) there’s a way to fix this.
You can actually edit the code of your book.
By right-clicking you can edit the book and you’ll see the editor with all the processed files.
By clicking on the different HTML files, you’ll see the titles will have different classes applied to them.
Now in the case of our bold text, I’d recommend removing the strong tag wrapping around the element, unless it is done all throughout the book but in my sample it only happened once so I’ll remove that. (it is also the only place the class .calibre5
is used, we’ll delete that from the css too then)
While looking through I saw that all paragraphs are using the class .calibre4
and the titles had the classes .calibre3
, .calibre6
, .calibre7
So what I’d do to make them all look the same is open the css with all the styling rules and just combine all of these classes.
This is a messy job but I assume this is just to not trigger a nerve and the quickest way is combine css instead of adjusting all the HTML. So I’d delete css for calibre3, calibre5, calibre6, calibre7, and then write a combined selector for 3, 6, 7 with the same stylign I apply to my h1.
Amazon does not like combined selectors so in case you were trying to publish something like this I’m certain this file would fail. But it will work on your personal kindle.
Before working on enhanced settings, send the file to your kindle. Connect to USB and send over. Look at them to make sure everything looks how you want it to. You should have properly looking tiltes, the Go to Option should work, a table of contents with functional links and actual page numbers.
Hyphenations will not be working and also if you have embedded fonts, they won’t be working yet and the alignment option will not be available.
Once your ebook works almost as you want it, it’s time to make it a proper .kfx
file. Install the KFX Output Plugin and then you’ll want to use the .azw3
file as input and run the conversion.
Keep all the settings and just change the KFX output format.
Once the conversion is done, send via specific format and then choose the KFX file to send it to your kindle and enjoy fun good looking ebook with all the features.