Word to Thunderbird formatting woes

Tags: writing, writing software

This one has plagued me for awhile. I wonder if any agents have thrown out my queries just because of it.

Thunderbird, it turns out, has a horrible formatting engine. And Microsoft Word pastes with a boatload of extra junk that Thunderbird doesn't do so well with.

At PPWC, one of the faculty actually recommended pasting first into Notepad, and then copy-pasting into the email client (she didn't specifically say Thunderbird) and manually restoring any styles needed. You can do that, but that's the maximum effort.

I've seen several posts about using HTML cleaners to clean up the Word output in one way or another, and they all still looked painful. One of the least bad was here, which suggested routing emails through Gmail as an attachment, then using Gmail to display the attachment and paste into Thunderbird. I'm guessing you could just upload the Word doc into Google Drive, and copy and paste out of there instead.

Here's another option: Use Word to Save As .MHT (a single-file web page). Then open that--the default application for it seems to be Internet Explorer, even in Windows 10. From there, copy and paste out of the Internet Explorer web page into Thunderbird.

When I did this (emailing from my cool-man.org account into my Gmail one), Thunderbird came away without the specific font (Garamond in my case), but the Arial or whatever it used wasn't so bad. It's far better than the Courier style it seemed to use before. And other fonts I had in a couple spots were maintained. Plus all the italics remained. Win-win. I tried with both Word 2010 and 2016 and both did great.

However, while the line spacing was fine, between Word and Thunderbird the paragraph spacing got hosed. The reason is that Word defined each paragraph as a paragraph, and Thunderbird accepted it as such. Thunderbird by default inserts <br> breaks when you hit ENTER. <p> as Word offers it comes with a default additional spacing.

So, what to do? In Thunderbird, before your pasted text, use Insert->HTML and insert this:

p.MsoNormal, li.MsoNormal, div.MsoNormal
    font-family:"Times New Roman","serif";
    mso-fareast-font-family:"Times New Roman";}

What does this do? It's CSS code that tells the HTML engine on the receiving end to format the paragraphs from Word (all of which use the "MsoNormal" class) with essentially no bottom margin. This works well at least with anything that reads the CSS. Outlook likes it fine; Gmail for the web turns it into Arial; Gmail's Inbox on my Android phone seems to ignore it and still have bad paragraph spacing.

Maybe it's time to try out Outlook for composition.


Well, I did that. Outlook fares a little better, but not for mobile readers; it still is using <p> tags and mobile readers just ignore the CSS in my experience. So at that point I turned to hacking. I already do post-processing of my book in C# in order to fix up smooth scene transitions. I added in something to replace the paragraph markers ("\r" in Word) with simple breaks (vertical tabs in Word, or "\v"). This didn't really work right, as it hosed up the formatting. I think that it messed up page breaks and somehow all my text became centered. So I tweaked it to use "\r\t" and "\v\t" and it is decent, though not perfect. Considering the lack of useful guidance in the MS Word interop docs on MSDN, this isn't too bad. I'd like to make it better at some point, but for purposes of emailing, this is way better than before.

In case you are a C# coder and want to do this yourself with Word interop, here's an excerpt of code you can try:

        public static string ChangeParagraphBreaksToLineBreaks(string filename)
            Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
                Microsoft.Office.Interop.Word.Document doc = application.Documents.Open(filename);

                    string results = ChangeParagraphBreaksToLineBreaks(doc);

                    return results;
            catch (Exception e)
                return e.ToString();

        // http://stackoverflow.com/questions/14830309/word-interop-replacing-some-text-with-a-line-break
        // https://msdn.microsoft.com/en-us/library/aa691087%28v=vs.71%29.aspx
        // 2nd best:  only do if there are tabs involved
        private static string _lineEnd = "\v\t";
        private static string _paragraphEnd = "\r\t";

        private static string ChangeParagraphBreaksToLineBreaks(Document doc)
            // https://msdn.microsoft.com/en-us/library/f1f367bx.aspx?cs-save-lang=1&cs-lang=csharp#code-snippet-11
            var range = doc.Range();

            Microsoft.Office.Interop.Word.Find findObject = range.Find;
            // this is the text Word shows back to me: \r (carriage return)
            findObject.Text = _paragraphEnd;
            findObject.Replacement.Text = _lineEnd;

            object replaceAll = Microsoft.Office.Interop.Word.WdReplace.wdReplaceAll;

            object findText = findObject.Text;
            object replaceWithText = _lineEnd;

            object matchCase = false;
            object matchWholeWord = false;
            object matchWildCards = false;
            object matchSoundsLike = false;
            object matchAllWordForms = false;
            object forward = true;
            object format = true;
            object matchKashida = false;
            object matchDiacritics = false;
            object matchAlefHamza = false;
            object matchControl = false;
            object read_only = false;
            object visible = false;
            object replace = 2;
            object wrap = WdFindWrap.wdFindStop;

            // From example:
            // http://stackoverflow.com/questions/19252252/c-sharp-word-interop-find-and-replace-everything
            bool findRet = findObject.Execute(ref findText, ref matchCase, ref matchWholeWord,
                ref matchWildCards, ref matchSoundsLike, ref matchAllWordForms, ref forward,
                ref wrap, ref format, ref replaceWithText, ref replaceAll,
                ref matchKashida, ref matchDiacritics, ref matchAlefHamza, ref matchControl);

            System.Diagnostics.Debug.WriteLine($"Find returned {findRet}");

            string results = "Done updating paragraph breaks to line breaks";
            return results;

Doing this gets around the CSS issue because there are no longer any <p> markers, just <br> in the HTML that Outlook generates. Thunderbird might even do an OK job with this as well.

No Comments

Add a Comment