New features in TransTools 2.2

TransTools v.2.2 has just been released. Here is a summary of all the recent developments.

Document Cleaner – tag cleaning improvements

Several months ago, I added Document Cleaner, a collection of commands for quick formatting of Word documents produced primarily by OCR (optical character recognition) and PDF conversion software. If you work with a Word document created from a PDF or a scan image, Document Cleaner will help you make the document more appealing and easier to work with, eliminating some of the common problems associated with converted PDFs/scans.

One of the main problems of such documents is the profusion of tags/codes which appear once you import the document into your CAT tool. Below is an example of one segment in Word and memoQ 6:

Segment in Word
The segment as it appears under Microsoft Word

Segment in memoQ 6
The same segment in memoQ 6

Here, we can see that the original sentence has no bold formatting, but for some reason that formatting appears in memoQ.

The above document was obtained through OCR (word recognition) of a PDF scan using FineReader 12, one of the best OCR applications, and saved in DOCX format. Normally, the use of a good OCR application coupled with the DOCX format should be sufficient to remove excessive tags, but this did not work in the example. The reason for this is that OCR applications often use a lot of dubious formatting techniques like use of character styles (as opposed to direct formatting) on top of paragraph styles, etc. Microsoft Word is smart enough to recognize and fix them on the screen, but it still retains these formatting instructions in the document and they propagate into CAT tools like memoQ.

The previous version of Document Cleaner's Reformat command did a bad job on the above document. Similar deficiencies of the previous version of Document Cleaner were identified in a blog post by Tuomas Kostiainen here.

To address these deficiencies, I have updated Document Cleaner:

  • Updated Reformat option: ‘Fix formatting problems’ – This option has been substantially modified to remove tags caused by incorrect use of character styles (as illustrated above), different formatting of spaces between words, etc. If this option is applied without any suboptions, the document will fully retain its visual appearance, while most tags will be removed.
  • New Reformat option: ‘Remove character styles, leave direct formatting only’ – Some OCR or PDF conversion tools, e.g. ABBYY Finereader, make frequent use of character styles on top of paragraph styles. If these character styles are left in the document, they will be imported into your CAT tools as tag pairs. At the same time, character styles are of value only if the document was authored by a human. This option strips character styles while leaving the formatting intact. If you process a document converted from a PDF or scan image, I recommend the use of this option.
  • New Reformat option: ‘Set same font face to all ranges within a paragraph/sentence’ – If you process an image-based PDF in an OCR or PDF conversion tool, the tool can decide that there are several different fonts in a given paragraph, while in fact there is only one. For example, the OCR tool can recognize one range of text as ‘Cambria font’ and another one as ‘Calibri font’. When you import the document into the CAT tool, this will result in several tags. Because most documents use the same font in a given paragraph (unless a symbol font is used), you can safely format the rest of the paragraph using the same font as used in the beginning of the paragraph. This option does just that.

  • New Reformat options: ‘Set same font size to all ranges within a paragraph/sentence’, ‘Fix paragraph/sentence font size differences of 1/2/3 pt or smaller’ – These options allow you to level the font size in a paragraph or sentence. If you process a PDF, esp. an image-based PDF, in an OCR or PDF conversion tool, the tool can decide that there are several different font sizes in a given paragraph, while in fact there was only one font size in the original Word document which was scanned. As a result, the OCR tool can format one range of text as ‘9 pt’, the second – as ‘9.5 pt’, and the third – as ‘9 pt’ again. Because human-authored documents rarely have more than one font size in a given paragraph, you can use this option to level font sizes across the paragraph or sentence using the first font size of the paragraph/sentence. For a detailed description of these options, read this online reference page.

To clean formatting-related tags, use the first tab of Document Cleaner dialogue titled ‘Reformat’. Choose the reformatting options from the list. For documents obtained from a PDF or scan image, I would recommend the following options: 1) Set default text spacing, 2) Remove text shading, 3) Remove hyphenation, 4) Change font color from ‘Black’ to ‘Automatic’, 5) Fix formatting problems, and suboptions 6) Remove character styles, leave direct formatting only, 7) Fix paragraph/sentence font size differences of 2 pt or smaller. Other options will allow you to remove even more tags, but they may interfere with the original formatting too much and should be used only if you see left-over tags in your CAT tool. Keep in mind that tags can also be associated with inline images, equations, bookmarks (you can use Bookmark Cleanup tab of Document Cleaner to clean up unnecessary bookmarks), etc.

To avoid changing the options every time, you can save all the options under a new configuration profile at the bottom of the dialogue, or save the options under the Default profile.

For more information on the updated functionality of Document Cleaner, refer to its online help page. To see the new tool in action, download and install the new version of TransTools.

Make a financial contribution for TransTools development

If you use TransTools on a regular basis, I would appreciate it if you made a small financial contribution via PayPal. While I currently only pay for web hosting and the domain name, I have several plans for TransTools that cost money, such as a digital certificate to enable the use of TransTools by more people, or special development tools to make the plug-ins better. Thank you!

New tool – Dual-Language Document Assistant

While we normally produce single-language translated documents, sometimes our clients may request a dual-language translation. Dual-Language Document Assistant (currently in beta status) is a new addition to TransTools for Word designed to facilitate the creation of dual-language documents.

Dual-Language Document Assistant has numerous options to help you format source documents before they are translated manually or imported into your CAT tool. It can convert selected paragraphs into a dual-language table or dual-language text. Additionally, text on the left or right side may be highlighted to facilitate translation or for hiding using Hide/Unhide Text command.

Generated dual-language table
Generated dual-language table

Generated dual-language text
Generated dual-language text (using slash as the separator)

Here is a screenshot of Dual-Language Document Assistant dialogue:

For more information about this new tool, go to the command's online reference page. To see the new tool in action, download and install the new version of TransTools.

Other enhancements

Document Cleaner's Apply Variable Row Height command (TransTools for Word), designed to apply variable row height formatting to tables in converted PDFs, can now process tables inside other tables (so-called nested tables).

16th March 2013

