Supported File Formats
What file formats are supported?
KantanMT supports a large range of training and client file formats. You can also develop your own parsers for proprietary file formats using GENTRY.
Files that are supported by KantanMT fall into two broad categories: Training Data Files and Translation Data Files.
Training Data File
These files are used to train a KantanMT engine. You can upload bi-lingual or parallel texts as well as mono-lingual training data sets.
- TMX - Translation Memory Exchange format. These are bi-lingual texts containing both source and target segments and are used to train your KantanMT engine.
- Wordfast TM - Wordfast Translation Memory format. These are bi-lingual texts containing both source and target segments.
- TXT - Text-based training files are also supported, but make sure they are UTF8 encoded and use source.utf8.src for source segments and source.utf8.trg for translated segments.
- DOCX/PDF - Microsoft Office DOCX and Adobe PDF files are treated as monolingual training data. If you want to upload a text file containing monolingual training data, please ensure it is called source.utf8.trg.mono.
- TBX - Terminology Interchange Format. This file lists words/phrases that are to be translated in specific ways and/or lists of untranslatable words/phrases.
- XLSX – Microsoft Excel Spreadsheet format. Source terms should be stored in Column A with corresponding target term in column B. All other columns are ignored.
Test & Tune Data
- Test Data - This data should be stored in aligned UTF8 encoded text files called source.test.src and source.test.trg. Each file should have one test segment per line.
- Tune Data - This data should be stored in aligned UTF8 encoded text files called source.tune.src and source.tune.trg. Each file should have one test segment per line.
KantanMT supports two compression file formats for training data files.
- ZIP - ZIP is an archive file format that may contain one or more files or directories that may have been compressed.
- TAR.GZ/TGZ/GZ - GZ files are created by files that have been placed in a TAR archive and then compressed using Gzip. These types of compressed TAR files are called tarballs.
Translation File Formats
KantanMT can translate a wide range of file formats. GENTRY can be used to build parsers for proprietary file formats.
CMS/GMS & CAT Tool File Formats
- XLIFF - Standard XLIFF document format
- TTX - SDL TRADOS Tag format (xml version)
- SDL-XLIFF - SDL Xliff format (TRADOS Studio 2011)
- TXML - WordFast Translation File Format
- TMX - Translation Memory Exchange Format
- EXP - Transplicity File Interchange Format
- XLZ - Idiom Worldserver Desktop Workbench Files
- MQXLZ - MemoQ Bundle Files
- MQXLIFF - MemoQ XLIFF File Format
- .sub.trg - Movie Subtitle File Format
- XLF - CAD file exports prepared in Muldrato
- DOCX - Microsoft Word Format
- PDF - Adobe PDF Format
- ODT - OpenOffice Document Format
- DITA - Standard DITA document format
- XML - Generic XML documents. Email firstname.lastname@example.org if you want a custom parser built for your own XML documents. We've developed GENTRY to do this quickly and easily!
Desktop Publishing Formats
- INX - Adobe inDesign File format
- IDML - Adobe IDML File format
- XML - Adobe Framemaker XML File Format
Web Based Formats
- HTML - Standard HTML documents used in the development of WEB content
- SVG - Scalable Vector Graphic files.
Content Management Formats
- NovaDoc - Nova document format
- MonTag XML - Montag document format
- Arbortext XML - XML document with Arbortext markup.
- TXT - Standard TXT file format (make sure they're UTF8 encoded!)
- XLSX - Microsoft Excel Formats
Need more files support?
Don't worry! We have that covered!
We've developed GENTRY to make it really quick and easy to build parsers for KantanMT.
Try it yourself or send an email(email@example.com) and we'll build one for you!