Digimind Encoding Identifier/Converter
Learn more
Resources
Objective
The internationalization of internet has come hand-in-hand with an explosion in types of encoding. This has allowed us to add new languages using non-latin characters: Mandarin, Arab, Indian, etc. Such diversity makes the task of wordprocessing much more complex and CI solutions often find it difficult to handle special characters, particularly accents and punctuation. Digimind Encoding Identifier/Converter was designed to resolve this problem and to encompass the linguistic diversity of the internet for competitive intelligence activities.

How it operates
Digimind Encoding Identifier/Converter’s first task is to identify a document’s encoding. While some documents indicate their encoding in their meta-data, this does remain the exception. It is not uncommon to find that the encoding indicated is incorrect. An additional problem is that one document may present many different types of encoding. This is particularly true in the case of RSS threads from blogs, which regroup content from various sites, each with a different encoding.
Digimind Encoding Identifier/Converter is able to identify all types of encoding automatically, per sentence. It then uses its translation dictionary able to transform all types of encoding into Unicode. The final result is a set of documents with a common and standard encoding.





