| Collation – defining, structuring and finding data for a particular language |
Collation is defined as the culturally expected ordering of linguistic characters in a particular language. This culturally expected ordering allows users to define, structure and find data in a way that is consistent for their particular language. For example, a filing system that uses an alphabetical ordering for English will start with A, and continue on to Z in a manner expected by English users. Users of the filing system will expect to find C before D, and R before S; if the filing system uses correct collation, users will be able to file and find data easily. Unfortunately, most languages use an ordering system that is significantly more complicated than this English example. Even within the Latin script, there are considerable differences in how A-Z (and additional characters) sort by language. Some examples include:
- In Danish and Norwegian, Ä and Ö follow the letter Z;
- In Finnish and Swedish, V = W;
- In Polish, Z < Z < Z;
- Turkish sorts S after S.
|
| In addition, many languages support multiple sorts. For example, German sorting can vary depending on how the umlauted characters are treated (either as variants of their non-umlauted forms, or as expansions); Hungarian supports multiple collations as well. |
| As is seen by this Latin script example, the variance in a single script and even a single language prohibit defining an all-encompassing collation for a script that is acceptable for all languages, and this applies to most scripts. |
| Collation as a rule is based on language-specific sorting elements. These sorting elements, for the purpose of this paper, refer to discrete elements in a language that carry a primary weight in sorting. Users consider a sorting element a 'character' in their language, which impacts collation. Users also expect 'groupings' of strings to be collected based on these primary sorting weights. From the previous sorting examples given, some sorting elements include Ä and Ö in Norwegian, and Z, Z and Z in Polish. What impacts sorting in one language may not impact sorting in another language in the same script; in addition, it is very often the case that identical sorting elements used in different languages sort differently. For example, Ö is a sorting element in Turkish, Swedish, and Danish, among other languages. It sorts however in very different ways depending on the language: |
Turkish: N < O < Ö < P < Q Swedish: Y < Z < Å < Ä < Ö Danish: X < Y < Z < Ä < Ö |
In addition, these 'characters' can be represented by multiple code points in Unicode. For example, Ö is encoded at U+00D6 as a pre-composed form, but can also be created through the composition of U+004F and U+0308; both forms need to sort in a correct manner for the appropriate languages. This is not unusual behavior; many primary (and even secondary) sorting elements of a language can consist of multiple code points. Because linguistic collation is primarily based on sorting elements in a language, and because a single code point cannot always represent these, it is not possible to create a culturally correct sort based on pure code points (that is, individual characters within an encoding) for many languages.
|
| Collation's applicability to Indic scripts and languages |
| The first concept, namely that of a single order not being sufficient for a single script, applies to at least some of the Indic scripts, notably Devanagari. For example: while the research is not complete for Devanagari-script languages other than Hindi, it is apparent that at least Marathi can have a different order than Hindi (Sanskrit and Konkani could possibly differ than Hindi as well). This is seen in the below sample comparing the latter section of consonant ordering within both Hindi and Marathi: Lla (U+0933) sorts between La (U+0932) and Llla (U+0934) in Hindi, but comes after Ha (U+0939) in Marathi. In addition, two different combinations of code points (Ksha and Jnya) are considered conjuncts in Hindi, but are unique characters (graphemes) in Marathi. |
| Table : Some differences in sorting order between two Devanagari script languages: Hindi and Marathi |
 |
*considered a conjunct in Hindi, but the 35th consonant in Marathi **considered a conjunct in Hindi, but the 36th consonant in Marathi |
| This particular example highlights why a single collation will not work for the Devanagari script; different languages that use the Devanagari script have different expected collation results. |
| Developers for the Indic market (or any language market) should consider it best practices to leverage extant (or develop new) collation technology, rather than depending on character encoding order to get correct sorting results for different languages. Software vendors developing linguistic collation functions conduct research to determine the correct 'character' order (where a character actually corresponds to a single code point) for each language within a script and write this into the collation function; these functions should be called for collation, rather than placing any expectation on the encoding that it should be in perfect sorting order. |
| In comparison to Devanagari, which clearly cannot use a single code point order due to different language collations within the script, there are other Indic scripts which support just a single language (e.g., Gurmukhi, used for the Punjabi language). Many implementers wonder: can these scripts support linguistic collation for their respective languages using only code point order (provided of course the code points are in the correct order)? |
| The answer to this question is no; the second concept of collation (primary sorting elements often require multiple code points) applies to monolingual scripts as well as to multiple-language scripts like Devanagari. In researching collation for Indic languages, it became apparent very quickly that properly sorting 'characters' in many of the Indic languages, including those with a single language per script, often requires treating multiple (two or three) code points as a single sorting element. |
| For example, in Hindi, consonants with modifier marks, that is the consonants modified by candrabindu (U+0901), anusvara (U+0902) or visarga (U+0903) sort as unique characters before the unmarked consonant. In other words, the sorting order for Hindi consonants follows this pattern (using Ka as an example): |
 |
| The three variants of Ka with the modifier marks are considered equal from a primary weight perspective (they differ on a secondary weight level), however, all three variants with modifier marks have a lighter primary weight than the version of Ka without a modifier mark. A consonant and one of these modifier marks has a lighter primary sorting weight than one of the same consonants without a modifier mark. |
| In addition, the nukta in Hindi (U+093c) modifies a consonant in sorting such that this combination has a combined primary weight equivalent to an unmodified consonant, but with an additional tertiary weight. That is: |
|
| This phenomenon is not limited to Hindi. Tamil has an analogous structure in sorting with the virama (U+0bcd), such that a consonant + virama (halant) combination carries a primary weight that is lighter than the consonant by itself (in other words, a consonant + virama combination is a unique sorting element that comes before the consonant without a virama; a consonant + virama combination has a lighter primary weight than a consonant by itself): |
|
| Like Hindi, it is often the case in Tamil that multiple code points combine to create a single sorting element. This is the situation in other Indic languages as well, and because of this, using a single code point order within an encoding as linguistic collation is in no way sufficient, even for those scripts which only represent one language. Again, developers should consider using code point order for collation to be against best practices, and they should either use or develop functions that provide linguistic collation. It is important for the development community to consider character encoding order just a characteristic of the encoding, and to not place the burden of linguistic collation on the encoding. |