Plain Text Encoding
- TEX fraction numerator is what follows a { up to keyword \over
- Denominator is what follows the \over up to the matching }
- { } are not printed
- Simple rules give unambiguous "plain text", but results don't look like math
- How to make a plain text that looks like math?
|
| A TeX fraction numerator consists of the expression that follows a { up to the keyword \over and the denominator consists of what follows the \over up to the matching }. In both the fraction and subscript/superscript cases, the { } are not printed. These simple rules immediately give a "plain text" that is unambiguous, but looks quite different from the corresponding mathematical notation, thereby making it hard to read. It's more appropriate to refer to TeX as a markup language, rather than plain text. |
Simple plain text encoding
- Simple operand is a span of alphanumeric characters. E.g., simple numerator o denominator is terminated by any operator.
- Operators include arithmetic operators, most whitespace characters, all U+22xx, an argument "break" operator (displayed as small raised dot), sub/superscript operators.
- Fraction operator is given by the Unicode fraction slash operator U+2044
|
| It's possible to define a "plain text" encoding that looks much more like mathematics. Strictly speaking, some constructs require some simplified mark up, but many expressions are literally plain (Unicode) text. The notation is handy as a math input language for more elaborate markup languages like TeX and MathML and can be used in its own right. |
| We define a simple operand to consist of all consecutive alphanumeric characters. We call this sequence of one or more alphanumeric characters a span of alphanumerics. As such, a simple numerator or denominator is terminated by any operator, including, for example, arithmetic operators, the blank operator, all Unicode characters with codes U+22xx, and a special argument "break" operator consisting of a small raised dot. The fraction operator is given by the Unicode fraction slash operator U+2044. |
Unicode Plain Text
- Can do a lot with plain text, e.g., BiDi
- Grey zone: use of embedded codes
- Unicode ascribes semantics to characters, e.g., paragraph mark, right-to-left mark
- Lots of interesting punctuation characters in range U+2000 to U+204F
- Extensive character semantics/properties tables, including mathematical, numerical
|
| In Unicode, many characters have default semantics which help in displaying them in mathematical formulae. For example, the characters 0-9 are integers, by default. In principle, one could treat them as variables and a general math markup language needs to be able to declare them as such. But such a need is rare and it's only needed for algebraic (symbolic) usage, not for display. One unfortunate thing about MathML is that these default properties of mathematical characters are ignored. You still have to declare mathematical alphabetic characters as variables using an <mi> tag, a numerical digit as a digit using an <mn> tag, and an operator as an operator using an <mo> tag. In our plain-text encoding of Unicode, such tags aren't used. Accordingly cases that depend on overriding the default properties, e.g., using a digit as a variable in a symbolic manipulation program, cannot be handled easily. |
Multiple Character Encodings
- As with non-math characters, math symbols can often be encoded in multiple ways, composed and decomposed. E.g., ≠ can be U+003D, U+0338 or U+2260
- Recommendation: use the fully composed symbol, e.g., U+2260 for ≠
- For alphabetic characters, use combining-mark sequences to get consistent typography
- Some representations use markup for the alphabetic cases. This allows multi-character combining marks.
|
| There have been many discussions as to various normalization forms for Unicode characters and Unicode Technical Report 15 discusses the subject in detail. Math characters are no exception: there are multiple ways of expressing various math characters. It would be nice to have a single way to represent any given character, since this would simplify recognizing the character in searches and other manipulations. Accordingly it's worthwhile to give some guidelines. |
| The first idea is to use the shortest form of a math operator symbol wherever possible. So U+2260 should be used for the not equal sign instead of the combining sequence U+003D U+0338. |
| On the other hand, for alphabetic characters, combining mark sequences give the most consistent typography. Mathematics uses a multitude of combining marks that greatly exceeds the predefined composed characters in Unicode. It's better to have the math display facility handle all of these cases uniformly to give a consistent look between characters that happen to have a fully composed Unicode character and those that don't. |
| The combining character sequences also typically have semantics as a group, so it's handy to be able to manipulate and search for them individually without having to have special tables to decompose characters for this purpose. |
Using combining-mark sequences appears to conflict with Normalization Form C recommended for the web. But note that it works fine for the Plane-1 alphanumeric characters. If alphabetic composed characters are encountered in math expressions, they should be rendered as the corresponding combining-mark sequence. MathML uses markup for some alphabetic cases. This allows multi-character combining marks, such as a tilda over two or more characters.
|