Gus Gollings: A Note on Asian Scripts

24 April 2002

The QWERTY keyboard has come into widespread use all over the world. It is based on the modern Latin alphabet, and it obviously does not directly support the input of the tens of thousands of ideographs from Chinese, Japanese and Korean languages. The question might be asked then, why are ideographic representations desired in a computer environment?

Ideographs pose a difficult input problem (in terms of the limited keys on a standard keyboard) and there are mature romanisations of these ideographic languages that could make use of existing word processing software and hardware. The problem with romanisation systems (and the same applies to speech) is that they do not effectively transliterate homonyms, in that the meaning is only seen in the ideograph, not in the phoneme. Therefore, the romanisation systems that are used, like Pin-Yin (pin meaning 'spell' and yin meaning 'sound') suffer from the same problems that speech does, only that in speech these things are negotiated as humour, whereas incomprehensibility in written language is seen as useless.

Nonetheless, Pin-Yin spelling is used extensively on signs and posters throughout China although the accent marks that are needed to reproduce the tones of the language are seldom used. Almost all Western newspapers, from the New York Times to The Australian, have adopted the Pin-Yin spelling to render Chinese names and terms.

It is remarkable that such enormous publishing regimes remain technically and editorially incompetent at rendering and proofing scripts that are dramatically different to those addressed in, for example, the parts of ISO 8859 and it is further testimony to the substantial difficulties that are posed by the typesetting and handling of ideographic texts. However, the more specific problem we see at work in the newspaper scenario is the need for, but impossibility of, mixing different character codes within the one document. The answer is to mix characters from different languages in the one character code, as opposed to mixing different language character codes in the one document.

Like all character codes, East Asian language character codes have their origin in the early telegraph technology. Although romanisation systems were well established at the time, it was more cost effective to develop character codes that directly mapped the main ideographs to Morse-like codes (and these often entailed encoding more than ten thousand individual characters). Modern Japanese character codes were developed by the Japan Industrial Standards Committee (JISC) from the 1970s onward. The first Japanese language character code was an extension of ISO 646 that supported katakana, just one of the three Japanese syllabaries (which are hiragana, katakana and kanji [the Chinese characters used in Japanese]).

This code was refined over several years to a stable release known as JIS X 0201-1976. The logical extension to this initial code was to include hiragana and some of the kanji characters. Over the next fourteen years the JIS X character codes grew to maturity, including hiragana, katakana, and several levels of kanji, and also Greek, Cyrillic and some Eastern European characters beyond the scope of the foundational ISO 646 characters. Therefore work on the early JIS X character sets was actually the first work toward a multiscript character set that would be the solution to the newspaper problem outlined above.

The definition of levels of kanji, mentioned above, defines as Level 1 a limited set of ideographic characters for 'everyday use' to simplify computer memory management of the character code. Further levels define more obscure ideographs, but they are implemented as separate character sets, so the one is not accessible from the other. As can be appreciated, this creates an interesting retardation of the Japanese kanji vocabulary for the benefit of computer performance. It is suggested that character encoding techniques and practice shape the possible use of a language (and as such, society's image of itself) in quite profound ways, more obvious in the implementation of gargantuan character sets that need to be reduced to 'everyday use'.

The difficulties that were encountered in the early days of Japanese text processing highlighted a path of least resistance for the development of Chinese, Korean and Taiwanese character sets. The People's Republic of China uses in its national script the largest number of simplified Han ideographs out of any of the East Asian languages. The simplified ideographs are borne out of the language-reforms of the 1950s aimed at increasing national literacy. As such, the character set for the People's Republic of China has come to be known as 'Simplified Chinese'. Taiwan, on the other hand, is still using the original form of the Chinese ideographs for its national script, and hence has its character code dubbed 'Traditional Chinese'.

While Simplified Chinese and Traditional Chinese character sets were able to learn from the development of the Japanese character sets, the main Korean language script, hangul, brought new challenges to the construction of character codes. Hangul is an alphabetised script, although out of its alphabet it builds not so much words, but syllabic blocks of stacked characters that look similar to the fixed ideography of the Chinese scripts. So there are two possibilities, one is to create a system through which the syllabic blocks are hand crafted by typing each letter component, the other is to create a character set that has all the possible combinations of the letters prearranged and key the data through a traditional QWERTY keyboard using Romanisation systems (in the same way Chinese ideographs can be entered into a computer). The former process has prevailed and the main Korean character set includes a massive collection of precompiled syllabic blocks.

This is an edited extract from 'Multilingual Script Encoding', a chapter in Cope and Gollings (eds.) (2001) Multilingual Book Production (Common Ground, Melbourne). Reproduced here with kind permission of the author. You can find out more about this project at the Creator to Consumer in a Digital Age site.