Analysis of the system requirements for a Japanese language tool

Wolfgang Slany
Institut für Informationssysteme
Technische Universität Wien
http://www.dbai.tuwien.ac.at/staff/slany/

During the last thirty years, Emmerich Simoncsics has created, together with his collaborators, several computer programs and databases that allow access to Japanese texts for persons who do not understand this language. Users of his programs can look-up the English meaning of Japanese words written with some of the several thousand Japanese characters by entering a simple numerical code. This Sim-code is easy to build once a few basic rules are remembered, and the beginner usually is able to look-up words after a few minutes of training. Other programs that were built include a program that teaches the correct drawing of the characters and simpler Japanese phonetic scripts for handwriting, as well as a prototype of a Japanese sentence analyzer that tags phrases and words with their grammatical mode and function. In 1996, Simoncsics published a Japanese-English Code-Dictionary [4] containing more than 80.000 entries. These correspond to Kenkyusha's Japanese-English dictionary [2] with added grammatical information, some added vocabulary for special engineering fields, and the Sim-code for fast and convenient look-up. It would be nice to be able to allow wide access to the wealth of Japanese language tools that have been built so far by Simoncsics and his team. However, there are several problems to be solved. Because most of the programs were written when only Japanese computers were able to display Japanese characters, these programs will not run on state of the art computer systems. Also, writing a program for some current operating system and hardware, besides from limiting its use to that particular community, will most probably lead to the same type of problem in a few years from now, and thus cannot be considered appropriate for an application whose usability should remain stable for a much longer period of time. Another concern relates to the copyright of databases, some of it remaining with Kenkyusha, some of it being hold by Simoncsics, and the easiness with which for instance CD-ROMs can be copied nowadays despite all software protection that is available. To solve these problems, we propose to build a system that

will be independent of current hardware and operating-systems, that
will allow controlled world-wide access with limitations where necessary, that
will allow us to integrate existing dictionaries and data as well as pieces of useful code available in the public domain, and that
will make it possible to seamlessly integrate future enhancements with regards to functionality and data contents.

The following sections will elaborate on each point.

Java for independence from current hardware and operating-systems

The Java programming language designed by the company Sun is the standard hardware- and operating-system-independent programming language. A program written in Java will run as-is, that is, without any change, on any hardware and operating-system for which a Java virtual machine is available. Currently, this includes all relevant computing platforms. Java is ideally suited for world-wide-web-applications, and has gained enough momentum to produce real-world applications, a local example being the SIDES database that allows teachers of several universities in Vienna to edit course information and communicate with their students. The SIDES editor applet written in Java was used 3678 times by 450 persons to edit their data from June 18, 1998 to July 30, 1998. A key feature of Java is its wide availability through web-browsers, so potential users can run Java-applets by simply clicking a hyperlink in their preferred web-browser.

Even more important is that free software reimplementations of Java are available, thus making the language in fact independent of any individual company: Any programmer can from now on, without concern for third party royalties or license restrictions, reimplement the language with relatively little effort for any hardware- and operating-system-platform that exists or will become available in the future. For instance, a similar free software project such as the LaTeX text-typesetting system [1] runs since 1981 on almost any hardware and operating system (although without Java and thus with much more porting effort) and has been adapted to all major human writing systems. It was used for instance to typeset the printed version of Simoncsics's Japanese-English code dictionary [4]. Besides the unbroken superiority of LaTeX in terms of typesetting quality, there is probably no commercial program from 1981 that still runs on present-day computers, whereas such stability is typical for free software. Free software cannot become extinct as long as any capable programmer wants to use it.

Java is therefore the recommended implementation language for an application such as an intelligent tool for human languages that should remain functional over an extended period of time.

Client-server architecture for access control

In a client-server system, a human accesses some service, for instance searching a word in a Japanese-English dictionary, by invoking a client program running locally on the human's computer that communicates with a server program on a central computer through some computer network, usually the ubiquitous Internet. One of the advantages of such a client-server architecture is that only data really needed by the user must be present on the client side computer. Besides providing central control on who accesses what, it allows to collect statistics on user behavior and to find out what users liked or disliked in the system. In the Japanese language tool that we propose, we plan to allow unlimited access to parts that are free, for instance to Jim Breen's collection of royalty free Japanese dictionaries, and limited access to parts that copyright holders require to be protected from unlimited access such as Simoncsics's Sim-code or third-party dictionary CD-ROMs such as the many datadiscs available in Sony's electronic book format. In case of Simoncsics's data, a user will be able to use the system to look-up the first 500 Japanese characters or words chosen freely from the whole database, the access limitation being based on the Internet address of that user. This will serve as a powerful incentive to buy the book version [4] or to acquire a license allowing access to the whole database. The client-server architecture is flexible enough to additionally ensure secure user authentification through passwords and at the same time control illegal data-downloads should these be forbidden even for users that acquired licenses for limited use, e.g., making it impossible to download systematically the whole dataset should this be forbidden by the copyright holder. In contrast, bundling the software and distributing it on a physical medium such as a CD-ROM would jeopardize all attempts to protect the data from being copied as a whole.

Thus, we recommend to implement the Japanese language tool in the proposed client-server style.

Open design for current data and tools

In order to make the system as widely accessible as possible to interested users, and to be able to incorporate existing data and pieces of useful programming code available as free software, it is necessary for licensing reasons to separate programs and data, in a similar way as a television set is bought separately from the media-contents that is later watched on its screen. Therefore, we recommend to release the client-server program as free software under the GNU General Public License, allowing us to take advantage of previous work while making it possible to abide to the different access control requirements from the copyrighted parts.

Open design for future enhancements

Besides the already mentioned aspects, the combination of the Java programming language and a client-server style implementation has additional advantages: Because the client program is written in Java, it is usually downloaded by the human user whenever the program is needed. While this means that a little overhead is required to download the program each time, the benefit is that any bug-fixes and future enhancements, for instance an added grammatical sentence analyzer for webpages written in Japanese, are immediately available to all users without any effort needed to install the client program on the user's side. The same is true for changes or enhancements of the data, i.e., added or edited entries to the dictionaries, new entry types such as digital audio data for sampled spoken words, or support for Japanese to non-English dictionaries, are available to all users as soon as they become available. Another advantage of the client-server framework that may become important in the future is that feedback in various forms can be collected from users and sent back to the server side for further processing. For instance, a simple application of this principle could be to quickly detect missing entries in the dictionaries. Many more advanced applications are of course imaginable.

Because editing the data will also be done via a special editing client in the same client-server style, it will also be possible to allow editing from distant sites, for instance cooperating institutions in Japan, in which case authentification and access control to ensure data quality are of course a must. This combination of ease of use for users of the Japanese language tool, easiness of maintenance for us, and the possibility to correct bugs and to add features `on the fly' makes the proposed system design highly recommended.

Conclusion

The framework for a Japanese language tool we propose is a compromise between our wish to allow as universal access as possible to a set of highly useful Japanese language tools and databases, while at the same time respecting the various copyright and license requirements related to some parts of the databases. We believe that the complex set of legal and technical requirements is best solved by the system we have tried to hint at in this short analysis.

References

1: Leslie Lamport. LaTeX: a document preparation system. Addison-Wesley, 2nd (1st: 1984) edition, 1994.
2: Koh Masuda, editor. New Japanese-English Dictionary. Kenkyusha Limited, Tokyo, 4th edition, 1974.
3: Marshall C. Ramsey, Thian-Huat Ong, and Hsinchun Chen. Multilingual input system for the Web--an open multimedia approach of keyboard and handwritten recognition for Chinese and Japanese. In Proceedings of IEEE Advances in Digital Libraries Conference (ADL'98), pages 188-195, Santa Barbara, April 1998. http://ai.bpa.arizona.edu/mramsey/papers/input/input3.pdf.gz.
4: Emmerich & Waltraude Simoncsics. Japanese-English Code-Dictionary. ÖBV Pädagogischer Verlag GmbH, Vienna, Austria, 1996.

About this document ...

Analysis of the system requirements for a Japanese language tool

The command line arguments were:
latex2html -split 0 sim.

The translation was initiated by Wolfgang Slany on Tue Dec 29 18:02:43 MET 1998

...Sun

http://java.sun.com/

...students

To access SIDES, a valid White-Pages X.500 entry at one of the participating universities and a corresponding password is required. Documentation in German is available at http://www.lzk.ac.at/sides/. A translation service is available at http://babelfish.altavista.com/.

...software

See Netscape corporation's white paper on open source code for details of the free software concept, available at http://sitesearch.netscape.com/browsers/future/whitepaper.html. For details on free software and the Free Software Foundation's GNU project, see http://www.gnu.org/.

...available

http://www.transvirtual.com/kaffe.html, a Java implementation that currently runs on 30 operating systems and 8 hardware platforms.

...dictionaries

http://www.dgs.monash.edu.au/~jwb/wwwjdic.html

...datadiscs

http://www.nichigai.co.jp/eblist/contents.html (in Japanese)

...format

http://www.sentius.com/Sentius/English/Other/EB/eb_fmt.html

...software

See for instance Jim Breen's Japanese dictionary collection that is freely available at http://www.dgs.monash.edu.au/~jwb/wwwjdic.html, the impressive Japanese character handwriting recognition Java applet allowing mouse input written by Todd Rudick and Marshall Ramsey [3] at http://www.cs.arizona.edu/japan/JavaDict/ and http://ai.bpa.arizona.edu/~mramsey/hwrkana/, the Kanji stroke order teaching Java applet written by Kazuya Hirobe at http://fl176.hyper.chubu.ac.jp/wwkanji/opening.html, or the Japanese-English dictionary web-service that requires no Japanese fonts on the client computer written by Jeffrey Friedl on http://merlin.soc.staffs.ac.uk/cgi-bin/j-e/dict.

...License

http://www.gnu.org/copyleft/gpl.html

Wolfgang Slany
Tue Dec 29 18:02:43 MET 1998