REM State

12 Nov

Unigen

Overview

Due to all my work with Javascript, I’ve begun toying around with writing an ECMAScript parser/engine. Naturally, I’m not too far along yet, but I do have a sub-project that’s almost complete: unigen. Basically, ECMAScript has several portions of its lexical grammar defined in terms of Unicode character classes—for example:

UnicodeLetter
any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

While I could simply rely on the C library’s isalpha function, that isn’t guaranteed to give me a “UnicodeLetter” as defined by the ECMAScript specification. Thus, the problem becomes finding out which characters are in each of the classes, and making my own personal lookup routine to tell if a given character is a “UnicodeLetter”.

Generating Class Lists

It just so happens that Unicode, Inc., the publishers of the Unicode specification, provide a huge amount of information about the various Unicode codepoints. It’s stored in a structured text file, called the Unicode Character Database—the perfect format for doing intensive, automated processing. Amongst other information, the UCD contains the specific information that the ECMAScript standard mentions—the category that each Unicode codepoint falls into.

In order to tackle this task, I called upon my favorite Swiss Army sledgehammer: bash, with the standard compliment of core UNIX utilities (cat, sed, cut, etc.). While I could have undoubtedly written this entire portion of the unigen utility in C (or another scripting language, like Perl), I certainly wouldn’t have been able to write the utility anywhere near as rapidly as I wrote it in bash. It managed to get a lot uglier than I initially thought that it would, and the script is slow as hell, but it works.

However, the bash script could go only so far. Mostly, it was able to take care of the initial UCD breakdown, namely:

  1. Expanding ranges defined by their first and last codepoint.
  2. Making sure that all codepoints had the same number of digits.
  3. Splitting codepoints into files based on Unicode categories.
  4. Generating C files with arrays based on ECMAScript groupings.
  5. Ensuring that codepoints are sorted properly in the output.
  6. Generating a header file so that C code can use the arrays.

Optimizing the Information

After going this far, the basic objective is complete: it would be possible to write a function that used the generated C arrays to determine if a given Unicode character is, e.g., an isECMAletter. However, the generated arrays are needlessly large; there are vast ranges of adjacent UTF-16 codepoints, which could easily be collapsed down into a start–finish pair. Since this would be somewhat tricky (and slow) to do in bash, this is where I hopped over to C. The C code is fairly straightforward; it is responsible for:

  1. Reading the arrays generated by the bash script.
  2. Run-length encoding the arrays.
  3. Writing the run-length encoded arrays to new C files.

Conclusion

Some minor work remains to do be done—specifically, surrogate pairs are not optimized (or handled at all) by the optimization code yet. Aside from this, and some code cleanup, the generation script and optimizer are pretty much complete. Other specification writers, take note: this project is a perfect demonstration of why it is an excellent idea to make portions of your specifications machine-readable!

Source

You will need bash and the GNU coreutils in order to run the scripts. Windows users can get an appropriate environment by installing Cygwin. The source is available in several formats:

© 2008 REM State | Entries (RSS) and Comments (RSS)

Global Positioning System Gazettewordpress logo