Arch2Arch Tab BEA.com
Syndicate this blog (XML)

Rethinking Algorithms

Bookmark Blog Post

del.icio.us del.icio.us
Digg Digg
DZone DZone
Furl Furl
Reddit Reddit

Fred Mikkelsen's Blog | October 1, 2007   9:18 AM | Comments (0)


I have been working on my programming language, "Just", the Java Smalltalk compiler. I wrote it originally in 2002 to allow Smalltalk-like expressions of blocks and iterators. Over time, I relaxed the full Smalltalk-80-type specification and went to a "best-fit" Java Reflections API approach to execute these instructions.

For ad-hoc, and algorithmic-oriented work, I find the Smalltalk-like specifications to be more flexible primarily because all data items are objects and all collections inherit common collections frameworks. For example:

collection do: [ :x | System out println: x ]

prints out all of the contents of a collection separated with carriage returns. Which is increadibly more easy than constructing a for- or while-loop, on an enumeration (if it's an enumerated list) or an iterator (if it's iterator oriented), or "[x]" if it's an array, or ....

For the compiler, I was working with standard compiler options like the LR-1 type compiler and dissecting things into terms, expressions, factors, etc. as I had been taught in compiler class. I found I was creating dozens of classes with abandon for all the data forms that would be parsed. I started restructuring the class hierarchy to simplify -- merge things that were similar. I began re-thinking the reason for those algorithms.

In the 1950s, when Donald Knuth, Grace Hopper and that generation of computer legends was working, 8KB of memory was a large amount. CPU time was expensive. And disk I/O (or tape or punch-card I/O) was very slow. I was reasonably sure that it would be rare for me to have one source block to compile greater than 5 MB, and all that could fit into memory.

I went back to the previous generation of computer heroes, Alan Turing, and the Turing Machine for ideas of a compiler. Alan Turing had a notion of the computational tape and moving back-and-forth using a state machine and substituting tokens on the tape for other tokens. I could try that.

I then dissected the "tape" of code using simple operations applied in a logical order of succession.
  • Comments are removed first
  • Strings and Character Constants are removed. (things in Strings can look like code macroscopically but not really be code)
  • Using an iterative approach, nested structures are removed with the rule that the innermost pair will be processed first. a) Groups in Parens, b) Groups in Brackets, c) Groups in Braces. Each of these will be further pushed into for processing
  • as each item is identified, it is replaced with a token that is a sequence of characters that is out of the bandwidth of standard processing. In my case, I chose a digit, two letters, and digits format to identify the type of tag removed. E.g. 0pa123 for parenthesized term 123.

  • More rigor would be needed to make this production code certainly, but as a debugging tool or a hacker's desk-top assistant, it is quite helpful.

    The take-away that I am putting forward is that this alogrithm is not 'efficient' in the classical sense, but it is very compact and fast. A two GHz processor can traverse a 100KB string in memory perhaps 2000 times per second. A code generator that generates code to parse code will be more compact and faster parsing across a String and repacing tokens than generating multiple Java classes to do push-down parsing. In a jar file, there is a minimal overhead for each class, and there is less overhead for a method. The types of problems that the 1950's compilers were solving are not as prevelant.

    This is particularly true in the area of data transformation in which one business record may be ten or twenty thousand bytes, and deconstructed as objects may be 200 objects or more fully decomposed. Turing-type access into data records for read and write may, overall, be a better use of resources because the data is more compact, and the page-jumping in memory will be lower because objects will not be traversed as frequently.

    I have seen database internals take a approach. Within one disk page, several short database records may be stored. Because some data fields can vary in size, the records are truncated to required length, and written to the disk page. If any of the records on that page are accessed, then all of them on that page are read out, and the desired record is decompressed into the form expected by the schema. There was more to it than this to handle over-flows, but this was essentially the algorithm.

    CORBA took a similar approach for object references. Because references may be lengthy, if the same reference was passed multiple times, the second and third references would be a reference to the reference.

    I was working in a Java applet environment needed to access XML fields in JRE 1.1. Downloading packages of W3C XML parsers was not going to be acceptable, and really there were only a couple fields to check. Using similar techniques, I was able to traverse and extract the XML information, and construct and return XML information very compactly.

    Looking at the computing capacity of CPUs today, and the abundance of memory, I put forward the notion that constructing or using Java packages to handle simple point-purpose parsing may not be the best final-form in a generated system. As mash-ups become more complex and diverse environments become more widespread, the need for more packages, versioning those packages, and allocating runtime space for those packages may be heavier than the lexical parsing indicated. Lexical parsing may work better on strings than "traditional" object decomposition.


    Comments

    Comments are listed in date ascending order (oldest first) | Post Comment



    Only logged in users may post comments. Login Here.

    Powered by
    Movable Type 3.31