Amtal

Rapid Prototyping and Domain Specific Languages with OMeta/JS

Overengineering risks yak-shaving. Underengineering risks eldritch code horror. However, sometimes smart people do all the engineering for you – and you evade the risks just by learning a new tool.

OMeta is the tool that raised waves some time ago, by using ASCII diagrams in an RFC to build a TCP stack in very few lines of code. 1 Its goal seems to be to make implementing DSLs cheap enough to use extensively, leading to massive LOC savings.

Let’s evaluate it on a well-suited, but real-world, problem. These are some rough notes on the Diablo 2 game server protocol, collected by various hackers over the years.

Both the protocol, and the DSL describing it, reek of “real world”. Let us test OMeta’s mettle against it.

Goal

Suppose we wanted to write a mid-sized C application using the protocol. We would need to define structures for the packets, fill them with default values for sending, and track their sizes for receiving. And we’d also want some reflection for writing legible packet dumps – and maybe wrapper functions for sending them directly, and legible documentation, and…

That’s a lot of code, tightly coupled across multiple files. In case it’s not obvious, the notes are incomplete and often wrong – the protocol implementation will change, and good luck keeping all that junk consistent.

So, based directly on the notes, let’s generate:

This is sufficient for sending and receiving packets, like so:

int size_lut(char* buf) 
{
    //              packet id   generated          user defined
    return size_lut[buf[0]]>0 ? size_lut[buf[0]] : advanced_size_lut(buf);
}

char* encode_example(char* send_buf, int* size) 
{
    struct bind_hotkey_to_skill p = BIND_HOTKEY_TO_SKILL;
    int psize = sizeof(*p);
    if (psize>*size) {/* handle overflow */};
    *size -= psize;

    p.skill = SKILL_FIREBALL;   // This code is interesting to write.
    p.side = SIDE_LEFT;         // Everything else, is just boilerplate.
    p.hotkey = 'f';

    *(bind_hotkey_to_skill*)send_buf = p;
    return send_buf+psize;
}

int decode_example(char* recv_buf, int size) 
{
    int psize = size_lut(recv_buf);
    if (psize>size) return 0;
    switch(recv_buf[0]) {
    // ...
    case 0x4d: 
        play_sound((play_npc_message*)recv_buf->npc_sound);
    // ...
    };
    return psize;
}

There’s a lot that could be done with additional generation, macros, and inline functions to simplify the above code – but I think it’s a sufficient proof of concept for generation. If you can make structures and their initializers, you can make anything.


Design

Class diagram + data flow diagram of project. Classes Parser/PPrint/Structs/Encode/SizeLUT transform between DSL/AST/CCode data.

Here’s a data flow/class diagram of the system. Ovals are OMeta objects, boxes are data. Solid lines are data flow, dotted lines are inheritance/extension.

The Parser is defined in two parts. One for “high quality” DSL only, then an extension for dealing with existing inconsistent syntax. Minor sanitizations that would otherwise require major parser changes will of course be done by hand.

Implementing the Printers

I implemented the parser first, but due to how messy the notes are it’s kind of gnarly. So, let’s start with the simpler printers.

First, a pretty-printer, to sanitize the mess of notes for good!

ometa PPrint {
  packets = packet*:ps -> ps.join('\n'),
  packet = ['p' id:id :size :name [field*:fs]] -> (id+' '+size+'\t'+name+' \t'+fs.join(' '))
         | ['p' id:id 'undefined' :desc]       -> (id+' ?\t'+desc),
  id = :h -> h.toString(16),
  field = ['const' id:n]         -> n
        | [['BYTES' :len] :name] -> ('BYTES['+len+'] '+name)
        | :s                     -> ('['+s.join(' ')+']')
}

Note the PEG-like structure: label = pattern match using labels -> output. Functional programmers should feel at home, especially with code like label = pattern match ? guard -> output.

Pattern matching works on both strings, and JSON arrays. You can parse text, and traverse trees, with the same syntax – the expressiveness makes generating output quick and easy:

43 ?  Unknown/Unused
44 17 Staff in orifice    44 [DWORD orifice entity kind] [DWORD orifice id] [DWORD staff id] [DWORD entity state]
45 9  Change TP Location[(logged as hack)]    45 [DWORD entity id] [WORD location] 0 0
46 13 Have Merc Interact  46 [DWORD merc id] [DWORD entity id] [DWORD entity type]
47 13 Move Merc   47 [DWORD Merc ID] [WORD X] 0 0 [WORD Y] 0 0

Next we’ll generate C code. In preparation, we can define a utility class for making valid identifiers.

I’ve read mentions of Alan Kay saying he didn’t envision Java when creating the term OOP. I have no love for side effect filled Java/Ruby-style mutation fests, but can definitely get behind the OMeta approach. The language encourages declarative, pure code; at the same time it doesn’t restrict your side effects when you need ’em.

Inheritance is just a simple means of extending existing code. Neat.

ometa C {
  // lazy C identifier validity - doesn't do 1st letter or reserved words
  valid = :s -> s.replace(/[^A-Za-z_0-9]/g, '_').toLowerCase(),
  // chops a valid identifier off a string, and sticks any tail into a comment
  ident = [idLabel:l ' '* anything*:c] -> 
                [l.toLowerCase(), c.length==0?'':'\t// '+c.join('')],
  idLabel = idWord:x idWord:y   -> (x+'_'+y)
          | idWord:x            -> x,
  idWord = ' '* (letter | '_')+:w -> w.join('')
}

// print a struct field, appending numbers to repeated symbols
field = function(type,txt) {
  var name = txt[0]
  if (Structs.sym[name]===undefined) {
     Structs.sym[name]=0
  } else {
     Structs.sym[name]++
     name += Structs.sym[name]
  }
  return ('\t'+type+'\t'+name+';'+txt[1])
}

Since we’re dealing with simple data structures rather than ambiguous, almost natural language notes, the code is simple. I use the opportunity to explore OMeta features.

The tree pattern match traversals are quite easy to read. JavaScript string generation can use work, however: I long for CoffeeScript/Ruby string interpolation, or at least sprintf…

ometa Structs <: C {
  // structure
  packets = packet*:ps -> ps.join(''),
  packet = ['p' :id :size valid:name [field*:fs]]   {'/*0x'+id.toString(16)+'*/ '}:comment
                                                    {' {\n' + fs.join('\n') + '\n}\n'}:body
                                                 -> {Structs.sym={}; 'struct '+comment+name+body}
         | ['p' :id 'undefined' :desc]      -> ('// 0x'+id.toString(16)+' '+desc+'\n'),
  field = ['const' ksym:k]                -> field('u8',k)
        | ['BYTE'       remap('u8'):n]  -> n
        | ['WORD'       remap('u16'):n]  -> n
        | ['DWORD'      remap('u32'):n] -> n
        | ['BUFFER'     remap('u8*'):n] -> n
        | ['CSTRING'    remap('char*'):n] -> n
        | [['BYTES' :s] ident:n] -> {n[0]+='['+s+']'; field('u8',n)}
        | :n -> ('\t// UNHANDLED GENERATOR: '+n),
  // I used remap to play with arguments, but really it loses clarity by muddling 
  // the split between input and output. Doubt it's worth the reduction in repetition.
  remap :type = ident:txt -> field(type,txt),
  ksym = :msg -> ['k', '\t// '+msg]
};
Structs.sym={}; // duplicate symbol counter

defines = """
#define u8 unsigned char;
#define u16 unsigned short;
#define u32 unsigned int;
"""

ometa Encode <: C {
  packets = packet*:ps -> ps.join(''),
  packet = ['p' :id :size valid:name [field*:fs]] -> 
                ('const struct '+name+' '+name.toUpperCase()+' = {'+fs.join(',')+'};\n')
         | ['p' :id 'undefined' :desc] -> '',
  field = ['const' hex:n] -> n
        | [['BYTES' :len] :name] -> 
                {var s='{'; for (var i=0;i<len;i++) {s+=(i==len-1)?'0':'0,';}; s+='}'}
        | :s -> '0',
  hex = :h -> ('0x'+h.toString(16))
}

ometa SizeLUT {
  packets = packet* -> {
     var s='const int size_lut[256] = {';
     for(var i=0;i<256;i++){
         s+=SizeLUT.lut[i]===undefined ? '-1' : SizeLUT.lut[i];
         if(i!=255) s+=',';
     }; 
     s+='};\n'
  },
  packet = ['p' :id size:s :name :fields] -> {SizeLUT.lut[id]=s}
         | ['p' :id 'undefined' :desc],
  size = '?' -> '0' // unknown
       | '*' -> '-1' // dynamic
       | number
}
SizeLUT.lut = {}

defines += """
#define SIZE_UNKNOWN 0
#define SIZE_DYNAMIC -1
"""

From the way structure defs were generated, I hope you see that generating fancy encoding/decoding functions would be easy.

For example, though poorly documented in the notes, some packets contain multiple null-terminated variable-size arrays. Generating a dispatch table by packet id to generated parsing functions would handle those efficiently, while allowing programmers to add custom handlers for more complicated packets.

Implementing the Parser

Since we’re parsing freeform notes bordering on natural language, cleanliness is hard to achieve. Rather than define an ideal parser, then try to extend it to deal with reality, I tackled parsing all of reality’s edge cases immediately.

Thus, the following looks brutal, but actually isn’t too bad when you consider what a mess the notes are. It was grown in an organic fashion, through quick trial-and-error in a REPL.

ometa CleanParser {
  // literals
  number = digit+:ds -> parseInt(ds.join('')),
  hexDigit = char:x {'0123456789abcdef'.indexOf(x.toLowerCase())}:v ?(v >= 0) -> v,
  hexLit = hexLit:n hexDigit:d -> (n*16+d)
         | hexDigit,
  sp = (~'\n' space)*,
  // hacky top-level labels
  topLabelFst = letter | '-' | '(' | '0x',
  topLabelOk = topLabelFst | char:c ? ("?)/.#,\'|=".indexOf(c)!=-1) | digit,
  topLabelRst = topLabelOk // here there be hacks:
           | ' ' topLabelOk,  // allow spaces and numbers, but tabs end
  topLabel = <topLabelFst topLabelRst*>,
  // hacky field labels
  labelFst = topLabelFst | '<',
  labelOk = topLabelOk | '<' | '>',
  labelRst = labelOk 
           | ' ' labelOk 
           | '[' digit+ ']', // for 'char[123]' in label contents 
  label = <labelFst labelRst*>,
  // moving on to proper structure
  length = "BYTES[" number:d "]" -> ['BYTES',d] // less frequent variant
         | "WORD" | "DWORD" | "BUFFER" | "BYTE" | "CSTRING",
  // Ideally every packet field has a name - but sometimes they aren't defined :(
  field = sp '[' length:l sp label*:name ']' -> [l,name.concat()[0]]
        | sp hexLit:h -> ['const',h],
  size = number | "?",
  packet = hexLit:id spaces size:s ' '* topLabel:desc sp field*:fs sp label*:desc2 ->
                ['p',id,s,desc2.length==0?desc:desc+desc2,fs]
         | hexLit:id sp fromTo('-','\n'):msg -> 
                ['p',id,'undefined',msg.replace(/[-\n(  )]/g,'')],
  header = "Number" "Size" "Effect:" "Usage:" '\n' '-'+ '\n',
  packets = header* (spaces packet)*:ps -> ps
}

I factored out the obvious bits into a separate object. The result isn’t as ideal as the design diagram suggests, but cleans up a little:

ometa Parser <: CleanParser {
  length = "NULLSTRING[" number:d "]" -> ['BYTES',d]
         | ^length
         | "VOID"     -> 'BUFFER'
         | "VARIABLE" -> 'BUFFER'     // varlength warden packet
         | "*char"    -> 'CSTRING'    // varlength chat messages
         | "Char"     -> 'CSTRING'    // definition typo like [Char Name[15]]
         | "Object"   -> 'DWORD'      // more typo
         // SaveChar defines `[dwUnk] [dwUnk] [char[252] SaveFile]`
         | "dwUnk"    -> 'DWORD'
         | "char[" number:d "]" -> ['BYTES',d],
  field = ^field
        | "<Research Incomplete>" -> ['BUFFER','ResearchIncomplete']
        | sp '?' sp -> ['BUFFER','ResearchIncomplete']
        | sp '*' sp -> ['BUFFER','ResearchIncomplete']
        | sp '-'+ sp -> ['BUFFER','ResearchIncomplete'],
  size = ^size       
       | "[Varies]" -> '*' // Why is this different from `sp '[Varies]' sp`?
       | "*" -> '*'  // * means dynamic size, need to inspect packet to determine
       | "?" -> '?', // ? means unknown size
  packet = ^packet
         | hexLit:id sp (~'\n' anything)*:msg -> 
                ['p',id,'undefined',msg.join('').replace(/[-\n(  )]/g,'')]
}

Phew. That’s parsing done – running Parser.matchAll(text,“packets”) reduces the nightmarish mess of notes into very simple JSON. I only made minor tweaks to the notes to avoid handling some one-time edge cases.

Results

The full inputs and generated outputs are available at https://github.com/amtal/dsl-rapid-proto.

Ignoring the already existing notes, that’s an order of magnitude improvement in code size. Nice.

More impressively, the gap should only get wider as extensions get added. I generated the very basics – imagine the LOC jump pretty-printing, or variable-length decoding would add.

The most time consuming part was writing the parser. However, considering that it was my first use of OMeta, and I intentionally picked an annoying language, it was very easy. The fact that the primary source of documentation is a chapter in someone’s PhD thesis wasn’t even an issue!

A large reason for that, was due to interactive development. The online OMeta/JS 2.0 Workspace allowed a trial and error approach to getting things right.

In short, faced with a realistic problem, and with minimal prior knowledge, OMeta made for an excellent tool.

References

1 Viewpoints Research Institute, VPRI Technical Report TR-2007-2008
STEPS Toward The Reinvention of Programming, First Year Progress Report, Dec 2007”
Appendix E: Extended Example: A Tiny TCP/IP Done as a Parser (by Ian Piumarta)
http://www.vpri.org/pdf/tr2007008_steps.pdf

2 Viewpoints Research Institute, VPRI Technical Report TR-2008-003
“Experimenting with Programming Languages”
Warth, Alessandro
http://www.vpri.org/pdf/tr2008003_experimenting.pdf