
MXL MicroXML Parser

MicroXML is a simplified subset of normal XML 1.0 (5th edition)
created by James Clark and John Cowan.  The specification is at:
  https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html
It is currently the subject of a W3C Community Group, but is not
a W3C Standard or on the Standards Track.  An informative pair of
articles about it by Uche Ogbuji are at:
  http://www.ibm.com/developerworks/library/x-microxml1/
  http://www.ibm.com/developerworks/library/x-microxml2/

The MXL Parser is designed to parse MicroXML by two different 
methods.  It produces a Data Model in stricr accordance with
the spec, and also provides SAX-type push parsing at the same
time.  Either or both are selectable.

MXL reports as errors everything in the document that is not 
conformant to the  MicroXML spec.  In its FullXML mode, it 
does the same, and when in its SAX mode it also reports the 
content of four constructs excluded from the data model:  
comments, DOCTYPEs, CDATA sections, and PIs.  They each have 
their own callback function.  None of the four ever appear 
in the Data Model itself.

MXL is written in C++ and is currently compiled with Visual
C++ 6.0.  It references windows.h along with stdio.h and 
stdlib.h, but does not use Microsoft-specific functions,
so it should be readily portable to other platforms.  The
Windows version consists of two parts: mxlparser.dll, the
parser itself, and mxl.exe, a simple console driver for it.


Operation

In a Windows command prompt window, the MXL parser is
invoked by:

 Usage: mxl [sourcefile  (default is stdin)] [options]
 Options:
  -o outputfile  Default is stdout
  -e errorfile   Default is stderr
  -n No content model, otherwise sent as JSON to outputfile
  -s Send SAX messages (as diagnostics) to errorfile
  -f FullXML report DOCTYPE, CDATA, and PIs as SAX messages
     to errorfile instead of reporting them as errors
  -x Expat callbacks for start and end tags, text, and PIs
  -a Provide brief help on the mxlparser.dll API
  -h or -?, Provide help (this message)

The API help mentioned there is:

 API for mxlparser.dll:
 First create an MxlParser with:
  MxlParser *Parser = new MxlParser();
 Optionally, set up options and SAX callbacks:
  Parser->SetOptions(UseSAX, UseModel, FullXML); (all bools)
  Parser->SetCallbacks(ErrFileName, ReportErrorFunc,
    StartTagFunc, EndTagFunc, TextContentFunc, ReportCDataFunc, 
    ReportPIFunc, ReportDoctypeFunc, ReportCommentFunc);
 For expat-compatible callbacks, use SetExpatCallbacks instead,
  which has a longer list of callbacks.
 Finally, parse the file:
  element *DataModel = Parser->ParseFile(SourceFileName);

 Error messages and comments are sent to ErrFileName (default
  stderr) unless the Report*Func says otherwise.
 If UseSax, the Tag and Text callbacks are used; the stub functions
  for them report the UTF-32 strings in JSON to ErrFileName.
 If UseModel, the data model is returned at the end as a struct
  with all strings in UTF-32 encoding, zero terminated.
 If FullXML, the DOCTYPE, CDATA, and PIs are reported as SAX messages
  instead of errors; they are never in the data model.


Data Model

The Data Model is returned as a structure by mxlparser.dll
upon completion of the parse.  All text items in it are in
UTF-32 strings, zero-terminated, for which length is also
given in the structure:

typedef unsigned char unc;
typedef unsigned long unl;

struct element {  // data model uses one top element per doc
 unl *name;       // array of UTF-32 chars
 long namelen;
 pair **attrs;    // array of attribute pairs
 long attrcnt;
 cont **content;  // arrays of element ptrs or UTF-32 chars
 long contcnt;
};

struct pair {
 unl *name;      // attribute name, UTF-32
 long namelen;
 unl *val;       // attribute value, UTF-32
 long vallen;
};

struct cont {
 void *it;       // ptr to array of UTF-32 chars or element
 long cnt;       // count if chars, 0 if element
};

For convenient study, the driver converts the structure 
to JSON format as used in the spec, and writes it to stdout
(or to a specified file) at completion.  Here is what it
produces for the sample in par. 3.1 of the spec:

[ "comment",
	{	"lang": "en",
		"date": "2012-09-11"
	}, 
	[	"\nI ",
		[ "em", {}, [ "love" ]
		],
		" \u00B5XML!",
		[ "br", {},  []
		],
		"\nIt's so clean & simple."
	]
]

This is slightly different formatting from the spec, as
we wanted to make all braces and brackets have matching
start and end columns when they held more than one item.


SAX Callbacks

When SAX mode is enabled, the parser calls back to these
functions, which are sent character data in UTF-32:
  void StartTag(unl *name, long namecnt, pair **attrs, long attrcnt);
  void EndTag(unl *name, long namecnt);
  void TextContent(unl *text, long textcnt);

In FullXML SAX mode, it also uses these, which are sent 
character data in UTF-8:
  void ReportComment(char *comment);
  void ReportPI(char *pi);
  void ReportDoctype(char *doctype);
  void ReportCData(char *cdata);

Whether in SAX mode or not, it always reports errors via 
this function, which is sent character data in UTF-8:
  void ReportError(long line, char *warning, char *cpt, bool fatal);
Hardly any errors are considered fatal; for most, some 
form of recovery is at least attempted.  For example, the
parser tries to match an end tag that doesn't match the
current start tag to the parents of the current element.
It reports any such issues and fixes as errors.

The stub functions provided for the callbacks all report
the name of the callback and the text sent to it in JSON
format to stderr (or to the errorfile set by the user).
Hence callbacks and errors precede the output of the
Data Model when SAX is specified but no callbacks are
set in mxlparser.dll by the using program.


expat-compatible Callbacks

When callbacks are set with SetExpatCallbacks, these are used:
  void StartTag(void *userdata, char *name, long namecnt, char **attrs);
  void EndTag(void *userdata, char *name, long namecnt);
  void TextContent(void *userdata, char *text, long textcnt);
  void StartCdataSection(void *userdata);
  void EndCdataSection(void *userdata);
  void ReportPI(void *userdata, char *target, char *data);
  void XMLDecl(void *userdata, char *version, char *encoding,
    int standalone);  [standalone = -1]
  void StartDoctypeDecl(void *userdata, char *name, char *sys,
	char *pub, int internalsubset); [internalsubset = 0]
  void EndDoctypeDecl(void *userdata);
  void ReportComment(void *userdata, char *comment);

The stub functions add "Ex" to the start of the reports,
as in "ExPI:".  All names and content are in UTF-8.


Licensing

The MXL Parser is entirely written by Jeremy H. Griffith
of Omni Systems, <jeremy@omsys.com>.  Omni intends to use
it for an upcoming product, working name uDoc, which is
a MicroXML editor specifically configured for a document
format similar to a simplified DITA.

We intend to license at least the parser, and probably the
entire product, as FOSS.  We are currently considering 
the GPL, although the Apache license is also a possibility.
At that point, we will create a SourceForge project for it.

Omni currently has three products available. The first is
Mif2Go, <http:///mif2go.com>, a commercial converter 
from FrameMaker source to a variety of output formats, 
including Word, DITA, HTML, and many forms of Help such 
as FOSS OmniHelp hosted on SourceForge.  Mif2Go is free
for a large number of its users: the unemployed, retired,
underemployed consultants, academics (staff, faculty, 
students), most nonprofits, and FOSS developers.  Quite
a few of its paying customers are Fortune 100's and
government agencies, who can afford to support the rest.

The second is DITA2Go, <http://dita2go.com>, a converter 
from DITA to the same outputs as Mif2Go, with which it 
shares a large part of its code.

The third is uDoc2Go, <http://udoc2go.com>, whick
converts from uDoc to the same outputs as Mif2Go, 
with which it also shares a large part of its code.

Part of the impetus for the newest product is concern over
the deteriorating quality and increasing cost of Adobe's
Framemaker.  The other part is concern over the difficulties
many users are experiencing with the increasing complication
of DITA.  MicroXML fits well with a product meant to improve
life for the Technical Writers using both Frame and DITA.

