M/Wire Tutorial

Example of how to make use of GT.M or Caché's Persistent Arrays

If you're new to GT.M and Caché and their sparse, hierarchical persistent arrays, it may not be clear to you how you can make use of them effectively. The fact is that they provide a very basic, low-level data structure, and you have pretty much free rein over what you do with them and how you use them. There are many different techniques and approaches that people have adopted over the years for using them, and there are no hard and fast rules on best practices etc. Developers have modelled all kinds of database and store types on top of GT.M and Caché, including relational and objects.

It's always helpful to see an example, and with that in mind, let's examine how these hierarchical arrays have been used within M/DB:X to model the XML DOM. We'll not look at the parsing process, simply some of the tricks and techniques that have been used to store the contents of an XML document as a DOM, and to allow that DOM to be manipulated, analysed and searched.

The XML DOM represents an XML document as a set of nodes of different types (eg Element, Attribute, Text, Comment etc), with each node linked by pointers. Elements represent XML tags and are connected by parent, first child, last child, next sibling and previous sibling pointers. The DOM API methods manipulate and modify these nodes and pointers, and the DOM can be turned back into an XML document by recursively "walking" the nodes and pointers.

Representing the DOM

There are numerous ways you could represent a single XML DOM in GT.M or Caché, but one model might be to allocate each node an ID or number:

DOM[nodeID,"NodeType"] = 1 (Element)
DOM[nodeID,"TagName"] = foo {represents the tag <foo>)
DOM[nodeID,"Parent"] = parentNodeID
DOM[nodeID,"FirstChild"] = firstChildNodeID
DOM[nodeID,"LastChild"] = lastChildNodeID
DOM[nodeID,"NextSibling"] = nextSiblingNodeID
DOM[nodeID,"PreviousSibling"] = previousSiblingNodeID

DOM[nodeID,"Attribs",attributeName] = attributeValue

So the minimal XML document:

 <foo>
    <bar hello="world" x="123" />
 </foo>

Could be represented as:

DOM[1,"NodeType"] = 1
DOM[1,"TagName"] = "foo"
DOM[1,"FirstChild"] = 2
DOM[1,"LastChild"] = 2

DOM[2,"NodeType"] = 1
DOM[2,"TagName"] = "bar"
DOM[2,"Parent"] = 1
DOM[2,"Attribs","hello"] = "world"
DOM[2,"Attribs","x"] = 123

Note the way that the sparse nature of GT.M and Caché arrays means you only create the pointer nodes that you need - this makes for a very efficient implementation of the DOM.

To create these nodes using M/Wire, use the SET command for each one:

SET DOM[1,"NodeType"] 1
1

SET DOM[1,"TagName"] 3
foo

SET DOM[1,"FirstChild"] 1
2

SET DOM[1,"LastChild"] 1
2

SET DOM[2,"NodeType"] 1
1

SET DOM[2,"TagName"] 3
bar

SET DOM[2,"Parent"] 1
1

SET DOM[2,"Attribs","hello"] 5
world

SET DOM[2,"Attribs","x"] 3
123

Walking the DOM Tree to output an XML Document

To output a DOM as an XML Document, we "walk" the DOM nodes via the child and sibling pointers, picking up any attributes along the way. This needs to be a recursive algorithm, depending on whether or not each node you find has a FirstChild pointer.

So to get the appropriate information via M/Wire, we'd start at nodeID 1: We'll get its tag name and see if it has any attributes:

GET DOM[1,"TagName"]
EXISTS DOM[1,"Attribs"]

The latter command will return 0 (=false), so we now have all the information to output the tag as <foo>.

Now check for child nodes:

EXISTS DOM[1,"FirstChild"]

This will return 1 meaning true, so we get the first child's NodeID:

GET DOM[1,"FirstChild"] => 2

We need to check if node 2 is an Element:

GET DOM[2,"NodeType"]

It is, so we get its tag name and see if it has any attributes:

GET DOM[2,"TagName"]
EXISTS DOM[2,"Attribs"]

A value of 1 will be returned (=true), so now we get the attributes:

ORDERALL DOM[2,"Attribs"]

..which will give us all the attribute name/value pairs. We now have all the information to output the tag as <bar hello="world x="123" >.

Next we see if it has any child nodes:

EXISTS DOM[2,"FirstChild"]

This will return 0 (=false), so now we check if it has any siblings:

EXISTS DOM[2,"NextSibling"]

..which returns 1 (=true) so we get its sibling's NodeID

GET DOM[2,"NextSibling"]

.. which points us to NodeID 3, and so the process continues until we have no more child nodes and siblings. Of course, each time we pop the recursive stack, we'd output the closing tag.

Manipulating the DOM

An XML DOM API Method such as createElement() would add a new node into the DOM array, initially without any child/parent/sibling pointers. Applying the appendChild() method would add the appropriate child/parent/sibling pointers to the created element, and would also adjust the pointers of the node to which it had been appended, along with any affected sibling nodes.

So let's create a new <bar> tag and add it. Here's what we'd now have:

DOM[1,"NodeType"] = 1
DOM[1,"TagName"] = "foo"
DOM[1,"FirstChild"] = 2
DOM[1,"LastChild"] = 2

DOM[2,"NodeType"] = 1
DOM[2,"TagName"] = "bar"
DOM[2,"Parent"] = 1
DOM[2,"Attribs","hello"] = "world"
DOM[2,"Attribs","x"] = 123

DOM[3,"NodeType"] = 1
DOM[3,"TagName"] = "bar"

So, createElement() can be implemented using M/Wire with just two commands:

SET DOM[3,"NodeType"] 1
1

SET DOM[1,"TagName"] 3
bar

One of the cool features of GT.M and Caché is that they support transactions if you wish, so you could optionally ensure the createElement() commands are handled together as a transaction simply by adding the TSTART and TCOMMIT commands:

TSTART

SET DOM[3,"NodeType"] 1
1

SET DOM[1,"TagName"] 3
bar

TCOMMIT

Now we'll use appendChild() to attach it as the last child of the <foo> tag so that we end up with the document:

 <foo>
    <bar hello="world" x="123" />
    <bar />
 </foo>

Our array would now look like this:

DOM[1,"NodeType"] = 1
DOM[1,"TagName"] = "foo"
DOM[1,"FirstChild"] = 2
DOM[1,"LastChild"] = 3

DOM[2,"NodeType"] = 1
DOM[2,"TagName"] = "bar"
DOM[2,"Parent"] = 1
DOM[2,"NextSibling"] = 3
DOM[2,"Attribs","hello"] = "world"
DOM[2,"Attribs","x"] = 123

DOM[3,"NodeType"] = 1
DOM[3,"TagName"] = "bar"
DOM[3,"Parent"] = 1
DOM[3,"PreviousSibling"] = 2

So, appendChild() can be implemented using M/Wire with the following commands that change the pointers and attach the new tag into the DOM tree:

TSTART

SET DOM[3,"Parent"] 1
1

SET DOM[3,"PreviousSibling"] 1
2

SET DOM[1,"LastChild"] 1
3

SET DOM[2,"NextSibling"] 1
3

TCOMMIT

Holding Multiple DOMs

So what if we want to store more than one DOM in the database? The previous model is limited to a single DOM. We can easily change that by adding a new first subscript that represents the Document ID, eg:

DOM[DocumentID] = Document Name
DOM[DocumentID,nodeID,"NodeType"] = NodeType
DOM[DocumentID,nodeID,"TagName"] = TagName
DOM[DocumentID,nodeID,"Parent"] = parentNodeID
DOM[DocumentID,nodeID,"FirstChild"] = firstChildNodeID
DOM[DocumentID,nodeID,"LastChild"] = lastChildNodeID
DOM[DocumentID,nodeID,"NextSibling"] = nextSiblingNodeID
DOM[DocumentID,nodeID,"PreviousSibling"] = previousSiblingNodeID
DOM[DocumentID,nodeID,"Attribs",attributeName] = attributeValue

So now we can store as many DOMs as we like: just allocate each one a unique Document ID value in the first subscript of each array node.

As you'll discover when you start to expand the model to provide other features such as indices, it's a good idea to add further literal subscripts that denote the purpose of the subscript value that follows, so our expanded example would now look as follows:

DOM["document",1] = "ExampleDOM"

DOM["document",1,"node",1,"NodeType"] = 1
DOM["document",1,"node",1,"TagName"] = "foo"
DOM["document",1,"node",1,"FirstChild"] = 2
DOM["document",1,"node",1,"LastChild"] = 2

DOM["document",1,"node",2,"NodeType"] = 1
DOM["document",1,"node",2,"TagName"] = "bar"
DOM["document",1,"node",2,"Parent"] = 1
DOM["document",1,"node",2,"Attribs","hello"] = "world"
DOM["document",1,"node",2,"Attribs","x"] = 123

For example, this expanded model now allows us to resolve an ommission that you may have spotted: how to determine the next NodeID or DocumentID when you add a new document or node. One approach would be to add a counter for each that we increment (using the INCR command) whenever we need a new one:

DOM["documentIDCounter"] = 1
DOM["document",1,"nodeCounter"] = 2

For example to get a new nodeID for Document 1:

INCR DOM["document",1,"nodeCounter"]

This will return you the value 3 and the nodeCounter node will now be:

DOM["document",1,"nodeCounter"] = 3

While we're at it, we should probably have a "DocumentNode" pointer that tells us what the top nodeID of a Document is, eg:

DOM["document",1,"documentNode"] = 1

Now, when we "walk" the DOM to output it, we know which node to start the recursive process from, in M/Wire we'd use:

GET DOM["document",1,"documentNode"]

Indexing

So what about indexing? The answer is that there is no built-in indexing. On the down-side, that means you have to design and maintain arrays that represent indices to the data you'll be searching and retrieving. On the up-side, you have complete control over what you index and how you index it.

So, for example, we will probably want an index of tagNames, for when we use the getElementsByTagName() method. We could do this as follows:

DOM["document",DocumentID,"tagName",tagName,nodeID] = ""

...resulting in our example having the following extra nodes:

DOM["document",1,"tagName","foo",1] = ""
DOM["document",1,"tagName","bar",2] = ""

Our createElement() method would need to be extended to add this index:

TSTART
SET DOM[3,"NodeType"] 1
1
SET DOM[1,"TagName"] 3
bar
SET DOM["document",1,"tagName","bar",3] 0

TCOMMIT

If the XML document had lots of <bar> tags, the index would fill up with lots of pointers to their respective nodeIDs, looking something like:

DOM["document",1,"tagName","foo",1] = ""
DOM["document",1,"tagName","bar",2] = ""
DOM["document",1,"tagName","bar",5] = ""
DOM["document",1,"tagName","bar",7] = ""
DOM["document",1,"tagName","bar",10] = ""
DOM["document",1,"tagName","bar",22] = ""
DOM["document",1,"tagName","bar",24] = ""

In order to locate all the tags named "bar", you would use the ORDERALL command, eg:

ORDERALL DOM["dom",1,"tagName","bar"]

which would return you 2, 5, 7, 10, 22 and 24 as the subscripts it finds, and we now have pointers to the NodeIDs of all the <bar> tags!

Note that we're just saving null values for these index nodes. We have to save something, but for this index we're interested in using the subscripts to represent the one-to-many relationship between tag name and nodeIDs, and data is irrelevant for our needs.

Our expanded model also now allows us to easily add a Document Name index. Assuming that the Document Name must be unique, we could index this in the following way eg:

DOM["index","ExampleDOM"] = 1

Since this time we need a one-to-one relationship, we set the value of the Document ID as the data value, rather than a subscript. So, if we know the name of the Document we want to access, we can get its Document ID by simply using the GET command:

GET DOM["index","ExampleDOM"]

Of course, if the name we've used doesn't exist in the index, the GET command will return a missing value (represented by $-1).

Iteration and Cursors

Let's take another look at that index of <bar> tags. We've already seen how the ORDERALL command can return all of them in one go. What if we wanted to iterate through them one at a time instead (for example, we might just want the first 20, and if there were 10,000 <bar> tags it would be slow and inefficient to return them all).

The answer is the ORDER (or NEXT) command which can be used as an iterator and which provides a cursor. To get the first <bar> tag, we'd use:

ORDER DOM["document",1,"tagName","bar",""]

The null subscript seeds the iteration and we'll get back 2 which is the first subscript in this index. So we now know that the first nodeID that represents a <bar> tag is 2 and we can then do whatever we need with that node.

Having finished with node 2 we now want the next one. We simply use the last value returned by the ORDER command as a cursor:

ORDER DOM["document",1,"tagName","bar",2]

This time we'll get back the value 5 because it's the next subscript value in the index in collating sequence.

We simply repeat this process until we want to stop or until we exhaust the list of nodeIDs in the index. How do we determine the latter? Simple: the ORDER iterator returns a null value (represented by $-1), ie in our example:

Client: ORDER DOM["document",1,"tagName","bar",24]
Server: $-1

Deleting Documents and Node Properties

So how do you delete stuff? The hierarchical nature of the persistent arrays means that you can delete an entire sub-tree of nodes in just one command - nice and simple but also potentially very dangerous!

So to delete the attribute hello="world" from our <bar> tag, we would use:

DEL DOM["document",1,"node",2,"Attribs","hello"]

To delete our entire Document from our model we can just delete at the Document ID subscript level, eg:

DEL DOM["document",1]

.. but of course we also have to delete the index, so we actually need to first do the following:

GET DOM["document",1]

..which returns us the Document Name, ie "ExampleDOM". Then we can delete everything including the index:

DEL DOM["index","ExampleDOM"]
DEL DOM["document",1]

In other words, there's no natural built-in referential integrity - you have to provide all that logic yourself. So, with that in mind, we should probably perform that deletion as a transaction, so it would be better to do the following:

TSTART
GET DOM["document",1]
DEL DOM["index","ExampleDOM"]
DEL DOM["document",1]
TCOMMIT

To get rid of everything, just delete the array name:

DEL DOM

..and the whole lot disappears. Like I said, nice and simple, but potentially very dangerous if you get it wrong!

One Array or Many?

In the examples above we've been holding everything that represents the DOM, including any indices, in one single array named DOM. If we wanted, we could use a different array name for the indices, eg:

DOMIndex["byDocumentName",DocumentName] = DocumentID
DOMIndex["byTagName",DocumentID,tagName,nodeID] = ""

..eg for our example:

DOMIndex["byDocumentName","ExampleDOM"] = 1
DOMIndex["byTagName",1,"foo",1] = ""
DOMIndex["byTagName",1,"bar",2] = ""

There are pros and cons to holding everything in one array or breaking it up into several for different purposes: these issues relate to practical considerations rather than logical design considerations, for example maximum array sizes (which on GT.M is determined by the configured block size), backup size and speed (indices should always be capable of being regenerated from the core data so in theory don't need to be backed up), logical array mapping across physical servers, etc.

Array and Subscript Names

Just a quick bit about the syntax for array names and subscript names.

Array names must start with an alphabetic character (A-Z or a-z) followed by any alphanumeric character (A-Z, a-z, 0-9). Punctuation characters such as ".", "-", "&" or "_" are not allowed in array names. On GT.M, array names can be up to 31 characters in length.

Subscript names can consist of any characters including punctuation characters. If a subscript is numeric (ie a valid real or integer value), then it is treated as that number and does not have to be wrapped in double quotes. Otherwise the subscript is assumed to be a string and must be wrapped in double quotes. On GT.M, the combined length of the subscripts in an array reference can be up to 247 characters, so the maximum allowed length of a subscript depends on how many you'll be using in your array references.

For example, the following are valid subscripts:

"hello"
"hello123"
"hello 123"
" hello 123"
"123 hello"
"0123"
123
1.23

GT.M and Caché automatically store subscripts in alphanumeric collating sequence. Numbers come first, then aplhabetic characters, but be careful because numbers are collated as strings, not numerically. For example, here's the order in which a variety of alphanumeric subscripts would be automatically stored:


test[0]
test[0.1]
test[1]
test[2]
test[12]
test[31]
test["12a"]
test["A"]
test["A123"]
test["Aaa"]
test["Z"]
test["a"]
test["abc"]
test["b"]

For string values, at least, you can avoid the need for sorting logic, simply by creating and maintaining an index. You'll then get the values back in the correct sorted order by using the ORDER or NEXT command.

Conclusion

That's a very quick overview of just one example of how you could apply the dynamic, sparse, persistent multi-dimensional arrays of GT.M and Caché. Think of them as a raw, low-level "pallette" on which you can let your imagine loose to model whatever projections and abstractions you like. Whatever you use them for, you'll find them to be lightning fast. Of course, you'll have to do lots of work that a relational database would otherwise do for you, but of course, as a NoSQL developer, that's the whole reason you're looking at GT.M and Caché in the first place!