StaxMateIterators

StaxMate Iterators

The input side of StaxMate is built on concept of multiple synchronized iterators. Iterators can be thought of as forward-only cursors that have scope: subset of the document they traverse over. That is, an iterator can only traverse the sub-tree it was created for.

Types of Iterators

There are 2 main flavours of iterators:

  • nested iterators (or "child iterators", implemented by SMNestedIterator) that traverse only the root-level events within sub-tree they are scoped under (immediate children of the parent); these only expose START_ELEMENT events matching each immediate child element (end of the scope is implicitly known via getNext() return value SM_NODE_NONE). This is the usual iterator used when doing hierarchic (recursive-descendant) parsing or transformations.
  • flat iterators (or "descendat iterators", implemented by SMFlatIterators) that traverse full sub-trees presenting "flattened" view (and exposing both START_ELEMENT and END_ELEMENT events; latter is needed due to flattening, to know nesting boundaries). These are useful when, for example, collecting all text in a sub-tree.

Creation of iterators is done either via SMIteratorFactory (for root iterators), or via other iterators (for all child and descendant iterators): each time an iterator points to START_ELEMENT, a new child iterator (flat or nested) can be created.

The main benefit of iterators is scoped access with serialization. What this means is that:

1. It is always safe to pass a child iterator to another processing component: that component can only access entries within scope of the iterator; and all this without that component having to keep track of the nesting of elements.
2. Access via multiple dependant iterators (child, parent) is serialized such that underlying access is always in document order.

Point 2. means that when a component is done with iterator (at any point; including not using it at all), there is no need to manually "fast forward" through events that iterator would be seeing. Parent iterator takes care of skipping through events not needed, automatically, when parent iterator itself is advanced. This advancement will then invalidate all child iterators (since cursor has advanced past point where they would be valid), and ensuring they can not be advanced any more.

Access to event information

All normal access to the event information (name of the element, attributes, attribute values, textual content, processing instruction target and so on) is accessible via an iterator that is currently pointing to an event. At any given point, there will be just one such iterator. Although there are methods to check if an iterator is at such valid point, this is seldom needed: as long as access is done right after advancing an iterator this works as expected.

Filtering iterators

In addition to the two main types of iterators, there is also support for simple configurable filtering of events visible using the iterators. For example, it is trivially easy to construct an element-only iterator (one that ignores all other event types but START_ELEMENT (and for flat iterators, END_ELEMENT)):

SMIterator nestedTextIter = currentIterator.childElementIterator();
SMIterator nestedTextIter = currentIterator.descendantElementIterator();

(first call creates a nested iterator, and second a flat iterator)
And for more configurable filtering:

SMIterator nestedTextIter = currentIterator.childIterator(new SimpleFilter(...));
SMIterator nestedTextIter = currentIterator.descendantIterator(new SimpleFilter(...));

you can specify your own filtering rules (SimpleFilter strictly bases filtering on event types; simple, fast and usually sufficient – it's used to implement text and element filters).

Tracking

Tracking is a simple yet powerful mechanism for persisting some subset of information for the currently active branch of the logical XML content tree. For example, you may want to know element and attribute names, and all attribute values, of all the parents of the element an iterator currently points to; but you need not keep track of any other ancestor information. This may be decent compromise between full in-memory DOM and transient streaming processing.

Tracking can be enabled on per-iterator basis: and it takes effect for all child (and descendant) iterators of the iterator. This is because there is no way to retrieve information of events that have been passed already.

(to be completed)

Convenience methods

There are some additional convenience methods, for doing commonly needed thins like:

  • Get me all the text contained within the element this iterator points to (ignoring all elements, comments, processing instructions, if any)

(to be completed)

Labels

 
(None)