Parse an XML in standard C/C++ without additional libraries - c++

I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don't have to validate it, just parse and memorize it into a tree.

The basic structure of XML is quite simple:
<tagname [attribute[="value"] ...]>content</tagname>
where the content may contain both normal text and more XML structures, or the special form
<tagname [attribute[="value"] ...]/>
which is equivalent to
<tagname [attribute[="value"] ...]></tagname>
that is,. empty content.
So if you don't need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.

Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:
whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and
the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.
The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.
What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).
In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.

The best and only approach is to re-implement such a library from scratch without using any other libraries...
You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.

Related

How to unit test a generator/serialization method?

I want to write unit tests for a serialization method. By serialization method I mean a methd that outputs a set of data into a special format.
For example, a method that outputs data in XML format. (I write in C++ but it is the same in every language.)
class Generator
{
public:
std::string serialize();
};
// unit test (pseudo-code)
Generator gen;
// set some data in gen
std::string actual = gen.serialize();
std::string expected = "<xml>...</xml>";
ASSERT_EQUAL(expected, actual);
The problem with this is that the unit test is highly dependent on non-important things, like the formatting of the XML (line breaks) or the order of XML-attributes.
While with XML the previous method will work, it will not work with generators that output binary data.
So, what is a robust way to test serialization methods?
The ideas I have are the following, but all have serious drawbacks.
using external libraries to parse the data (for proprietary formats, there may not exist).
always write pairs of serialization/deserialization and test them in combination (bugs in both methods might remain undiscovered).
store the serialized data in external files and compare against them in the test (the unit test is difficult to read and maintain).
As you are asking about unit-testing, I assume that the intended behaviour of the serializer in all its details is known by you. That is, you know where you want line breaks and indentation etc. to be inserted.
The problem now is, that in every single test case only a subset of these details would be relevant. In other words, in some tests you want to test the proper indentation, and in some tests you just want to be sure that a number is inserted in the right way.
In addition to the options you have provided I recommend another approach: Use regular expression matching instead of string comparison. With the help of regular expressions you can reduce the serialized string to the essential parts which are of interest in the respective test. To check, for example, if the result string contains a certain number, say, 42, you could match it against ^[^0-9]*42[^0-9]*$. Then, the enclosing XML would be ignored in this particular test. This test would then be robust against a large number of changes in the serialization.
With this approach you avoid the dependency on external parsing libraries (well, you are depending on the regular expression library, but that is in many languages today even part of the standard library), you can also test for aspects which the serialize-deserialize can not test (indentation), your tests run fast and are not OS specific (no dependency on the file system).
This is more like a long comment with my first thoughts on the topic.
I think you have to look at two different scenarios. Your data <-> serialized data relation could be either 1:1 or 1:n.
XML would be a 1:n relation, where you XML code would have quite a little bit of freedom, but would still be unserialized to the same data again. In this case it seems to me, that developing and testing serialization/deserialization in combination is the way to go. If there are external libraries available as well, use them of course. If there are no external libraries available, then - as long as serialization / deserialization - yield the same result, you will probably not have "bugs", but "features"...
Testing the deserialization with stored external datafiles does also make sense, but this does not apply to the serialization, imho.
Looking at a 1:1 relation, like maybe putting the data into a certain binary format, you should go for the stored data in external files. Always use external libraries, if they exist, as well, of course.
I would suggest to do all three of those approaches together - where applicable, of course. You should not rely on a single one of them.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

Finite State Machine parser

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:
global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...
Syntax explanation:
{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.
So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:
Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
upper: reads global, exe, lib+ target names...
normal: inside a scope, can read SOURCES..., create user variables...
list: adds items to a list until a newline is encountered.
Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).
Thanks for any input.
If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.
If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.
If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.
There are two stages to analyzing a text input stream for parsing:
Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.
Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.
So in response to your questions:
Should I use an enum or an abstract class + derivatives for my states?
I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.
When reading a list, I need to NOT ignore \n.
You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.
Here's the draft states I have in mind: ...
See my commments on lexical and grammatical analysis above.
For parsing I always try to use something already proven to work: ANTLR with ANTLRWorks which is of great help for designing and testing a grammar. You can generate code for C/C++ (and other languages) but you need to build the ANTLR runtime for those languages.
Of course if you find flex or bison easier to use you can use them too (I know that they generate only C and C++ but I may be wrong since I didn't use them for some time).

hierarchical regex expression

Is it possible/practical to build a single regular expression that matches hierarchical data?
For example:
<h1>Action</h1>
<h2>Title1</h2><div>data1</div>
<h2>Title2</h2><div>data2</div>
<h1>Adventure</h1>
<h2>Title3</h2><div>data3</div>
I would like to end up with matches.
"Action", "Title1", "data1"
"Action", "Title2", "data2"
"Adventure", "Title3", "data3"
As I see it this would require knowing that there is a hierarchical structure at play here and if I code the pattern to capture the H1, it only matches the first entry of that hierarchy. If I don't code for H1 then I can't capture it. Was wondering if there are any special tricks I an employ to solve this.
This is a .NET project.
The solution is to not use regular expressions. They're not powerful enough for this sort of thing.
What you want is a parser - since it looks like you're trying to match HTML, there are plenty to choose from.
It's generally considered bad practice to attempt to parse HTML/XML with RegEx, precisely because it's hierarchical. You COULD use a recursive function to do so, but a better solution in this case is to use a real XML parser. I couldn't give you better advice than that without knowing the platform you're using.
EDIT: Regex is also very slow, which is another reason it's bad for processing HTML; however, I don't know that an XML/DOM processor is likely to be faster since it's likely to use a lot more memory.
If you JUST want data from a simple document like you've demonstrated, and/or if you want to build a solution yourself, it's not that tough to do. Just build a simple, recursive state-based stream processor that looks for tags and passes the contents to the the next recursive level.
For example:
- In a recursive function, seek out a "<" character.
- Now find a ">" character.
- Preserve everything you find until the next "<" character.
- Find a ">" character.
- Pass whatever you found between those tags into the recursive function.
You'd have to work out error checking yourself, but the base case (when you return back up to the previous level) is just when there's nothing else to find.
Maybe this helps, maybe not. Good luck to you.
Regex does not work for this type of data. It is not regular, per se.
You should use an XML parser for this.

Generalized stream parsing?

Are there any libraries or technologies(in any language) that provide a regular-expression-like tool for any sort of stream-like or list-like data(as opposed to only character strings)?
For example, suppose you were writing a parser for your pet programming language. You've already got it lexed into a list of Common Lisp objects representing the tokens.
You might use a pattern like this to parse function calls(using C-style syntax):
(pattern (:var (:class ident)) (:class left-paren)
(:optional (:var object)) (:star (:class comma) (:var :object)) (:class right-paren))
Which would bind variables for the function name and each of the function arguments(actually, it would probably be implemented so that this pattern would probably bind a variable for the function name, one for the first argument, and a list of the rest, but that's not really an important detail).
Would something like this be useful at all?
I don't know how many replies you'll receive on a subject like this, as most languages lack the sort of robust stream APIs you seem to have in mind; thus, most of the people reading this probably don't know what you're talking about.
Smalltalk is a notable exception, shipping with a rich hierarchy of Stream classes that--coupled with its Collection classes--allow you to do some pretty impressive stuff. While most Smalltalks also ship with regex support (the pure ST implementation by Vassili Bykov is a popular choice), the regex classes unfortunately are not integrated with the Stream classes in the same way the Collection classes are. This means that using streams and regexes in Smalltalk usually involves reading character strings from a stream and then testing those strings separately with regex patterns--not the sort "read next n characters up until a pattern matches," or "read next n characters matching this pattern" type of functionally you likely have in mind.
I think a powerful stream API coupled with powerful regex support would be great. However, I think you'd have trouble generalizing about different stream types. A read stream on a character string would pose few difficulties, but file and TCP streams would have their own exceptions and latencies that you would have to handle gracefully.
Try looking at scala.util.regexp, both the API documentation, and the code example at http://scala.sygneca.com/code/automata. I think would allow a computational linguist to match strings of words by looking for part of speech patterns, for example.
This is the principle behind most syntactic parsers, which operate in two phases. The first phase is the lexer, where identifiers, language keywords, and other special characters (arithmetic operators, braces, etc) are identified and split into Token objects that typically have a numeric field indicating the type of the lexeme, and optionally another field indicating the text of the lexeme.
In the second phase, a syntactic parser operates on the Token objects, matching them by magic number alone, to parse phrases. (Software for doing this includes Antlr, yacc/bison, Scala's cala.util.parsing.combinator.syntactical library, and plenty of others). The two phases don't entirely have to depend on each other -- you can get your Token objects from anywhere else that you like. The magic number aspect seems to be important, though, because the magic numbers are assigned to constants, and they're what make it easy to express your grammar in a readable language.
And remember, that anything you can accomplish with a regular expression can also be accomplished with a context-free grammar (usually just as easily).