Real time parsing

Real time parsing - c++

I am quite new to parsing text files. While googling a bit, I found out that a parser builds a tree structure usually out of a text file. Most of the examples consists of parsing files, which in my view is quite static. You load the file to parser and get the output.
My problem is something different from parsing files. I have a stream of JSON data coming from a server socket at TCP port 6000. I need to parse the incoming data.I have some questions in mind:
1) Do I need to save the incoming JSON data at the client side with some sought of buffer? Answer: I think yes I need to save it, but are there any parsers which can do it directly like passing the JSON object as an argument to the parse function.
2) How would the structure of the real time parser look like`? Answer: Since on google only static parsing tree structure is available. In my view each object is parsed and have some sought of parsed tree and then it is deleted from the memory. Otherwise it will cause memory overflow because the data is continuous.
There are some parser libraries available like JSON-C and JSON lib. One more thing which comes into my mind is that can we save a JSON object in any C/C++ array. Just thought of that but could realize how to do that.

Related

python2.7: reading journald's binary log

I'm currently working on a parser for plaso. For this I need to read journald's binary log files and convert those to a plaso timeline object.
My question now is: How do I read a binary file in python, keeping in mind that the file may contain strings and integers. Is a byte array sufficient for this? If so, how can I find the correct delimiters for the message fields?
Since I'm new to python I can't provide useful code just yet, still trying to wrap my head around this.

You can deal with binary data using struct package.
If I had been you I would have seen the struct of the file by journald (from journald docs or its source code) and parsed binary data into fields.

How to create a dynamic message with Protocol Buffers?

Say we want to create our message not using any preexisting .proto files and compiled out from them cpp/cxx/h files. We want to use protobuf strictly as a library. For example we got (in some only known to us format) message description: a message called MyMessage has to have MyIntFiels and optional MyStringFiels. How to create such message? for example fill it with simple data save to .bin and read from that binary its contents back?
I looked all over dynamic_message.h help description and DescriptorPool and so on but do not see how to add/remove fields to the message as well as no way to add described on fly message to DescriptorPool.
Can any one please explain?

Short answer: it can't be used that way.
The overview page of Protobuf says:
XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).
Meaning the whole point of Protobuf is to throw-out self-descriptability in favor of parsing speed ==> it's just not it's purpose to create self describing messages.
Consider using XML or JSON or any other serialization format. If the protection is needed, you can use symmetric encryption and/or lzip compression.

High performance XML parsing in C++

Well a lot of questions have been made about parsing XML in C++ and so on...
But, instead of a generic problem, mine is very specific.
I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse.
My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.
To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly without loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.
Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?

You want a streaming XML parser rather than what is called a DOM parser.
There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.
In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.
I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.
Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.

Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.

SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.
As you mentioned Xerces is a good C++ SAX parser.
I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.

Okay, here is one off the beaten track, I looked at this, but haven't really used it myself, it's called asmxml. These boys claim performance bar none, downside, you need x86 assembler.

If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.

I’m convinced that no XML library exists that allows you to modify a file without loading it first. This simply isn’t possible because files don’t work that way: you cannot insert (or remove) in the middle of a file. You can only overwrite a block of identical size, or append at the end. But your request would require to append or remove in the middle of the file.
Reading only parts of an XML file may be possible. But writing … no way.

Go for template libraries as much as possible, like Boost::property_tree or Boost::XMLParser or POCO::XML and Folly has XML Parser in it.
Avoid old C libraries, it all old code designs.

someone say QtXML module is high performance for huge XML files.

XML Serialization/Deserialization in C++

I am using C++ from Mingw, which is the windows version of GNC C++.
What I want to do is: serialize C++ object into an XML file and deserialize object from XML file on the fly. I check TinyXML. It's pretty useful, and (please correct me if I misunderstand it) it basically add all the nodes during processing, and finally put them into a file in one chunk using TixmlDocument::saveToFile(filename) function.
I am working on real-time processing, and how can I write to a file on the fly and append the following result to the file?
Thanks.

BOOST has a very nice Serialization/Deserialization lib BOOST.Serialization.
If you stream your objects to a boost xml archive it will stream them in xml format.
If xml is to big or to slow you only need to change the archive in a text or binary archive to change the streaming format.

Here is a better example of C++ object serialization:
http://www.codeproject.com/KB/XML/XMLFoundation.aspx

I notice that each TiXmlBase Class has a Print method and also supports streaming to strings and streams.
You could walk the new parts of the document in sequence and output those parts as they are added, maybe?
Give it a try.....
Tony

I've been using gSOAP for this purpose. It is probably too powerful for just XML serialization, but knowing it can do much more means I do not have to consider other solutions for more advanced projects since it also supports WSDL, SOAP, XML-RPC, and JSON. Also suitable for embedded and small devices, since XML is simply a transient wire format and not kept in a DOM or something memory intensive.

XML Parsing Problem

I have an XML parser that crashes on incomplete XML data. So XML data fed to it could be one of the following:
<one><two>twocontent</two</one>
<a/><b/> ( the parser treats it as two root elements )
Element attributes are also handled ( though not shown above ).
Now, the problem is when I read data from socket I get data in fragments. For example:
<one>one
content</two>
</one>
Thus, before sending the XML to the parser I have to construct a valid XML and send it.
What programming construct ( like iteration, recursion etc ) would be the best fit for this kind of scenario.
I am programming in C++.
Please help.

Short answer: You're doing it wrong.
Your question confuses two separate issues:
Parsing of data that is not well-formed XML at all, i.e. so-called tag soup.
Example: Files generated by programmers who do not understand XML or have lousy coding practices.
It is not unfair to say: A file that is not well-formed XML is not an XML document at all. Every correct XML parser will reject it. Ideally you would work to correct the source of this data and make sure that proper XML is generated instead.
Alternatively, use a tag soup parser, i.e. a parser that does error correction.
Useful tag soup parsers are often actually HTML parsers. tidy has already been pointed out in another answer.
Make certain that you understand what correction steps such a parser actually performs, since there is no universal approach that could fix XML. Tidy in particular is very aggressive at "repairing" the data, more aggressive than real browsers and the HTML 5 spec, for example.
XML parsing from a socket, where data arrives chunk-by-chunk in a stream. In this situation, the XML document might be viewed as "infinite", with chunks being processed as the appear, long before a final end tag for the root element has been seen.
Example: XMPP is a protocol that works like this.
The solution is to use a pull-based parser, for example the XMLTextReader API in libxml2.
If a tree-based data structure for the XML child elements being parser is required, you can build a tree structure for each such element that is being read, just not for the entire document.

What is feeding you the XML from the other end of the socket connection? It doesn't make sense that you should be missing stuff, as you illustrate, just because you receive it from a socket.
If the socket is using TCP (or a custom protocol with similar properties), you should not be missing parts of your XML. Thus, you should be able to just buffer it all until the other end signals "end of document", and then feed it to your picky XML parser.
If you are using UDP or some other "lossy" protocol, you need to reconsider, since it's obviously not possible to correctly transfer a large XML document over a channel that randomly drops pieces.

Because the XML structure is a hierarchic structure (a tree) a recursion would be the best way to approach this.
You can call the recursion on each child and fix the missing XML identifiers.
Basically, you'll be doing the same thing a DOM object parser would do, only you'll parse the file in order to fix it's structure.
One thing though, it seems to me as if in this method you are going to re-write the XML parser. Isn't it a waist of time?
Maybe it's better to find a way for the XML to arrive in the right structure rather than trying to fix it.

Are there multiple writers? Why isn't your parser validating the XML?
Use a tree, where every node represents an element and carries with it a dirty bit. The first occurrence of the node marks it as dirty i.e. you are expecting a closing tag, unless of course the node is of the form <a/>. Also, the first element, you encounter is the root.
When you hit a dirty node, keep pushing nodes in a stack, until you hit the closing tag, when you pop the contents.

In your example, how are you going to figure out exactly where in the content to put the opening <two> tag once you have detected it is missing? This is, as they say, non-trivial.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js