create and stream large XML document in C++

create and stream large XML document in C++ - c++

I have some code that creates a fairly large xml DOM and writes it off to a file (up to 50-100MB) . It basically creates the DOM and then calls a toString on it and writes it out with ofstream. Is there a way to get streaming output of the generated dom so that it doesn't create the whole structure in memory all at once and then copy it, etc? I will not modify any node after i create it so it can write it out and free up the memory right away. I could write my own xml class that does the xml construction but ... i don't think that's a good idea since i'll probably miss something when it comes down to escaping etc.

Ok, turns out libxml2 has a streaming API:
http://xmlsoft.org/examples/testWriter.c
It's a little old style (very C-ish) but you can write your wrapper around it.

I would recommend GenX as a streaming XML writer, I use this in Programmer's Notepad and it works a treat, you can see examples of use in the source code. Extremely fast, and it produces good UTF-8 XML. Memory usage while you use it should remain roughly constant.

Look under keyword "C++ XML writer;" XML writers generate XML to file without building the entire DOM in memory so they don't need to use very much memory at all. You didn't mention platform, but Microsoft XmlLite has IXmlWriter.

There really isn't much to generating valid XML; the escaping that you worry about is trivial.
There's a library for streaming writing of XML here: https://code.google.com/p/xml-writer-cpp/ — if nothing else, it's useful for education purposes.

Related

Writing Output File Cuda C++

I need to write simulation data computed on GPU into an output .csv file. Normally I would just use the fstream library but that's not possible on GPU.
Are there any built-in functions or other libraries that I could use to write data to .csv or .txt files directly from device code? Right now, performance is really not that important but rather an easy interim solution.

No, it's not possible to do direct file I/O in CUDA from device code, unless you are using something like GPU Direct Storage (GDS) (which most likely you are not, at the current time, and based on your question). If you don't already have it set up, GDS might not be an "easy interim solution".
Copy the data to the host, then use whatever file I/O routines you are comfortable with.
Note that requests for library recommendations are specifically off-topic for SO.

Use the printf statement to output the prints from Cuda kernel to a text file and then parse the text file to convert to CSV.

Serialize Lua Table from C++ (via JSON)

I'd like to communicate complex data from a C++ service to a Lua application. This communication takes place over the network. For simplicity and speed in the Lua application I would prefer to send literal Lua table literals (no need for a separate parser) instead of XML or JSON or YAML or such.
While there exist things like C++ libraries that write JSON, I cannot find an existing C++ library for creating serialized Lua. My idea, then, is to use an existing JSON library for C++ and then convert the string to Lua.
So, for example, I'd like to convert this string:
{
"hello":42,
"array":[1,2,{"more":false},null,true],
"worst":"still [null]: got it?"
}
into this string:
{
["hello"]=42,
["array"]={1,2,{["more"]=false},nil,true},
["worst"]="still [null]: got it?"
}
A naive replace_all converting to : to =, [] to {}, and null to nil will destroy content inside of strings. How can I perform this conversion?
To avoid the problems of an XY problem I have included my end motivation at the top and in the title, in case the JSON->Lua string conversion is the wrong choice.

I would code that Lua-format serializing library by myself. You could choose a free software Json C++ library (e.g. jsoncpp or libjson) and adapt its code (to your Lua-format) quite easily.
Of course you should obey the license of that library, and I strongly suggest you to make your Lua-format serialization library a free software itself, e.g. on github and/or freecode and/or sourceforge...
The point is that JSON (and hopefully your Lua format) is simple enough to make quite easy its parsing or printing... Adapting an existing library to your format is probably simpler and certainly faster than "post-processing" its output ...

Although I'm not locating it readily today, I recall that there was discussion a year or so ago on the Lua list of the merits of defining a limited subset of Lua table literals analogous to JSON, dubbed "LSON" for the sake of discussion. IIRC, the consensus developed that there wasn't enough benefit to be had over just using an established standard lightweight format like JSON, but I know some experiments were conducted.
This Github Gist for lson.lua demonstrates a simple LSON writer and reader in pure Lua. The writer could be transformed to C or C++ with only moderate effort based on that code. A key feature of that code is that it provides some protection against circular references, and against data types that can be stored in tables but which have no reasonable mechanism for writing as source code (userdata, and thread are both highly problematic to serialize in any form). Of course, for data originating as plain old data in C with only lightweight structure, you won't have any of the problematic data types anyway. It also protects against circular references. If serializing lists or trees from C, circular references may be impossible by construction. If not, you will need to deal with them yourself.
Note that using Lua's own parser does potentially introduce security issues. The most glaring issue is that just writing assert(loadstring('return '..Input))() allows the imported text access to your entire current environment. While there is some protection resulting from applying the return keyword outside of the input text, that still won't prevent clever use of any functions that can be called from an expression. For best safety, you will want to read about sandboxes and possibly even apply some clever tricks to restrict the compiled bytecode before blindly executing it.
The security issues may be a strong argument in favor of qualifying and using a JSON parser. Even javascipt applications often prefer to use JSON parsers rather than just letting the javascript engine execute untrusted content.

Can flex and bison accept input from other sources?

I am going to write a text editor in Qt which can provide highlighting/code completion/syntax analyzing for a programming language (toy language, for learning purpose).
At first, I thought of writing handcraft C++, which would be more comfortable for me since I'm more familiar. However, upon searching, I found out that flex/bison can simplify parser creation. Upon trying a few simple example, it seems the working examples accept input from standard input in terminal. So, I just want to know, can flex/bison accept input from the text editor widget in GUI framework (such as Qt, which I'm going to learn at the same time after I finish a few features in the parser engine), then later output the result back to the text editor?

If you do not want to use a FILE * pointer, you can also scan from in-memory buffers such as character arrays and nul-terminated C type strings by creating FLEX input buffers - yy_scan_string() creates a buffer from a null terminated string, yy_scan_bytes creates a buffer from a fixed length character array. See Multiple Input Buffers in the flex documentation for more information.
And if that does not meet your needs, you can also re-define the YY_INPUT macro for complete control - see Generated Scanner.

flex reads its input from yyin. If you point it to something that is not stdin... See here for example.
Edit: btw, yyin is a FILE *. You're using C++, which means you'd want to pass a stream instead. Please, read flex's documentation on C++ interfacing
Edit2: for the output... you're the one programming yacc/bison actions for the rules, and also the error handler. In that sense, you're given quite some freedom on what to do there. For example, you can "emit" highlighted code and also use the error handlers to point errors when analyzing the code. The completion would force you to implement at least part of the semantics (symbol table, etc), but that's a different story...

What library to use to write XML file in a C++ program?

What library to use to write XML file in a C++ program?
I've found two classes posted in CodeProject
http://www.codeproject.com/KB/stl/simple_xmlwriter.aspx
http://www.codeproject.com/KB/XML/XML_writer.aspx
but want to check if there is more standard option than these. I'm only concerned with writing, and not parsing XML.

I tried different libraries and finally decided for TinyXml. It's compact, fast, free (zlib license) and very easy to use.

Question: Are you ever going to update an XML file? Because while that sounds like it's just more writing, with XML it still requires a parser.
While xerces is large and bloated, it is fully standards compliant and it is DOM based. Should you ever have to cross platform or change language, there will always be a DOM based library for whatever language/platform you might move to so knowing how DOM based parsing/writing works is a benefit. If you are going to use XML, you may as well use it correctly.
Avoiding XML altogether is of course the best option. But short of that, I'd go with xerces.

You can use Xerces-C++, a library written by Apache foundation. This library permits read, write and manipulate XML files.
Link: http://xerces.apache.org/xerces-c/

For my purposes, PugiXML worked out really nicely
http://pugixml.org/
The reason why I thought it was so nice was because it was simply 3 files, a configuration header, a header, and the actual source.
But as you stated, you aren't interested in parsing, so why even bother using a special class to write out XML? While maybe your classes are too complex for this, I found the easy thing to do is use the std::ostream and just write out standard compliant XML this way. For example, say I have a class that represents a Company which is a collection of Employee objects, just make a method in each the Company and Employee classes that looks something like the following psuedocode
Company::writeXML(std::ostream& out){
out << "<company>" << std::endl;
BOOST_FOREACH(Employee e, employees){
e.writeXML(out);
}
out << "</company>" << std::endl;
}
and to make it even easier, your Employee's writeXML function can be declared virtual so that you can have a specific output for say a CEO, President, Janitor or whatever the subclasses should be.

I have been using the open-source libxml2 library for years, works great for me.

I ran into the same problem and wound up rolling my own solution. It's implemented as a single header file that you can drop into your project: xml_writer.h
And it comes with a set of unit tests which also serve as documentation.

Roll your own
I've been in a similar situation. I had a program that needed to generate JSON. We did it two ways. First we tried jsoncpp, but in the end I just generated the JSON directly via a std::ofstream.
Afterward we ran the generated JSON through a validator to catch any syntax errors. There were a few but they were really easy to find and correct.
If I were to do it again I would definitely roll my own again. Somewhat unexpectedly, there was less code when using std::ofstream. Plus we didn't have to use/learn a new API. It was easier to write and is easier to maintain.

How to create a virtual file?

I'd like to simulate a file without writing it on disk. I have a file at the end of my executable and I would like to give its path to a dll. Of course since it doesn't have a real path, I have to fake it.
I first tried using named pipes under Windows to do it. That would allow for a path like \\.\pipe\mymemoryfile but I can't make it works, and I'm not sure the dll would support a path like this.
Second, I found CreateFileMapping and GetMappedFileName. Can they be used to simulate a file in a fragment of another ? I'm not sure this is what this API does.
What I'm trying to do seems similar to boxedapp. Any ideas about how they do it ? I suppose it's something like API interception (Like Detour ), but that would be a lot of work. Is there another way to do it ?
Why ? I'm interested in this specific solution because I'd like to hide the data and for the benefit of distributing only one file but also for geeky reasons of making it works that way ;)
I agree that copying data to a temporary file would work and be a much easier solution.

Use BoxedApp and do not worry.

You can store the data in an NTFS stream. That way you can get a real path pointing to your data that you can give to your dll in the form of
x:\myfile.exe:mystreamname
This works precisely like a normal file, however it only works if the file system used is NTFS. This is standard under Windows nowadays, but is of course not an option if you want to support older systems or would like to be able to run this from a usb-stick or similar. Note that any streams present in a file will be lost if the file is sent as an attachment in mail or simply copied from a NTFS partition to a FAT32 partition.
I'd say that the most compatible way would be to write your data to an actual file, but you can of course do it one way on NTFS systems and another on FAT systems. I do recommend against it because of the added complexity. The appropriate way would be to distribute your files separately of course, but since you've indicated that you don't want this, you should in that case write it to a temporary file and give the dll the path to that file. Make sure you write the temporary file to the users' temp directory (you can find the path using GetTempPath in C/C++).
Your other option would be to write a filesystem filter driver, but that is a road that I strongly advise against. That sort of defeats the purpose of using a single file as well...
Also, in case you want only a single file for distribution, how about using a zip file or an installer?

Pipes are for communication between processes running concurrently. They don't store data for later access, and they don't have the same semantics as files (you can't seek or rewind a pipe, for instance).
If you're after file-like behaviour, your best bet will always be to use a file. Under Windows, you can pass FILE_ATTRIBUTE_TEMPORARY to CreateFile as a hint to the system to avoid flushing data to disk if there's sufficient memory.
If you're worried about the performance hit of writing to disk, the above should be sufficient to avoid the performance impact in most cases. (If the system is low enough on memory to force the file data out to disk, it's probably also swapping heavily anyway -- you've already got a performance problem.)
If you're trying to avoid writing to disk for some other reason, can you explain why? In general, it's quite hard to stop data from ever hitting the disk -- the user can always hibernate the machine, for instance.

Since you don't have control over the DLL you have to assume that the DLL expects an actual file. It probably at some point makes that assumption which is why named pipes are failing on you.
The simplest solution is to create a temporary file in the temp directory, write the data from your EXE to the temp file and then delete the temporary file.
Is there a reason you are embedding this "pseudo-file" at the end of your EXE instead of just distributing it with our application? You are obviously already distributing this third party DLL with your application so one more file doesn't seem like it is going to hurt you?
Another question, will this data be changing? That is are you expecting to write back data this "pseudo-file" in your EXE? I don't think that will work well. Standard users may not have write access to the EXE and that would probably drive anti-virus nuts.
And no CreateFileMapping and GetMappedFileName definitely won't work since they don't give you a file name that can be passed to CreateFile. If you could somehow get this DLL to accept a HANDLE then that would work.
And I wouldn't even bother with API interception. Just hand the DLL a path to an acutal file.

Reading your question made me think: if you can pretend an area of memory is a file and have kind of "virtual path" to it, then this would allow loading a DLL directly from memory which is what LoadLibrary forbids by design by asking for a path name. And this is why people write their own PE loader when they want to achieve that.
I would say you can't achieve what you want with file mapping: the purpose of file mapping is to treat a portion of a file as if it was physical memory, and you're wanting the reciprocal.
Using Detours implies that you would have to replicate everything the intercepted DLL function does except from obtaining data from a real file; hence it's not generic. Or, even more intricate, let's pretend the DLL uses fopen; then you provide your own fopen that detects a special pattern in the path and you mimmic the C runtime internals... Hmm is it really worth all the pain? :D

Please explain why you can't extract the data from your EXE and write it to a temporary file. Many applications do this -- it's the classic solution to this problem.
If you really must provide a "virtual file", the cleanest solution is probably a filesystem filter driver. "clean" doesn't mean "good" -- a filter is a fully documented and supported solution, so it's cleaner than API hooking, injection, etc. However, filesystem filters are not easy.
OSR Online is the best place to find Windows filesystem information. The NTFSD mailing list is where filesystem developers hang out.

How about using a some sort of RamDisk and writing the file to this disk? I have tried some ramdisks myself, though never found a good one, tell me if you are successful.

Well, if you need to have the virtual file allocated in your exe, you will need to create a vector, stream or char array big enough to hold all of the virtual data you want to write.
that is the only solution I can think of without doing any I/O to disk (even if you don't write to file).
If you need to keep a file like path syntax, just write a class that mimics that behaviour and instead of writing to a file write to your memory buffer. It's as simple as it gets. Remember KISS.
Cheers

Open the file called "NUL:" for writing. It's writable, but the data are silently discarded. Kinda like /dev/null of *nix fame.
You cannot memory-map it though. Memory-mapping implies read/write access, and NUL is write-only.

I'm guessing that this dll cant take a stream? Its almost to simple to ask BUT if it can you could just use that.

Have you tried using the \?\ prefix when using named pipes? Many APIs support using \?\ to pass the remainder of the path directly through without any parsing/modification.
http://msdn.microsoft.com/en-us/library/aa365247(VS.85,lightweight).aspx

Why not just add it as a resource - http://msdn.microsoft.com/en-us/library/7k989cfy(VS.80).aspx - the same way you would add an icon.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js