Reading a 200 mb json file takes 1.5 gb memory

Reading a 200 mb json file takes 1.5 gb memory - c++

I'm using the json_spirit library in C++ to parse a 200 mb json file. What surprises me is that when read into memory in my program, 1.5 gb of my RAM gets used. Is this something that is expected when deserializing json?
Here is how I'm loading in the json file:
std::ifstream istream(path.c_str());
json_spirit::mValue val;
json_spirit::read(istream, val);

You may try rapidjson.
It is optimized for both memory usage and performance.
By using insitu-parsing option (i.e. it changes the source string of parsing), it only incurs 16 bytes per JSON value to store the DOM in 32-bit architecture. The string values will use pointers pointed to the modified source string.
I expect the memory usage will be much less.
On the other hand, rapidjson also support SAX-style parsing. If the application just need to traverse the JSON file from begin to end (e.g. doing some statistics), then SAX-style API will be even faster and very few memory consumption (program stack + maximum length of string value).

I think, this is not JSON dependent. It's more a question of data structure overhead. If you have many small objects the administrative part becomes more and more relevant.
Although, more than seven times overhead, seems really excessive.

Related

Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

I'm trying to read the data contained in a .dat file with size ~1.1GB.
Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.
To do this, I employed the slurp function found in this SO answer.
The problem is that the code sometimes, but not always, throws a bad_alloc exception.
Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.
Here is the code that reproduces this error
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
ifstream file;
file.open("big_file.dat");
if(!file.is_open())
cerr << "The file was not found\n";
stringstream sstr;
sstr << file.rdbuf();
string text = sstr.str();
cout << "Successfully read file!\n";
return 0;
}
What could be causing this problem?
And what are the best practices to avoid it?

The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.
I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.
The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.
Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.
Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!
If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.
The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....
I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?
Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.
I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?
Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.
Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.
Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.
By the way, to give you an impression of how bad storing floats in ASCII is:
The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:
-1.23456E-10
That's an average of 11 characters.
Add one , or \n after every number.
Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.
Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.

In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.
To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.
There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.
Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.
There are many file io datastructures, from stxxl to boost, or you can roll your own.

The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.
If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Interpreting memory consumption of program using pugixml

I have a program which parses an XML file of ~50MB and extracts the data to an internal object structure with no links to the original XML file. When I try to roughly estimate how much memory I need, I reckon 40MB.
But my program needs something like 350MB, and I try to find out what happens. I use boost::shared_ptr, so I'm not dealing with raw pointers and hopefully I didn't produce memory leaks.
I try to write what I did, and I hope someone might point out problems in my process, wrong assumptions and so on.
First, how did I measure? I used htop to find out that my memory is full and processes using my piece of code are using the most of it. To sum up memory of different threads and to get a more pretty output, I used http://www.pixelbeat.org/scripts/ps_mem.py which confirmed my observation.
I roughly estimated the theoretical consumption to get an idea which factor lies between the consumption and what it should be at least. It's 10. So I used valgrind --tool=massif to analyze memory consumption. It shows, that at the peak of 350MB 250MB are used by something called xml_allocator which stems from the pugixml library. I went to the section of my code where I instantiate the pugi::xml_document and put an std::cout into the destructor of the object to confirm that it is released which happens pretty early in my program (at the end I sleep for 20s to have enough time to measure memory consumption, which stays 350MB even after the console output from the destructor appears).
Now I have no idea how to interpret that and hope that someone can help me where I make wrong assumptions or some such.
The outermost code snippet using pugixml is similar to:
void parse( std::string filename, my_data_structure& struc )
{
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load_file(filename.c_str());
for (pugi::xml_node node = doc.child("foo").child("bar"); node; node = node.next_sibling("bar"))
{
struc.hams.push_back( node.attribute("ham").value() );
}
}
And since in my code I don't store pugixml elements somewhere (only actual values pulled out of it), I would doc expect to release all resources when the function parse is left, but looking on the graph, I cannot tell where (on the time axis) this happens.

Your assumptions are incorrect.
Here's how to estimate pugixml memory consumption:
When you load the document, the entire text of the document gets loaded to memory. So that's 50 Mb for your file. This comes as 1 allocation from xml_document::load_file -> load_file_impl
In addition to that, there's the DOM structure that contains links to other nodes, etc. The size of the node is 32 bytes, the size of the attribute is 20 bytes; that's for 32-bit processes, multiply by 2 for 64-bit processes. This comes as many allocations (each allocation is roughly 32kb) from xml_allocator.
Depending on the density of nodes/attributes in your document, memory consumption can range from, say, 110% of the document size (i.e. 50 Mb -> 55 Mb) to, say, 600% (i.e. 50 Mb -> 300 Mb).
When you destroy pugixml document (xml_document dtor gets called), the data is freed - however, depending on the way OS heap behaves, you may not see it returned to the system immediately - it may stay in process heap. To verify that you can try doing the parsing again and checking that peak memory is the same after the second parse.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.

You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

limit on string size in c++?

I have like a million records each of about 30 characters coming in over a socket. Can I read all of it into a single string? Is there a limit on the string size I can allocate?
If so, is there someway I can send data over the socket records by record and receive it record by record. I dont know the size of each record until runtime.

To answer your first question: The maximum size of a C++ string is given by string::max_size

std::string::max_size() will tell you the theoretical limit imposed by the architecture your program is running under. Other than that, as long as you have sufficient RAM and/or disk swap space, you can have std::strings of huge size.
The answer to your second question is yes, you can send record by record, moreover you might not be able to send big chunks of data over a socket at once - there are limits on the size of a single send operation. That the size of a single string is not known until runtime is not a problem, it doesn't need to be known at compile time for sending them over a socket. How to actually send those strings record by record depends on what socket/networking library you are using; consult the relevant documentation.

There is no official limit on the size of a string. The software will ask your system for memory and, as long as it gets it, it will be able to add characters to your string.
The rest of your question is not clear.

The only practical limit on string size in c++ is your available memory. That being said, it will be expensive to reallocate your string to the right size as you keep receiving data (assuming you do not know its total size in advance). Normally you would read chunks of the data into a fixed-size buffer and decode it into its naturally shape (your records) as you get it.

The size of a string is only limited by the amount of memory available to the program, it is more of a operating system limitation than a C++ limitation. C++/C strings are null terminated so the string routines will happily process extremely long strings until they find a null.
On Win32 the maximum amount of memory available for data is normally around 2 Gigs.
You can read arbitrarily large amounts of data from a socket, but you must have some way of delimiting the data that you're reading. There must be an end of record marker or length associated with the records that you are reading so use that to parse the records. Do you really want read the data into a string? What happens if your don't have enough free memory to hold the data in RAM? I suspect there is a more efficient way to handle this data, but I don't know enough about the problem.

In theory, no. But don't go allocating 100GB of memory, because the user will probably not have that much RAM. If you are using std::strings then the max size is std::string::npos.

If we are talking about char* You are limited with smth about 2^32 on 32-bit systems and with 2^64 on (surprise) 64-bit ones
Update: This is wrong. See comments

How about send them with different format?
in your server:
send(strlen(szRecordServer));
send(szRecordServer);
in you client:
recv(cbRecord);
alloc(szRecordClient);
recv(szRecordClient);
and repeat this million times.

Which is faster in memory, ints or chars? And file-mapping or chunk reading?

Okay, so I've written a (rather unoptimized) program before to encode images to JPEGs, however, now I am working with MPEG-2 transport streams and the H.264 encoded video within them. Before I dive into programming all of this, I am curious what the fastest way to deal with the actual file is.
Currently I am file-mapping the .mts file into memory to work on it, although I am not sure if it would be faster to (for example) read 100 MB of the file into memory in chunks and deal with it that way.
These files require a lot of bit-shifting and such to read flags, so I am wondering that when I reference some of the memory if it is faster to read 4 bytes at once as an integer or 1 byte as a character. I thought I read somewhere that x86 processors are optimized to a 4-byte granularity, but I'm not sure if this is true...
Thanks!

Memory mapped files are usually the fastest operations available if you require your file to be available synchronously. (There are some asynchronous APIs that allow the O/S to reorder things for a slight speed increase sometimes, but that sounds like it's not helpful in your application)
The main advantage you're getting with the mapped files is that you can work in memory on the file while it is still being read from disk by the O/S, and you don't have to manage your own locking/threaded file reading code.
Memory reference wise, on the x86 memory is going to be read an entire line at a time no matter what you're actually working with. The extra time associated with non byte granular operations refers to the fact that integers need not be byte aligned. For example, performing an ADD will take more time if things aren't aligned on a 4 byte boundary, but for something like a memory copy there will be little difference. If you are working with inherently character data then it's going to be faster to keep it that way than to read everything as integers and bit shift things around.
If you're doing h.264 or MPEG2 encoding the bottleneck is probably going to be CPU time rather than disk i/o in any case.

If you have to access the whole file, it is always faster to read it to memory and do the processing there. Of course, it's also wasting memory, and you have to lock the file somehow so you won't get concurrent access by some other application, but optimization is about compromises anyway. Memory mapping is faster if you're skipping (large) parts of the file, because you don't have to read them at all then.
Yes, accessing memory at 4-byte (or even 8-byte) granularity is faster than accessing it byte-wise. Again it's a compromise - depending on what you have to do with the data afterwards, and how skilled you are at fiddling with the bits in an int, it might not be faster overall.
As for everything regarding optimization:
measure
optimize
measure

These are sequential bit-streams - you basically consume them one bit at a time without random-access.
You don't need to put a lot of effort into explicitly buffering reads and such in this scenario: the operating system will be buffering them for you anyway. I've written H.264 parsers before, and the time is completely dominated by the decoding and manipulation, not the IO.
My recommendation is to use a standard library and for parsing these bit-streams.
Flavor is such a parser, and the website even includes examples of MPEG-2 (PS) and various H.264 parts like M-Coder. Flavor builds native parsing code from a c++-like language; here's an quote from the MPEG-2 PS spec:
class TargetBackgroundGridDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 7
{
unsigned int(14) horizontal_size;
unsigned int(14) vertical_size;
unsigned int(4) aspect_ratio_information;
}
class VideoWindowDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 8
{
unsigned int(14) horizontal_offset;
unsigned int(14) vertical_offset;
unsigned int(4) window_priority;
}

Regarding to the best size to read from memory, I'm sure you will enjoy reading this post about memory access performance and cache effects.

One thing to consider about memory-mapping files is that a file with a size greater than the available address range will only be able to be map a portion of the file. To access the remainder of the file requires the first part to be unmapped and the next part to mapped in its place.
Since you're decoding mpeg streams you may want to use a double buffered approach with asynchronous file reading. It works like this:
blocksize = 65536 bytes (or whatever)
currentblock = new byte [blocksize]
nextblock = new byte [blocksize]
read currentblock
while processing
asynchronously read nextblock
parse currentblock
wait for asynchronous read to complete
swap nextblock and currentblock
endwhile

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js