JSON parser that can handle large input (2 GB)? - c++

So far, I've tried (without success):
QJsonDocument – "document too large" (looks like the max size is artificially capped at 1 << 27 bytes)
Boost.PropertyTree – takes up 30 GB RAM and then segfaults
libjson – takes up a few gigs of RAM and then segfaults
I'm gonna try yajl next, but Json.NET handles this without any issues so I'm not sure why it should be such a big problem in C++.

Check out https://github.com/YasserAsmi/jvar. I have tested it with a large database (SF street data or something, which was around 2GB). It was quite fast.

Well, I'm not proud of my solution, but I ended up using some regex to split my data up into top-level key-value pairs (each one being only a few MB), then just parsed each one of those pairs with Qt's JSON parser and passed them into my original code.
Yajl would have been exactly what I needed for something like this, but I went with the ugly regex hack because:
Fitting my logic into Yajl's callback structure would have involved rewriting enough of my code to be a pain, and this is just for a one-off MapReduce job so the code itself doesn't matter long-term anyway.
The data set is controlled by me and guaranteed to always work with my regex.
For various reasons, adding dependencies to Elastic MapReduce deployments is a bigger hassle than it should be (and static Qt compilation is buggy), so for the sake of not doing more work than necessary I'm inclined to keep dependencies to a minimum.
This still works and performs well (both time-wise and memory-wise).
Note that the regex I used happens to work for my data specifically because the top-level keys (and only the top level keys) are integers; my code below is not a general solution, and I wouldn't ever advise a similar approach over a SAX-style parser where reasons #1 and #2 above don't apply.
Also note that this solution is extra gross (splitting and manipulating JSON strings before parsing + special cases for the start and end of the data) because my original expression that captured the entire key-value pairs broke down when one of the pairs happened to exceed PCRE's backtracking limit (it's incredibly annoying in this case that that's even a thing, especially since it's not configurable through either QRegularExpression or grep).
Anyway, here's the code; I am deeply ashamed:
QFile file( argv[1] );
file.open( QIODevice::ReadOnly );
QTextStream textStream( &file );
QString jsonKey;
QString jsonString;
QRegularExpression jsonRegex( "\"-?\\d+\":" );
bool atEnd = false;
while( atEnd == false )
{
QString regexMatch = jsonRegex.match
(
jsonString.append( textStream.read(1000000) )
).captured();
bool isRegexMatched = regexMatch.isEmpty() == false;
if( isRegexMatched == false )
{
atEnd = textStream.atEnd();
}
if( atEnd || (jsonKey.isEmpty() == false && isRegexMatched) )
{
QString jsonObjectString;
if( atEnd == false )
{
QStringList regexMatchSplit = jsonString.split( regexMatch );
jsonObjectString = regexMatchSplit[0]
.prepend( jsonKey )
.prepend( LEFT_BRACE )
;
jsonObjectString = jsonObjectString
.left( jsonObjectString.size() - 1 )
.append( RIGHT_BRACE )
;
jsonKey = regexMatch;
jsonString = regexMatchSplit[1];
}
else
{
jsonObjectString = jsonString
.prepend( jsonKey )
.prepend( LEFT_BRACE )
;
}
QJsonObject jsonObject = QJsonDocument::fromJson
(
jsonObjectString.toUtf8()
).object();
QString key = jsonObject.keys()[0];
... process data and store in boost::interprocess::map ...
}
else if( isRegexMatched )
{
jsonKey = regexMatch;
jsonString = jsonString.split( regexMatch )[1];
}
}

I've recently finished (probably still a bit beta) such a library:
https://github.com/matiu2/json--11
If you use the json_class .. it'll load it all into memory, which is probably not what you want.
But you can parse it sequentially by writing your own 'mapper'.
The included mapper, iterates through the JSON, mapping the input to JSON classes:
https://github.com/matiu2/json--11/blob/master/src/mapper.hpp
You could write your own that does whatever you want with the data, and feed a file stream into it, so as not to load the whole lot into memory.
So as an example to get you started, this just outputs the json data in some random format, but doesn't fill up the memory any (completely untested nor compiled):
#include "parser.hpp"
#include <fstream>
#include <iterator>
#include <string>
int main(int argc, char **) {
std::ifstream file("hugeJSONFile.hpp");
std::istream_iterator<char> input(file);
auto parser = json::Parser(input);
using Parser = decltype(parser);
using std::cout;
using std::endl;
switch (parser.getNextType()) {
case Parser::null:
parser.readNull();
cout << "NULL" << endl;
return;
case Parser::boolean:
bool val = parser.readBoolean();
cout << "Bool: " << val << endl;
case Parser::array:
parser.consumeOneValue();
cout << "Array: ..." << endl;
case Parser::object:
parser.consumeOneValue();
cout << "Map: ..." << endl;
case Parser::number: {
double val = parser.readNumber<double>();
cout << "number: " << val << endl;
}
case Parser::string: {
std::string val = parser.readString();
cout << "string: " << val << endl;
}
case Parser::HIT_END:
case Parser::ERROR:
default:
// Should never get here
throw std::logic_error("Unexpected error while parsing JSON");
}
return 0;
}
Addendum
Originally I had planned for this library to never copy any data. eg. read a string just gave you a start and end iterator to the string data in the input, but because we actually need to decode the strings, I found that methodology too impractical.
This library automatically converts \u0000 codes in JSON to utf8 encoding in standard strings.

When dealing with records you can for example format your json and use the newline as a separator between objects, then parse each line separately eg:
"records": [
{ "someprop": "value", "someobj": { ..... } ... },
.
.
.
or:
"myobj": {
"someprop": { "someobj": {}, ... },
.
.
.

I just faced the same problem with Qt's 5.12 JSON support. Fortunately starting with Qt 5.15 (64 Bit) reading of large JSON files (I tested 1GB files) works flawlessly.

Related

Fixing wrong JSON data

I'm working with a JSON data that I download from web. Problem with this JSON is that it's contents are incorrect. To show the problem here is a simplified preview:
[
{
"id": 0,
"name": "adsad"
},
{
"id": "123",
"name": "aawew"
}
]
So there is an array of these items, where in some value for "id" is string and somewhere it is an integer. This is the data that I get and I can't make the source fix this.
The solution I came up with was to fix this data before serializing it and here is my naive algorithm where Defaults::intTypes() is a vector of all key that should be integer but are sometimes string:
void fixJSONData(QString& data) {
qDebug() << "Fixing JSON data ( thread: " << QThread::currentThreadId() << ")";
QElapsedTimer timer;
timer.start();
for (int i = 0; i < data.size(); ++i) {
for (const auto& key : Defaults::intTypes()) {
if (data.mid(i, key.size() + 3) == "\"" + key + "\":") {
int newLine = i + key.size() + 3;
while (data[newLine] != ',' && data[newLine] != '}') {
if (data[newLine] == '"') {
data.remove(newLine, 1);
} else {
++newLine;
}
}
i = newLine;
break;
}
}
}
qDebug() << "Fixing done in " << timer.elapsed() << " ms.";
}
Well it does fix the problem, but the algorithm is too slow and it is too slow (went through 4.5 million characters in 390 seconds). How could this be done faster?
P.S.: for JSON serialization I use nlohmann::json library.
Edit: After reading up a bit deeper into JSON rules, it looks like that example above is absolutely valid JSON file. Should this be an issue related to C++ being strongly type dependent so it can't serialize an array of different elements into C++ classes?
Edit2: What I would like to create from that json string is QVector<Model> where:
class Model {
unsigned id;
QString name;
}
Although there must be several way to improve this conversion maybe there is a much more effective solution.
Most of the JSON libraries allow the end user to define custom serializer/deserializer for an object. If you create a custom deserializer then it can parse the original data and you don't have to modify the stream or files.
It's not only faster but also more elegant.
(If the given JSON library doesn't support custom deserialization I would consider choosing an other one.)

C++ boost/regex regex_search

Consider the following string content:
string content = "{'name':'Fantastic gloves','description':'Theese gloves will fit any time period.','current':{'trend':'high','price':'47.1000'}";
I have never used regex_search and I have been searching around for ways to use it - I still do not quite get it. From that random string (it's from an API) how could I grab two things:
1) the price - in this example it is 47.1000
2) the name - in this example Fantastic gloves
From what I have read, regex_search would be the best approach here. I plan on using the price as an integer value, I will use regex_replace in order to remove the "." from the string before converting it. I have only used regex_replace and I found it easy to work with, I don't know why I am struggling so much with regex_search.
Keynotes:
Content is contained inside ' '
Content id and value is separated by :
Conent/value are separated by ,
Value of id's name and price will vary.
My first though was to locate for instance price and then move 3 characters ahead (':') and gather everything until the next ' - however I am not sure if I am completely off-track here or not.
Any help is appreciated.
boost::regex would not be needed. Regular expressions are used for more general pattern matching, whereas your example is very specific. One way to handle your problem is to break the string up into individual tokens. Here is an example using boost::tokenizer:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
#include <map>
int main()
{
std::map<std::string, std::string> m;
std::string content = "{'name':'Fantastic gloves','description':'Theese gloves will fit any time period.','current':{'trend':'high','price':'47.1000'}";
boost::char_separator<char> sep("{},':");
boost::tokenizer<boost::char_separator<char>> tokenizer(content, sep);
std::string id;
for (auto tok = tokenizer.begin(); tok != tokenizer.end(); ++tok)
{
// Since "current" is a special case I added code to handle that
if (*tok != "current")
{
id = *tok++;
m[id] = *tok;
}
else
{
id = *++tok;
m[id] = *++tok; // trend
id = *++tok;
m[id] = *++tok; // price
}
}
std::cout << "Name: " << m["name"] << std::endl;
std::cout << "Price: " << m["price"] << std::endl;
}
Link to live code.
As the string you are attempting to parse appears to be JSON (JavaScript Object Notation), consider using a specialized JSON parser.
You can find a comprehensive list of JSON parsers in many languages including C++ at http://json.org/. Also, I found a discussion on the merits of several JSON parsers for C++ in response to this SO question.

How to getline() from specific line in a file? C++

I've looked around a bit and have found no definitive answer on how to read a specific line of text from a file in C++. I have a text file with over 100,000 English words, each on its own line. I can't use arrays because they obviously won't hold that much data, and vectors take too long to store every word. How can I achieve this?
P.S. I found no duplicates of this question regarding C++
while (getline(words_file, word))
{
my_vect.push_back(word);
}
EDIT:
A commenter below has helped me to realize that the only reason loading a file to a vector was taking so long was because I was debugging. Plainly running the .exe loads the file nearly instantaneously. Thanks for everyones help.
If your words have no white-space (I assume they don't), you can use a more tricky non-getline solution using a deque!
using namespace std;
int main() {
deque<string> dictionary;
cout << "Loading file..." << endl;
ifstream myfile ("dict.txt");
if ( myfile.is_open() ) {
copy(istream_iterator<string>(myFile),
istream_iterator<string>(),
back_inserter<deque<string>>(dictionary));
myfile.close();
} else {
cout << "Unable to open file." << endl;
}
return 0;
}
The above reads the entire file into a string and then tokenizes the string based on the std::stream default (any whitespace - this is a big assumption on my part) which makes it slightly faster. This gets done in about 2-3 seconds with 100,000 words. I'm also using a deque, which is the best data structure (imo) for this particular scenario. When I use vectors, it takes around 20 seconds (not even close to your minute mark -- you must be doing something else that increases complexity).
To access the word at line 1:
cout << dictionary[0] << endl;
Hope this has been useful.
You have a few options, but none will automatically let you go to a specific line. File systems don't track line numbers within files.
One way is to have fixed-width lines in the file. Then read the appropriate amount of data based upon the line number you want and the number of bytes per line.
Another way is to loop, reading lines one a time until you get to the line that you want.
A third way would be to have a sort of index that you create at the beginning of the file to reference the location of each line. This, of course, would require that you control the file format.
I already mentioned this in a comment, but I wanted to give it a bit more visibility for anyone else who runs into this issue...
I think that the following code will take a long time to read from the file because std::vector probably has to re-allocate its internal memory several times to account for all of these elements that you are adding. This is an implementation detail, but if I understand correctly std::vector usually starts out small and increases its memory as necessary to accommodate new elements. This works fine when you're adding a handful of elements at a time, but is really inefficient when you're adding a thousand elements at once.
while (getline(words_file, word)) {
my_vect.append(word); }
So, before running the loop above, try to initialize the vector with my_vect(100000) (constructor with the number of elements specified). This forces std::vector to allocate enough memory in advance so that it doesn't need to shuffle things around later.
The question is exceedingly unclear. How do you determine the specific
line? If it is the nth line, simplest solution is just to call
getline n times, throwing out all but the last results; calling
ignore n-1 times might be slightly faster, but I suspect that if
you're always reading into the same string (rather than constructing a
new one each time), the difference in time won't be enormous. If you
have some other criteria, and the file is really big (which from your
description it isn't) and sorted, you might try using a binary search,
seeking to the middle of the file, reading enough ahead to find the
start of the next line, then deciding the next step according to it's
value. (I've used this to find relevant entries in log files. But
we're talking about files which are several Gigabytes in size.)
If you're willing to use system dependent code, it might be advantageous
to memory map the file, then search for the nth occurance of a '\n'
(std::find n times).
ADDED: Just some quick benchmarks. On my Linux box, getting the
100000th word from /usr/share/dict/words (479623 words, one per line,
on my machine), takes about
272 milliseconds, reading all words
into an std::vector, then indexing,
256 milliseconds doing the same, but
with std::deque,
30 milliseconds using getline, but
just ignoring the results until the
one I'm interested in,
20 milliseconds using
istream::ignore, and
6 milliseconds using mmap and
looping on std::find.
FWIW, the code in each case is:
For the std:: containers:
template<typename Container>
void Using<Container>::operator()()
{
std::ifstream input( m_filename.c_str() );
if ( !input )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
Container().swap( m_words );
std::copy( std::istream_iterator<Line>( input ),
std::istream_iterator<Line>(),
std::back_inserter( m_words ) );
if ( static_cast<int>( m_words.size() ) < m_target )
Gabi::ProgramManagement::fatal()
<< "Not enough words, had " << m_words.size()
<< ", wanted at least " << m_target;
m_result = m_words[ m_target ];
}
For getline without saving:
void UsingReadAndIgnore::operator()()
{
std::ifstream input( m_filename.c_str() );
if ( !input )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
std::string dummy;
for ( int count = m_target; count > 0; -- count )
std::getline( input, dummy );
std::getline( input, m_result );
}
For ignore:
void UsingIgnore::operator()()
{
std::ifstream input( m_filename.c_str() );
if ( !input )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
for ( int count = m_target; count > 0; -- count )
input.ignore( INT_MAX, '\n' );
std::getline( input, m_result );
}
And for mmap:
void UsingMMap::operator()()
{
int input = ::open( m_filename.c_str(), O_RDONLY );
if ( input < 0 )
Gabi::ProgramManagement::fatal() << "Could not open " << m_filename;
struct ::stat infos;
if ( ::fstat( input, &infos ) != 0 )
Gabi::ProgramManagement::fatal() << "Could not stat " << m_filename;
char* base = (char*)::mmap( NULL, infos.st_size, PROT_READ, MAP_PRIVATE, input, 0 );
if ( base == MAP_FAILED )
Gabi::ProgramManagement::fatal() << "Could not mmap " << m_filename;
char const* end = base + infos.st_size;
char const* curr = base;
char const* next = std::find( curr, end, '\n' );
for ( int count = m_target; count > 0 && curr != end; -- count ) {
curr = next + 1;
next = std::find( curr, end, '\n' );
}
m_result = std::string( curr, next );
::munmap( base, infos.st_size );
}
In each case, the code is run
You could seek to a specific position, but that requires that you know where the line starts. "A little less than a minute" for 100,000 words does sound slow to me.
Read some data, count the newlines, throw away that data and read some more, count the newlines again... and repeat until you've read enough newlines to hit your target.
Also, as others have suggested, this is not a particularly efficient way of accessing data. You'd be well-served by making an index.

Read file and extract certain part only

ifstream toOpen;
openFile.open("sample.html", ios::in);
if(toOpen.is_open()){
while(!toOpen.eof()){
getline(toOpen,line);
if(line.find("href=") && !line.find(".pdf")){
start_pos = line.find("href");
tempString = line.substr(start_pos+1); // i dont want the quote
stop_pos = tempString .find("\"");
string testResult = tempString .substr(start_pos, stop_pos);
cout << testResult << endl;
}
}
toOpen.close();
}
What I am trying to do, is to extrat the "href" value. But I cant get it works.
EDIT:
Thanks to Tony hint, I use this:
if(line.find("href=") != std::string::npos ){
// Process
}
it works!!
I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.
Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.
You might want to look at the answers to this previous question for suggestions about an HTML parser to use.
As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:
#include <fstream>
#include <iostream>
#include <string>
int main ( int, char ** )
{
std::ifstream file("sample.html");
if ( !file.is_open() ) {
std::cerr << "Failed to open file." << std::endl;
return (EXIT_FAILURE);
}
for ( std::string line; (std::getline(file,line)); )
{
// process line.
}
}
As for the inner part the processes the line, there are several problems.
It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
There is confusion between variable names temp and tempString etc.
string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.
Here is a simple test content for sample.html.
<html>
<a href="foo.pdf"/>
</html>
Sticking the following inside the loop:
if ((line.find("href=") != std::string::npos) &&
(line.find(".pdf" ) != std::string::npos))
{
const std::size_t start_pos = line.find("href");
std::string temp = line.substr(start_pos+6);
const std::size_t stop_pos = temp.find("\"");
std::string result = temp.substr(0, stop_pos);
std::cout << "'" << result << "'" << std::endl;
}
I actually get the output
'foo.pdf'
However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.

ifstream::unget() fails. Is MS' implementation buggy or is my code erroneous?

Yesterday I discovered an odd bug in rather simple code that basically gets text from an ifstream and tokenizes it. The code that actually fails does a number of get()/peek() calls looking for the token "/*". If the token is found in the stream, unget() is called so the next method sees the stream starting with the token.
Sometimes, seemingly depending only on the length of the file, the unget() call fails. Internally it calls pbackfail() which then returns EOF. However after clearing the stream state, I can happily read more characters so it's not exactly EOF..
After digging in, here's the full code that easily reproduces the problem:
#include <iostream>
#include <fstream>
#include <string>
//generate simplest string possible that triggers problem
void GenerateTestString( std::string& s, const size_t nSpacesToInsert )
{
s.clear();
for( size_t i = 0 ; i < nSpacesToInsert ; ++i )
s += " ";
s += "/*";
}
//write string to file, then open same file again in ifs
bool WriteTestFileThenOpenIt( const char* sFile, const std::string& s, std::ifstream& ifs )
{
{
std::ofstream ofs( sFile );
if( ( ofs << s ).fail() )
return false;
}
ifs.open( sFile );
return ifs.good();
}
//find token, unget if found, report error, show extra data can be read even after error
bool Run( std::istream& ifs )
{
bool bSuccess = true;
for( ; ; )
{
int x = ifs.get();
if( ifs.fail() )
break;
if( x == '/' )
{
x = ifs.peek();
if( x == '*' )
{
ifs.unget();
if( ifs.fail() )
{
std::cout << "oops.. unget() failed" << std::endl;
bSuccess = false;
}
else
{
x = ifs.get();
}
}
}
}
if( !bSuccess )
{
ifs.clear();
std::string sNext;
ifs >> sNext;
if( !sNext.empty() )
std::cout << "remaining data after unget: '" << sNext << "'" << std::endl;
}
return bSuccess;
}
int main()
{
std::string s;
const char* testFile = "tmp.txt";
for( size_t i = 0 ; i < 12290 ; ++i )
{
GenerateTestString( s, i );
std::ifstream ifs;
if( !WriteTestFileThenOpenIt( testFile, s, ifs ) )
{
std::cout << "file I/O error, aborting..";
break;
}
if( !Run( ifs ) )
std::cout << "** failed for string length = " << s.length() << std::endl;
}
return 0;
}
The program fails when the string length gets near the typical multiple=of-2 buffersizes 4096, 8192, 12288, here's the output:
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 4097
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 8193
oops.. unget() failed
remaining data after unget: '*'
** failed for string length = 12289
This happens when tested on on Windows XP and 7, both compiled in debug/release mode, both dynamic/static runtime, both 32bit and 64bit systems/compiles, all with VS2008, default compiler/linker options.
No problem found when testing with gcc4.4.5 on a 64bit Debian system.
Questions:
can other people please test this? I would really appreciate some active collaboration form SO.
is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)
or any compiler flags that might trigger this behaviour?
all parser code is rather critical for the application and is tested heavily, but off course this problem was not found in the test code. Should I come up with extreme test cases, and if so, how do I do that? How could I ever predict this could cause a problem?
if this really is a bug, where should do I best report it?
is there anything that is not correct in the code that could cause the problem (not talking about whether it makes sense)
Yes. Standard streams are required to have at least 1 unget() position. So you can safely do only one unget() after a call to get(). When you call peek() and the input buffer is empty, underflow() occurs and the implementation clears the buffer and loads a new portion of data. Note that peek() doesn't increase current input location, so it points to the beginning of the buffer. When you try to unget() the implementation tries to decrease current input position, but it's already at the beginning of the buffer so it fails.
Of course this depends on the implementation. If the stream buffer holds more than one character then it may sometimes fail and sometimes not. As far as I know microsoft's implementation stores only one character in basic_filebuf (unless you specify a greater buffer explicitly) and relies on <cstdio> internal buffering (btw, that's one reason why MVS iostreams are slow). Quality implementation may load the buffer again from the file when unget() fails. But it isn't required to do so.
Try to fix your code so you don't need more than one unget() position. If you really need it then wrap the stream with a stream that guarantees that unget() won't fail (look at Boost.Iostreams). Also the code you posted is nonsense. It tries to unget() and then get() again. Why?