In my project I read from a json file with QJsonDocument::fromJson(). This works great, however when I try to write the QJsonDocument back to file with toJson() some of the doubles have messed up precision.
For example, calling toJson() on a document with a QJsonValue with a double value of 0.15 will save to file as 0.14999999999999999. I do not want this.
This is because the Qt source file qjsonwriter.cpp at line 126 (Qt 5.6.2) reads:
json += QByteArray::number(d, 'g', std::numeric_limits<double>::digits10 + 2); // ::digits10 is 15
That +2 at the end there is messing me up. If this same call to QByteArray::number() instead has a precision of 15 (instead of 17), the result is exactly as I need... 0.15.
I understand how the format of floating point precision causes the double to be limited in what it can represent. But if I limit the precision to 15 instead of 17, this has the effect of matching the input double precision, which I want.
How can I get around this?
Obviously... I could write my own Json parser, but that's last resort. And obviously I could edit the Qt source code, however my software is already deployed with the Qt5Core.dll included in everyone's install directory, and my updater is not designed to update any dll's. So I cannot edit the Qt source code.
Fingers crossed someone has a magic fix for this :)
this has the effect of matching the input double precision, which I want.
This request doesn't make much sense. A double doesn't carry any information about its precision - it only carries a value. 0.15, 0.1500 and 0.14999999999999999 are the exact same double value, and the JSON writer has no way to know how it was read from the file in first place (if it was read from a file at all).
In general you cannot ask for maximum 15 digits of precision as you propose, as, depending from the particular value, up to 17 are required for a precise double->text->double roundtrip, so you would write incorrectly rounded values. What some JSON writers do however is to write numbers with the minimum number of decimals required to read the same double back. This is far from trivial to do numerically correctly unless you do - as many do - a loop from 15 to 17, write the number with such precision, parse it back and see if it comes back as the exact same double value. While this generates "nicer" (and smaller) output, it's more work and slows down the JSON write, so that's why probably Qt doesn't do this.
Still, you can write your own JSON write code and have this feature, for a simple recursive implementation I expect ~15 lines of code.
That being said, again, if you want to precisely match your input this won't save you - as it's simply impossible.
I just encountered this as well. Rather than replace an entire Qt JSON implementation with a third party library (or roll my own!), however, I kludged a solution...
My full code base related to this is too extensive and elaborate to post and explain here. But the gist of the solution to this point is simple enough.
First, I use a QVariantMap (or QVariantHash) to collect my data, and then convert that to json via the built-in QJsonObject::fromVariantMap or QJsonDocument::fromVariant functions. To control the serialization, I define a class called DataFormatOptions which has a decimalPrecision member (and sets up easy expansion to other such formatting options..) and then I call a function called toMagicVar to create "magic variants" for my data structure to be converted to json bytes. To control for the number format / precision toMagicVar converts doubles and floats to strings that are in the desired format, and surrounds the string value with some "magic bytes". The way my actual code is written, one can easily do this on any "level" of the map/hash I'm building / formatting via recursive processing, but I've omitted those details...
const QString NO_QUOTE( "__NO_QUOT__" );
QVariant toMagicVar( const QVariant &var, const DataFormatOptions &opt )
{
...
const QVariant::Type type( var.type() );
const QMetaType::Type metaType( (QMetaType::Type)type );
...
if( opt.decimalPrecision != DataFormatOptions::DEFAULT_PRECISION
&& (type == QVariant::Type::Double || metaType == QMetaType::Float) )
{
static const char FORMAT( 'f' );
static const QRegExp trailingPointAndZeros( "\\.?0+$" );
QString formatted( QString::number(
var.toDouble(), FORMAT, opt.decimalPrecision ) );
formatted.remove( trailingPointAndZeros );
return QVariant( QString( NO_QUOTE + formatted + NO_QUOTE ) );
}
...
}
Note that I trim off any extraneous digits via formatted.remove. If you want the data to always include exactly X digits after the decimal point, you may opt to skip that step. (Or you might want to control that via DataFormatOptions?)
Once I have the json bytes I'm going to send across the network as a QByteArray, I remove the magic bytes so my numbers represented as quoted strings become numbers again in the json.
// This is where any "magic residue" is removed, or otherwise manipulated,
// to produce the desired final json bytes...
void scrubMagicBytes( QByteArray &bytes )
{
static const QByteArray EMPTY, QUOTE( "\"" ),
NO_QUOTE_PREFIX( QUOTE + NO_QUOTE.toLocal8Bit() ),
NO_QUOTE_SUFFIX( NO_QUOTE.toLocal8Bit() + QUOTE );
bytes.replace( NO_QUOTE_PREFIX, EMPTY );
bytes.replace( NO_QUOTE_SUFFIX, EMPTY );
}
Related
I need to format a FILETIME value info a wide string buffer and configuration provides the format string.
What I am actually doing:
Config provides the format string: L"{YYYY}-{MM}-{DD} {hh}:{mm}:{ss}.{mmm}"
Convert the FILETIME to System time:
SYSTEMTIME stUTC;
FileTimeToSystemTime(&fileTime, &stUTC);
Format the string with
fmt::format_to(std::back_inserter(buffer), strFormat,
fmt::arg(L"YYYY", stUTC.wYear),
fmt::arg(L"MM", stUTC.wMonth),
fmt::arg(L"DD", stUTC.wDay),
fmt::arg(L"hh", stUTC.wHour),
fmt::arg(L"mm", stUTC.wMinute),
fmt::arg(L"ss", stUTC.wSecond),
fmt::arg(L"mmm", stUTC.wMilliseconds));
I perfectly understand that with a service comes a cost :) but my code is calling this statement millions of time and the performance penalty is clearly present (more than 6% of CPU usage).
"Anything" I could do to improve this code would be welcomed.
I saw that {fmt} has a time API support.
Unfortunately, it seems to be unable to format the millisecond part of the time/date and, it requires some conversion effort from FILETIME to std::time_t...
Should I forget about the "custom" format string and provide a custom formatter for the FILETIME (or SYSTEMTIME) types? Would that result in a significant performance boost?
I'd appreciate any guidance you can provide.
In the comments I suggested parsing your custom time format string into a simple state machine. It does not even have to be a state machine as such. It is simply a linear series of instructions.
Currently, the fmt class needs to do a bit of work to parse the format type and then convert an integer to a zero-padded string. It is possible, though unlikely, that it is as heavily optimized as I'm about to suggest.
The basic idea is to have a (large) lookup table, which of course can be generated at runtime, but for the purposes of quick illustration:
const wchar_t zeroPad4[10000][5] = { L"0000", L"0001", L"0002", ..., L"9999" };
You can have 1-, 2- and 3-digit lookup tables if you want, or alternatively recognize that these values are all contained in the 4-digit lookup table if you just add an offset.
So to output a number, you just need to know what the offset in SYSTEMTIME is, what type the value is, and what string offset to apply (0 for 4-digit, 1 for 3-digit, etc). It makes things simpler, given that all struct elements in SYSTEMTIME are the same type. And you should reasonably assume that no values require range checks, although you can add that if unsure.
And you can configure it like this:
struct Output {
int dataOffset; // offset into SYSTEMTIME struct
int count; // extra adjustment after string lookup
};
What about literal strings? Well, you can either copy those or just repurpose Output to use a negative dataOffset representing where to start in the format string and count to hold how many characters to output in that mode. If you need extra output modes, extend this struct with a mode member.
Anwyay, let's take your string L"{YYYY}-{MM}-{DD} {hh}:{mm}:{ss}.{mmm}" as an example. After you parse this, you would end up with:
Output outputs[] {
{ offsetof(SYSTEMTIME, wYear), 0 }, // "{YYYY}"
{ -6, 1 }, // "-"
{ offsetof(SYSTEMTIME, wMonth), 2 }, // "{MM}"
{ -11, 1 }, // "-"
{ offsetof(SYSTEMTIME, wDay), 2 }, // "{DD}"
{ -16, 1 }, // " "
// etc... you get the idea
{ offsetof(SYSTEMTIME, wMilliseconds), 1 }, // "{mmm}"
{ -1, 0 }, // terminate
};
It shouldn't take much to see that, when you have a SYSTEMTIME as input, a pointer to the original format string, the lookup table, and this basic array of instructions you can go ahead and output the result into a pre-sized buffer very quickly.
I'm sure you can come up with the code to execute these instructions efficiently.
The main drawback of this approach is the size of the lookup table may lead to cache issues. However, most lookups will occur in the first 100 elements. You could also compress the table to ordinary char values and then inject the wchar_t zero bytes when copying.
As always: experiment, measure, and have fun!
I am trying to write double variable into a binary file. I am using below code:
double x = 1.;
ofstream mfout;
mfout.open("junk.bin", ios::out | ios::binary );
mfout.write((char*) &x, sizeof(double));
mfout.close();
What it returns to me after converting output binary file to ASCII is this:
.......
The third party software which has to read the file also returns error showing that there is problem. I would be thankful if someone guide me.
What it returns to me after converting output binary file to ASCII is this:
.......
No. That's what it returns to you if you interpret it as ASCII without converting it. Since it's not ASCII, interpreting it as ASCII will produce nonsense.
The third party software which has to read the file also returns error showing that there is problem.
Then it sounds like the third party software isn't expecting a binary file, since that's what you've written.
The file is binary, not ASCII. Only something expecting a single double in binary format (whatever binary format that your platform happens to use with your compiler options and so on) will be able to make sense out of it.
I want to save my terrain data to a file and load only some parts of it, because it's just too big to store it in memory as a whole. Actually I don't even know whether the protobuf is good for this purposes.
For example I would have a structure like (might be invalid gramatically, I know only simple basics):
message Quad {
required int32 x = 1;
required int32 z = 2;
repeated int32 y = 3;
}
The x and z values are available in my program and by using them I would like to find the correct Quad object with the same x and z (in the file) to obtain y values. However, I can't just parse the file with the ParseFromIstream(), because (I think so) it loads whole file into memory, but in my case the file is just too big.
So, is the protobuf able to load one object, send me for checking it and if the object is wrong give me the second one?
Actually... I could just ask: does the ParseFromIstream() loads whole file into memory?
While some libraries to allow you to read files partially, the technique recommended by Google is to simply have the file consist of multiple messages:
https://developers.google.com/protocol-buffers/docs/techniques
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if
you are dealing in messages larger than a megabyte each, it may be time to consider an
alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data
set. Usually, large data sets are really just a collection of small pieces, where each small
piece may be a structured piece of data.
So you could just write a long sequence of Quad messages to the file, delimited by the lengths of the messages. If you need to seek randomly to specific Quads, you may want to add some kind of an index.
This depends on which implementation you are using. Some have "read as a sequence" APIs. For example, assuming you stored it as a "repeated Quad", then with protobuf-net that would be:
int x = ..., y = ...;
var found = Serializer.DeserializeItems<Quad>(source)
.Where(q => q.x ==x && q.y == y);
The point being: it yields a spooling (not loaded all at once) and short-circuiting sequence.
I don't know the c++ api specifically, but I would hope it has something similar - but worst case you could parse the varint headers and prepare a length-capped stream.
I am developing a program in C++, using the string container , as in std::string to store network data from the socket (this is peachy), I receive the data in a maximum possible 1452 byte frame at a time, the protocol uses a header that contains information about the data area portion of the packets length, and header is a fixed 20 byte length. My problem is that a string is giving me an unknown debug assertion, as in , it asserts , but I get NO message about the string. Now considering I can receive more than a single packet in a frame at a any time, I place all received data into the string , reinterpret_cast to my data struct, calculate the total length of the packet, then copy the data portion of the packet into a string for regex processing, At this point i do a string.erase, as in mybuff.Erase(totalPackLen); <~ THIS is whats calling the assert, but totalpacklen is less than the strings size.
Is there some convention I am missing here? Or is it that the std::string really is an inappropriate choice here? Ty.
Fixed it on my own. Rolled my own VERY simple buffer with a few C calls :)
int ret = recv(socket,m_buff,0);
if(ret > 0)
{
BigBuff.append(m_buff,ret);
while(BigBuff.size() > 16){
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
if(ntohs(hdr->PackLen) <= BigBuff.size() - 20){
hdr->PackLen = ntohs(hdr->PackLen);
string lData;
lData.append(BigBuff.begin() + 20,BigBuff.begin() + 20 + hdr->PackLen);
Parse(lData); //regex parsing helper function
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
}
}
}
From the code snippet you provided it appears that your packet comprises a fixed-length binary header followed by a variable length ASCII string as a payload. Your first mistake is here:
BigBuff.append(m_buff,ret);
There are at least two problems here:
1. Why the append? You presumably have dispatched with any previous messages. You should be starting with a clean slate.
2. Mixing binary and string data can work, but more often than not it doesn't. It is usually better to keep the binary and ASCII data separate. Don't use std::string for non-string data.
Append adds data to the end of the string. The very next statement after the append is a test for a length of 16, which says to me that you should have started fresh. In the same vein you do that reinterpret cast from BigBuff[0]:
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
Because of your use of append, you are perpetually dealing with the header from the first packet received rather than the current packet. Finally, there's that erase:
BigBuff.erase(hdr->PackLen + 20);
Many problems here:
- If the packet length and the return value from recv are consistent the very first call will do nothing (the erase is at but not past the end of the string).
- There is something very wrong if the packet length and the return value from recv are not consistent. It might mean, for example, that multiple physical frames are needed to form a single logical frame, and that in turn means you need to go back to square one.
- Suppose the physical and logical frames are one and the same, you're still going about this all wrong. As noted, the first time around you are erasing exactly nothing. That append at the start of the loop is exactly what you don't want to do.
Serialization oftentimes is a low-level concept and is best treated as such.
Your comment doesn't make sense:
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
BigBuff.erase(hdr->PackLen + 20) will erase from hdr->PackLen + 20 onwards till the end of the string. From the description of the code - seems to me that you're erasing beyond the end of the content data. Here's the reference for std::string::erase() for you.
Needless to say that std::string is entirely inappropriate here, it should be std::vector.
I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).
The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.
Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.
The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.
As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.
Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:
#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;
unsigned int FileRead( istream & is, vector <char> & buff ) {
is.read( &buff[0], buff.size() );
return is.gcount();
}
unsigned int CountLines( const vector <char> & buff, int sz ) {
int newlines = 0;
const char * p = &buff[0];
for ( int i = 0; i < sz; i++ ) {
if ( p[i] == '\n' ) {
newlines++;
}
}
return newlines;
}
int main( int argc, char * argv[] ) {
time_t now = time(0);
if ( argc == 1 ) {
cout << "lines\n";
ifstream ifs( "lines.dat" );
int n = 0;
string s;
while( getline( ifs, s ) ) {
n++;
}
cout << n << endl;
}
else {
cout << "buffer\n";
const int SZ = 1024 * 1024;
std::vector <char> buff( SZ );
ifstream ifs( "lines.dat" );
int n = 0;
while( int cc = FileRead( ifs, buff ) ) {
n += CountLines( buff, cc );
}
cout << n << endl;
}
cout << time(0) - now << endl;
}
Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.
Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )
If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines
and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.
You wrote that it keeps getting larger.
This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:
Parse to the end of file.
Remember the line count and the offset of EOF.
When the file grows fseek to the offset, parse to EOF and update the line count and the offset.
There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:
What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.
The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.
Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
struct TestEOL
{
bool operator()(char c)
{
last = c;
return last == '\n';
}
char last;
};
int main()
{
std::fstream file("Plop.txt");
TestEOL test;
std::size_t count = std::count_if(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>(),
test);
if (test.last != '\n') // If the last character checked is not '\n'
{ // then the last line in the file has not been
++count; // counted. So increement the count so we count
} // the last line even if it is not '\n' terminated.
}
It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.
However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.
Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.
You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.
However, there are a couple of possibilities which you may want to consider.
1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.
A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.
2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).
3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.
For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.
To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.
The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.
So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.
If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).
Ludwig Algorithm
Initialize offset to 0
Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
Update offset with position at EOF
Save counter & offset to a file or in a variable if you only need it in your software
Repeat from "Read file ..." on a change
This is actually how various software processing log files function (i.e. fail2ban comes to mind).
The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.
Proactive Algorithm
When creating the files, reset counter to 0.
Then each time you receive a new line to add to the file:
Lock file
Write one line
Load counter
Add one to counter
Save counter
Unlock file
This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.
Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.
For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.
Don't do processing backward.