QString::compare() vs converting QString to numbers and then comparing - c++

Is it faster to compare 2 QStrings containing numbers, or to convert those QStrings to numbers and then compare the numbers?
so which is faster?
QString str1,str2;
if(str1.compare(str2)==0)
OR
QString str1,str2;
if(QString::number(str1)==QString::number(str2))
The reason I'm asking is because I have to fill a QMap with error codes and error messages corresponding to those error codes. I'll be reading the error code / error message from an ini file, so I'm wondering whether it's better to convert the error codes to integers and have QMap<int,QString> or to just keep them as QStrings and have QMap<QString,QString>. Which approach will give me the most optimal code?
Where the QMap contains <error code, error message>

String comparison is likely to end with trouble: "1.00" != "1.0" != "1" != "0001"
Always use numeric types for comparing numbers, and don't worry about imagined performance issues of such a minuscule piece of any whole.

For a one time use just comparing the string will (probably) be faster than converting them to numbers and comparing the numbers.
If you need the result as a number for other steps then convert them to numbers at the start and store numbers.
If you error codes are contiguous then you would typically but them into a vector indexed by [error_code - first_error_code]
BUT before doing any optimisation - 1, measure 2, decide if you care

In the case of the code you've written, doing two conversions and comparing the results is going to be slower than comparing the strings directly.
The thing is, to do a string comparison, you must at worst visit each character of each string. In the != case, you may visit fewer characters before finding the diff and exiting the compare (I assume a compare routine that exits early on fail). In the convert and compare case, you MUST visit all characters of both strings, every time. So the direct compare case will be faster.
In the case of the maps, you'll want to use QString because you'll do the conversion once and do the compare many, many times. That means that the cost of the conversion will be swamped by the savings from the comparisons and you'll win in the end.

With QString keys, the map is performing string comparisons on every insertion, deletion and lookup. Since those comparisons are done repeatedly, it's cheaper to convert the string to an integer before using it as a map key. Such a conversion then is only done once per item, and perhaps once per lookup the key for your lookup is originally in QString form as well.

Related

Is the md5 function safe to use for merging datasets?

We are about to promote a piece of code which uses the SAS md5() hash function to efficiently track changes in a large dataset.
format md5 $hex32.;
md5=md5(cats(of _all_));
As per the documentation:
The MD5 function converts a string, based on the MD5 algorithm, into a 128-bit hash value. This hash value is referred to as a message digest (digital signature), which is nearly unique for each string that is passed to the function.
At approximately what stage does 'nearly unique' begin to pose a data integrity risk (if at all)?
I have seen an example where the md5 comparison goes wrong.
If you have the values "AB" and "CD" in the (two columns of the) first row and "ABC" and "D" in the second row, they got the same md5 value. See this example:
data md5;
attrib a b length=$3 informat=$3.;
infile datalines;
input a b;
format md5 $hex32.;
md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;
This is, of course, because the CATS(of _all_) will concatinate and strip the variables (converting numbers to string using the "best" format), without a delimiter. If you use CAT instead , this will not happen because the leading and trailing blanks are not removed. This error is not very far fetched. If you have missing values, then this could occur more often. If, for example, you have a lot of binary values in text variables, some of which are missing, it could occur very often.
One could do this manually, adding a delimiter in between the values. Of course, you would still have the case when you have ("AB!" and "CD") and ("AB" and "!CD") and you use "!" as delimiter...
MD5 has 2^128 distinct values, and from what I've read at 2^64 different values (that's 10^20 or so) you begin to have a high likelihood of finding a collision.
However, as a result of how MD5 is generated, you have some risks of collisions from very similar preimages which only differ in as little as two bytes. As such, it's hard to say how risky this would be for your particular process. It's certainly possible for a collision to occur on as few as two messages. It's not likely. Does saving [some] computing time benefit you enough to outweigh a small risk?

String storage optimization

I'm looking for some C++ library that would help to optimize memory usage by storing similar (not exact) strings in memory only once. It is not FlyWeight or string interning which is capable to store exact objects/strings only once. The library should be able to analyze and understand that, for example, two particular strings of different length have identical first 100 characters, this substring should be stored only once.
Example 1:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch=test2"<br/>
in this case it is obvious that the only difference in these two strings is the last character, so it would be a great saving in memory if we hold only one copy of "http://www.contoso.com/some/path/app.aspx?ch=test" and then two additional strings "1" and "2"
Example 2:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch1=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch2=test2"<br/>
this is more complicated case when there are multiple identical substrings : one copy of "http://www.contoso.com/some/path/app.aspx?ch", then two strings "1" and "2", one copy of "=test" and since we already have strings "1" and "2" stored we don't need any additional strings.
So, is there such a library? Is there something that can help to develop such a library relatively fast? strings are immutable, so there is no need to worry about updating indexes or locks for threadsafety
If strings have common prefix the solution may be - using radix tree (also known as trie) (http://en.wikipedia.org/wiki/Radix_tree) for string representation. So you can only store pointer to tree leaf. And get whole string by growing up to tree root.
hello world
hello winter
hell
[2]
/
h-e-l-l-o-' '-w-o-r-l-d-[0]
\
i-n-t-e-r-[1]
Here is one more solution: http://en.wikipedia.org/wiki/Rope_(data_structure)
libstdc++ implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a00223.html
SGI documentation: http://www.sgi.com/tech/stl/Rope.html
But I think you need to construct your strings for rope to work properly. Maybe found longest common prefix and suffix for every new string with previous string and then express new string as concatenation of previous string prefix, then uniq part and then previous string suffix.
For example 1, what I can come up with is Radix Tree, a space-optimized version from Trie. I did a simple google and found quite a few implementations in C++.
For example 2, I am also curious about the answer!
First of all, note that std::string is not immutable and you have to make sure that none of these strings are accidentally modified.
This depends on the pattern of the strings. I suggest using hash tables (std::unordered_map in C++11). The exact details depend on how you are going to access these strings.
The two strings you have provided differ only after the "?ch" part. If you expect that many strings will have long common prefixes where these prefixes are almost of the same size. You can do the following:
Let's say the size of a prefix is 43 chars. Let s be a string. Then, we can consider s[0-42] a key into the hash table and the rest of the string as a value.
For example, given "http://www.contoso.com/some/path/app.aspx?ch=test1" the key would be "http://www.contoso.com/some/path/app.aspx?" and "ch=test1" would be the value. if the key already exists in the hash table you can just add the value to the collection of values associated with key. Otherwise, add the key/value pair.
This is just an example, what the key is and what the value is depend on how you are going to access these strings.
Also if all string have "=test" in them, then you don't have to store this with every value. You can just store it once and then insert it when retrieving a string. So given the value "ch1=test1", what will be stored is just "ch11". This depends on the pattern of the strings.

How to read a string character by character as a range in D?

How to read a line as a range in D?
I know there is ranges in D, but I just wondered how to simply iterate over each character of a string using this concept?
To show what I'm after, the similar code in Go is:
for _, someChar := range someString {
// Do something
}
That would depend on whether you want to iterate over code units or code points. The language itself iterates over arrays by array elements, and strings are arrays of code units, so if you simply use foreach with type inference, then with
foreach(c; "La Verité")
writeln(c);
the last two characters printed would be gibberish, because é is a code point made up of two UTF-8 code units, and you're printing out individual code units (since char is a UTF-8 code unit). Whereas, if you do
foreach(dchar c; "La Verité")
writeln(c);
then the runtime will decode the code units to code points, and é will be printed as the last character. But none of this is really operating on strings as ranges. foreach operates on arrays natively without having to use the input range API. However, for all string types, the range API looks like
#property bool empty();
#property dchar front();
void popFront();
It operates on strings as ranges of dchar - not their code unit type. This avoids issues with functions like std.algorithm.filter operating on individual code units, since that would make no sense. Operating on code points isn't 100% correct either, since Unicode gets very complicated with regards to combining code points and graphemes and whatnot, but operating on code points is far closer to being correct (and I believe there's work being done on adding range support for graphemes into the standard library for the cases where you need that and are willing to pay the performance hit). So, having the range API for strings operate on them as ranges of dchar is far more correct, and if you did something like
foreach(c; filter!"true"("La Verité"))
writeln(c);
you would be iterating over dchar, and é would print correctly. The downside to all of this of course is the fact that foreach on strings operates on the code unit level by default whereas the range API for strings operate on them as code points, so you have to be careful when mixing array operations and range-based operations on strings. That's also why string and wstring are not considered random-access ranges - just bidirectional ranges. You can't do random access in O(1) on code points when they're made up of varying numbers of code units (whereas dstring is a random-access range, because with UTF-32, every code unit is a code point).
foreach(ch; str)
do_something(ch);
A string is an InputRange. An InputRange implements three things:
empty; is it empty?
front; give me the next item.
popFront; advance the range, otherwise front will return the same.
foreach "understands" how to work with ranges, so it "just works".
But I don't speak Go, so I'm not entirely sure we're speaking the same language.

Efficiency of insert in string

I'm trying to write this custom addition class for very large integers, bigger than long long. One approach which I'm investigating is keeping the integer as a string and then converting the characters to their int components and then adding each "column". Another approach I'm considering is to split up the string into multiple strings each of which is the size of a long long and then casting it using a string stream into a long long adding and then recombining.
Regardless I came across the fact that addition is done most easily in reverse to allow to the carrying over of digits. This being the case I was wondering the efficiency of the insert method for the string. It seems since a string is an array of chars that all the chars would have to be shifted over one. So it would vary but it would seem the efficiency is O(n) where n is the number of chars in the string.
Is this correct, or is this only with a naive interpretation?
Edit: I now have answer to my question but I was wondering on a related topic which is more efficient, inserting a string into a stream then extracting into an int. Or doing 10^n*char1+10^n-1*char2...etc?
As far as I know, you are correct. The C++ implementation of String will perform an insert in O(n) time. It treats the string as a character array.
For your numeric implementation, why not store the numbers as arrays of integers and convert to string only for output?
It's probably correct with all real implementations of std::string. Under the circumstances, you might want to either store the digits in reverse (though that could be clumsy in other ways) or else use something like an std::deque<char>. Better still, use an std::deque<unsigned long long>, which will reduce the number of operations involved.
Of course, for real use you usually want to use an existing library rather than rolling your own.

Is this an acceptable use of "ASCII arithmetic"?

I've got a string value of the form 10123X123456 where 10 is the year, 123 is the day number within the year, and the rest is unique system-generated stuff. Under certain circumstances, I need to add 400 to the day number, so that the number above, for example, would become 10523X123456.
My first idea was to substring those three characters, convert them to an integer, add 400 to it, convert them back to a string and then call replace on the original string. That works.
But then it occurred to me that the only character I actually need to change is the third one, and that the original value would always be 0-3, so there would never be any "carrying" problems. It further occurred to me that the ASCII code points for the numbers are consecutive, so adding the number 4 to the character "0", for example, would result in "4", and so forth. So that's what I ended up doing.
My question is, is there any reason that won't always work? I generally avoid "ASCII arithmetic" on the grounds that it's not cross-platform or internationalization friendly. But it seems reasonable to assume that the code points for numbers will always be sequential, i.e., "4" will always be 1 more than "3". Anybody see any problem with this reasoning?
Here's the code.
string input = "10123X123456";
input[2] += 4;
//Output should be 10523X123456
From the C++ standard, section 2.2.3:
In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.
So yes, if you're guaranteed to never need a carry, you're good to go.
The C++ language definition requres that the code-point values of the numerals be consecutive. Therefore, ASCII Arithmetic is perfectly acceptable.
Always keep in mind that if this is generated by something that you do not entirely control (such as users and third-party system), that something can and will go wrong with it. (Check out Murphy's laws)
So I think you should at least put on some validations before doing so.
It sounds like altering the string as you describe is easier than parsing the number out in the first place. So if your algorithm works (and it certainly does what you describe), I wouldn't consider it premature optimization.
Of course, after you add 400, it's no longer a day number, so you couldn't apply this process recursively.
And, <obligatory Year 2100 warning>.
Very long time ago I saw some x86 processor instructions for ASCII and BCD.
Those are AAA (ASCII Adjust for Addition), AAS (subtraction), AAM (mult), AAD (div).
But even if you are not sure about target platform you can refer to specification of characters set you are using and I guess you'll find that first 127 characters of ASCII is always have the same meaning for all characters set (for unicode that is first characters page).