Unexpected collision with std::hash - c++

I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.
Isn't it weird that these 2 strings have the same hash?
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1
I know I can use boost::hash<std::string> or others, but I want to know what is wrong with std::hash. Am I using it wrong? Shouldn't I somehow "seed" it?

There's nothing wrong with your usage of std::hash. The problem is that the specialization std::hash<std::string> provided by the standard library implementation bundled with Visual Studio 2010 only takes a subset of the string's characters to determine the hash value (presumably for performance reasons). Coincidentally the last character of a string with 14 characters is not part of this set, which is why both strings yield the same hash value.
As far as I know this behaviour is in conformance with the standard, which demands only that multiple calls to the hash function with the same argument must always return the same value. However, the probability of a hash collision should be minimal. The VS2010 implementation fulfills the mandatory part, yet fails to account for the optional one.
For details, see the implementation in the header file xfunctional (starting at line 869 in my copy) and §17.6.3.4 of the C++ standard (latest public draft).
If you absolutely need a better hash function for strings, you should implement it yourself. It's actually not that hard.

The exact hash algorithm isn't specified by the standard, so the results
will vary. The algorithm used by VC10 doesn't seem to take all of the
characters into account if the string is longer than 10 characters; it
advances with an increment of 1 + s.size() / 10. This is legal,
albeit from a QoI point of view, rather disappointing; such hash codes
are known to perform very poorly for some typical sets of data (like
URLs). I'd strongly suggest you replace it with either a FNV hash or
one based on a Mersenne prime:
FNV hash:
struct hash
{
size_t operator()( std::string const& s ) const
{
size_t result = 2166136261U ;
std::string::const_iterator end = s.end() ;
for ( std::string::const_iterator iter = s.begin() ;
iter != end ;
++ iter ) {
result = (16777619 * result)
^ static_cast< unsigned char >( *iter ) ;
}
return result ;
}
};
Mersenne prime hash:
struct hash
{
size_t operator()( std::string const& s ) const
{
size_t result = 2166136261U ;
std::string::const_iterator end = s.end() ;
for ( std::string::const_iterator iter = s.begin() ;
iter != end ;
++ iter ) {
result = 127 * result
+ static_cast< unsigned char >( *iter ) ;
}
return result ;
}
};
(The FNV hash is supposedly better, but the Mersenne prime hash will be
faster on a lot of machines, because multiplying by 127 is often
significantly faster than multiplying by 16777619.)

You should likely get different hash values. I get different hash values (GCC 4.5):
hashtest.cpp
#include <string>
#include <iostream>
#include <functional>
int main(int argc, char** argv)
{
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
std::cout << hash0 << (hash0 == hash1 ? " == " : " != ") << hash1 << "\n";
return 0;
}
Output
# g++ hashtest.cpp -o hashtest -std=gnu++0x
# ./hashtest
16797002355621538189 != 16797001256109909978

You do not seed hashing function, you can just salt "them" at most.
The function is used in the right way and this collision could be just fortuitous.
You cannot tell whether the hashing function is not evenly distributed unless you perform a massive test with random keys.

The TR1 hash function and the newest standard define proper overloads for things like strings. When I run this code using std::tr1::hash (g++ 4.1.2), I get different hash values for these two strings.

Related

How to cut off parts of a string, which every string in a collection has

My currently problem is the following:
I have a std::vector of full path names to files.
Now i want to cut off the common prefix of all string.
Example
If I have these 3 strings in the vector:
/home/user/foo.txt
/home/user/bar.txt
/home/baz.txt
I would like to cut off /home/ from every string in the vector.
Question
Is there any method to achieve this in general?
I want an algorithm that drops the common prefix of all string.
I currently only have an idea which solves this problem in O(n m) with n strings and m is the longest string length, by just going through every string with every other string char by char.
Is there a faster or more elegant way solving this?
This can be done entirely with std:: algorithms.
synopsis:
sort the input range if not already sorted. The first and last paths in the sorted range
will be the most dissimilar. Best case is O(N), worst case O(N + N.logN)
use std::mismatch to determine the larges common sequence between the
two most dissimilar paths [insignificant]
run through each path erasing the first COUNT characters where COUNT is the number of characters in the longest common sequence. O (N)
Best case time complexity: O(2N), worst case O(2N + N.logN) (can someone check that?)
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
std::string common_substring(const std::string& l, const std::string& r)
{
return std::string(l.begin(),
std::mismatch(l.begin(), l.end(),
r.begin(), r.end()).first);
}
std::string mutating_common_substring(std::vector<std::string>& range)
{
if (range.empty())
return std::string();
else
{
if (not std::is_sorted(range.begin(), range.end()))
std::sort(range.begin(), range.end());
return common_substring(range.front(), range.back());
}
}
std::vector<std::string> chop(std::vector<std::string> samples)
{
auto str = mutating_common_substring(samples);
for (auto& s : samples)
{
s.erase(s.begin(), std::next(s.begin(), str.size()));
}
return samples;
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
samples = chop(std::move(samples));
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}
expected:
baz.txt
user/bar.txt
user/foo.txt
Here's an alternate `common_substring' which does not require a sort. time complexity is in theory O(N) but whether it's faster in practice you'd have to check:
std::string common_substring(const std::vector<std::string>& range)
{
if (range.empty())
{
return {};
}
return std::accumulate(std::next(range.begin(), 1), range.end(), range.front(),
[](auto const& best, const auto& sample)
{
return common_substring(best, sample);
});
}
update:
Elegance aside, this is probably the fastest way since it avoids any memory allocations, performing all transformations in-place. For most architectures and sample sizes, this will matter more than any other performance consideration.
#include <iostream>
#include <vector>
#include <string>
void reduce_to_common(std::string& best, const std::string& sample)
{
best.erase(std::mismatch(best.begin(), best.end(),
sample.begin(), sample.end()).first,
best.end());
}
void remove_common_prefix(std::vector<std::string>& range)
{
if (range.size())
{
auto iter = range.begin();
auto best = *iter;
for ( ; ++iter != range.end() ; )
{
reduce_to_common(best, *iter);
}
auto prefix_length = best.size();
for (auto& s : range)
{
s.erase(s.begin(), std::next(s.begin(), prefix_length));
}
}
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
remove_common_prefix(samples);
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}
You have to search every string in the list. However you don't need to compare all the characters in every string. The common prefix can only get shorter, so you only need to compare with "the common prefix so far". I don't think this changes the big-O complexity - but it will make quite a difference to the actual speed.
Also, these look like file names. Are they sorted (bearing in mind that many filesystems tend to return things in sorted order)? If so, you only need to consider the first and last elements. If they are probably pr mostly ordered, then consider the common prefix of the first and last, and then iterate through all the other strings shortening the prefix further as necessary.
You just have to iterate over every string. You can only avoid iterating over the full length of strings needlessly by exploiting the fact, that the prefix can only shorten:
#include <iostream>
#include <string>
#include <vector>
std::string common_prefix(const std::vector<std::string> &ss) {
if (ss.empty())
// no prefix
return "";
std::string prefix = ss[0];
for (size_t i = 1; i < ss.size(); i++) {
size_t c = 0; // index after which the string differ
for (; c < prefix.length(); c++) {
if (prefix[c] != ss[i][c]) {
// strings differ from character c on
break;
}
}
if (c == 0)
// no common prefix
return "";
// the prefix is only up to character c-1, so resize prefix
prefix.resize(c);
}
return prefix;
}
void strip_common_prefix(std::vector<std::string> &ss) {
std::string prefix = common_prefix(ss);
if (prefix.empty())
// no common prefix, nothing to do
return;
// drop the common part, which are always the first prefix.length() characters
for (std::string &s: ss) {
s = s.substr(prefix.length());
}
}
int main()
{
std::vector<std::string> ss { "/home/user/foo.txt", "/home/user/bar.txt", "/home/baz.txt"};
strip_common_prefix(ss);
for (std::string &s: ss)
std::cout << s << "\n";
}
Drawing from the hints of Martin Bonner's answer, you may implement a more efficient algorithm if you have more prior knowledge on your input.
In particular, if you know your input is sorted, it suffices to compare the first and last strings (see Richard's answer).
i - Find the file which has the least folder depth (i.e. baz.txt) - it's root path is home
ii - Then go through the other strings to see if they start with that root.
iii - If so then remove root from all the strings.
Start with std::size_t index=0;. Scan the list to see if characters at that index match (note: past the end does not match). If it does, advance index and repeat.
When done, index will have the value of the length of the prefix.
At this point, I'd advise you to write or find a string_view type. If you do, simply create a string_view for each of your strings str with start/end of index, str.size().
Overall cost: O(|prefix|*N+N), which is also the cost to confirm that your answer is correct.
If you don't want to write a string_view, simply call str.erase(str.begin(), str.begin()+index) on each str in your vector.
Overall cost is O(|total string length|+N). The prefix has to be visited in order to confirm it, then the tail of the string has to be rewritten.
Now the cost of the breadth-first is locality, as you are touching memory all over the place. It will probably be more efficient in practice to do it in chunks, where you scan the first K strings up to length Q and find the common prefix, then chain that common prefix plus the next block. This won't change the O-notation, but will improve locality of memory reference.
for(vector<string>::iterator itr=V.begin(); itr!=V.end(); ++itr)
itr->erase(0,6);

Case insensitive sorting of an array of strings

Basically, I have to use selection sort to sort a string[]. I have done this part but this is what I am having difficulty with.
The sort, however, should be case-insensitive, so that "antenna" would come before "Jupiter". ASCII sorts from uppercase to lowercase, so would there not be a way to just swap the order of the sorted string? Or is there a simpler solution?
void stringSort(string array[], int size) {
int startScan, minIndex;
string minValue;
for(startScan = 0 ; startScan < (size - 1); startScan++) {
minIndex = startScan;
minValue = array[startScan];
for (int index = startScan + 1; index < size; index++) {
if (array[index] < minValue) {
minValue = array[index];
minIndex = index;
}
}
array[minIndex] = array[startScan];
array[startScan] = minValue;
}
}
C++ provides you with sort which takes a comparison function. In your case with a vector<string> you'll be comparing two strings. The comparison function should return true if the first argument is smaller.
For our comparison function we'll want to find the first mismatched character between the strings after tolower has been applied. To do this we can use mismatch which takes a comparator between two characters returning true as long as they are equal:
const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});
To decide if the lhs is smaller than the rhs fed to mismatch we need to test 3 things:
Were the strings of unequal length
Was string lhs shorter
Or was the first mismatched char from lhs smaller than the first mismatched char from rhs
This evaluation can be performed by:
result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));
Ultimately, we'll want to wrap this up in a lambda and plug it back into sort as our comparator:
sort(foo.begin(), foo.end(), [](const unsigned char lhs, const unsigned char rhs){
const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});
return result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));
});
This will correctly sort vector<string> foo. You can see a live example here: http://ideone.com/BVgyD2
EDIT:
Just saw your question update. You can use sort with string array[] as well. You'll just need to call it like this: sort(array, std::next(array, size), ...
#include <algorithm>
#include <vector>
#include <string>
using namespace std;
void CaseInsensitiveSort(vector<string>& strs)
{
sort(
begin(strs),
end(strs),
[](const string& str1, const string& str2){
return lexicographical_compare(
begin(str1), end(str1),
begin(str2), end(str2),
[](const char& char1, const char& char2) {
return tolower(char1) < tolower(char2);
}
);
}
);
}
I use this lambda function to sort a vectors of strings:
std::sort(entries.begin(), entries.end(), [](const std::string& a, const std::string& b) -> bool {
for (size_t c = 0; c < a.size() and c < b.size(); c++) {
if (std::tolower(a[c]) != std::tolower(b[c]))
return (std::tolower(a[c]) < std::tolower(b[c]));
}
return a.size() < b.size();
});
Instead of the < operator, use a case-insensitive string comparison function.
C89/C99 provide strcoll (string collate), which does a locale-aware string comparison. It's available in C++ as std::strcoll. In some (most?) locales, like en_CA.UTF-8, A and a (and all accented forms of either) are in the same equivalence class. I think strcoll only compares within an equivalence class as a tiebreak if the whole string is otherwise equal, which gives a very similar sort order to a case-insensitive compare. Collation (at least in English locales on GNU/Linux) ignores some characters (like [). So ls /usr/share | sort gives output like
acpi-support
adduser
ADM_scripts
aglfn
aisleriot
I pipe through sort because ls does its own sorting, which isn't quite the same as sort's locale-based sorting.
If you want to sort some user-input arbitrary strings into an order that the user will see directly, locale-aware string comparison is usually what you want. Strings that differ only in case or accents won't compare equal, so this won't work if you were using a stable sort and depending on case-differing strings to compare equal, but otherwise you get nice results. Depending on the use-case, nicer than plain case-insensitive comparison.
FreeBSD's strcoll was and maybe still is case sensitive for locales other than POSIX (ASCII). That forum post suggests that on most other systems it is not case senstive.
MSVC provides a _stricoll for case-insensitive collation, implying that its normal strcoll is case sensitive. However, this might just mean that the fallback to comparing within an equivalence class doesn't happen. Maybe someone can test the following example with MSVC.
// strcoll.c: show that these strings sort in a different order, depending on locale
#include <stdio.h>
#include <locale.h>
int main()
{
// TODO: try some strings containing characters like '[' that strcoll ignores completely.
const char * s[] = { "FooBar - abc", "Foobar - bcd", "FooBar - cde" };
#ifdef USE_LOCALE
setlocale(LC_ALL, ""); // empty string means look at env vars
#endif
strcoll(s[0], s[1]);
strcoll(s[0], s[2]);
strcoll(s[1], s[2]);
return 0;
}
output of gcc -DUSE_LOCALE -Og strcoll.c && ltrace ./a.out (or run LANG=C ltrace a.out):
__libc_start_main(0x400586, 1, ...
setlocale(LC_ALL, "") = "en_CA.UTF-8" # my env contains LANG=en_CA.UTF-8
strcoll("FooBar - abc", "Foobar - bcd") = -1
strcoll("FooBar - abc", "FooBar - cde") = -2
strcoll("Foobar - bcd", "FooBar - cde") = -1
# the three strings are in order
+++ exited (status 0) +++
with gcc -Og -UUSE_LOCALE strcoll.c && ltrace ./a.out:
__libc_start_main(0x400536, ...
# no setlocale, so current locale is C
strcoll("FooBar - abc", "Foobar - bcd") = -32
strcoll("FooBar - abc", "FooBar - cde") = -2
strcoll("Foobar - bcd", "FooBar - cde") = 32 # s[1] should sort after s[2], so it's out of order
+++ exited (status 0) +++
POSIX.1-2001 provides strcasecmp. The POSIX spec says the results are "unspecified" for locales other than plain-ASCII, though, so I'm not sure whether common implementations handle utf-8 correctly or not.
See this post for portability issues with strcasecmp, e.g. to Windows. See other answers on that question for other C++ ways of doing case-insensitive string compares.
Once you have a case-insensitive comparison function, you can use it with other sort algorithms, like C standard lib qsort, or c++ std::sort, instead of writing your own O(n^2) selection-sort.
As b.buchhold's answer points out, doing a case-insensitive comparison on the fly might be slower than converting everything to lowercase once, and sorting an array of indices. The lowercase-version of each strings is needed many times. std::strxfrm will transform a string so that strcmp on the result will give the same result as strcoll on the original string.
You could call tolower on every character you compare. This is probably the easiest, yet not a great solution, becasue:
You look at every char multiple times so you'd call the method more often than necessary
You need extra care to handle wide-characters w.r.t to their encoding (UTF8 etc)
You could also replace the comparator by your own function. I.e. there will be some place where you compare something like stringone[i] < stringtwo[j] or charA < charB. change it to my_less_than(stringone[i], stringtwo[j]) and implement the exact ordering you want based.
another way would be to transform every string to lowercase once and create an array of pairs. then you base your comparisons on the lowercase value only, but you swap whole pairs so that your final strings will be in the right order as well.
finally, you can create an array with lowercase versions and sort this one. whenever you swap two elements in this one, you also swap in the original array.
note that all those proposals would still need proper handling of wide characters (if you need that at all)
This solution is much simpler to understand than Jonathan Mee's and pretty inefficient, but for educational purpose could be fine:
std::string lowercase( std::string s )
{
std::transform( s.begin(), s.end(), s.begin(), ::tolower );
return s;
}
std::sort( array, array + length,
[]( const std::string &s1, const std::string &s2 ) {
return lowercase( s1 ) < lowercase( s2 );
} );
if you have to use your sort function, you can use the same approach:
....
minValue = lowercase( array[startScan] );
for (int index = startScan + 1; index < size; index++) {
const std::string &tstr = lowercase( array[index] );
if (tstr < minValue) {
minValue = tstr;
minIndex = index;
}
}
...

Optimizing a comparison over array elements with two conditions; C++ abstraction mechanisms?

My question is a follow-up to How to make this code faster (learning best practices)?, which has been put on hold (bummer). The problem is to optimize a loop over an array with floats which are tested for whether they lie within a given interval. Indices of matching elements in the array are to be stored in a provided result array.
The test includes two conditions (smaller than the upper threshold and bigger than the lower one). The obvious code for the test is if( elem <= upper && elem >= lower ) .... I observed that branching (including the implicit branch involved in the short-circuiting operator&&) is much more expensive than the second comparison. What I came up with is below. It is about 20%-40% faster than a naive implementation, more than I expected. It uses the fact that bool is an integer type. The condition test result is used as an index into two result arrays. Only one of them will contain the desired data, the other one can be discarded. This replaces program structure with data structure and computation.
I am interested in more ideas for optimization. "Technical hacks" (of the kind provided here) are welcome. I'm also interested in whether modern C++ could provide means to be faster, e.g. by enabling the compiler to create parallel running code. Think visitor pattern/functor. Computations on the single srcArr elements are almost independent, except that the order of indices in the result array depends on the order of testing the source array elements. I would loosen the requirements a little so that the order of the matching indices reported in the result array is irrelevant. Can anybody come up with a fast way?
Here is the source code of the function. A supporting main is below. gcc needs -std=c++11 because of chrono. VS 2013 express was able to compile this too (and created 40% faster code than gcc -O3).
#include <cstdlib>
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
/// Check all elements in srcArr whether they lie in
/// the interval [lower, upper]. Store the indices of
/// such elements in the array pointed to by destArr[1]
/// and return the number of matching elements found.
/// This has been highly optimized, mainly to avoid branches.
int findElemsInInterval( const float srcArr[], // contains candidates
int **const destArr, // two arrays to be filled with indices
const int arrLen, // length of each array
const float lower, const float upper // interval
)
{
// Instead of branching, use the condition
// as an index into two distinct arrays. We need to keep
// separate indices for both those arrays.
int destIndices[2];
destIndices[0] = destIndices[1] = 0;
for( int srcInd=0; srcInd<arrLen; ++srcInd )
{
// If the element is inside the interval, both conditions
// are true and therefore equal. In all other cases
// exactly one condition is true so that they are not equal.
// Matching elements' indices are therefore stored in destArr[1].
// destArr[0] is a kind of a dummy (it will incidentally contain
// indices of non-matching elements).
// This used to be (with a simple int *destArr)
// if( srcArr[srcInd] <= upper && srcArr[srcInd] >= lower) destArr[destIndex++] = srcInd;
int isInInterval = (srcArr[srcInd] <= upper) == (srcArr[srcInd] >= lower);
destArr[isInInterval][destIndices[isInInterval]++] = srcInd;
}
return destIndices[1]; // the number of elements in the results array
}
int main(int argc, char *argv[])
{
int arrLen = 1000*1000*100;
if( argc > 1 ) arrLen = atol(argv[1]);
// destArr[1] will hold the indices of elements which
// are within the interval.
int *destArr[2];
// we don't check destination boundaries, so make them
// the same length as the source.
destArr[0] = new int[arrLen];
destArr[1] = new int[arrLen];
float *srcArr = new float[arrLen];
// Create always the same numbers for comparison (don't srand).
for( int srcInd=0; srcInd<arrLen; ++srcInd ) srcArr[srcInd] = rand();
// Create an interval in the middle of the rand() spectrum
float lowerLimit = RAND_MAX/3;
float upperLimit = lowerLimit*2;
cout << "lower = " << lowerLimit << ", upper = " << upperLimit << endl;
int numInterval;
auto t1 = high_resolution_clock::now(); // measure clock time as an approximation
// Call the function a few times to get a longer run time
for( int srcInd=0; srcInd<10; ++srcInd )
numInterval = findElemsInInterval( srcArr, destArr, arrLen, lowerLimit, upperLimit );
auto t2 = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();
cout << numInterval << " elements found in " << duration << " milliseconds. " << endl;
return 0;
}
Thinking of the integer range check optimization of turning a <= x && x < b into ((unsigned)(x-a)) < b-a, a floating point variant comes to mind:
You could try something like
const float radius = (b-a)/2;
if( fabs( x-(a+radius) ) < radius )
...
to reduce the check to one conditional.
I see about a 10% speedup from this:
int destIndex = 0; // replace destIndices
int isInInterval = (srcArr[srcInd] <= upper) == (srcArr[srcInd] >= lower);
destArr[1][destIndex] = srcInd;
destIndex += isInInterval;
Eliminate the pair of output arrays. Instead only advance the 'number written' by 1 if you want to keep the result, otherwise just keep overwriting the 'one past the end' index.
Ie, retval[destIndex]=curIndex; destIndex+= isInArray; -- better coherancy and less wasted memory.
Write two versions: one that supports a fixed array length (of say 1024 or whatever) and another that supports a runtime parameter. Use a template argumemt to remove code duplication. Assume the length is less than that constant.
Have the function return size and a RVO'd std::array<unsigned, 1024>.
Write a wrapper function that merges results (create all results, then merge them). Then throw the parrallel patterns library at the problem (so the results get computed in parrallel).
If you allow yourself vectorization using the SSE (or better, AVX) instruction set, you can perform 4/8 comparisons in a go, do this twice, 'and' the results, and retrieve the 4 results (-1 or 0). At the same time, this unrolls the loop.
// Preload the bounds
__m128 lo= _mm_set_ps(lower);
__m128 up= _mm_set_ps(upper);
int srcIndex, dstIndex= 0;
for (srcInd= 0; srcInd + 3 < arrLen; )
{
__m128 src= _mm_load_ps(&srcArr[srcInd]); // Load 4 values
__m128 tst= _mm_and_ps(_mm_cmple_ps(src, lo), _mm_cmpge_ps(src, up)); // Test
// Copy the 4 indexes with conditional incrementation
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[0];
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[1];
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[2];
dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[3];
}
CAUTION: unchecked code.

Why is splitting a string slower in C++ than Python?

I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).
Python Code:
#!/usr/bin/env python
from __future__ import print_function
import time
import sys
count = 0
start_time = time.time()
dummy = None
for line in sys.stdin:
dummy = line.split()
count += 1
delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print(" Crunch Speed: {0}".format(lps))
else:
print('')
C++ Code:
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}
void split2(vector<string> &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp
Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).
I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"
Results:
$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real 0.01 user 0.38 sys
Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?
Edit 1 / Partial Solution?:
I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:
for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1
The performance of python is now about the same as the split1 C++ implementation.
/usr/bin/time cat test_lines_double | ./split5.py
22.61 real 0.01 user 0.40 sys
Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090
I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)
Thanks to all for your help.
Final Edit/Solution:
Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):
$ /usr/bin/time cat test_lines_double | ./split6
15.09 real 0.01 user 0.45 sys
C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333
My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.
One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.
Thanks again to all for your suggestions!
As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.
If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).
The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
class StringRef
{
private:
char const* begin_;
int size_;
public:
int size() const { return size_; }
char const* begin() const { return begin_; }
char const* end() const { return begin_ + size_; }
StringRef( char const* const begin, int const size )
: begin_( begin )
, size_( size )
{}
};
vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
vector<StringRef> result;
enum State { inSpace, inToken };
State state = inSpace;
char const* pTokenBegin = 0; // Init to satisfy compiler.
for( auto it = str.begin(); it != str.end(); ++it )
{
State const newState = (*it == delimiter? inSpace : inToken);
if( newState != state )
{
switch( newState )
{
case inSpace:
result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
break;
case inToken:
pTokenBegin = &*it;
}
}
state = newState;
}
if( state == inToken )
{
result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
}
return result;
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
//spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
//split2(spline, input_line);
vector<StringRef> const v = split3( input_line );
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
}
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x
Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.
I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.
Using strtok_r (reentrant variant of strtok):
void splitc1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
char *saveptr;
char *cpy, *token;
cpy = (char*)malloc(str.size() + 1);
strcpy(cpy, str.c_str());
for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
tokens.push_back(string(token));
}
free(cpy);
}
Additionally using character strings for parameters, and fgets for input:
void splitc2(vector<string> &tokens, const char *str,
const char *delimiters) {
char *saveptr;
char *cpy, *token;
cpy = (char*)malloc(strlen(str) + 1);
strcpy(cpy, str);
for(token = strtok_r(cpy, delimiters, &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters, &saveptr)) {
tokens.push_back(string(token));
}
free(cpy);
}
And, in some cases, where destroying the input string is acceptable:
void splitc3(vector<string> &tokens, char *str,
const char *delimiters) {
char *saveptr;
char *token;
for(token = strtok_r(str, delimiters, &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters, &saveptr)) {
tokens.push_back(string(token));
}
}
The timings for these are as follows (including my results for the other variants from the question and the accepted answer):
split1.cpp: C++ : Saw 20000000 lines in 31 seconds. Crunch speed: 645161
split2.cpp: C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
split.py: Python: Saw 20000000 lines in 33 seconds. Crunch Speed: 606060
split5.py: Python: Saw 20000000 lines in 35 seconds. Crunch Speed: 571428
split6.cpp: C++ : Saw 20000000 lines in 18 seconds. Crunch speed: 1111111
splitc1.cpp: C++ : Saw 20000000 lines in 27 seconds. Crunch speed: 740740
splitc2.cpp: C++ : Saw 20000000 lines in 22 seconds. Crunch speed: 909090
splitc3.cpp: C++ : Saw 20000000 lines in 20 seconds. Crunch speed: 1000000
As we can see, the solution from the accepted answer is still fastest.
For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.
I suspect that this is because of the way std::vector gets resized during the process of a push_back() function call. If you try using std::list or std::vector::reserve() to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
list<string> token_list;
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the list
token_list.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
tokens.assign(token_list.begin(), token_list.end());
}
EDIT: The other obvious thing I see is that Python variable dummy gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to be dummy = [] to initialize it and then do dummy += line.split(). Can you report the runtime after this?
EDIT2: To make it even more fair can you modify the while loop in C++ code to be:
while(cin) {
getline(cin, input_line);
std::vector<string> spline; // create a new vector
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
I think the following code is better, using some C++17 and C++14 features:
// These codes are un-tested when I write this post, but I'll test it
// When I'm free, and I sincerely welcome others to test and modify this
// code.
// C++17
#include <istream> // For std::istream.
#include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
#include <string>
#include <utility> // C++14 feature std::move.
template <template <class...> class Container, class Allocator>
void split1(Container<std::string_view, Allocator> &tokens,
std::string_view str,
std::string_view delimiter = " ")
{
/*
* The model of the input string:
*
* (optional) delimiter | content | delimiter | content | delimiter|
* ... | delimiter | content
*
* Using std::string::find_first_not_of or
* std::string_view::find_first_not_of is a bad idea, because it
* actually does the following thing:
*
* Finds the first character not equal to any of the characters
* in the given character sequence.
*
* Which means it does not treeat your delimiters as a whole, but as
* a group of characters.
*
* This has 2 effects:
*
* 1. When your delimiters is not a single character, this function
* won't behave as you predicted.
*
* 2. When your delimiters is just a single character, the function
* may have an additional overhead due to the fact that it has to
* check every character with a range of characters, although
* there's only one, but in order to assure the correctness, it still
* has an inner loop, which adds to the overhead.
*
* So, as a solution, I wrote the following code.
*
* The code below will skip the first delimiter prefix.
* However, if there's nothing between 2 delimiter, this code'll
* still treat as if there's sth. there.
*
* Note:
* Here I use C++ std version of substring search algorithm, but u
* can change it to Boyer-Moore, KMP(takes additional memory),
* Rabin-Karp and other algorithm to speed your code.
*
*/
// Establish the loop invariant 1.
typename std::string_view::size_type
next,
delimiter_size = delimiter.size(),
pos = str.find(delimiter) ? 0 : delimiter_size;
// The loop invariant:
// 1. At pos, it is the content that should be saved.
// 2. The next pos of delimiter is stored in next, which could be 0
// or std::string_view::npos.
do {
// Find the next delimiter, maintain loop invariant 2.
next = str.find(delimiter, pos);
// Found a token, add it to the vector
tokens.push_back(str.substr(pos, next));
// Skip delimiters, maintain the loop invariant 1.
//
// # next is the size of the just pushed token.
// Because when next == std::string_view::npos, the loop will
// terminate, so it doesn't matter even if the following
// expression have undefined behavior due to the overflow of
// argument.
pos = next + delimiter_size;
} while(next != std::string_view::npos);
}
template <template <class...> class Container, class traits, class Allocator2, class Allocator>
void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens,
std::istream &stream,
char delimiter = ' ')
{
std::string<char, traits, Allocator2> item;
// Unfortunately, std::getline can only accept a single-character
// delimiter.
while(std::getline(stream, item, delimiter))
// Move item into token. I haven't checked whether item can be
// reused after being moved.
tokens.push_back(std::move(item));
}
The choice of container:
std::vector.
Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large.
But, if you can estimate the size of it, this'll be less a problem.
std::list.
For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.
std::forward_list.
Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.
std::array.
If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.
std::deque.
This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.
According to std::deque-cppreference
On the other hand, deques typically have large minimal memory cost; a
deque holding just one element has to allocate its full internal array
(e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size
or 4096 bytes, whichever is larger, on 64-bit libc++)
or you can use combo of these:
std::vector< std::array<T, 2 ^ M> >
This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.
std::list< std::array<T, ...> >
Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.
std::forward_list< std::array<T, ...> >
Same as 2, but cost the same memory as combo 1.
You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?
If you take the split1 implementaion and change the signature to more closely match that of split2, by changing this:
void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")
to this:
void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')
You get a more dramatic difference between split1 and split2, and a fairer comparison:
split1 C++ : Saw 10000000 lines in 41 seconds. Crunch speed: 243902
split2 C++ : Saw 10000000 lines in 144 seconds. Crunch speed: 69444
split1' C++ : Saw 10000000 lines in 33 seconds. Crunch speed: 303030
void split5(vector<string> &tokens, const string &str, char delim=' ') {
enum { do_token, do_delim } state = do_delim;
int idx = 0, tok_start = 0;
for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
switch (state) {
case do_token:
if (it == str.end()) {
tokens.push_back (str.substr(tok_start, idx-tok_start));
return;
}
else if (*it == delim) {
state = do_delim;
tokens.push_back (str.substr(tok_start, idx-tok_start));
}
break;
case do_delim:
if (it == str.end()) {
return;
}
if (*it != delim) {
state = do_token;
tok_start = idx;
}
break;
}
}
}
I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.
See this post for details on how to change the buffer size, then try the comparison again:
Setting smaller buffer size for sys.stdin?

How to iterate over std::vector<char> and find null-terminated c-strings

I have three questions based on the following code fragments
I have a list of strings. It just happens to be a vector but could potentially be any source
vector<string> v1_names = boost::assign::list_of("Antigua and Barbuda")( "Brasil")( "Papua New Guinea")( "Togo");
The following is to store lengths of each name
vector<int> name_len;
the following is where I want to store the strings
std::vector<char> v2_names;
estimate memory required to copy names from v1_names
v2_names.reserve( v1_names.size()*20 + 4 );
Question: is this the best way to estimate storage? I fix the max len at 20 that is ok, then add space for null treminator
Now copy the names
for( std::vector<std::string>::size_type i = 0; i < v1_names.size(); ++i)
{
std::string val( v1_names[i] );
name_len.push_back(val.length());
for(std::string::iterator it = val.begin(); it != val.end(); ++it)
{
v2_names.push_back( *it );
}
v2_names.push_back('\0');
}
Question: is this the most efficient way to copy the elements from v1_name to v2_names?
Main Question: How do I iterate over v2_names and print the country names contained in v2_names
Use simple join, profit!
#include <boost/algorithm/string/join.hpp>
#include <vector>
#include <iostream>
int main(int, char **)
{
vector<string> v1_names = boost::assign::list_of("Antigua and Barbuda")( "Brasil")( "Papua New Guinea")( "Togo");
std::string joined = boost::algorithm::join(v1_names, "\0");
}
To estimate storage, you should probably measure the strings, rather than rely on a hard-coded constant 20. For example:
size_t total = 0;
for (std::vector<std::string>::iterator it = v1_names.begin(); it != v1_names.end(); ++it) {
total += it->size() + 1;
}
The main inefficiency in your loop is probably that you take an extra copy of each string in turn: std::string val( v1_names[i] ); could instead be const std::string &val = v1_names[i];.
To append each string, you can use the insert function:
v2_names.insert(v2_names.end(), val.begin(), val.end());
v2_names.push_back(0);
This isn't necessarily the most efficient, since there's a certain amount of redundant checking of available space in the vector, but it shouldn't be too bad and it's simple. An alternative would be to size v2_names at the start rather than reserving space, and then copy data (with std::copy) rather than appending it. But either one of them might be faster, and it shouldn't make a lot of difference.
For the main question, if all you have is v2_names and you want to print the strings, you could do something like this:
const char *p = &v2_names.front();
while (p <= &v2_names.back()) {
std::cout << p << "\n";
p += strlen(p) + 1;
}
If you also have name_len:
size_t offset = 0;
for (std::vector<int>::iterator it = name_len.begin(); it != name_len.end(); ++it) {
std::cout << &v2_names[offset] << "\n";
offset += *it + 1;
}
Beware that the type of name_len is technically wrong - it's not guaranteed that you can store a string length in an int. That said, even if int is smaller than size_t in a particular implementation, strings that big will still be pretty rare.
The best way to compute the required storage is to sum up the length of each string in v1_names.
For your second question instead of using the for loop for you could just use the iterator, iterator append method of vector with begin and end on the string.
For your third question: Just don't do that. Iterate over v1_names's strings instead. The only reason to ever create such a thing as v2_names is to pass it into a legacy C API and then you don't have to worry about iterating over it.
If you want to concatenate all the strings, you could just use a single pass and rely on amortized O(1) insertions:
name_len.reserve(v1_names.size());
// v2_names.reserve( ??? ); // only if you have a good heuristic or
// if you can determine this efficiently
for (auto it = v1_names.cbegin(); it != v1_names.cend(); ++it)
{
name_len.push_back(it->size());
v2_names.insert(v2_names.end(), it->c_str(), it->c_str() + it->size() + 1);
}
You could precompute the total length by another loop before this and call reserve if you think this will help. It depends on how well you know the strings. But perhaps there's no point worrying, since in the long run the insertions are O(1).