C++ Faster String Parsing? - c++

I am trying to write a parser to parse this dataset: http://research.microsoft.com/en-us/projects/mslr/
I wrote the code (below).. However it is too slow.. it takes almost a full minute to parse a few hundred megabytes of data.
I ran a profiler and it said most of the time was spent on boost::split (30%) and boost::lexical_cast (40%) ... any suggestions on how to speed up the code?
Thanks.
std::ifstream train("letor/Fold1/train.txt", std::ifstream::in | std::ifstream::binary);
m_train.clear();
for (std::size_t i = 0; i < 10000 && train.good(); i++) {
std::size_t query;
boost::numeric::ublas::vector<double> features;
double label;
features.resize(get_feature_size(), false);
// 2 qid:1 1:3 2:3 3:0 4:0 5:3 6:1 7:1 8:0 9:0 10:1 11:156 12:4 13:0 14:7 15:167 16:6.931275 17:22.076928 18:19.673353 19:22.255383 20:6.926551 21:3 22:3 23:0 24:0 25:6 26:1 27:1 28:0 29:0 30:2 31:1 32:1 33:0 34:0 35:2 36:1 37:1 38:0 39:0 40:2 41:0 42:0 43:0 44:0 45:0 46:0.019231 47:0.75000 48:0 49:0 50:0.035928 51:0.00641 52:0.25000 53:0 54:0 55:0.011976 56:0.00641 57:0.25000 58:0 59:0 60:0.011976 61:0.00641 62:0.25000 63:0 64:0 65:0.011976 66:0 67:0 68:0 69:0 70:0 71:6.931275 72:22.076928 73:0 74:0 75:13.853103 76:1.152128 77:5.99246 78:0 79:0 80:2.297197 81:3.078917 82:8.517343 83:0 84:0 85:6.156595 86:2.310425 87:7.358976 88:0 89:0 90:4.617701 91:0.694726 92:1.084169 93:0 94:0 95:2.78795 96:1 97:1 98:0 99:0 100:1 101:1 102:1 103:0 104:0 105:1 106:12.941469 107:20.59276 108:0 109:0 110:16.766961 111:-18.567793 112:-7.760072 113:-20.838749 114:-25.436074 115:-14.518523 116:-21.710022 117:-21.339609 118:-24.497864 119:-27.690319 120:-20.203779 121:-15.449379 122:-4.474452 123:-23.634899 124:-28.119826 125:-13.581932 126:3 127:62 128:11089534 129:2 130:116 131:64034 132:13 133:3 134:0 135:0 136:0
std::string line;
getline(train, line);
boost::algorithm::trim(line);
std::vector<std::string> tokens;
boost::split(tokens, line, boost::is_any_of(" "));
assert(tokens.size() == 138);
label = boost::lexical_cast<double>(tokens[0]);
query = boost::lexical_cast<std::size_t>(tokens[1].substr(tokens[1].find(":") + 1, tokens[1].size()));
for (std::size_t i = 2; i < tokens.size(); i++) {
features[i - 2] = boost::lexical_cast<double>(tokens[i].substr(tokens[i].find(":") + 1, tokens[i].size()));
}
m_train.push_back(query, features, label);
train.peek();
}

If I understand your format correctly, each line starts with a number followed by colon separated pairs. The first pair on each line has a special meaning and consists of an std::string and a size_t while all other pairs consist of an index (which is ignored) and a double. There is no reason to use Boost for this at all: use IOStreams directly:
std::streamsize max(std::numeric_limits<std::streamsize>::max());
std::string line;
std::istringstream in;
for (std::size_t i(0); i < 1000 && std::getline(train, line); ++i) {
double label;
size_t query;
in.clear();
in.str(line)
if ((in >> label).ignore(max, ':') >> query) {
boost::numeric::ublas::vector<double> features;
while (in.ignore(max, ':') >> feature) {
features.push_back(feature);
}
assert(features.size() == 136);
m_train.push_back(query, features, label);
}
}
Note that this code is careful to check that reads are actually successful. Your code checked ahead of time if the read would be successful but this doesn't work reliably. For example, if your last line consisted just of a spurious space your assert() would trigger which is hardly what you want.

Chopping the string multiple times requires memory allocations and deallocations. You could go with the good old strtod and char pointers to avoid splitting the string. That will take some care of the 30% spend in string tokenization. As for the 40% in converting sting to doubles, this probably cannot be significantly improved.
If you want to go for a quick, dirty, and amazingly ugly but probably the fastest C-only solution, try this. That test completed in about 35 seconds on a E8300 2.83 GHz CPU. Assuming that all strings have exactly same format.
#include "stdio.h"
void main ()
{
const char* test_str = "2 qid:1 1:3 2:3 3:0 4:0 5:3 6:1 7:1 8:0 9:0 10:1 11:156 12:4 13:0 14:7 15:167 16:6.931275 17:22.076928 18:19.673353 19:22.255383 20:6.926551 21:3 22:3 23:0 24:0 25:6 26:1 27:1 28:0 29:0 30:2 31:1 32:1 33:0 34:0 35:2 36:1 37:1 38:0 39:0 40:2 41:0 42:0 43:0 44:0 45:0 46:0.019231 47:0.75000 48:0 49:0 50:0.035928 51:0.00641 52:0.25000 53:0 54:0 55:0.011976 56:0.00641 57:0.25000 58:0 59:0 60:0.011976 61:0.00641 62:0.25000 63:0 64:0 65:0.011976 66:0 67:0 68:0 69:0 70:0 71:6.931275 72:22.076928 73:0 74:0 75:13.853103 76:1.152128 77:5.99246 78:0 79:0 80:2.297197 81:3.078917 82:8.517343 83:0 84:0 85:6.156595 86:2.310425 87:7.358976 88:0 89:0 90:4.617701 91:0.694726 92:1.084169 93:0 94:0 95:2.78795 96:1 97:1 98:0 99:0 100:1 101:1 102:1 103:0 104:0 105:1 106:12.941469 107:20.59276 108:0 109:0 110:16.766961 111:-18.567793 112:-7.760072 113:-20.838749 114:-25.436074 115:-14.518523 116:-21.710022 117:-21.339609 118:-24.497864 119:-27.690319 120:-20.203779 121:-15.449379 122:-4.474452 123:-23.634899 124:-28.119826 125:-13.581932 126:3 127:62 128:11089534 129:2 130:116 131:64034 132:13 133:3 134:0 135:0 136:0";
const char* format = "%lf qid:%lf 1:%lf 2:%lf 3:%lf 4:%lf 5:%lf 6:%lf 7:%lf 8:%lf 9:%lf 10:%lf 11:%lf 12:%lf 13:%lf 14:%lf 15:%lf 16:%lf 17:%lf 18:%lf 19:%lf 20:%lf 21:%lf 22:%lf 23:%lf 24:%lf 25:%lf 26:%lf 27:%lf 28:%lf 29:%lf 30:%lf 31:%lf 32:%lf 33:%lf 34:%lf 35:%lf 36:%lf 37:%lf 38:%lf 39:%lf 40:%lf 41:%lf 42:%lf 43:%lf 44:%lf 45:%lf 46:%lf 47:%lf 48:%lf 49:%lf 50:%lf 51:%lf 52:%lf 53:%lf 54:%lf 55:%lf 56:%lf 57:%lf 58:%lf 59:%lf 60:%lf 61:%lf 62:%lf 63:%lf 64:%lf 65:%lf 66:%lf 67:%lf 68:%lf 69:%lf 70:%lf 71:%lf 72:%lf 73:%lf 74:%lf 75:%lf 76:%lf 77:%lf 78:%lf 79:%lf 80:%lf 81:%lf 82:%lf 83:%lf 84:%lf 85:%lf 86:%lf 87:%lf 88:%lf 89:%lf 90:%lf 91:%lf 92:%lf 93:%lf 94:%lf 95:%lf 96:%lf 97:%lf 98:%lf 99:%lf 100:%lf 101:%lf 102:%lf 103:%lf 104:%lf 105:%lf 106:%lf 107:%lf 108:%lf 109:%lf 110:%lf 111:%lf 112:%lf 113:%lf 114:%lf 115:%lf 116:%lf 117:%lf 118:%lf 119:%lf 120:%lf 121:%lf 122:%lf 123:%lf 124:%lf 125:%lf 126:%lf 127:%lf 128:%lf 129:%lf 130:%lf 131:%lf 132:%lf 133:%lf 134:%lf 135:%lf 136:%lf";
double data[138];
for (int i = 0; i < 500000; i++)
{
sscanf(test_str, format,
data+0, data+1, data+2, data+3, data+4, data+5,
data+6, data+7, data+8, data+9, data+10, data+11,
data+12, data+13, data+14, data+15, data+16, data+17,
data+18, data+19, data+20, data+21, data+22, data+23,
data+24, data+25, data+26, data+27, data+28, data+29,
data+30, data+31, data+32, data+33, data+34, data+35,
data+36, data+37, data+38, data+39, data+40, data+41,
data+42, data+43, data+44, data+45, data+46, data+47,
data+48, data+49, data+50, data+51, data+52, data+53,
data+54, data+55, data+56, data+57, data+58, data+59,
data+60, data+61, data+62, data+63, data+64, data+65,
data+66, data+67, data+68, data+69, data+70, data+71,
data+72, data+73, data+74, data+75, data+76, data+77,
data+78, data+79, data+80, data+81, data+82, data+83,
data+84, data+85, data+86, data+87, data+88, data+89,
data+90, data+91, data+92, data+93, data+94, data+95,
data+96, data+97, data+98, data+99, data+100, data+101,
data+102, data+103, data+104, data+105, data+106, data+107,
data+108, data+109, data+110, data+111, data+112, data+113,
data+114, data+115, data+116, data+117, data+118, data+119,
data+120, data+121, data+122, data+123, data+124, data+125,
data+126, data+127, data+128, data+129, data+130, data+131,
data+132, data+133, data+134, data+135, data+136, data+137);
}
}
C99 has vsscanf that would make it look better. The format string can then be pre-generated dynamically once before the loop, depending on the dataset format. Make sure to check the return value of sscanf to be exactly 138 in this sample.
EDIT: Dietmar Kühl's solution looks clean, and must not be significantly if at all slower than a single sscanf. Best use the code above only as a benchmark reference.

It's hard to say without some experimentation, but...
I'd start by dropping boost::split. It's making a
std::vector<std::string>, which in turn involves a lot of dynamic
allocation and copying. What you probably want to do is write some sort
of iterator over a string in which ++ advances to the next token, and
* returns a pair of iterators defining the current token. This avoids
the intermediate data structure.
You can then define a << operator on this pair, something like:
std::ostream&
operator<<( std::ostream& dest, TextRange const& token )
{
std::copy( token.begin(), token.end(),
std::ostream_iterator<char>( dest ) );
return dest;
}
Reducing the time used by boost::lexical_cast will be more difficult.
Basically, boost::lexical_cast is using << to insert the source into
a std::stringstream, and >> to extract it. You can write something
similar, but using your own streambuf, based on the pair of iterators
(very simple), so given the pair of iterators, 1) you don't have to use
<< at all, since the pair of iterators becomes the istream, and 2) you
completely avoid creating any intermediate std::string. (You do
not want to reimplement the conversion routines in std::istream.)

Related

C++ - checking a string for all values in an array

I have some parsed text from the Vision API, and I'm filtering it using keywords, like so:
if (finalTextRaw.find("File") != finalTextRaw.npos)
{
LogMsg("Found Menubar");
}
E.g., if the keyword "File" is found anywhere within the string finalTextRaw, then the function is interrupted and a log message is printed.
This method is very reliable. But I've inefficiently just made a bunch of if-else-if statements in this fashion, and as I'm finding more words that need filtering, I'd rather be a little more efficient. Instead, I'm now getting a string from a config file, and then parsing that string into an array:
string filterWords = GetApp()->GetFilter();
std::replace(filterWords.begin(), filterWords.end(), ',', ' '); ///replace ',' with ' '
vector<int> array;
stringstream ss(filterWords);
int temp;
while (ss >> temp)
array.push_back(temp); ///create an array of filtered words
And I'd like to have just one if statement for checking that string against the array, instead of many of them for checking the string against each keyword I'm having to manually specify in the code. Something like this:
if (finalTextRaw.find(array) != finalTextRaw.npos)
{
LogMsg("Found filtered word");
}
Of course, that syntax doesn't work, and it's surely more complicated than that, but hopefully you get the idea: if any words from my array appear anywhere in that string, that string should be ignored and a log message printed instead.
Any ideas how I might fashion such a function? I'm guessing it's going to necessitate some kind of loop.
Borrowing from Thomas's answer, a ranged for loop offers a neat solution:
for (const auto &word : words)
{
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
As pointed out by Thomas, the most efficient way is to split both texts into a list of words. Then use std::set_intersection to find occurrences in both lists. You can use std::vector as long it is sorted. You end up with O(n*log(n)) (with n = max words), rather than O(n*m).
Split sentences to words:
auto split(std::string_view sentence) {
auto result = std::vector<std::string>{};
auto stream = std::istringstream{sentence.data()};
std::copy(std::istream_iterator<std::string>(stream),
std::istream_iterator<std::string>(), std::back_inserter(result));
return result;
}
Find words existing in both lists. This only works for sorted lists (like sets or manually sorted vectors).
auto intersect(std::vector<std::string> a, std::vector<std::string> b) {
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
auto result = std::vector<std::string>{};
std::set_intersection(std::move_iterator{a.begin()},
std::move_iterator{a.end()},
b.cbegin(), b.cend(),
std::back_inserter(result));
return result;
}
Example of how to use.
int main() {
const auto result = intersect(split("hello my name is mister raw"),
split("this is the final raw text"));
for (const auto& word: result) {
// do something with word
}
}
Note that this makes sense when working with large or undefined number of words. If you know the limits, you might want to use easier solutions (provided by other answers).
You could use a fundamental, brute force, loop:
unsigned int quantity_words = array.size();
for (unsigned int i = 0; i < quantity_words; ++i)
{
std::string word = array[i];
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
The above loop takes each word in the array and searches the finalTextRaw for the word.
There are better methods using some std algorithms. I'll leave that for other answers.
Edit 1: maps and association
The above code is bothering me because there are too many passes through the finalTextRaw string.
Here's another idea:
Create a std::set using the words in finalTextRaw.
For each word in your array, check for existence in the set.
This reduces the quantity of searches (it's like searching a tree).
You should also investigate creating a set of the words in array and finding the intersection between the two sets.

Why is splitting a string slower in C++ than Python?

I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).
Python Code:
#!/usr/bin/env python
from __future__ import print_function
import time
import sys
count = 0
start_time = time.time()
dummy = None
for line in sys.stdin:
dummy = line.split()
count += 1
delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print(" Crunch Speed: {0}".format(lps))
else:
print('')
C++ Code:
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}
void split2(vector<string> &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp
Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).
I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"
Results:
$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real 0.01 user 0.38 sys
Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?
Edit 1 / Partial Solution?:
I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:
for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1
The performance of python is now about the same as the split1 C++ implementation.
/usr/bin/time cat test_lines_double | ./split5.py
22.61 real 0.01 user 0.40 sys
Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090
I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)
Thanks to all for your help.
Final Edit/Solution:
Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):
$ /usr/bin/time cat test_lines_double | ./split6
15.09 real 0.01 user 0.45 sys
C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333
My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.
One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.
Thanks again to all for your suggestions!
As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.
If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).
The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
class StringRef
{
private:
char const* begin_;
int size_;
public:
int size() const { return size_; }
char const* begin() const { return begin_; }
char const* end() const { return begin_ + size_; }
StringRef( char const* const begin, int const size )
: begin_( begin )
, size_( size )
{}
};
vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
vector<StringRef> result;
enum State { inSpace, inToken };
State state = inSpace;
char const* pTokenBegin = 0; // Init to satisfy compiler.
for( auto it = str.begin(); it != str.end(); ++it )
{
State const newState = (*it == delimiter? inSpace : inToken);
if( newState != state )
{
switch( newState )
{
case inSpace:
result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
break;
case inToken:
pTokenBegin = &*it;
}
}
state = newState;
}
if( state == inToken )
{
result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
}
return result;
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
//spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
//split2(spline, input_line);
vector<StringRef> const v = split3( input_line );
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
}
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x
Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.
I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.
Using strtok_r (reentrant variant of strtok):
void splitc1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
char *saveptr;
char *cpy, *token;
cpy = (char*)malloc(str.size() + 1);
strcpy(cpy, str.c_str());
for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
tokens.push_back(string(token));
}
free(cpy);
}
Additionally using character strings for parameters, and fgets for input:
void splitc2(vector<string> &tokens, const char *str,
const char *delimiters) {
char *saveptr;
char *cpy, *token;
cpy = (char*)malloc(strlen(str) + 1);
strcpy(cpy, str);
for(token = strtok_r(cpy, delimiters, &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters, &saveptr)) {
tokens.push_back(string(token));
}
free(cpy);
}
And, in some cases, where destroying the input string is acceptable:
void splitc3(vector<string> &tokens, char *str,
const char *delimiters) {
char *saveptr;
char *token;
for(token = strtok_r(str, delimiters, &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters, &saveptr)) {
tokens.push_back(string(token));
}
}
The timings for these are as follows (including my results for the other variants from the question and the accepted answer):
split1.cpp: C++ : Saw 20000000 lines in 31 seconds. Crunch speed: 645161
split2.cpp: C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
split.py: Python: Saw 20000000 lines in 33 seconds. Crunch Speed: 606060
split5.py: Python: Saw 20000000 lines in 35 seconds. Crunch Speed: 571428
split6.cpp: C++ : Saw 20000000 lines in 18 seconds. Crunch speed: 1111111
splitc1.cpp: C++ : Saw 20000000 lines in 27 seconds. Crunch speed: 740740
splitc2.cpp: C++ : Saw 20000000 lines in 22 seconds. Crunch speed: 909090
splitc3.cpp: C++ : Saw 20000000 lines in 20 seconds. Crunch speed: 1000000
As we can see, the solution from the accepted answer is still fastest.
For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.
I suspect that this is because of the way std::vector gets resized during the process of a push_back() function call. If you try using std::list or std::vector::reserve() to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
list<string> token_list;
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the list
token_list.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
tokens.assign(token_list.begin(), token_list.end());
}
EDIT: The other obvious thing I see is that Python variable dummy gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to be dummy = [] to initialize it and then do dummy += line.split(). Can you report the runtime after this?
EDIT2: To make it even more fair can you modify the while loop in C++ code to be:
while(cin) {
getline(cin, input_line);
std::vector<string> spline; // create a new vector
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
I think the following code is better, using some C++17 and C++14 features:
// These codes are un-tested when I write this post, but I'll test it
// When I'm free, and I sincerely welcome others to test and modify this
// code.
// C++17
#include <istream> // For std::istream.
#include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
#include <string>
#include <utility> // C++14 feature std::move.
template <template <class...> class Container, class Allocator>
void split1(Container<std::string_view, Allocator> &tokens,
std::string_view str,
std::string_view delimiter = " ")
{
/*
* The model of the input string:
*
* (optional) delimiter | content | delimiter | content | delimiter|
* ... | delimiter | content
*
* Using std::string::find_first_not_of or
* std::string_view::find_first_not_of is a bad idea, because it
* actually does the following thing:
*
* Finds the first character not equal to any of the characters
* in the given character sequence.
*
* Which means it does not treeat your delimiters as a whole, but as
* a group of characters.
*
* This has 2 effects:
*
* 1. When your delimiters is not a single character, this function
* won't behave as you predicted.
*
* 2. When your delimiters is just a single character, the function
* may have an additional overhead due to the fact that it has to
* check every character with a range of characters, although
* there's only one, but in order to assure the correctness, it still
* has an inner loop, which adds to the overhead.
*
* So, as a solution, I wrote the following code.
*
* The code below will skip the first delimiter prefix.
* However, if there's nothing between 2 delimiter, this code'll
* still treat as if there's sth. there.
*
* Note:
* Here I use C++ std version of substring search algorithm, but u
* can change it to Boyer-Moore, KMP(takes additional memory),
* Rabin-Karp and other algorithm to speed your code.
*
*/
// Establish the loop invariant 1.
typename std::string_view::size_type
next,
delimiter_size = delimiter.size(),
pos = str.find(delimiter) ? 0 : delimiter_size;
// The loop invariant:
// 1. At pos, it is the content that should be saved.
// 2. The next pos of delimiter is stored in next, which could be 0
// or std::string_view::npos.
do {
// Find the next delimiter, maintain loop invariant 2.
next = str.find(delimiter, pos);
// Found a token, add it to the vector
tokens.push_back(str.substr(pos, next));
// Skip delimiters, maintain the loop invariant 1.
//
// # next is the size of the just pushed token.
// Because when next == std::string_view::npos, the loop will
// terminate, so it doesn't matter even if the following
// expression have undefined behavior due to the overflow of
// argument.
pos = next + delimiter_size;
} while(next != std::string_view::npos);
}
template <template <class...> class Container, class traits, class Allocator2, class Allocator>
void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens,
std::istream &stream,
char delimiter = ' ')
{
std::string<char, traits, Allocator2> item;
// Unfortunately, std::getline can only accept a single-character
// delimiter.
while(std::getline(stream, item, delimiter))
// Move item into token. I haven't checked whether item can be
// reused after being moved.
tokens.push_back(std::move(item));
}
The choice of container:
std::vector.
Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large.
But, if you can estimate the size of it, this'll be less a problem.
std::list.
For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.
std::forward_list.
Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.
std::array.
If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.
std::deque.
This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.
According to std::deque-cppreference
On the other hand, deques typically have large minimal memory cost; a
deque holding just one element has to allocate its full internal array
(e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size
or 4096 bytes, whichever is larger, on 64-bit libc++)
or you can use combo of these:
std::vector< std::array<T, 2 ^ M> >
This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.
std::list< std::array<T, ...> >
Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.
std::forward_list< std::array<T, ...> >
Same as 2, but cost the same memory as combo 1.
You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?
If you take the split1 implementaion and change the signature to more closely match that of split2, by changing this:
void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")
to this:
void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')
You get a more dramatic difference between split1 and split2, and a fairer comparison:
split1 C++ : Saw 10000000 lines in 41 seconds. Crunch speed: 243902
split2 C++ : Saw 10000000 lines in 144 seconds. Crunch speed: 69444
split1' C++ : Saw 10000000 lines in 33 seconds. Crunch speed: 303030
void split5(vector<string> &tokens, const string &str, char delim=' ') {
enum { do_token, do_delim } state = do_delim;
int idx = 0, tok_start = 0;
for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
switch (state) {
case do_token:
if (it == str.end()) {
tokens.push_back (str.substr(tok_start, idx-tok_start));
return;
}
else if (*it == delim) {
state = do_delim;
tokens.push_back (str.substr(tok_start, idx-tok_start));
}
break;
case do_delim:
if (it == str.end()) {
return;
}
if (*it != delim) {
state = do_token;
tok_start = idx;
}
break;
}
}
}
I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.
See this post for details on how to change the buffer size, then try the comparison again:
Setting smaller buffer size for sys.stdin?

Removing specified characters from a string - Efficient methods (time and space complexity)

Here is the problem: Remove specified characters from a given string.
Input: The string is "Hello World!" and characters to be deleted are "lor"
Output: "He Wd!"
Solving this involves two sub-parts:
Determining if the given character is to be deleted
If so, then deleting the character
To solve the first part, I am reading the characters to be deleted into a std::unordered_map, i.e. I parse the string "lor" and insert each character into the hashmap. Later, when I am parsing the main string, I will look into this hashmap with each character as the key and if the returned value is non-zero, then I delete the character from the string.
Question 1: Is this the best approach?
Question 2: Which would be better for this problem? std::map or std::unordered_map? Since I am not interested in ordering, I used an unordered_map. But is there a higher overhead for creating the hash table? What to do in such situations? Use a map (balanced tree) or a unordered_map (hash table)?
Now coming to the next part, i.e. deleting the characters from the string. One approach is to delete the character and shift the data from that point on, back by one position. In the worst case, where we have to delete all the characters, this would take O(n^2).
The second approach would be to copy only the required characters to another buffer. This would involve allocating enough memory to hold the original string and copy over character by character leaving out the ones that are to be deleted. Although this requires additional memory, this would be a O(n) operation.
The third approach, would be to start reading and writing from the 0th position, increment the source pointer when every time I read and increment the destination pointer only when I write. Since source pointer will always be same or ahead of destination pointer, I can write over the same buffer. This saves memory and is also an O(n) operation. I am doing the same and calling resize in the end to remove the additional unnecessary characters?
Here is the function I have written:
// str contains the string (Hello World!)
// chars contains the characters to be deleted (lor)
void remove_chars(string& str, const string& chars)
{
unordered_map<char, int> chars_map;
for(string::size_type i = 0; i < chars.size(); ++i)
chars_map[chars[i]] = 1;
string::size_type i = 0; // source
string::size_type j = 0; // destination
while(i < str.size())
{
if(chars_map[str[i]] != 0)
++i;
else
{
str[j] = str[i];
++i;
++j;
}
}
str.resize(j);
}
Question 3: What are the different ways by which I can improve this function. Or is this best we can do?
Thanks!
Good job, now learn about the standard library algorithms and boost:
str.erase(std::remove_if(str.begin(), str.end(), boost::is_any_of("lor")), str.end());
Assuming that you're studying algorithms, and not interested in library solutions:
Hash tables are most valuable when the number of possible keys is large, but you only need to store a few of them. Your hash table would make sense if you were deleting specific 32-bit integers from digit sequences. But with ASCII characters, it's overkill.
Just make an array of 256 bools and set a flag for the characters you want to delete. It only uses one table lookup instruction per input character. Hash map involves at least a few more instructions to compute the hash function. Space-wise, they are probably no more compact once you add up all the auxiliary data.
void remove_chars(string& str, const string& chars)
{
// set up the look-up table
std::vector<bool> discard(256, false);
for (int i = 0; i < chars.size(); ++i)
{
discard[chars[i]] = true;
}
for (int j = 0; j < str.size(); ++j)
{
if (discard[str[j]])
{
// do something, depending on your storage choice
}
}
}
Regarding your storage choices: Choose between options 2 and 3 depending on whether you need to preserve the input data or not. 3 is obviously most efficient, but you don't always want an in-place procedure.
Here is a KISS solution with many advantages:
void remove_chars (char *dest, const char *src, const char *excludes)
{
do {
if (!strchr (excludes, *src))
*dest++ = *src;
} while (*src++);
*dest = '\000';
}
You can ping pong between strcspn and strspn to avoid the need for a hash table:
void remove_chars(
const char *input,
char *output,
const char *characters)
{
const char *next_input= input;
char *next_output= output;
while (*next_input!='\0')
{
int copy_length= strspn(next_input, characters);
memcpy(next_output, next_input, copy_length);
next_output+= copy_length;
next_input+= copy_length;
next_input+= strcspn(next_input, characters);
}
}

Sorting a file with 55K rows and varying Columns

I want to find a programmatic solution using C++.
I have a 900 files each of 27MB size. (just to inform about the enormity ).
Each file has 55K rows and Varying columns. But the header indicates the columns
I want to sort the rows in an order w.r.t to a Column Value.
I wrote the sorting algorithm for this (definitely my newbie attempts, you may say).
This algorithm is working for few numbers, but fails for larger numbers.
Here is the code for the same:
basic functions I defined to use inside the main code:
int getNumberOfColumns(const string& aline)
{
int ncols=0;
istringstream ss(aline);
string s1;
while(ss>>s1) ncols++;
return ncols;
}
vector<string> getWordsFromSentence(const string& aline)
{
vector<string>words;
istringstream ss(aline);
string tstr;
while(ss>>tstr) words.push_back(tstr);
return words;
}
bool findColumnName(vector<string> vs, const string& colName)
{
vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
if ( it != vs.end())
return true;
else return false;
}
int getIndexForColumnName(vector<string> vs, const string& colName)
{
if ( !findColumnName(vs,colName) ) return -1;
else {
vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
return it - vs.begin();
}
}
////////// I like the Recurssive functions - I tried to create a recursive function
///here. This worked for small values , say 20 rows. But for 55K - core dumps
void sort2D(vector<string>vn, vector<string> &srt, int columnIndex)
{
vector<double> pVals;
for ( int i = 0; i < vn.size(); i++) {
vector<string>meancols = getWordsFromSentence(vn[i]);
pVals.push_back(stringToDouble(meancols[columnIndex]));
}
srt.push_back(vn[max_element(pVals.begin(), pVals.end())-pVals.begin()]);
if (vn.size() > 1 ) {
vn.erase(vn.begin()+(max_element(pVals.begin(), pVals.end())-pVals.begin()) );
vector<string> vn2 = vn;
//cout<<srt[srt.size() -1 ]<<endl;
sort2D(vn2 , srt, columnIndex);
}
}
Now the main code:
for ( int i = 0; i < TissueNames.size() -1; i++)
{
for ( int j = i+1; j < TissueNames.size(); j++)
{
//string fname = path+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
//string fname2 = sortpath2+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+"Sorted.txt";
string fname = path+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
string fname2 = sortpath2+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+"4Columns.txt";
vector<string>AllLinesInFile;
BioInputStream fin(fname);
string aline;
getline(fin,aline);
replace (aline.begin(), aline.end(), '"',' ');
string headerline = aline;
vector<string> header = getWordsFromSentence(aline);
int pindex = getIndexForColumnName(header,"p-raw");
int xcindex = getIndexForColumnName(header,"xC");
int xeindex = getIndexForColumnName(header,"xE");
int prbindex = getIndexForColumnName(header,"X");
string newheaderline = "X\txC\txE\tp-raw";
BioOutputStream fsrt(fname2);
fsrt<<newheaderline<<endl;
int newpindex=3;
while ( getline(fin, aline) ){
replace (aline.begin(), aline.end(), '"',' ');
istringstream ss2(aline);
string tstr;
ss2>>tstr;
tstr = ss2.str().substr(tstr.length()+1);
vector<string> words = getWordsFromSentence(tstr);
string values = words[prbindex]+"\t"+words[xcindex]+"\t"+words[xeindex]+"\t"+words[pindex];
AllLinesInFile.push_back(values);
}
vector<string>SortedLines;
sort2D(AllLinesInFile, SortedLines,newpindex);
for ( int si = 0; si < SortedLines.size(); si++)
fsrt<<SortedLines[si]<<endl;
cout<<"["<<i<<","<<j<<"] = "<<SortedLines.size()<<endl;
}
}
can some one suggest me a better way of doing this?
why it is failing for larger values. ?
The primary function of interest for this query is Sort2D function.
thanks for the time and patience.
prasad.
I'm not sure why your code is crashing, but recursion in that case is only going to make the code less readable. I doubt it's a stack overflow, however, because you're not using much stack space in each call.
C++ already has std::sort, why not use that instead? You could do it like this:
// functor to compare 2 strings
class CompareStringByValue : public std::binary_function<string, string, bool>
{
public:
CompareStringByValue(int columnIndex) : idx_(columnIndex) {}
bool operator()(const string& s1, const string& s2) const
{
double val1 = stringToDouble(getWordsFromSentence(s1)[idx_]);
double val2 = stringToDouble(getWordsFromSentence(s2)[idx_]);
return val1 < val2;
}
private:
int idx_;
};
To then sort your lines you would call
std::sort(vn.begin(), vn.end(), CompareByStringValue(columnIndex));
Now, there is one problem. This will be slow because stringToDouble and getWordsFromSentence are called multiple times on the same string. You would probably want to generate a separate vector which has precalculated the values of each string, and then have CompareByStringValue just use that vector as a lookup table.
Another way you can do this is insert the strings into a std::multimap<double, std::string>. Just insert the entries as (value, str) and then read them out line-by-line. This is simpler but slower (though has the same big-O complexity).
EDIT: Cleaned up some incorrect code and derived from binary_function.
You could try a method that doesn't involve recursion. if your program crashes using the Sort2D function with large values, then your probably overflowing the stack (danger of using recursion with a large number of function calls). Try another sorting method, maybe using a loop.
sort2D crashes because you keep allocating an array of strings to sort and then you pass it by value, in effect using O(2*N^2) memory. If you really want to keep your recursive function, simply pass vn by reference and don't bother with vn2. And if you don't want to modify the original vn, move the body of sort2D into another function (say, sort2Drecursive) and call that from sort2D.
You might want to take another look at sort2D in general, since you are doing O(N^2) work for something that should take O(N+N*log(N)).
The problem is less your code than the tool you chose for the job. This is purely a text processing problem, so choose a tool good at that. In this case on Unix the best tool for the job is Bash and the GNU coreutils. On Windows you can use PowerShell, Python or Ruby. Python and Ruby will work on any Unix-flavoured machine too, but roughly all Unix machines have Bash and the coreutils installed.
Let $FILES hold the list of files to process, delimited by whitespace. Here's the code for Bash:
for FILE in $FILES; do
echo "Processing file $FILE ..."
tail --lines=+1 $FILE |sort >$FILE.tmp
mv $FILE.tmp $FILE
done

Fastest way to determine whether a string contains a real or integer value

I'm trying to write a function that is able to determine whether a string contains a real or an integer value.
This is the simplest solution I could think of:
int containsStringAnInt(char* strg){
for (int i =0; i < strlen(strg); i++) {if (strg[i]=='.') return 0;}
return 1;
}
But this solution is really slow when the string is long... Any optimization suggestions?
Any help would really be appreciated!
What's the syntax of your real numbers?
1e-6 is valid C++ for a literal, but will be passed as integer by your test.
Is your string hundreds of characters long? Otherwise, don't care about any possible performance issues.
The only inefficiency is that you are using strlen() in a bad way, which means a lot of iterations over the string (inside strlen). For a simpler solution, with the same time complexity (O(n)), but probably slightly faster, use strchr().
You are using strlen, which means you are not worried about unicode. In that case why to use strlen or strchr, just check for '\0' (Null char)
int containsStringAnInt(char* strg){
for (int i =0;strg[i]!='\0'; i++) {
if (strg[i]=='.') return 0;}
return 1; }
Only one parsing through the string, than parsing through the string in each iteration of the loop.
Your function does not take into account exponential notation of reals (1E7, 1E-7 are both doubles)
Use strtol() to try to convert the string to integer first; it will also return the first position in the string where the parsing failed (this will be '.' if the number is real). If the parsing stopped at '.', use strtod() to try to convert to double. Again, the function will return the position in the string where the parsing stopped.
Don't worry about performance, until you have profiled the program. Otherwise, for fastest possible code, construct a regular expression that describes acceptable syntax of numbers, and hand-convert it first into a FSM, then into highly optimized code.
So the standard note first, please don't worry about performance too much if not profiled yet :)
I'm not sure about the manual loop and checking for a dot. Two issues
Depending on the locale, the dot can actually be a "," too (here in Germany that's the case :)
As others noted, there is the issue with numbers like 1e7
Previously I had a version using sscanf here. But measuring performance showed that sscanf is is significantly slower for bigger data-sets. So I'll show the faster solution first (Well, it's also a whole more simple. I had several bugs in the sscanf version until I got it working, while the strto[ld] version worked the first try):
enum {
REAL,
INTEGER,
NEITHER_NOR
};
int what(char const* strg){
char *endp;
strtol(strg, &endp, 10);
if(*strg && !*endp)
return INTEGER;
strtod(strg, &endp);
if(*strg && !*endp)
return REAL;
return NEITHER_NOR;
}
Just for fun, here is the version using sscanf:
int what(char const* strg) {
// test for int
{
int d; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%d %n", &d, &n);
if(!strg[n] && rd == 1) {
return INTEGER;
}
}
// test for double
{
double v; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%lf %n", &v, &n);
if(!strg[n] && rd == 1) {
return REAL;
}
}
return NEITHER_NOR;
}
I think that should work. Have fun.
Test was done by converting test strings (small ones) randomly 10000000 times in a loop:
6.6s for sscanf
1.7s for strto[dl]
0.5s for manual looping until "."
Clear win for strto[ld], considering it will parse numbers correctly I will praise it as the winner over manual looping. Anyway, 1.2s/10000000 = 0.00000012 difference roughly for one conversion isn't all that much in the end.
Strlen walks the string to find the length of the string.
You are calling strlen with every pass of the loop. Hence, you are walking the string way many more times than necessary. This tiny change should give you a huge performance improvement:
int containsStringAnInt(char* strg){
int len = strlen(strg);
for (int i =0; i < len; i++) {if (strg[i]=='.') return 0;}
return 1;
}
Note that all I did was find the length of the string once, at the start of the function, and refer to that value repeatedly in the loop.
Please let us know what kind of performance improvement this gets you.
#Aaron, with your way also you are traversing the string twice. Once within strlen, and once again in for loop.
Best way for ASCII string traversing in for loop is to check for Null char in the loop it self. Have a look at my answer, that parses the string only once within for loop, and may be partial parsing if it finds a '.' prior to end. that way if a string is like 0.01xxx (anotther 100 chars), you need not to go till end to find the length.
#include <stdlib.h>
int containsStringAnInt(char* strg){
if (atof(strg) == atoi(strg))
return 1;
return 0;
}