Speed up integer reading from file in C++ - c++

I'm reading a file, line by line, and extracting integers from it. Some noteworthy points:
the input file is not in binary;
I cannot load up the whole file in memory;
file format (only integers, separated by some delimiter):
x1 x2 x3 x4 ...
y1 y2 y3 ...
z1 z2 z3 z4 z5 ...
...
Just to add context, I'm reading the integers, and counting them, using an std::unordered_map<unsigned int, unsinged int>.
Simply looping through lines, and allocating useless stringstreams, like this:
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
std::stringstream ss(line);
}
gives me ~2.7s for a 700MB file.
Parsing each line:
unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
std::stringstream ss(line);
while (ss >> item);
}
Gives me ~17.8s for the same file.
If I change the operator to a std::getline + atoi:
unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
std::stringstream ss(line);
while (std::getline(ss, token, ' ')) item = atoi(token.c_str());
}
It gives ~14.6s.
Is there anything faster than these approaches? I don't think it's necessary to speed up the file reading, just the parsing itself -- both wouldn't make no harm, though (:

This program
#include <iostream>
int main ()
{
int num;
while (std::cin >> num) ;
}
needs about 17 seconds to read a file. This code
#include <iostream>
int main()
{
int lc = 0;
int item = 0;
char buf[2048];
do
{
std::cin.read(buf, sizeof(buf));
int k = std::cin.gcount();
for (int i = 0; i < k; ++i)
{
switch (buf[i])
{
case '\r':
break;
case '\n':
item = 0; lc++;
break;
case ' ':
item = 0;
break;
case '0': case '1': case '2': case '3':
case '4': case '5': case '6': case '7':
case '8': case '9':
item = 10*item + buf[i] - '0';
break;
default:
std::cerr << "Bad format\n";
}
}
} while (std::cin);
}
needs 1.25 seconds for the same file. Make what you want of it...

Streams are slow. If you really want to do stuff fast load the entire file into memory, and parse it in memory. If you really can't load it all into memory, load it in chunks, making those chunks as large as possible, and parse the chunks in memory.
When parsing in memory, replace the spaces and line endings with nulls so you can use atoi to convert to integer as you go.
Oh, and you'll get problems with the end of chunks because you don't know whether the chunk end cuts off a number or not. To solve this easily stop a small distance (16 byte should do) before the chunk end and copy this tail to the start before loading the next chunk after it.

Have you tried input iterators?
It skips the creation of the strings:
std::istream_iterator<int> begin(infile);
std::istream_iterator<int> end;
int item = 0;
while(begin != end)
item = *begin++;

Why don't you skip the stream and the line buffers and read from the file stream directly?
template<class T, class CharT, class CharTraits>
std::vector<T> read(std::basic_istream<CharT, CharTraits> &in) {
std::vector<T> ret;
while(in.good()) {
T x;
in >> x;
if(in.good()) ret.push_back(x);
}
return ret;
}
http://ideone.com/FNJKFa

Following up Jack Aidley's answer (can't put code in the comments), here's some pseudo-code:
vector<char> buff( chunk_size );
roffset = 0;
char* chunk = &buff[0];
while( not done with file )
{
fread( chunk + roffset, ... ); // Read a sizable chunk into memory, filling in after roffset
roffset = find_last_eol(chunk); // find where the last full line ends
parse_in_mem( chunk, chunk_size - roffset ); // process up to the last full line
move_unprocessed_to_front( chunk, roffset ); // don't re-read what's already in mem
}

Related

Reading in Floats from Text File and Counting the Columns [duplicate]

How can i read data untill end of line?I have a text file "file.txt" with this
1 5 9 2 59 4 6
2 1 2
3 2 30 1 55
I have this code:
ifstream file("file.txt",ios::in);
while(!file.eof())
{
....//my functions(1)
while(?????)//Here i want to write :while (!end of file)
{
...//my functions(2)
}
}
in my functions(2) i use the data from the lines and it need to be Int ,not char
Don't use while(!file.eof()) as eof() will only be set after reading the end of the file. It does not indicate, that the next read will be the end of the file. You can use while(getline(...)) instead and combine with istringstream to read numbers.
#include <fstream>
#include <sstream>
using namespace std;
// ... ...
ifstream file("file.txt",ios::in);
if (file.good())
{
string str;
while(getline(file, str))
{
istringstream ss(str);
int num;
while(ss >> num)
{
// ... you now get a number ...
}
}
}
You need to read Why is iostream::eof inside a loop condition considered wrong?.
As for reading until the end of the line. there's std::getline.
You have another problem though, and that is that you loop while (!file.eof()) which will most likely not work as you expect. The reason is that the eofbit flag is not set until after you try to read from beyond the end of the file. Instead you should do e.g. while (std::getline(...)).
char eoln(fstream &stream) // C++ code Return End of Line
{
if (stream.eof()) return 1; // True end of file
long curpos; char ch;
curpos = stream.tellp(); // Get current position
stream.get(ch); // Get next char
stream.clear(); // Fix bug in VC 6.0
stream.seekp(curpos); // Return to prev position
if ((int)ch != 10) // if (ch) eq 10
return 0; // False not end of row (line)
else // (if have spaces?)
stream.get(ch); // Go to next row
return 1; // True end of row (line)
} // End function
If you want to write it as function in order to call some where, you can use a vector. This is a function which I use to read such file and return integers element wise.
vector<unsigned long long> Hash_file_read(){
int frames_sec = 25;
vector<unsigned long long> numbers;
ifstream my_file("E:\\Sanduni_projects\\testing\\Hash_file.txt", std::ifstream::binary);
if (my_file) {
//ifstream file;
string line;
for (int i = 0; i < frames_sec; i++){
getline(my_file, line);
numbers.push_back(stoull(line));
}
}
else{
cout << "File can not be opened" << endl;
}
return numbers;
}

How to convert char to int and find the sum of characters from .txt file

I have a text file:
Carly:Cat:ABCCCCE.
I need to convert the last set of characters: ABCCCCE into integers and display their sum. I need to set the values to:
A = 15, B = 5, C = 6, D = 8, E = 2
The issue I am having is that these characters are in a .txt file and I am quite unfamiliar with how to extract ONLY the last elements of the line.
I have attempted to create a function to set the values of each letter A through to E, and have used this code to attempt to add them the values of these characters ABCCCCE together:
int Return_Complexity(char character)
{
switch (character)
{
case 'a':
return 15;
break;
case 'b':
return 5;
break;
case 'c':
return 6;
break;
case 'd':
return 8;
break;
case 'e':
return 2;
break;
default:
return 0;
}
}
void printComplexity()
{
ifstream file("Customers.txt");
vector<int> values;
char character;
while (file >> character)
{
int result = Return_Complexity(tolower(character));
if (result > 0) //if its a-e
values.push_back(result);
}
int Result = accumulate(values.begin(), values.end(), 1); //add together
std::cout << Result << std::endl;
}
int main()
{
printComplexity();
return 0;
}
From the values assigned to each character, I would want the output to be:
46
However, the function returns:
1
Any help would be greatly appreciated!
Okay, here is two things you are doing it wrong.
Indentation
Using all characters from the line to calculate sum.
You have mentioned that you only want to use the last section of the line but there is no code to distinguish the last section. Therefore it adds all the characters from "carly" & "cat".
As #Slava mentioned on a comment, you need to read each line from the file & get the position of last :.
void printComplexity()
{
ifstream file("Customers.txt");
vector<int> values;
string line;
// Read file line-by-line until reaches EOF
while (getline(file, line))
{
// Find last occurrence of `:`
size_t last_pos = line.find_last_of(':');
// Create new string that only contains desired section
// and iterate over it to get value.
string word (line, last_pos);
for (auto& ch : word) {
int result = Return_Complexity(tolower(ch));
values.push_back(result);
}
}
int Result = accumulate(values.begin(), values.end(), 0); //add together
std::cout << Result << std::endl;
}
the code is reading the whole file
thus even Carly:Cat:ABCCCCE. are counted with Carly:Cat:ABCCCCE.
you must wait for character to match delimter(: in this case) twice
then expect the comming of ABCCCCE.

Convert vector<string> to unsigned char array in C++

I have a string vector that holds some values. These values are supposed to be hex bytes but are being stored as strings inside this vector.
The bytes were read from inside a text file actually, something like this:
(contents of the text file)
<jpeg1>
0xFF,0xD8,0xFF,0xE0,0x00,0x10,0x4A,0x46,0x49,0x46,0x00,0x01,0x01,0x01,0x00,0x60
</jpeg1>
so far, what my code does is, it starts reading the line after the {JPEG1} tag until the {/jpeg1} tag and then using the comma ',' as a delimeter it stores the bytes into the string vector.
After Splitting the string, the vector at the moment stores the values like this :
vector<string> myString = {"0xFF", "0xD8", "0xFF", "0xE0", "0x00", "0x10", "0x4A", "0x46", "0x49", "0x46", "0x00", "0x01", "0x01", "0x01", "0x00", "0x60"};
and if i print this i get the following:
0: 0xFF
1: 0xD8
2: 0xFF
3: 0xE0
4: 0x00
5: 0x10
6: 0x4A
7: 0x46
8: 0x49
9: 0x46
What I would want is that, I'd like to store these bytes inside an unsigned char array, such that each element be treated as a HEX byte and not a string value.
Preferably something like this :
unsigned char myHexArray[] = {0xFF,0xD8,0xFF,0xE0,0x00,0x10,0x4A,0x46,0x49,0x46,0x00,0x01,0x01,0x01,0x00,0x60};
if i print this i get:
0:  
1: ╪
2:  
3: α
4:
5:
6: J
7: F
8: I
9: F
Solved!
Thanks for your help guys, so far "ranban282" solution has worked for me, I'll try solutions provided by other users as well.
I wouldn't even go through the std::vector<std::string> stage, you don't need it and it wastes a lot of allocations for no good reason; just parse the string to bytes "online".
If you already have an istream for your data, you can parse it straight from it, although I had terrible experiences about performance for it.
// is is some derived class of std::istream
std::vector<unsigned char> ret;
while(is) {
int val = 0;
is>>std::hex>>val;
if(!is) {
break; // failed conversion; remember to clean up the stream
// if you need it later!
}
ret.push_back(val);
if(is.getc()!=',') break;
}
If instead you have it in a string - as often happens when extracting data from an XML file, you can parse it either using istringstream and the code above (one extra string copy + generally quite slow), or parse it straight from the string using e.g. sscanf with %i; say that your string is in a const char *sz:
std::vector<unsigned char> ret;
for(; *sz; ++sz) {
int read = 0;
int val = 0;
if(sscanf(sz, " %i %n", &val, &read)==0) break; // format error
ret.push_back(val):
sz += read;
if(*sz && *sz != ',') break; // format error
}
// now ret contains the decoded string
If you are sure that the strings are always hexadecimal, regardless of the 0x prefix, and that whitespace is not present strtol is a bit more efficient and IMO nicer to use:
std::vector<unsigned char> ret;
for( ;*sz;++sz) {
char *endp;
long val = strtol(sz, &endp, 16);
if(endp==sz) break; // format error
sz = endp;
ret.push_back(val);
if(*sz && *sz!=',') break; // format error
}
If C++17 is available, you can use std::from_chars instead of strtol to cut out the locale bullshit, which can break your parsing function (although that's more typical for floating point parsing) and slow it down for no good reason.
OTOH, if the performance is critical but from_chars is not available (or if it's available but you measured that it's slow), it may be advantageous to hand roll the whole parser.
auto conv_digit = [](char c) -> int {
if(c>='0' && c<='9') return c-'0';
// notice: technically not guaranteed to work;
// in practice it'll work on anything that doesn't use EBCDIC
if(c>='A' && c<='F') return c-'A'+10;
if(c>='a' && c<='f') return c-'a'+10;
return -1;
};
std::vector<unsigned char> ret;
for(; *sz; ++sz) {
while(*sz == ' ') ++sz;
if(*sz!='0' || sz[1]!='x' || sz[1]!='X') break; // format error
sz+=2;
int val = 0;
int digit = -1;
const char *sz_before = sz;
while((digit = conv_digit(*sz)) >= 0) {
val=val*16+digit; // or, if you prefer: val = val<<4 | digit;
++sz;
}
if(sz==sz_before) break; // format error
ret.push_back(val);
while(*sz == ' ') ++sz;
if(*sz && *sz!=',') break; // format error
}
If you're using C++11, you can use the stoi function.
vector<string> myString = {"0xFF", "0xD8", "0xFF", "0xE0", "0x00", "0x10", "0x4A", "0x46", "0x49", "0x46", "0x00", "0x01", "0x01", "0x01", "0x00", "0x60"};
unsigned char* myHexArray=new unsigned char[myString.size()];
for (unsigned i=0;i<myString.size();i++)
{
myHexArray[i]=stoi(myString[i],NULL,0);
}
for (unsigned i=0;i<myString.size();i++)
{
cout<<myHexArray[i]<<endl;
}
The function stoi() was introduced by C++11. In order to compile with gcc, you should compile with the flags -std=c++11.
In case you're using an older version of c++ you can use strtol instead of stoi. Note that you need to convert the string to a character array first.
myHexArray[i]=strtol(myString[i].c_str(),NULL,0);
You can use std::stoul on each of your values and build your array using another std::vector like this:
std::vector<std::string> vs {"0xFF", "0xD8", "0xFF" ...};
std::vector<unsigned char> vc;
vc.reserve(vs.size());
for(auto const& s: vs)
vc.push_back((unsigned char) std::stoul(s, 0, 0));
Now you can access your array with:
vc.data(); // <-- pointer to unsigned char array
Here's a complete solution including a test and a rudimentary parser (for simplicity, it assumes that the xml tags are on their own lines).
#include <string>
#include <sstream>
#include <regex>
#include <iostream>
#include <iomanip>
#include <iterator>
const char test_data[] =
R"__(<jpeg1>
0xFF,0xD8,0xFF,0xE0,0x00,0x10,0x4A,0x46,0x49,0x46,0x00,0x01,0x01,0x01,0x00,0x60,
0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0
</jpeg1>)__";
struct Jpeg
{
std::string name;
std::vector<std::uint8_t> data;
};
std::ostream& operator<<(std::ostream& os, const Jpeg& j)
{
os << j.name << " : ";
const char* sep = " ";
os << '[';
for (auto b : j.data) {
os << sep << std::hex << std::setfill('0') << std::setw(2) << std::uint32_t(b);
sep = ", ";
}
return os << " ]";
}
template<class OutIter>
OutIter read_bytes(OutIter dest, std::istream& source)
{
std::string buffer;
while (std::getline(source, buffer, ','))
{
*dest++ = static_cast<std::uint8_t>(std::stoul(buffer, 0, 16));
}
return dest;
}
Jpeg read_jpeg(std::istream& is)
{
auto result = Jpeg {};
static const auto begin_tag = std::regex("<jpeg(.*)>");
static const auto end_tag = std::regex("</jpeg(.*)>");
std::string line, hex_buffer;
if(not std::getline(is, line)) throw std::runtime_error("end of file");
std::smatch match;
if (not std::regex_match(line, match, begin_tag)) throw std::runtime_error("not a <jpeg_>");
result.name = match[1];
while (std::getline(is, line))
{
if (std::regex_match(line, match, end_tag)) { break; }
std::istringstream hexes { line };
read_bytes(std::back_inserter(result.data), hexes);
}
return result;
}
int main()
{
std::istringstream input_stream(test_data);
auto jpeg = read_jpeg(input_stream);
std::cout << jpeg << std::endl;
}
expected output:
1 : [ ff, d8, ff, e0, 00, 10, 4a, 46, 49, 46, 00, 01, 01, 01, 00, 60, 12, 34, 56, 78, 9a, bc, de, f0 ]

Txt to 2 different arrays c++

I have a txt file with a lot of things in it.
The lines have this pattern: 6 spaces then 1 int, 1 space, then a string.
Also, the 1st line has the amount of lines that the txt has.
I want to put the integers in an array of ints and the string on an array of strings.
I can read it and put it into an array , but only if I'm considering the ints as chars and putting into one array of strings.When I try to separate things I have no idea on how I'd do it. Any ideas?
The code I used for putting everything in an array was this:
int size()
{
ifstream sizeX;
int x;
sizeX.open("cities.txt");
sizeX>>x;
return x;
};
int main(void)
{
int size = size();
string words[size];
ifstream file("cities.txt");
file.ignore(100000,'\n');
if(file.is_open())
{
for(int i=0; i<size; i++)
{
getline(file,words[i]);
}
}
}
Just to start I'm going to provide some tips about your code:
int size = size();
Why do you need to open the file, read the first line and then close it? That process can be done opening the file just once.
The code string words[size]; is absolutely not legal C++. You cannot instantiate a variable-length-array in C++. That C feature has been not included in C++ standard (some ref). I suggest you to replace with std::vector, which is more C++ code.
Here I write a snippet of function which perform what you need.
int parse_file(const std::string& filename,
std::vector<std::string>* out_strings,
std::vector<int>* out_integers) {
assert(out_strings != nullptr);
assert(out_integers != nullptr);
std::ifstream file;
file.open(filename, std::ios_base::in);
if (file.fail()) {
// handle the error
return -1;
}
// Local variables
int num_rows;
std::string line;
// parse the first line
std::getline(file, line);
if (line.size() == 0) {
// file empty, handle the error
return -1;
}
num_rows = std::stoi(line);
// reserve memory
out_strings->clear();
out_strings->reserve(num_rows);
out_integers->clear();
out_integers->reserve(num_rows);
for (int row = 0; row < num_rows; ++row) {
// read the line
std::getline(file, line);
if (line.size() == 0) {
// unexpected end of line, handle it
return -1;
}
// get the integer
out_integers->push_back(
std::stoi(line.substr(6, line.find(' ', 6) - 6)));
// get the string
out_strings->push_back(
line.substr(line.find(' ', 6) + 1, std::string::npos));
}
file.close();
return 0;
}
You can definitely improved it, but I think it's a good point where to start.
The last suggest I can give you, in order to improve the robustness of your code, you can match each line with a regular expression. In this way you can be sure your line is formatted exactly how you need.
For example:
std::regex line_pattern("\\s{6}[0-9]+\\s[^\\n]+");
if (std::regex_match(line, line_pattern) == false) {
// ups... the line is not formatted how you need
// this is an error
}

Performance bottleneck with CSV parser

My current parser is given below - Reading in ~10MB CSV to an STL vector takes ~30secs, which is too slow for my liking given I've got over 100MB which needs to be read in every time the program is run. Can anyone give some advice on how to improve performance? Indeed, would it be faster in plain C?
int main() {
std::vector<double> data;
std::ifstream infile( "data.csv" );
infile >> data;
std::cin.get();
return 0;
}
std::istream& operator >> (std::istream& ins, std::vector<double>& data)
{
data.clear();
// Reserve data vector
std::string line, field;
std::getline(ins, line);
std::stringstream ssl(line), ssf;
std::size_t rows = 1, cols = 0;
while (std::getline(ssl, field, ',')) cols++;
while (std::getline(ins, line)) rows++;
std::cout << rows << " x " << cols << "\n";
ins.clear(); // clear bad state after eof
ins.seekg(0);
data.reserve(rows*cols);
// Populate data
double f = 0.0;
while (std::getline(ins, line)) {
ssl.str(line);
ssl.clear();
while (std::getline(ssl, field, ',')) {
ssf.str(field);
ssf.clear();
ssf >> f;
data.push_back(f);
}
}
return ins;
}
NB: I have also have openMP at my disposal, and the contents will eventually be used for GPGPU computation with CUDA.
You could half the time by reading the file once and not twice.
While presizing the vector is beneficial, it will never dominate runtime, because I/O will always be slower by some magnitude.
Another possible optimization could be reading without a string stream. Something like (untested)
int c = 0;
while (ins >> f) {
data.push_back(f);
if (++c < cols) {
char comma;
ins >> comma; // skip comma
} else {
c = 0; // end of line, start next line
}
}
If you can omit the , and separate the values by white space only, it could be even
while (ins >> f)
data.push_back(f);
or
std::copy(std::istream_iterator<double>(ins), std::istream_iterator<double>(),
std::back_inserter(data));
On my machine, your reserve code takes about 1.1 seconds and your populate code takes 8.5 seconds.
Adding std::ios::sync_with_stdio(false); made no difference to my compiler.
The below C code takes 2.3 seconds.
int i = 0;
int j = 0;
while( true ) {
float x;
j = fscanf( file, "%f", & x );
if( j == EOF ) break;
data[i++] = x;
// skip ',' or '\n'
int ch = getc(file);
}
Try calling
std::ios::sync_with_stdio(false);
at the start of your program. This disables the (allegedly quite slow) synchronization between cin/cout and scanf/printf (I have never tried this myself, but have often seen the recommendation, such as here). Note that if you do this, you cannot mix C++-style and C-style IO in your program.
(In addition, Olaf Dietsche is completely right about only reading the file once.)
apparently, file io is a bad idea, just map the whole file into memory, access the
csv file as a continous vm block, this incur only a few syscall