How to speed up loading text file to multi vector - c++

I have to load large files (several GB) with data and I want to load them to two dimensional vector. Code below does the job, but it is insanely slow. To be more specific, the goal is to get all lines where values in 2nd column are equal to index(_lh,_sh). And then exclude the lines where 4th column value is same as line+1 and line-1.
Now, I'am new to c++ and I usualy code in Python (have working code for this problem already). But I need it to be as fast as posible so I tried to rewrite my python code to C++. But it rus slower than Python now (and only getting the data to vector is implemented)... so before I proceed, I want to improve that.
From what I have found in similar questions, the problem would be dynamic vectors, .push_back() and getline().
I am rather confused about maping and chunk loading mentioned in similar questions so I am not able to change the code acording to these.
Could you help me to optimize this code?
Thank you.
#include <iostream>
#include <sstream>
#include <fstream>
#include <array>
#include <string>
#include <vector>
using namespace std;
int pixel(int radek, int sloupec, int rozmer = 256) {
int index = (radek - 1) * rozmer + sloupec;
int index_lh = (index - rozmer - 1);
int index_sh = (index - rozmer);
int index_ph = (index - rozmer + 1);
int index_l = (index - 1);
int index_p = (index + 1);
int index_ld = (index + rozmer - 1);
int index_sd = (index + rozmer);
int index_pd = (index + rozmer + 1);
array<int, 9> index_all = { {index, index_lh, index_sh, index_ph, index_l, index_p, index_ld, index_sd, index_pd } };
vector<vector<string>> Data;
vector<string> Line;
string line;
for (int m = 2; m < 3; m++) {
string url = ("e:/TPX3 - kalibrace - 170420/ToT_ToA_calib_Zn_" + to_string(m) + string(".t3pa"));
cout << url << endl;
ifstream infile(url);
if (!infile)
{
cout << "Error opening output file" << endl;
system("pause");
return -1;
}
while (getline(infile, line))
{
Line.push_back(line);
istringstream txtStream(line);
string txtElement;
vector<string> Element;
while (getline(txtStream, txtElement, '\t')){
Element.push_back(txtElement);
}
Data.push_back(Element);
}
}
cout << Data[1][0] << ' ' << Data[1][1] << ' ' << Data[1][2] << endl;
return 0;
}
int main()
{
int x = pixel(120, 120);
cout << x << endl;
system("pause");
return 0;
}

Vectors can get slow if their underlying buffer gets reallocated often. A vector is required to be implemented on a buffer of continuous memory, and every time the buffer limit is exceeded, it will have to allocate a new and larger buffer, and then copy the content from the old buffer to the new buffer. If you have an idea of how big buffers you require (you don't need to be excact), you can help the program to allocate a buffer of appropriate size by using e.g. Data.reserve(n) (where n is approximately the number of elements you think you need). This does note change the "size" of the vector, just the size of the underlying buffer. As a concluding remark, I have to say I haven't really ever benchmarked this, so this may or may not improve the performance of your program.
EDIT: Though, I deem it a bit more likely that the performance is a bit bottled by the line Data.push_back(Element); which makes a copy of the Element-vector. If you're using C++11, I believe it's possible to work around this by doing something like Data.emplace_back(std::move(Element)); in which case you can't alter Element afterwards (it's content is moved). You would also need to include memory for std::move.

In the while loop, you could try changing the lines from
while (getline(infile, line))
{
Line.push_back(line);
istringstream txtStream(line);
string txtElement;
vector<string> Element;
while (getline(txtStream, txtElement, '\t')){
Element.push_back(txtElement);
}
Data.push_back(Element);
}
to:
while (getline(infile, line))
{
Line.push_back(line);
istringstream txtStream(line);
string txtElement;
//vector<string> Element; [-]
Data.emplace_back(); // [+]
while (getline(txtStream, txtElement, '\t')) {
//Element.push_back(txtElement); [-]
Data.back().push_back(txtElement); // [+]
}
//Data.push_back(Element); [-]
}
That way, the vectors in Data don't need to get moved or copied there -- they are already constructed, albeit empty. The vectors in Data are default-constructed with .emplace_back(). We get the last element in Data with the .back() function, and push our values as usual with .push_back(). Hopefully this helps :)

You can try using old C file reading API (FILE*, fopen(), etc.) or setting a bigger buffer for std::istringstream as follows
constexp std::size_t dimBuff { 10240 } // 10K, by example
char myBuff[dimBuff];
// ...
istringstream txtStream(line);
txtStream.rdbuf()->pubsetbuf(myBuff, dimBuff);
Another thing that you can try is using std::deques instead of std::vectors (but I've no idea if this is useful).
As suggested by muos, you can use move semantics; you can use emplace_back() also.
So I suggest to try with
Element.push_back(std::move(txtElement));
Data.push_back(std::move(Element));
or
Element.emplace_back(std::move(txtElement));
Data.emplace_back(std::move(Element));
You can also swith the following lines (there isn't a move constructor from a string for std::istringstream, if I'm not wrong)
Line.push_back(line);
istringstream txtStream(line);
adding move semantics (and emplace_back())
istringstream txtStream(line);
Line.emplace_back(std::move(line));
p.s.: obviously reserve() is usefull

You can also use reserve(int) on the vectors so they are created closer to the target size.
That too can avoid a lot of vector hopping around the heap, as the vector will only be recreated of it passes the target size.
You can call reserve again if vector passes the size you previously reserved:
vector<int> vec;
vec.reserve(10);
for (int i=0;i < 1000; i++)
{
if ( vec.size() == vec.capacity() )
{
vec.reserve(vec.size()+10);
}
vec.push_back(i);
}

Related

Reading in a file that has strings and ints in c++

So I have a sample file that I would like to read in, looking something like:
data 1
5
data 2
0
9
6
6
1
data 3
7
3
2
I basically want to assign each of these to variables I have in a struct, eg. my struct looks like:
struct sample_struct
{ int data1;
double* data2;
double* data3;
};
How do I approach this question?
I think I would be able to do it if I had the sample number of integers following each of the string titles, but like this I have no idea. Please help.
You have to solve 2 problems.
Dynamic memory management.
Detect, where, inwhich section we are and where to store the data.
Number 1 will usually be solved with a std::vector in C++. Raw pointers for owened memory or C-Style arrays are not used in C++.
If you do not want to or are not allowed to the a std::vetor you need to handcraft some dynamic array. I made an example for you.
For the 2nd part, we can simple take the alphanumering string as a separator for different sections of the source file. So, if we see an alpha character, we go to a new section and then store the data in the appropriate struct members.
Input and output in C++ is usually done with the extractor >> and inserter << operator. This allows to read from and write to any kind of stream. And, in C++ we use often object oriented programming. Here, data and methods are packed in one class/struct. Only the class/struct should know, how to read and write its data.
Based on the above thoughts, we could come up with the below solution:
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <cctype>
// Ultra simle dynamic array
struct DynamicDoubleArray {
// Constructor and Destructor
DynamicDoubleArray() { values = new double[capacity]; }; // Allocate default memory
~DynamicDoubleArray() { delete[] values; }; // Release previously allocated memory
// Data
double* values{}; // Here we store the values. This is a dynamic array
int numberOfElements{}; // Number of elements currently existing in dynamic array
int capacity{ 2 }; // Number of elements that could be stored in the dynamic array
void push_back(double v) { // Add a new aelement to our dynamic array
if (numberOfElements >= capacity) { // Check, if we have enough capacity to store the new element
capacity *= 2; // No, we have not. We need more capacity
double* temp = new double[capacity]; // Get new, bigger memory
for (int k = 0; k < numberOfElements; ++k) // Copy old data to new bigger memory
temp[k] = values[k];
delete[] values; // Delete old data
values = temp; // And assign new temp data to our original pointer
}
values[numberOfElements++] = v; // Store new data and increment element counter
}
};
// Our sample struct
struct SampleStruct {
// Data part
int data1{};
DynamicDoubleArray data2{};
DynamicDoubleArray data3{};
// Help functions. We overwrite the inserter and extractor operator
// Extract elements from whatever stream
friend std::istream& operator >> (std::istream& is, SampleStruct& s) {
std::string line{}; // Temporaray storage to hold a complete line
int section = 0; // Section. Where to store the data
while (std::getline(is, line)) {
if (std::isalpha(line[0])) { // If we see an alpha character then we are in the next section
++section; // Now, we will use the next section
continue;
}
switch (section) { // Depending on in which section we are
case 1:
s.data1 = std::stoi(line); // Section 1 --> Store int data
break;
case 2:
s.data2.push_back(std::stod(line)); // Section 2 --> Add/Store double data 2
break;
case 3:
s.data3.push_back(std::stod(line)); // Section 3 --> Add/Store double data 2
break;
default:
std::cerr << "\nError: internal mode error\n";
break;
}
}
return is;
}
// Simple output
friend std::ostream& operator << (std::ostream& os, const SampleStruct& s) {
os << "\nData 1: " << s.data1 << "\nData 2: ";
for (int k = 0; k < s.data2.numberOfElements; ++k)
os << s.data2.values[k] << ' ';
os << "\nData 3: ";
for (int k = 0; k < s.data3.numberOfElements; ++k)
os << s.data2.values[3] << ' ';
return os;
}
};

Returning Array in C++ returns unaccessable elements

I am working on a project where I parse a string in to an array and then return it back to the main function. It parses fine but when I return it to the main function I can't get access to the array elements.
//This is from the Main function. It calls commaSeparatedToArray which returns the array.
for (int i = 0; i < numberOfStudents; i++) {
string * parsedToArray = mainRoster->commaSeparatedToArray(studentData[i]);
Degree degreeType = SOFTWARE;
for (int i = 0; i < 3; i++) {
if (degreeTypeStrings[i] == parsedToArray[8])
degreeType = static_cast<Degree>(i);
}
mainRoster->add(parsedToArray[0], parsedToArray[1], parsedToArray[2], parsedToArray[3], stoi(parsedToArray[4]), stoi(parsedToArray[5]), stoi(parsedToArray[6]), stoi(parsedToArray[7]), degreeType);
}
//Here is the commaSeparatedToArray function
string * roster::commaSeparatedToArray(string rowToParse) {
int currentArraySize = 0;
const int expectedArraySize = 9;
string valueArray[expectedArraySize];
int commaIndex = 0;
string remainingString = rowToParse;
while (remainingString.find(",") != string::npos) {
currentArraySize++;
if (currentArraySize <= expectedArraySize) {
commaIndex = static_cast<int>(remainingString.find(","));
valueArray[currentArraySize - 1] = remainingString.substr(0, commaIndex);
remainingString = remainingString.substr(commaIndex + 1, remainingString.length());
}
else {
cerr << "INVALID RECORD. Record has more values then is allowed.\n";
exit(-1);
}
}
if (currentArraySize <= expectedArraySize) {
currentArraySize++;
commaIndex = static_cast<int>(remainingString.find(","));
valueArray[currentArraySize - 1] = remainingString.substr(0, commaIndex);
remainingString = remainingString.substr(commaIndex + 1, remainingString.length());
}
if (currentArraySize < valueArray->size()) {
cerr << "INVALID RECORD. Record has fewer values then is allowed.\n";
exit(-1);
}
return valueArray;
}
1) You can't return arrays in C++. Your code (as I'm sure you know) returns a pointer to an array. That's an important difference.
2) The array is declared locally in the function and therefore no longer exists after the function has exitted.
3) Therefore once you have returned from the function you have a pointer to something which no longer exists. Bad news.
4) You must always consider the lifetime of objects when you program C++. One solution to this problem is to dynamically allocate the array (using new[]). This means that the array will still exist when you exit the function. But it has the signifcant disavantage that you must remember to delete[] the array at a suitable later time.
5) The best solution (in general) is to use a std::vector. Unlike an array a std::vector can be returned from a function. So this option leads to the simplest, most natural code.
vector<string> roster::commaSeparatedToArray(string rowToParse) {
...
vector<string> valueArray(expectedArraySize);
...
return valueArray;
}
Since your array/vector is constant size, you could also use a std::array
array<string, expectedArraySize> valueArray;
To complete the answer that John has already given, I made some example code to show you, how such function could look like.
Parsing, or tokenizing can be easily done with the std::sregex_token_iterator. That is one of the purposes for this iterator. You can see the simplicity of the usage below.
In the function we define a vector af string and use its range constructor to do the whole tokenizing.
Then we make a sanity check and return the data.
Please see:
#include <string>
#include <regex>
#include <iterator>
#include <vector>
#include <algorithm>
#include <iostream>
const std::regex separator(",");
constexpr size_t ExpectedColumnSize = 9;
std::vector<std::string> commaSeparatedToArray(std::string rowToParse)
{
// Parse row into substrings
std::vector<std::string> columns{
std::sregex_token_iterator(rowToParse.begin(),rowToParse.end(),separator ,-1),
std::sregex_token_iterator() };
// Check number of columns
if (columns.size() != ExpectedColumnSize) {
std::cerr << "Error. Unexpected number of columns in record\n";
}
return columns;
}
// test code
int main()
{
// Define test data
std::string testInputData{ "1,2,3,4,5,6,7,8,9" };
// Get the result from the parser
std::vector<std::string> parsedElements{ commaSeparatedToArray(testInputData) };
// show the result on the console
std::copy(parsedElements.begin(), parsedElements.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}

Trying to add elements to a Vector classified with a Struct

I'm making a program to basically show the statistics about words the user enters. The rest of the program is fine so far, but I'm having a hard time adding words to a vector of type WordCount.
I have looked around and found several answers, which I would've thought could solve my issue, but I either get a very weird compiler error or it just does not work. I have tried using emplace_back and push_back with calls I thought was right. In essence, my problem code is as follows:
#include <iostream>
#include <string>
#include <vector>
using namespace std; //for simplicity here
struct WordCount {
string word;
int count;
//I have tried using this too:
WordCount(string _word, int _count) : word{_word}, count{_count} {}
};
//...//
void wordToVector(/**...**/,string addStr, vector<WordCount>& wordStats){
/**... code that I've tested to work; basically determined if the
word was already said as I need to have unique words only...**/
wordStats.push_back(WordCount(addStr, 1));
/** also tried: (some had "#include <istream>" when using emplace_back
but that didn't seem to make a difference for me in any case)
wordStats.emplace_back(WordCount(addStr, 1));
wordStats.emplace_back({addStr, 1});
wordStats.push_back(addStr, 1)
wordStats.push_back(addStr).word; (and wordStats.push_back(1).count;)
**/
}
int main() {
vector<WordCount> wordStats(1); //"1" to initialize the size
wordStats.at(0).word = "";
wordStats.at(0).count = 0;
/**There's already a part to change the first values to what they should
be, and it worked last I tested it. Below is a part was for my
personal use to see if anything came out... if it worked**/
for (int i = 0; i < 3; i++) {
cout << wordStats.at(i).word << endl;
cout << wordStats.at(i).count << endl;
}
return 0;
}
I must use a vector for this and cannot use pointers (as I've seen suggested) or #include <algorithm> per the instructions. If I typed in "Oh happy day!", it should be able to print (when fixed, with the current cout statements):
OH
1
HAPPY
1
DAY
1
(There's an earlier part that capitalizes every letter, which I tested to work).
This is my first post here because I'm lost. Please let me know if I provided too much or not enough. **Edited formatting
#include <iostream>
#include <string>
#include <vector>
using namespace std;
struct WordCount {
string word;
int count;
};
void wordToVector(string addStr, vector<WordCount>& wordStats){
for (int i = 0; i < wordStats.size(); i++) {
if (wordStats[i].word == addStr) {
wordStats[i].count = wordStats[i].count + 1;
return;
}
}
struct WordCount wc;
wc.word = addStr;
wc.count = 1;
wordStats.push_back(wc);
}
int main() {
vector<WordCount> wordStats;
wordToVector("hehe", wordStats);
wordToVector("hehe", wordStats);
wordToVector("haha", wordStats);
for (int i = 0; i < wordStats.size(); i++) {
cout << wordStats.at(i).word << endl;
cout << wordStats.at(i).count << endl;
}
return 0;
}
Using this code I get output:
hehe
2
haha
1
Is there anything else that needs to be added?
If you want to split the input by the spaces and check for occurrences of every word in the input it could be quite inefficient for longer texts to check for every word (Would be linear I think with M*N complexity), so if you are allowed I do suggest to use a map with word as key and value as the amount of occurrences - or something in that fashion.

C++ Spell checking program with two classes; Dictionary and word

Here is the specification for the code:
You are to use the Word and Dictionary classes defined below and write all member functions and any necessary supporting functions to achieve the specified result.
The Word class should dynamically allocate memory for each word to be stored in the dictionary.
The Dictionary class should contain an array of pointers to Word. Memory for this array must be dynamically allocated. You will have to read the words in from the file. Since you do not know the "word" file size, you do not know how large to allocate the array of pointers. You are to let this grow dynamically as you read the file in. Start with an array size of 8, When that array is filled, double the array size, copy the original 8 words to the new array and continue.
You can assume the "word" file is sorted, so your Dictionary::find() function must contain a binary search algorithm. You might want to save this requirement for later - until you get the rest of your program running.
Make sure you store words in the dictionary as lower case and that you convert the input text to the same case - that way your Dictionary::find() function will successfully find "Four" even though it is stored as "four" in your Dictionary.
Here is my code so far.
#include <cstring>
#include <iostream>
#include <fstream>
using namespace std;
class Word
{
char* word_;
public:
Word(const char* text = 0);
~Word() { delete[] word_; word_ = nullptr; }
const char* word() const;
};
Word::Word(const char* arg)
: word_(new char[strlen(arg) + 1])
{
strcpy(word_, arg);
}
const char* Word::word() const
{
return word_;
}
class Dictionary
{
Word** words_;
unsigned int capacity_; // max number of words Dictionary can hold
unsigned int numberOfWordsInDictionary_;
void resize() {
capacity_ = capacity_ * 2;
cout << "Size = " << capacity_ << endl;
};
void addWordToDictionary(char* word) { words_ += *word; };
public:
Dictionary(const char* filename);
~Dictionary() {
delete[] words_; words_ = nullptr;
};
bool find(const char* word);
};
Dictionary::Dictionary(const char * filename)
: words_(new Word*[8]), capacity_(8), numberOfWordsInDictionary_(0)
{
ifstream fin(filename);
if (!filename) {
cout << "Failed to open file!" << endl;
}
char buffer[32];
while (fin.getline(buffer, sizeof(buffer)))
{
if (numberOfWordsInDictionary_ == capacity_)
{
resize();
}
addWordToDictionary(buffer);
}
}
bool Dictionary::find(const char * left)
{
int last = capacity_ - 1,
first = 0,
middle;
bool found = false;
while (!found && first <= last) {
middle = (first + last) / 2;
if (strcmp(left, reinterpret_cast<char*>(words_[middle])) == 0) {
found = true;
}
else if (left > reinterpret_cast<char*>(words_[middle]))
last = middle - 1;
else
first = middle + 1;
}
return found;
}
;
bool cleanupWord(char x[] ) {
bool lower = false;
int i = 0;
while (x[i]) {
char c = x[i];
putchar(tolower(c));
lower = true;
}
return lower;
}
int main()
{
char buffer[32];
Dictionary Websters("words.txt");
ifstream fin("gettysburg.txt");
cout << "\nSpell checking " << "gettysburg.text" << "\n\n";
while (fin >> buffer) {
if (cleanupWord(buffer) == true) {
if (!Websters.find(buffer)) {
cout << buffer << " not found in the Dictionary\n";
}
}
}
system("PAUSE");
}
When I run the program it stops after outputting "spellchecking Gettysburg.txt" and I don't know why. Thank you!
The most likely cause of this problem is the text files have not been opened. Add a check with is_open to make sure they have been opened.
When using Relative Paths (any path that does not go all the way back to the root of the file system (and is an Absolute Path)), take care that the program is being run from the directory you believe it to be. It is not always the same directory as the executable. Search Term to use to learn more about this: Working Directory.
Now on to other reasons this program will not work:
void addWordToDictionary(char* word) { words_ += *word; };
is not adding words to the dictionary. Instead it is advancing the address at which words_ points by the numeric value of the letter at *word. This is extremely destructive as it loses the pointer to the buffer allocated for words_ in the constructor making delete[] words_; in the Dictionary destructor ineffective and probably fatal.
Instead you want to (Note I use want to with a bit of trepidation. What you really want to do is use std::vector and std::string, but I strongly suspect this would upset the assignment's marker)
Dynamically allocate a new Word with new.
Place this word in a free spot in the words_ array. Something along the lines of words_[numberOfWordsInDictionary_] = myNewWord;
Increase numberOfWordsInDictionary_ by 1.
Note that the Words allocated with new must all be released in the Dictionary destructor. You will want a for loop to help with this.
In addition, I would move the
if (numberOfWordsInDictionary_ == capacity_)
{
resize();
}
from Dictionary to addWordToDictionary so that any time addWordToDictionary is called it is properly sized.
Hmmm. While we're at it, let's look at resize
void resize() {
capacity_ = capacity_ * 2;
cout << "Size = " << capacity_ << endl;
};
This increases the object's capacity_ but does nothing to allocate more storage for words_. This needs to be corrected. You must:
Double the value of capacity_. You already have this.
Allocate a larger buffer to hold the replacement of words_ with new.
Copy all of the Words in words_ to the larger buffer.
Free the buffer currently pointed to by words_
Point words_ at the new, larger buffer.
Addendum
I haven't looked closely at find because the carnage required to fix the reading and storage of the dictionary will most likely render find unusable even if it does currently work. The use of reinterpret_cast<char*> is an alarm bell, though. There should be no reason for a cast, let alone the most permissive of them all, in a find function. Rule of thumb: When you see a reinterpret_cast and you don't know what it's for, assume it's hiding a bug and approach it with caution and suspicion.
In addition to investigating the Rule of Three mentioned in the comments, look into the Rule of Five. This will allow you to make a much simpler, and probably more efficient, dictionary based around Word* words_, where words_ will point to an array of Word directly instead of pointers to Words.

Segmentation fault during counting of elements in array of strings c++

I am trying to solve an old problem found on topcoder. I am immediately stuck in trying to find the number of elements in an array of strings. Here is my code
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <string>
using namespace std;
class MiniPaint {
private:
size_t numLines;
public:
int leastBad(string picture[], int maxStrokes) {
numLines = 0;
while (!picture[numLines].empty()) {
numLines++;
}
cout << numLines << '\n';
return 0;
}
};
int main() {
MiniPaint instance;
string picture[] = {"BBBBBBBBBBBBBBB", "WWWWWWWWWWWWWWW", "WWWWWWWWWWWWWWW", "WWWWWBBBBBWWWWW"};
instance.leastBad(picture, 10);
return 0;
}
This code gives me a segmentation fault. Something is going wrong, the code is a little bit excessive for just the functionality of counting the number of elements but of course I want to extend the class to include more functionality. If anyone can explain what is going wrong here I would be grateful! Thanks in advance.
EDIT: when I expand the code by
cout << picture[numlines] << '\n';
in the while loop, to show the actual elements in the array, first the four proper strings are shown and then somehow it endlessly prints spaces to the terminal. So the problem lies somewhere in the fact that
picture[4].empty()
does not return true, even though picture has only four elements.
Your while loop condition assumes that the last string in the array is empty:
int leastBad(string picture[], int maxStrokes) {
numLines = 0;
while (!picture[numLines].empty()) {
But your input string array defined in main() is not terminated with an empty "" string.
So you may want to add this empty string terminator:
// inside main()
string picture[] = {..., "" /* Empty string terminator */ };
In addition, in modern C++ I'd encourage you to use array container classes instead of raw C-style arrays, typically std::vector<std::string>.
In this case, you can use the size() method to get the array size (i.e. element count), or just a range-for loop for iterating through the whole array.
You access the array out of bounds.
When you call picture[4] you want to access a string object which is not there end the call to the function empty() is on uninitialized memory.
You either need to store how big the array is and iterate until numLines<=3 or you can use a vector
std::vector<std::string> picture = ...
for(std::string line : picture)
{
//do stuff
}
You are out of array bounds at picture[numLines]. You should pass array length or calculate it and check the index numLines. Code will look like:
size_t length = sizeof(picture) / sizeof(*picture); // For VS use _countof macro
while (numLines < length && !picture[numLines].empty())
{
++numLines;
}