Handle very large data in C++ - c++

I have a .csv file which has ~3GB of data. I want to read all that data and process it. The following program reads the data from a file and stores it into a std::vector<std::vector<std::string>>. However, the program runs for too long and the application (vscode) freezes and needs to be restarted. What have I done wrong?
#include <algorithm>
#include <iostream>
#include <fstream>
#include "sheet.hpp"
extern std::vector<std::string> split(const std::string& str, const std::string& delim);
int main() {
Sheet sheet;
std::ifstream inputFile;
inputFile.open("C:/Users/1032359/cpp-projects/Straggler Job Analyzer/src/part-00001-of-00500.csv");
std::string line;
while(inputFile >> line) {
sheet.addRow(split(line, ","));
}
return 0;
}
// split and Sheet's member functions have been tested thoroughly and work fine. split has a complexity of N^2 though...
EDIT1: The file read has been fixed as per the suggestions in the comments.
The Split function:
std::vector<std::string> split(const std::string& str, const std::string& delim) {
std::vector<std::string> vec_of_tokens;
std::string token;
for (auto character : str) {
if (std::find(delim.begin(), delim.end(), character) != delim.end()) {
vec_of_tokens.push_back(token);
token = "";
continue;
}
token += character;
}
vec_of_tokens.push_back(token);
return vec_of_tokens;
}
EDIT2:
dummy csv row:
5612000000,5700000000,4665712499,798,3349189123,0.02698,0.06714,0.07715,0.004219,0.004868,0.06726,7.915e-05,0.0003681,0.27,0.00293,3.285,0.008261,0,0,0.01608
limits:
field1: starting timestamp (nanosecs)
field2: ending timestamp (nanosecs)
field3: job id (<= 1,000,000)
field4: task id (<= 10,000)
field5: machine id (<= 30,000,000)
field6: CPU time (sorry, no clue)
field7-20: no idea, unused for the current stage, but needed for later stages.
EDIT3: Required Output
remember the .thenby function in Excel?
the sorting order here is sort first on 5th column (1-based indexing), then on 3rd column and lastly on 4th column; all ascending.

I would start by defining a class to carry the information about one record and add overloads for operator>> and operator<< to help reading/writing records from/to streams. I'd probably add a helper to deal with the comma delimiter too.
First, the set of headers I've used:
#include <algorithm> // sort
#include <array> // array
#include <cstdint> // integer types
#include <filesystem> // filesystem
#include <fstream> // ifstream
#include <iostream> // cout
#include <iterator> // istream_iterator
#include <tuple> // tie
#include <vector> // vector
A simple delimiter helper could look like below. It discards (ignore()) the delimiter if it's in the stream or sets the failbit on the stream if the delimiter is not there.
template <char Char> struct delimiter {};
template <char Char> // read a delimiter
std::istream& operator>>(std::istream& is, const delimiter<Char>) {
if (is.peek() == Char) is.ignore();
else is.setstate(std::ios::failbit);
return is;
}
template <char Char> // write a delimiter
std::ostream& operator<<(std::ostream& os, const delimiter<Char>) {
return os.put(Char);
}
The actual record class can, with the information you've supplied, look like this:
struct record {
uint64_t start; // ns
uint64_t end; // ns
uint32_t job_id; // [0,1000000]
uint16_t task_id; // [0,10000]
uint32_t machine_id; // [0,30000000]
double cpu_time;
std::array<double, 20 - 6> unknown;
};
Reading such a record from a stream can then be done like this, using the delimiter class template (instantiated to use a comma and newline as delimiters):
std::istream& operator>>(std::istream& is, record& r) {
delimiter<','> del;
delimiter<'\n'> nl;
// first read the named fields
if (is >> r.start >> del >> r.end >> del >> r.job_id >> del >>
r.task_id >> del >> r.machine_id >> del >> r.cpu_time)
{
// then read the unnamed fields:
for (auto& unk : r.unknown) is >> del >> unk;
}
return is >> nl;
}
Writing a record is similarly done by:
std::ostream& operator<<(std::ostream& os, const record& r) {
delimiter<','> del;
delimiter<'\n'> nl;
os <<
r.start << del <<
r.end << del <<
r.job_id << del <<
r.task_id << del <<
r.machine_id << del <<
r.cpu_time;
for(auto&& unk : r.unknown) os << del << unk;
return os << nl;
}
Reading the whole file into memory, sorting it and then printing the result:
int main() {
std::filesystem::path filename = "C:/Users/1032359/cpp-projects/"
"Straggler Job Analyzer/src/part-00001-of-00500.csv";
std::vector<record> records;
// Reserve space for "3GB" / 158 (the length of a record + some extra bytes)
// records. Increase the 160 below if your records are actually longer on average:
records.reserve(std::filesystem::file_size(filename) / 160);
// open the file
std::ifstream inputFile(filename);
// copy everything from the file into `records`
std::copy(std::istream_iterator<record>(inputFile),
std::istream_iterator<record>{},
std::back_inserter(records));
// sort on columns 5-3-4 (ascending)
auto sorter = [](const record& lhs, const record& rhs) {
return std::tie(lhs.machine_id, lhs.job_id, lhs.task_id) <
std::tie(rhs.machine_id, rhs.job_id, rhs.task_id);
};
std::sort(records.begin(), records.end(), sorter);
// display the result
for(auto& r : records) std::cout << r;
}
The above process takes ~2 minutes on my old computer with spinning disks. If this is too slow, I'd measure the time of the long running parts:
reserve
copy
sort
Then, you can probably use that information to try to figure out where you need to improve it. For example, if sorting is a bit slow, it could help to use a std::vector<double> instead of a std::array<double, 20-6> to store the unnamed fields:
struct record {
record() : unknown(20-6) {}
uint64_t start; // ns
uint64_t end; // ns
uint32_t job_id; // [0,1000000]
uint16_t task_id; // [0,10000]
uint32_t machine_id; // [0,30000000]
double cpu_time;
std::vector<double> unknown;
};

I would suggest a slightly different approach:
Do NOT parse the entire row, only extract fields that are used for sorting
Note that your stated ranges require small number of bits, that together fit in one 64-bit value:
30,000,000 - 25 bit
10,000 - 14 bit
1,000,000 - 20 bit
Save a "raw" source in your vector, so that you can write it out as needed.
Here is what I got:
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <chrono>
#include <algorithm>
struct Record {
uint64_t key;
std::string str;
Record(uint64_t key, std::string&& str)
: key(key)
, str(std::move(str))
{}
};
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
std::ifstream src("data.csv");
std::vector<Record> v;
std::string str;
uint64_t key(0);
while (src >> str)
{
size_t pos = str.find(',') + 1;
pos = str.find(',', pos) + 1;
char* p(nullptr);
uint64_t f3 = strtoull(&str[pos], &p, 10);
uint64_t f4 = strtoull(++p, &p, 10);
uint64_t f5 = strtoull(++p, &p, 10);
key = f5 << 34;
key |= f3 << 14;
key |= f4;
v.emplace_back(key, std::move(str));
}
std::sort(v.begin(), v.end(), [](const Record& a, const Record& b) {
return a.key < b.key;
});
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
std::ofstream out("out.csv");
for (const auto& r : v) {
out.write(r.str.c_str(), r.str.length());
out.write("\n", 1);
}
auto t3 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count() << std::endl;
}
Of course, you can reserve space in your vector upfront to avoid reallocation.
I've generated a file with 18,000,000 records. My timing shows ~30 second for reading / sorting the file, and ~200 seconds to write the output.
UPDATE:
Replaced streaming with out.write(), reduced writing time from 200 seconds to 17!

As an alternate way to solve this problem, I would suggest to not read all data in memory, but to use the minimum amount of RAM to sort the huge CSV file: a std::vector of line offsets.
The important thing is to understand the concept, not the precise implementation.
As the implementation only needs 8 bytes per line (in 64-bit mode), to sort the 3 GB data file, we only need roughly 150 MB of RAM. The drawback is that the parsing of numbers need to be done several times for the same line, roughly log2(17e6)= 24 times. However, I think that this overhead is partially compensated by the less memory used and no need to parse all numbers of the row.
#include <Windows.h>
#include <cstdint>
#include <vector>
#include <algorithm>
#include <array>
#include <fstream>
std::array<uint64_t, 5> readFirst5Numbers(const char* line)
{
std::array<uint64_t, 5> nbr;
for (int i = 0; i < 5; i++)
{
nbr[i] = atoll(line);
line = strchr(line, ',') + 1;
}
return nbr;
}
int main()
{
// 1. Map the input file in memory
const char* inputPath = "C:/Users/1032359/cpp-projects/Straggler Job Analyzer/src/part-00001-of-00500.csv";
HANDLE fileHandle = CreateFileA(inputPath, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, 0, NULL);
DWORD highsize;
DWORD lowsize = GetFileSize(fileHandle, &highsize);
HANDLE mappingHandle = CreateFileMapping(fileHandle, NULL, PAGE_READONLY, highsize, lowsize, NULL);
size_t fileSize = (size_t)lowsize | (size_t)highsize << 32;
const char* memoryAddr = (const char*)MapViewOfFile(mappingHandle, FILE_MAP_READ, 0, 0, fileSize);
// 2. Find the offset of the start of lines
std::vector<size_t> linesOffset;
linesOffset.push_back(0);
for (size_t i = 0; i < fileSize; i++)
if (memoryAddr[i] == '\n')
linesOffset.push_back(i + 1);
linesOffset.pop_back();
// 3. sort the offset according to some logic
std::sort(linesOffset.begin(), linesOffset.end(), [memoryAddr](const size_t& offset1, const size_t& offset2) {
std::array<uint64_t, 5> nbr1 = readFirst5Numbers(memoryAddr + offset1);
std::array<uint64_t, 5> nbr2 = readFirst5Numbers(memoryAddr + offset2);
if (nbr1[4] != nbr2[4])
return nbr1[4] < nbr2[4];
if (nbr1[2] != nbr2[2])
return nbr1[2] < nbr2[2];
return nbr1[4] < nbr2[4];
});
// 4. output sorted array
const char* outputPath = "C:/Users/1032359/cpp-projects/Straggler Job Analyzer/output/part-00001-of-00500.csv";
std::ofstream outputFile;
outputFile.open(outputPath);
for (size_t offset : linesOffset)
{
const char* line = memoryAddr + offset;
size_t len = strchr(line, '\n') + 1 - line;
outputFile.write(line, len);
}
}

Related

Is there a better way than O(n³) to solve this text search?

I am trying to optimize a text search, where I am searching for multiple words. I want to know the frequency of all the words, per line.
I have tried to make it as fast as I can, as I want to run the search many times, with multiple keywords, on the same data.
I still am thinking though that there should be a more efficient way to solve this, so anybody has some good suggestions?
I have put up a simple demo to show the POC on gitlab:
https://gitlab.com/dkruithof/textfind
My current search time is 410ms on 6 keywords in a dataset of 408MB
Also, the source of the demo is this:
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
using namespace std;
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string& line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
char str[line.size()];
strncpy(str, line.c_str(), line.size());
// Getting the first token
char *token = strtok(str, newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = {"this", "blog", "political", "debate", "climate", "iphone"};
unsigned int searchLength = searchWords.size();
unsigned int searchWordIds[searchLength];
//convert searchWords
unsigned int i = 0;
for(const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
++i;
}
//#### This part is not time critical ####
//reading file and convert words to numbers
fstream newsFile;
newsFile.open("news.txt",ios::in);
if (newsFile.is_open())
{
string line;
while(getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
}
newsFile.close();
}
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for(std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for(unsigned int word : line)
{
for(unsigned int s = 0; s < searchLength; ++s)
{
unsigned int searchWord = searchWordIds[s];
if(word == searchWord)
{
++count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
cout << elapsed.count() << "ms" << endl;
//#### Print for checking result, time insensitive :)
int n = 0;
for(unsigned int count : counts)
{
cout << "Count[" << n << "]: " << count << endl;
++n;
if(n > 100)
{
break;
}
}
}
End results
I tried the multiple approaches, and the scores are as following:
Approach
User
Time
Encoding words
kcid42
410 ms
Hash tables
Öö Tiib & Jérôme Richard
135 ms
Ordered & encoded words
A M
13 ms
Hash tables & encoded words
Everybody
72 ms
The committed the results also to my gitlab, if you want to check for yourself.
Analysis
Using hash tables to speed up the search is smart, and does indeed reduce the search time. Better than my blunt approach at least. But it is still using strings, and string comparisons / construction / hashing is rather slow.
The approach of A M to speed up the encoded word search is I think faster because of that.
I have also tried to combine the approaches, to use the hash tables and encoded words together, but that was still slower than A M's custom search.
So I think we learned that A M is pretty good at searching stuff.
Thanks everybody for your input!
If you just want to speed up the part that you marked, then you can get a drastical improvement by sorting all vectors, before you enter this loop.
The searching will be really superfast.
The runtime of the loop will be reduced from 490ms to 10ms.
Can you please check and feed back.
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
#include <algorithm>
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
#pragma warning(suppress : 4996)
// Getting the first token
#pragma warning(suppress : 4996)
char* token = strtok(line.data(), newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
#pragma warning(suppress : 4996)
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = { "this", "blog", "political", "debate", "climate", "iphone" };
unsigned int searchLength = searchWords.size();
std::vector<unsigned int> searchWordIds(searchLength);
//convert searchWords
unsigned int i = 0;
for (const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
++i;
}
std::sort(searchWordIds.begin(), searchWordIds.end());
//#### This part is not time critical ####
//reading file and convert words to numbers
std::fstream newsFile;
newsFile.open("r:\\news.txt", std::ios::in);
if (newsFile.is_open())
{
std::string line;
while (std::getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
std::sort(textAsNumbers.back().begin(), textAsNumbers.back().end());
}
newsFile.close();
}
#if 1
std::vector<unsigned int>::iterator last2 = searchWordIds.end();
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for (std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
std::vector<unsigned int>::iterator first1 = line.begin();
std::vector<unsigned int>::iterator last1 = line.end();
std::vector<unsigned int>::iterator first2 = searchWordIds.begin();
while (first1 != last1 && first2 != last2) {
if (*first1 < *first2) {
++first1;
}
else {
if (!(*first2 < *first1)) {
++count;
++first1;
}
else
++first2;
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
#else
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for ( std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for (unsigned int word : line)
{
for (unsigned int s = 0; s < searchLength; ++s)
{
unsigned int searchWord = searchWordIds[s];
if (word == searchWord)
{
++count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
#endif
//#### Print for checking result, time insensitive :)
int n = 0;
for (unsigned int count : counts)
{
std::cout << "Count[" << n << "]: " << count << '\n';
++n;
if (n > 100)
{
break;
}
}
}
.
Edit:
We can make the overall program much more faster by optimizing the design:
increase the IO-Buffer size
read the whole file in one shot (not line by line)
use a special encryption for the characters. Convert all none-essential characters to a SPACE. This will make comparison really fast
use special identifier for End-Of-Line, count it and with that get the number of lines
store all words as std::string_view
also the key for the hash map for the dictionary will be a std::string_view
build the hash map in the same loop where words and End-Of_lines will be identified. This reduces duplication of work
Build rows with IDs for words, so that we can compare single integers instead of strings
Sort all those rows will all encoded words. This will make comparing very fast
Use optimized search and compare algorithm to count the matches per line
All this will reduce the runtime for the whole program from the original roughly 40s to ~4.5s. So, nearly ten times faster.
We can see some astonishing results here:
Reading 430MB in 189 ms
And converting all this amount of data in 90 ms
Counting the number of lines in 80ms
Building a hash map with a size of 284k entries in 3.6 s
Sorting 5000 lines with each many entries in unbelievable 367 ms
And doing the matching and counting in 13 ms
Please see an example of an output. I use a 11 years old Windows 7 machine.
And the code:
#include <iostream>
#include <fstream>
#include <string>
#include <chrono>
#include <filesystem>
#include <cstdint>
#include <array>
#include <execution>
#include <unordered_map>
#include <string_view>
// Basic definitions for data types
using MyChar = uint8_t;
using EncoderType = unsigned int;
// Dependent data types
using String = std::basic_string<MyChar, std::char_traits<MyChar>, std::allocator<MyChar>>;
using StringView = std::basic_string_view<MyChar, std::char_traits<MyChar>>;
using IFStream = std::basic_ifstream<MyChar, std::char_traits<MyChar>>;
using Dictionary = std::unordered_map<StringView, EncoderType>;
using DictionaryIter = Dictionary::iterator;
using EncodedLine = std::vector<EncoderType>;
using EncodedLineIter = EncodedLine::iterator;
using EncodedLines = std::vector<EncodedLine>;
using SearchWords = std::vector<StringView>;
using SearchWordsEncoded = EncodedLine;
using CounterForMatchesInOneLine = std::size_t;
using CounterForMatchesForEachLineLine = std::vector<CounterForMatchesInOneLine>;
StringView operator"" _msv(const char* str, std::size_t len) { return StringView{ reinterpret_cast<const MyChar*>(str), len }; };
// Special encoding of values in text
constexpr MyChar SPACE = 254;
constexpr MyChar EOL = 255;
constexpr std::array<MyChar, 256> Convert{ SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,EOL,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,48,49,50,51,52,53,54,55,56,57,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE };
// Speed up reading of file by using larger input buffer
constexpr std::size_t IOBufSize = 5'000'000u;
static MyChar ioBuf[IOBufSize];
// For measuring durations
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> startTime{};
long long elapsedTime{};
void start() { startTime = std::chrono::high_resolution_clock::now(); }
void stop() { elapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); }
friend std::ostream& operator << (std::ostream& os, const Timer& t) { return os << t.elapsedTime << " ms "; }
};
// Main Programm
int main() {
Timer t{}, tAll{}; tAll.start(); // Define Timers
Dictionary dictionary(300000); // The dictionory for words and their encoded IS
EncoderType encodedWordIdentifier{}; // This is for encoding strings. It will be simply incremented for each new word
// The words that we want to search. We use string_views for more efficient processing
SearchWords searchWords{ "this"_msv, "blog"_msv, "political"_msv, "debate"_msv, "climate"_msv, "iphone"_msv };
// And here we will store the encoded search words
SearchWordsEncoded searchWordsEncoded{};
// Add words to dictionary
for (const StringView& searchWord : searchWords) {
dictionary[searchWord] = encodedWordIdentifier;
searchWordsEncoded.push_back(encodedWordIdentifier++);
}
// Now read the complete text file and start all fata processing
// Open file and check, if it could be opened
if (IFStream ifs{ "r:\\news.txt",std::ios::binary }; ifs) {
// To speed up reading of the file, we will set a bigger input buffer
ifs.rdbuf()->pubsetbuf(ioBuf, IOBufSize);
// Here we will store the complete file, all data
String text{};
// Get number of bytes in file
const std::uintmax_t size = std::filesystem::file_size("r:\\news.txt");
text.resize(size);
// Read the whole file with one statement. Will be ultrafast
t.start();
ifs.read(text.data(), size);
t.stop(); std::cout << "Duration for reading complete file:\t\t\t\t" << t << "\tData read: " << ifs.gcount() << " bytes\n";
// No convert characters. Set all none essential characters to space. Build lowercase text. Special Mark for end of line
t.start();
std::transform(std::execution::par, text.begin(), text.end(), text.begin(), [&](const MyChar c) {return Convert[c]; });
t.stop(); std::cout << "Duration for converting all text data:\t\t\t\t" << t << '\n';
// Count the number of lines. We need this to pre-allocate space for our vectors
t.start();
std::size_t numberOfLines = std::count(std::execution::par, text.begin(), text.end(), EOL);
if (text.back() == EOL) ++numberOfLines;
t.stop(); std::cout << "Duration for counting number of lines:\t\t\t\t" << t << "\tNumber of lines identified: " <<numberOfLines << '\n';
// Now we can define the vector for the encoded lines with the exact needed size
EncodedLines encodedLines(numberOfLines);
// Start building the hash map. We will store string_views to optimize space
std::size_t wordLength{}; // Length of word that will be added to the hash map
MyChar* startWord{}; // Startposition (in the overall text) of the word to be added
bool waitForWord{ true }; // Mini state machine. Either we wait for start of word or its end
std::size_t index{}; // This will be used for addressing the current line
t.start();
// Iterate over all characters from the text file
for (MyChar& c : text) {
if (waitForWord) { // If we are in state of waiting for the beginning of the next word
if (c & 0b1000'0000) { // if the charcter is either space or end of line, continue to wait
if (c == EOL) ++index; // If we foound an end of line, then we will address the next line from now one
}
else { // Else, we found a character, so the beginning of a new word
startWord = &c; // Remember start position (in complete text file) of word
wordLength = 1; // The word length is now already 1, because we have foound the first character
waitForWord = false; // From now on we are "in" a word and wait for the end of the word, the next SPACE or EOL
}
}
else { // If we are in state of waiting for the end of the word
if (c & 0b1000'0000) { // If we have found a SPACE or EOL, then we found the end of a word
const StringView wordAsStringView{ startWord, wordLength }; // Build a string_view of the word
EncoderType currentEncodedWordIdentifier{ encodedWordIdentifier }; // Temporaray for the next encoding if
// Either add to dictioanry of use existing encoding ID
if (DictionaryIter entry = dictionary.find(wordAsStringView); entry != dictionary.end())
currentEncodedWordIdentifier = entry->second; // Already existing ID found. use it
else
dictionary[wordAsStringView] = encodedWordIdentifier++; // Create new entry in the hash map
encodedLines[index].push_back(currentEncodedWordIdentifier);
if (c == EOL) ++index; // If we have read an EOL, we will now address the next line
waitForWord = true; // We will change the state and from now on wait for the beginning of the next word again
}
else
++wordLength; // If we are in state of waiting for the end of the word and found a normal character, increment word length counter
}
}
t.stop(); std::cout << "Duration for building the dictionary and encode the lines:\t" << t << "Number of hashes : " << dictionary.size() << '\n';
// Sort all rows with line ideas. Will be very fast
t.start();
std::for_each(std::execution::par, encodedLines.begin(), encodedLines.end(), [](std::vector<unsigned int>& encodedLine) { std::sort(encodedLine.begin(), encodedLine.end()); });
t.stop(); std::cout << "Duration for sorting all line id encodings:\t\t\t" << t << '\n';
// Now, we will count, how often a search word appears in a line
CounterForMatchesForEachLineLine counterForMatchesForEachLineLine{}; // Vector of match-counters for each lines
counterForMatchesForEachLineLine.reserve(numberOfLines); // Preallocate memory
const EncodedLineIter searchWordsEnd = searchWordsEncoded.end(); // Pointer to search word vector end
t.start();
for (EncodedLine& encodedLine : encodedLines) // For all lines
{
CounterForMatchesInOneLine counterForMatchesInOneLine{}; // Counter for matches in current line
EncodedLineIter encodedLineCurrent = encodedLine.begin(); // Pointer to encoded value for current line
const EncodedLineIter encodedLineEnd = encodedLine.end(); // Pointer to last encoded value for current line
EncodedLineIter searchWordCurrent = searchWordsEncoded.begin(); // Pointer to beginning of search word IDs
// Compare and search. Take advantage of sorted IDs
while (encodedLineCurrent != encodedLineEnd && searchWordCurrent != searchWordsEnd) {
if (*encodedLineCurrent < *searchWordCurrent) {
++encodedLineCurrent;
}
else {
if (!(*searchWordCurrent < *encodedLineCurrent)) {
++counterForMatchesInOneLine;
++encodedLineCurrent;
}
else
++searchWordCurrent;
}
}
// Number of matches in this line has been detected. Store count for this line and continue with next line
counterForMatchesForEachLineLine.push_back(counterForMatchesInOneLine);
}
t.stop(); std::cout << "Duration for searching, comparing and counting:\t\t\t" << t << '\n';
tAll.stop(); std::cout << "\n\nDuration Program processing overall: " << tAll << '\n';
// Debug output
std::cout << "\n\nDemo Result. First 100 counts of matches:\n";
int lineCounter{};
for (CounterForMatchesInOneLine counterForMatchesInOneLine : counterForMatchesForEachLineLine)
{
std::cout << "Count[" << lineCounter++ << "]: " << counterForMatchesInOneLine << '\n';
if (lineCounter > 100) break;
}
}
else
std::cerr << "\n***Error: Could not open file\n";
}
I'd try building a https://en.wikipedia.org/wiki/Radix_tree that contains all your search words. When processing each line of text you then only need one maintain one pointer into the radix tree for each character position, and need to advance all of them with every additionally consumed character (or remove the pointer of the character sequence can no longer reach a valid word). Whenever an advanced pointer points to the end of a word, you increment your counter.
This shouldn't require any tokenization.
You do not need to iterate over all the searchWordIds items. Assuming this array do no contains any duplicates, you can use hash table for that so to make the algorithm runs in O(n²) time rather than O(n³) time (thanks to a O(1) search in searchWordIds). More specifically, an std::unordered_set<int> can be used so to check if word is in searchWordIds in constant time. You need to convert searchWordIds to a std::unordered_set<int> first. If the array has duplicates, then you can use a std::unordered_map<int, int> so to store the number of duplicates associated to a given word. The 2 nested loops consist in doing count += searchWordIds[word] in this last case.
If this is not enthough, you can use a Bloom filter so to speed up the lookup in searchWordIds. Indeed, this probabilistic data structure can very quickly find if word is not in searchWordIds (100% sure) or say if it is certainly in it (with a good accuracy assuming the bloom filter is sufficiently large). This should be at least twice faster. Possibly even more (the unordered_set and unordered_map are generally not very efficient, partially due to the use of linked-list-based buckets and a slow hash management).
If this is still not enough, you can parallelize the outermost loop. The idea is to compute a local count value for each section of the textAsNumbers array and then perform a final reduction. This assume the size of the sub arrays is relatively uniform (it will not scale well if one line is much much bigger than all others). You can flatten the vector<vector<int>> so to better load-balance the work and certainly even improve the performance in sequential (due to less indirections and likely less cache misses).
In practice I would perhaps serialize the whole text into std::unordered_map<std::string, int>. There string is word and int is count of that word in text. That operation is about O(X) where X is count of all words in text assuming that individual words are too short for hashing of those to matter. You said it is not time critical ... but just for the record.
After that searching a word in it is O(1) assuming again that the "word" means relatively short string and also we already have count of those words. If you have a list of words to search then it is O(N) where N is length of list.

C++ Reading data from text file into array of structures

I am reasonably new to programming in C++ and i'm having some trouble reading data from a text file into an array of structures. I have looked around similar posts to try and find a solution however, I have been unable to make any of it work for me and wanted to ask for some help. Below is an example of my data set (P.S. I will be using multiple data sets of varying sizes):
00010 0
00011 1
00100 0
00101 1
00110 1
00111 0
01000 0
01001 1
Below is my code:
int variables = 5;
typedef struct {
int variables[variables];
int classification;
} myData;
//Get the number of rows in the file
int readData(string dataset)
{
int numLines = 0;
string line;
ifstream dataFile(dataset);
while (getline(dataFile, line))
{
++numLines;
}
return numLines;
}
//Store data set into array of data structure
int storeData(string dataset)
{
int numLines = readData(dataset);
myData *dataArray = new myData[numLines];
...
return 0;
}
int main()
{
storeData("dataset.txt");
What I am trying to achieve is to store the first 5 integers of each row of the text file into the 'variables' array in the 'myData' structure and then store the last integer separated by white space into the 'classification' variable and then store that structure into the array 'dataArray' and then move onto the next row.
For example, the first structure in the array will have the variables [00010] and the classification will be 0. The second will have the variables [00011] and the classification will be 1, and so on.
I would really appreciate some help with this, cheers!
Provide stream extraction and stream insertion operators for your type:
#include <cstddef> // std::size_t
#include <cstdlib> // EXIT_FAILURE
#include <cctype> // std::isspace(), std::isdigit()
#include <vector> // std::vector<>
#include <iterator> // std::istream_iterator<>, std::ostream_iterator<>
#include <fstream> // std::ifstream
#include <iostream> // std::cout, std::cerr, std::cin
#include <algorithm> // std::copy()
constexpr std::size_t size{ 5 };
struct Data {
int variables[size];
int classification;
};
// stream extraction operator
std::istream& operator>>(std::istream &is, Data &data)
{
Data temp; // don't write directly to data since extraction might fail
// at any point which would leave data in an undefined state.
int ch; // signed integer because std::istream::peek() and ...get() return
// EOF when they encounter the end of the file which is usually -1.
// don't feed std::isspace
// signed values
while ((ch = is.peek()) != EOF && std::isspace(static_cast<unsigned>(ch)))
is.get(); // read and discard whitespace
// as long as
// +- we didn't read all variables
// | +-- the input stream is in good state
// | | +-- and the character read is not EOF
// | | |
for (std::size_t i{}; i < size && is && (ch = is.get()) != EOF; ++i)
if (std::isdigit(static_cast<unsigned>(ch)))
temp.variables[i] = ch - '0'; // if it is a digit, assign it to our temp
else is.setstate(std::ios_base::failbit); // else set the stream to a
// failed state which will
// cause the loop to end (is)
if (!(is >> temp.classification)) // if extraction of the integer following the
return is; // variables fails, exit.
data = temp; // everything fine, assign temp to data
return is;
}
// stream insertion operator
std::ostream& operator<<(std::ostream &os, Data const &data)
{
std::copy(std::begin(data.variables), std::end(data.variables),
std::ostream_iterator<int>{ os });
os << ' ' << data.classification;
return os;
}
int main()
{
char const *filename{ "test.txt" };
std::ifstream is{ filename };
if (!is.is_open()) {
std::cerr << "Failed to open \"" << filename << "\" for reading :(\n\n";
return EXIT_FAILURE;
}
// read from ifstream
std::vector<Data> my_data{ std::istream_iterator<Data>{ is },
std::istream_iterator<Data>{} };
// print to ostream
std::copy(my_data.begin(), my_data.end(),
std::ostream_iterator<Data>{ std::cout, "\n" });
}
Uncommented it looks less scary:
std::istream& operator>>(std::istream &is, Data &data)
{
Data temp;
int ch;
while ((ch = is.peek()) != EOF && std::isspace(static_cast<unsigned>(ch)))
is.get();
for (std::size_t i{}; i < size && is && (ch = is.get()) != EOF; ++i)
if (std::isdigit(static_cast<unsigned>(ch)))
temp.variables[i] = ch - '0';
else is.setstate(std::ios_base::failbit);
if (!(is >> temp.classification))
return is;
data = temp;
return is;
}
std::ostream& operator<<(std::ostream &os, Data const &data)
{
std::copy(std::begin(data.variables), std::end(data.variables),
std::ostream_iterator<int>{ os });
os << ' ' << data.classification;
return os;
}
It looks line you are trying to keep binary values as integer index. If that is the case, it will be converted into integer internally. You may need int to binary conversion again.
If you want to preserve data as is in the text file, then you need to choose either char/string type for the index value. For classification, it seems value will be either 0 or 1. So you can choose bool as data type.
#include <iostream>
#include <map>
using namespace std;
std::map<string, bool> myData;
int main()
{
// THIS IS SAMPLE INSERT. INTRODUCE LOOP FOR INSERT.
/*00010 0
00011 1
00100 0
00101 1
00110 1*/
myData.insert(std::pair<string, bool>("00010", 0));
myData.insert(std::pair<string, bool>("00011", 1));
myData.insert(std::pair<string, bool>("00100", 0));
myData.insert(std::pair<string, bool>("00101", 1));
myData.insert(std::pair<string, bool>("00110", 1));
// Display contents
std::cout << "My Data:\n";
std::map<string, bool>::iterator it;
for (it=myData.begin(); it!=myData.end(); ++it)
std::cout << it->first << " => " << it->second << '\n';
return 0;
}

Trouble overloading extraction operator for custom PriorityQueue

I am trying to overload operator>> for a custom PriorityQueue class I've been writing, code is below:
/**
* #brief Overloaded stream extraction operator.
*
* Bitshift operator>>, i.e. extraction operator. Used to write data from an input stream
* into a targeted priority queue instance. The data is written into the queue in the format,
*
* \verbatim
[item1] + "\t" + [priority1] + "\n"
[item2] + "\t" + [priority2] + "\n"
...
* \endverbatim
*
* #todo Implement functionality for any generic Type and PriorityType.
* #warning Only works for primitives as template types currently!
* #param inStream Reference to input stream
* #param targetQueue Instance of priority queue to manipulate with extraction stream
* #return Reference to input stream containing target queue data
*/
template<typename Type, typename PriorityType> std::istream& operator>>(std::istream& inStream, PriorityQueue<Type, PriorityType>& targetQueue) {
// vector container for input storage
std::vector< std::pair<Type, PriorityType> > pairVec;
// cache to store line input from stream
std::string input;
std::getline(inStream, input);
if (typeid(inStream) == typeid(std::ifstream)) {
inStream.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}
// loop until empty line
while (!input.empty()) {
unsigned int first = 0;
// loop over input cache
for (unsigned int i = 0; i < input.size(); ++i) {
// if char at index i of cache is a tab, break from loop
if (input.at(i) == '\t')
break;
++first;
}
std::string data_str = input.substr(0, first);
// convert from std::string to reqd Type
Type data = atoi(data_str.c_str());
std::string priority_str = input.substr(first);
// convert from std::string to reqd PriorityType
PriorityType priority = atof(priority_str.c_str());
pairVec.push_back(std::make_pair(data, priority));
// get line from input stream and store in input string
std::getline(inStream, input);
}
// enqueue pairVec container into targetQueue
//targetQueue.enqueueWithPriority(pairVec);
return inStream;
}
This currently works for stdin or std::cin input however it doesn't work for fstream input - the very first getline always reads an empty line from the input such that the while loop never gets triggered, and I can't seem to skip it (I tried with inStream.ignore() as you can see above but this doesn't work.
Edit:
Currently I just want to get it working for file input ignoring the fact it only works for int data type and double priority type right now - these aren't relevant (and neither is the actual manipulation of the targetQueue object itself).
For the moment I'm just concerned with resolving the blank-line issue when trying to stream through file-input.
Example file to pass:
3 5.6
2 6.3
1 56.7
12 45.1
where the numbers on each line are \t separated.
Example testing:
#include "PriorityQueue.h"
#include <sstream>
#include <iostream>
#include <fstream>
int main(void) {
// create pq of MAX binary heap type
PriorityQueue<int, double> pq(MAX);
std::ifstream file("test.txt");
file >> pq;
std::cout << pq;
}
where "test.txt" is the in the format of the example file above.
Edit: Simpler Example
Code:
#include <iostream>
#include <fstream>
#include <vector>
class Example {
public:
Example() {}
size_t getSize() const { return vec.size(); }
friend std::istream& operator>>(std::istream& is, Example& example);
private:
std::vector< std::pair<int, double> > vec;
};
std::istream& operator>>(std::istream& is, Example& example) {
int x;
double y;
while (is >> x >> y) {
std::cout << "in-loop" << std::endl;
example.vec.push_back(std::make_pair(x, y));
}
return is;
}
int main(void) {
Example example;
std::ifstream file("test.txt");
file >> example;
file.close();
std::cout << example.getSize() << std::endl;
return 0;
}
The operator is already overloaded -- and shall be overloaded -- for many types. Let those functions do their work:
template<typename Type, typename PriorityType>
std::istream& operator>>(std::istream& inStream, PriorityQueue<Type, PriorityType>& targetQueue)
{
std::vector< std::pair<Type, PriorityType> > pairVec;
Type data;
PriorityType priority;
while(inStream >> data >> priority)
pairVec.push_back(std::make_pair(data, priority));
targetQueue.enqueueWithPriority(pairVec);
return inStream;
}

Efficient parsing of mmap file

Following is the code for creating a memory map file using boost.
boost::iostreams::mapped_file_source file;
boost::iostreams::mapped_file_params param;
param.path = "\\..\\points.pts"; //! Filepath
file.open(param, fileSize);
if(file.is_open())
{
//! Access the buffer and populate the ren point buffer
const char* pData = file.data();
char* pData1 = const_cast<char*>(pData); //! this gives me all the data from Mmap file
std::vector<RenPoint> readPoints;
ParseData( pData1, readPoints);
}
The implementation of ParseData is as follows
void ParseData ( char* pbuffer , std::vector<RenPoint>>& readPoints)
{
if(!pbuffer)
throw std::logic_error("no Data in memory mapped file");
stringstream strBuffer;
strBuffer << pbuffer;
//! Get the max number of points in the pts file
std::string strMaxPts;
std::getline(strBuffer,strMaxPts,'\n');
auto nSize = strMaxPts.size();
unsigned nMaxNumPts = GetValue<unsigned>(strMaxPts);
readPoints.clear();
//! Offset buffer
pbuffer += nSize;
strBuffer << pbuffer;
std::string cur_line;
while(std::getline(strBuffer, cur_line,'\n'))
{
//! How do I read the data from mmap file directly and populate my renpoint structure
int yy = 0;
}
//! Working but very slow
/*while (std::getline(strBuffer,strMaxPts,'\n'))
{
std::vector<string> fragments;
istringstream iss(strMaxPts);
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter<vector<string>>(fragments));
//! Logic to populate the structure after getting data back from fragments
readPoints.push_back(pt);
}*/
}
I have say a minimum of 1 million points in my data structure and I want to optimize my parsing. Any ideas ?
read in header information to get the number of points
reserve space in a std::vector for N*num_points (N=3 assuming only X,Y,Z, 6 with normals, 9 with normals and rgb)
load the remainder of the file into a string
boost::spirit::qi::phrase_parse into the vector.
//code here can parse a file with 40M points (> 1GB) in about 14s on my 2 year old macbook:
#include <boost/spirit/include/qi.hpp>
#include <fstream>
#include <vector>
template <typename Iter>
bool parse_into_vec(Iter p_it, Iter p_end, std::vector<float>& vf) {
using boost::spirit::qi::phrase_parse;
using boost::spirit::qi::float_;
using boost::spirit::qi::ascii::space;
bool ret = phrase_parse(p_it, p_end, *float_, space, vf);
return p_it != p_end ? false : ret;
}
int main(int argc, char **args) {
if(argc < 2) {
std::cerr << "need a file" << std::endl;
return -1;
}
std::ifstream in(args[1]);
size_t numPoints;
in >> numPoints;
std::istreambuf_iterator<char> eos;
std::istreambuf_iterator<char> it(in);
std::string strver(it, eos);
std::vector<float> vf;
vf.reserve(3 * numPoints);
if(!parse_into_vec(strver.begin(), strver.end(), vf)) {
std::cerr << "failed during parsing" << std::endl;
return -1;
}
return 0;
}
AFAICT, you're currently copying the entire contents of the file into strBuffer.
What I think you want to do is use boost::iostreams::stream with your mapped_file_source instead.
Here's an untested example, based on the linked documentation:
// Create the stream
boost::iostreams::stream<boost::iostreams::mapped_file_source> str("some/path/file");
// Alternately, you can create the mapped_file_source separately and tell the stream to open it (using a copy of your mapped_file_source)
boost::iostreams::stream<boost::iostreams::mapped_file_source> str2;
str2.open(file);
// Now you can use std::getline as you normally would.
std::getline(str, strMaxPts);
As an aside, I'll note that by default mapped_file_source maps the entire file, so there's no need to pass the size explicitly.
You can go with something like this (just a fast concept, you'll need to add some additional error checking etc.):
#include "boost/iostreams/stream.hpp"
#include "boost/iostreams/device/mapped_file.hpp"
#include "boost/filesystem.hpp"
#include "boost/lexical_cast.hpp"
double parse_double(const std::string & str)
{
double value = 0;
bool decimal = false;
double divisor = 1.0;
for (std::string::const_iterator it = str.begin(); it != str.end(); ++it)
{
switch (*it)
{
case '.':
case ',':
decimal = true;
break;
default:
{
const int x = *it - '0';
value = value * 10 + x;
if (decimal)
divisor *= 10;
}
break;
}
}
return value / divisor;
}
void process_value(const bool initialized, const std::string & str, std::vector< double > & values)
{
if (!initialized)
{
// convert the value count and prepare the output vector
const size_t count = boost::lexical_cast< size_t >(str);
values.reserve(count);
}
else
{
// convert the value
//const double value = 0; // ~ 0:20 min
const double value = parse_double(str); // ~ 0:35 min
//const double value = atof(str.c_str()); // ~ 1:20 min
//const double value = boost::lexical_cast< double >(str); // ~ 8:00 min ?!?!?
values.push_back(value);
}
}
bool load_file(const std::string & name, std::vector< double > & values)
{
const int granularity = boost::iostreams::mapped_file_source::alignment();
const boost::uintmax_t chunk_size = ( (256 /* MB */ << 20 ) / granularity ) * granularity;
boost::iostreams::mapped_file_params in_params(name);
in_params.offset = 0;
boost::uintmax_t left = boost::filesystem::file_size(name);
std::string value;
bool whitespace = true;
bool initialized = false;
while (left > 0)
{
in_params.length = static_cast< size_t >(std::min(chunk_size, left));
boost::iostreams::mapped_file_source in(in_params);
if (!in.is_open())
return false;
const boost::iostreams::mapped_file_source::size_type size = in.size();
const char * data = in.data();
for (boost::iostreams::mapped_file_source::size_type i = 0; i < size; ++i, ++data)
{
const char c = *data;
if (strchr(" \t\n\r", c))
{
// c is whitespace
if (!whitespace)
{
whitespace = true;
// finished previous value
process_value(initialized, value, values);
initialized = true;
// start a new value
value.clear();
}
}
else
{
// c is not whitespace
whitespace = false;
// append the char to the value
value += c;
}
}
if (size < chunk_size)
break;
in_params.offset += chunk_size;
left -= chunk_size;
}
if (!whitespace)
{
// convert the last value
process_value(initialized, value, values);
}
return true;
}
Note that your main problem will be the conversion from string to float, which is very slow (insanely slow in the case of boost::lexical_cast). With my custom special parse_double func it is faster, however it only allows a special format (e.g. you'll need to add sign detection if negative values are allowed etc. - or you can just go with atof if all possible formats are needed).
If you'll want to parse the file faster, you'll probably need to go for multithreading - for example one thread only parsing the string values and other one or more threads converting the loaded string values to floats. In that case you probably won't even need the memory mapped file, as the regular buffered file read might suffice (the file will be read only once anyway).
A few quick comments on your code:
1) you're not reserving space for your vector so it's doing expansion every time you add a value. You have read the number of points from the file so call reserve(N) after the clear().
2) you're forcing a map of the entire file in one hit which will work on 64 bits but is probably slow AND is forcing another allocation of the same amount of memory with strBuffer << pbuffer;
http://www.boost.org/doc/libs/1_53_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file.mapped_file_mapping_regions shows how to getRegion
Use a loop through getRegion to load an estimated chunk of data containing many lines. You are going to have to handle partial buffers - each getRegion will likely end with part of a line you need to preserve and join to the next partial buffer starting the next region.

How to generate 'consecutive' c++ strings?

I would like to generate consecutive C++ strings like e.g. in cameras: IMG001, IMG002 etc. being able to indicate the prefix and the string length.
I have found a solution where I can generate random strings from concrete character set: link
But I cannot find the thing I want to achieve.
A possible solution:
#include <iostream>
#include <string>
#include <sstream>
#include <iomanip>
std::string make_string(const std::string& a_prefix,
size_t a_suffix,
size_t a_max_length)
{
std::ostringstream result;
result << a_prefix <<
std::setfill('0') <<
std::setw(a_max_length - a_prefix.length()) <<
a_suffix;
return result.str();
}
int main()
{
for (size_t i = 0; i < 100; i++)
{
std::cout << make_string("IMG", i, 6) << "\n";
}
return 0;
}
See online demo at http://ideone.com/HZWmtI.
Something like this would work
#include <string>
#include <iomanip>
#include <sstream>
std::string GetNextNumber( int &lastNum )
{
std::stringstream ss;
ss << "IMG";
ss << std::setfill('0') << std::setw(3) << lastNum++;
return ss.str();
}
int main()
{
int x = 1;
std::string s = GetNextNumber( x );
s = GetNextNumber( x );
return 0;
}
You can call GetNextNumber repeatedly with an int reference to generate new image numbers. You can always use sprintf but it won't be the c++ way :)
const int max_size = 7 + 1; // maximum size of the name plus one
char buf[max_size];
for (int i = 0 ; i < 1000; ++i) {
sprintf(buf, "IMG%.04d", i);
printf("The next name is %s\n", buf);
}
char * seq_gen(char * prefix) {
static int counter;
char * result;
sprintf(result, "%s%03d", prefix, counter++);
return result;
}
This would print your prefix with 3 digit padding string. If you want a lengthy string, all you have to do is provide the prefix as much as needed and change the %03d in the above code to whatever length of digit padding you want.
Well, the idea is rather simple. Just store the current number and increment it each time new string is generated. You can implement it to model an iterator to reduce the fluff in using it (you can then use standard algorithms with it). Using Boost.Iterator (it should work with any string type, too):
#include <boost/iterator/iterator_facade.hpp>
#include <sstream>
#include <iomanip>
// can't come up with a better name
template <typename StringT, typename OrdT>
struct ordinal_id_generator : boost::iterator_facade<
ordinal_id_generator<StringT, OrdT>, StringT,
boost::forward_traversal_tag, StringT
> {
ordinal_id_generator(
const StringT& prefix = StringT(),
typename StringT::size_type suffix_length = 5, OrdT initial = 0
) : prefix(prefix), suffix_length(suffix_length), ordinal(initial)
{}
private:
StringT prefix;
typename StringT::size_type suffix_length;
OrdT ordinal;
friend class boost::iterator_core_access;
void increment() {
++ordinal;
}
bool equal(const ordinal_id_generator& other) const {
return (
ordinal == other.ordinal
&& prefix == other.prefix
&& suffix_length == other.suffix_length
);
}
StringT dereference() const {
std::basic_ostringstream<typename StringT::value_type> ss;
ss << prefix << std::setfill('0')
<< std::setw(suffix_length) << ordinal;
return ss.str();
}
};
And example code:
#include <string>
#include <iostream>
#include <iterator>
#include <algorithm>
typedef ordinal_id_generator<std::string, unsigned> generator;
int main() {
std::ostream_iterator<std::string> out(std::cout, "\n");
std::copy_n(generator("IMG"), 5, out);
// can even behave as a range
std::copy(generator("foo", 1, 2), generator("foo", 1, 4), out);
return 0;
}
Take a look at the standard library's string streams. Have an integer that you increment, and insert into the string stream after every increment. To control the string length, there's the concept of fill characters, and the width() member function.
You have many ways of doing that.
The generic one would be to, like the link that you showed, have an array of possible characters. Then after each iteration, you start from right-most character, increment it (that is, change it to the next one in the possible characters list) and if it overflowed, set it to the first one (index 0) and go the one on the left. This is exactly like incrementing a number in base, say 62.
In your specific example, you are better off with creating the string from another string and a number.
If you like *printf, you can write a string with "IMG%04d" and have the parameter go from 0 to whatever.
If you like stringstream, you can similarly do so.
What exactly do you mean by consecutive strings ?
Since you've mentioned that you're using C++ strings, try using the .string::append method.
string str, str2;
str.append("A");
str.append(str2);
Lookup http://www.cplusplus.com/reference/string/string/append/ for more overloaded calls of the append function.
it's pseudo code. you'll understand what i mean :D
int counter = 0, retval;
do
{
char filename[MAX_PATH];
sprintf(filename, "IMG00%d", counter++);
if(retval = CreateFile(...))
//ok, return
}while(!retval);
You have to keep a counter that is increased everytime you get a new name. This counter has to be saved when your application is ends, and loaded when you application starts.
Could be something like this:
class NameGenerator
{
public:
NameGenerator()
: m_counter(0)
{
// Code to load the counter from a file
}
~NameGenerator()
{
// Code to save the counter to a file
}
std::string get_next_name()
{
// Combine your preferred prefix with your counter
// Increase the counter
// Return the string
}
private:
int m_counter;
}
NameGenerator my_name_generator;
Then use it like this:
std::string my_name = my_name_generator.get_next_name();