Fast read of large text file to 1D structure in C++ - c++

I need to read a batch of text files of up to 20mb in size, fast.
The text file comes in the format. The numbers need to be in double format as some other file may have 3 decimal place precision:
0 0 29 175 175 175 175 174
0 1 29 175 175 175 175 174
0 2 29 28 175 175 175 174
0 3 29 28 175 175 175 174
0 4 29 29 175 175 175 174
.
.
.
I would like to store the last six numbers of each line into a single 1D structure like this such that it skips the first two columns. It basically transposes each column and horizontally concatenates each transposed column:
29 29 29 29 29 175 175 28 28 29 175 175 175 175 175...
Here is my class attempting this that is too slow for my purposes.
void MyClass::GetFromFile(std::string filename, int headerLinestoSkip, int ColumnstoSkip, int numberOfColumnsIneed)
{
std::ifstream file(filename);
std::string file_line;
double temp;
std::vector<std::vector<double>> temp_vector(numberOfColumnsIneed);
if(file.is_open())
{
SkipLines(file, headerLinestoSkip);
while(getline(file, file_line, '\n'))
{
std::istringstream ss(file_line);
for(int i=0; i<ColumnstoSkip; i++)
{
ss >> temp;
}
for(int i=0; i<numberOfColumnsIneed; i++)
{
ss >> temp;
temp_vector[i].push_back(temp);
}
}
for(int i=0; i<numberOfColumnsIneed; i++)
{
this->ClassMemberVector.insert(this->ClassMemberVector.end(), temp_vector[i].begin(), temp_vector[i].end());
}
}
I have read that memory mapping the file may be helpful but my attempts to getting it into the 1D structure I need has not been successful. An example from someone would be very much appreciated!

With 20mb and short lines as you show, that's approx 500 000 lines. Knowing this, there are several factors that could slow down your code:
I/O : at the current hardware and OS performance, I can't imagine that this plays a role here;
parsing/conversion. You read each line, build a string stream out of it, to then extract the numbers. This could be an overhead, especially on some C++ implementations where stream extraction is slower than the old sscanf(). I may be wrong but again I'm not sure that this overhead would be so huge.
the memory allocation for your vectors. This is definitely the first place to look for. A vector has a size and a capacity. Each time you add an item above capacity, the vector needs to be reallocated, which could require to move and move again all its content.
I'd strongly advise you to execute your code with a profiler to identify the bottleneck. Manual timing will be difficult here because your loop contains all potential problems, but each iteration is certainly to quick for std::chrono to measure the different loop parts with sufficient accuracy.
If you can't use a profiler, I'd suggest to compute a rough estimation of the number of lines using the file size, and take half of it. Pre-reserve then the corresponding capacity in each temp_vector[i]. If you observe a good progress you'll be the right track and could then fine tune this approach. If not, edit your answer with your new findings and post a comment to this answer.

Related

how to fetch data from a txt file using fstream

I am building a small application. in this i have saved some data in a txt file. i need to edit a value in particular row and column. i wrote a code to go to a particular line and fetch the values but i have tried almost everything to got a particular column and edit that value.
1000 400 120 110 800 110 150 500 0 1000
1000 400 90 150 800 120 150 600 0 1000
1000 400 80 60 **800** 132 150 700 0 1000
1000 400 120 60 800 123 150 200 0 1000
1000 400 111 80 800 143 150 700 0 1000
1000 400 30 90 800 155 150 500 0 1000
for example i have edit the highlighted value , which is the best way to do. i cannot paste my whole code as it is very long.
this is the one where i can go to a particular line
fstream& GotoLine(int num) {
infile.seekg(std::ios::beg);
for (int i = 0; i < num - 1; ++i) {
infile.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}
return infile;
}
i would appreciate any help over this.
Assuming that your file is formatted as fixed column size, my advice would be:
prepare an array or a vector of struct { size_t pos; size_t width} to define the fields
open the file in ios::in | ios::out mode
read the file one line at a time with your GotoLine function up to the line you want to process
note the index in the fstream with tellg
read the interesting line in a char array with a size greater than the line - here I would use 64 to be large enough - with istream::getline
as you have read the line as a plain array, you can rewrite the characters of one single field
go back to beginning of line with seekp
write the line back to the file
This is not a general method of editing a text file. It only works here because as you have fields of fixed size the edited line has exactly same size than original one, so it can be re-written in place. But never use it in a general case.

How to parse a Vector List from a .txt File

I have a project where I have to read a vector list from a .txt file and parse it. I have the vector list displaying on the screen but I do not know how to parse it.
#include <iostream>
#include <cstring>
#include <fstream>
#include <conio.h>
using namespace std;
int main() {
ifstream vl;
vl.open("PIXA.txt");
char output[100];
if (vl.is_open()) {
while (!vl.eof()) {
vl >> output;
cout << output;
}
vl >> output;
cout << output;
}
cin.ignore();
cin.get();
vl.close();
return 0;
}
Could someone help out with adding code so that I can parse this .txt file?
Here is what the .txt file looks like:
J
366 -1722 583
366 356 1783
866 789 1033
866 -1289 -167
366 -1722 583
J
-500 -1472 150
0 -1039 -600
0 1039 600
-500 606 1350
-500 -1472 150
J
366 356 1783
-500 606 1350
0 1039 600
866 789 1033
366 356 1783
J
366 -1722 583
866 -1289 -167
0 -1039 -600
-500 -1472 150
366 -1722 583
So, there are two things that can help you here.
Firstly, std::getline can take an optional delimiter parameter. This is handy, because your matrices are delimited by J.
std::string line;
std::getline(in, line, 'J')
This will either leave line empty (for the first entry in your file) or with a long string containing a bunch of space-delimited integers. Note that std::getline will pull the delimiter out of the stream, but not add it to the string argument, so you don't need to worry about the J when it comes to parsing the integer bits.
You can feed that into a std::stringstream, and yank out all the integers in a loop like this:
std::vector<int> matrix;
std::stringstream ss(line);
int i;
while (ss >> i)
matrix.push_back(i);
and you'll get a nice 1-dimensional vector full of all your numbers. Writing a simple indexing function that can convert between a row,column format that you probably want from a matrix, and the offset format you'll need with a vector is left as an exercise to the reader as it is pretty simple. Don't forget to handle the empty-line situation for the very first J!
You can wrap this stuff up in a loop,
while (std::getline(in, line)) { /* ... */ }
easily enough, and do something like generate a vector of vectors that'll be dead easy to work with later in your application.

c++ array sorting with some specifications

I'm using C++. Using sort from STL is allowed.
I have an array of int, like this :
1 4 1 5 145 345 14 4
The numbers are stored in a char* (i read them from a binary file, 4 bytes per numbers)
I want to do two things with this array :
swap each number with the one after that
4 1 5 1 345 145 4 14
sort it by group of 2
4 1 4 14 5 1 345 145
I could code it step by step, but it wouldn't be efficient. What I'm looking for is speed. O(n log n) would be great.
Also, this array can be bigger than 500MB, so memory usage is an issue.
My first idea was to sort the array starting from the end (to swap the numbers 2 by 2) and treating it as a long* (to force the sorting to take 2 int each time). But I couldn't manage to code it, and I'm not even sure it would work.
I hope I was clear enough, thanks for your help : )
This is the most memory efficient layout I could come up with. Obviously the vector I'm using would be replaced by the data blob you're using, assuming endian-ness is all handled well enough. The premise of the code below is simple.
Generate 1024 random values in pairs, each pair consisting of the first number between 1 and 500, the second number between 1 and 50.
Iterate the entire list, flipping all even-index values with their following odd-index brethren.
Send the entire thing to std::qsort with an item width of two (2) int32_t values and a count of half the original vector.
The comparator function simply sorts on the immediate value first, and on the second value if the first is equal.
The sample below does this for 1024 items. I've tested it without output for 134217728 items (exactly 536870912 bytes) and the results were pretty impressive for a measly macbook air laptop, about 15 seconds, only about 10 of that on the actual sort. What is ideally most important is no additional memory allocation is required beyond the data vector. Yes, to the purists, I do use call-stack space, but only because q-sort does.
I hope you get something out of it.
Note: I only show the first part of the output, but I hope it shows what you're looking for.
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cstdint>
// a most-wacked-out random generator. every other call will
// pull from a rand modulo either the first, or second template
// parameter, in alternation.
template<int N,int M>
struct randN
{
int i = 0;
int32_t operator ()()
{
i = (i+1)%2;
return (i ? rand() % N : rand() % M) + 1;
}
};
// compare to integer values by address.
int pair_cmp(const void* arg1, const void* arg2)
{
const int32_t *left = (const int32_t*)arg1;
const int32_t *right = (const int32_t *)arg2;
return (left[0] == right[0]) ? left[1] - right[1] : left[0] - right[0];
}
int main(int argc, char *argv[])
{
// a crapload of int values
static const size_t N = 1024;
// seed rand()
srand((unsigned)time(0));
// get a huge array of random crap from 1..50
vector<int32_t> data;
data.reserve(N);
std::generate_n(back_inserter(data), N, randN<500,50>());
// flip all the values
for (size_t i=0;i<data.size();i+=2)
{
int32_t tmp = data[i];
data[i] = data[i+1];
data[i+1] = tmp;
}
// now sort in pairs. using qsort only because it lends itself
// *very* nicely to performing block-based sorting.
std::qsort(&data[0], data.size()/2, sizeof(data[0])*2, pair_cmp);
cout << "After sorting..." << endl;
std::copy(data.begin(), data.end(), ostream_iterator<int32_t>(cout,"\n"));
cout << endl << endl;
return EXIT_SUCCESS;
}
Output
After sorting...
1
69
1
83
1
198
1
343
1
367
2
12
2
30
2
135
2
169
2
185
2
284
2
323
2
325
2
347
2
367
2
373
2
382
2
422
2
492
3
286
3
321
3
364
3
377
3
400
3
418
3
441
4
24
4
97
4
153
4
210
4
224
4
250
4
354
4
356
4
386
4
430
5
14
5
26
5
95
5
145
5
302
5
379
5
435
5
436
5
499
6
67
6
104
6
135
6
164
6
179
6
310
6
321
6
399
6
409
6
425
6
467
6
496
7
18
7
65
7
71
7
84
7
116
7
201
7
242
7
251
7
256
7
324
7
325
7
485
8
52
8
93
8
156
8
193
8
285
8
307
8
410
8
456
8
471
9
27
9
116
9
137
9
143
9
190
9
190
9
293
9
419
9
453
With some additional constraints on both your input and your platform, you can probably use an approach like the one you are thinking of. These constraints would include
Your input contains only positive numbers (i.e. can be treated as unsigned)
Your platform provides uint8_t and uint64_t in <cstdint>
You address a single platform with known endianness.
In that case you can divide your input into groups of 8 bytes, do some byte shuffling to arrange each groups as one uint64_t with the "first" number from the input in the lower-valued half and run std::sort on the resulting array. Depending on endianness you may need to do more byte shuffling to rearrange each sorted 8-byte group as a pair of uint32_t in the expected order.
If you can't code this on your own, I'd strongly advise you not to take this approach.
A better and more portable approach (you have some inherent non-portability by starting from a not clearly specified binary file format), would be:
std::vector<int> swap_and_sort_int_pairs(const unsigned char buffer[], size_t buflen) {
const size_t intsz = sizeof(int);
// We have to assume that the binary format in buffer is compatible with our int representation
// we also require an even number of integers
assert(buflen % (2*intsz) == 0);
// load pairwise
std::vector< std::pair<int,int> > pairs;
pairs.reserve(buflen/(2*intsz));
for (const unsigned char* bufp=buffer; bufp<buffer+buflen; bufp+= 2*intsz) {
// It would be better to have a more portable binary -> int conversion
int first_value = *reinterpret_cast<int*>(bufp);
int second_value = *reinterpret_cast<int*>(bufp + intsz);
// swap each pair here
pairs.emplace_back( second_value, firstvalue );
}
// less<pair<..>> does lexicographical ordering, which is what you are looking ofr
std::sort(pairs.begin(), pairs.end());
// convert back to linear vector
std::vector<int> result;
result.reserve(2*pairs.size());
for (auto& entry : pairs) {
result.push_back(entry.first);
result.push_back(entry.second);
}
return result;
}
Both the inital parse/swap pass (which you need anyway) and the final conversion are O(N), so the total complexity is still (O(N log(N)).
If you can continue to work with pairs, you can save the final conversion. The other way to save that conversion would be to use a hand-coded sort with two-int strides and two-int swap: much more work - and possibly still hard to get as efficient as a well-tuned library sort.
Do one thing at a time. First, give your data some *struct*ure. It seems that each 8 byte form a unit of the
form
struct unit {
int key;
int value;
}
If the endianness is right, you can do this in O(1) with a reinterpret_cast. If it isn't, you'll have to live with a O(n) conversion effort. Both vanish compared to the O(n log n) search effort.
When you have an array of these units, you can use std::sort like:
bool compare_units(const unit& a, const unit& b) {
return a.key < b.key;
}
std::sort(array, length, compare_units);
The key to this solution is that you do the "swapping" and byte-interpretation first and then do the sorting.

Direct-inclusion sorting

What is the other name for direct-inclusion sorting and what is the algorithm for the same sort?
I have been trying to search on the Internet, but I'm not getting a straight answer, but I can not find any. I found this algorithm for straight insertion sort and in some books it's saying they are the same with direct direct-inclusion sorting, but I'm doubting it because the book is in Russian, so I want to confirm (that is, if it's true or might I have a translation error?)
Code in C++:
int main(int argc, char* argv[])
{
int arr[8] = {27, 412, 71, 81, 59, 14, 273, 87},i,j;
for (j=1; j<8; j++){
if (arr[j] < arr[j-1]) {
//Что бы значение j мы не меняли а работали с i
i = j;
//Меняем местами пока не найдем нужное место
do{
swap(arr[i],arr[i-1]);
i--;
//защита от выхода за пределы массива
if (i == 0)
break;
}
while (arr[i] < arr[i-1]) ;
}
for (i=0;i<8;i++)
cout << arr[i]<< ' ';
cout << '\n';
}
getch();
return 0;
}
Result
27 412 71 81 59 14 273 87
27 71 412 81 59 14 273 87
27 71 81 412 59 14 273 87
27 59 71 81 412 14 273 87
14 27 59 71 81 412 273 87
14 27 59 71 81 273 412 87
14 27 59 71 81 87 273 412
The posted code is Insertion sort.
Most implementations will copy an out-of-order element to a temporary variable and then work backwards, moving elements up until the correct open spot is found to "insert" the current element. That's what the pseudocode in the Wikipedia article shows.
Some implementations just bubble the out-of-order element backwards while it's less than the element to its left. That's what the inner do...while loop in the posted code shows.
Both methods are valid ways to implement Insertion sort.
The code you posted looks not like an algorithm for insertion sort, since you are doing a repeated swap of two neighboring elements.
Your code looks much more like some kind of bubble-sort.
Here a list of common sorting algorithms:
https://en.wikipedia.org/wiki/Sorting_algorithm
"straight insertion" and "direct inclusion" sounds like pretty much the same .. so I quess they probably are different names for the same algorithm.
Edit:
Possibly the "straight" prefix should indicate that only one container is used .. however, if two neighboring elements are swaped, I would not call it insertion-sort, since no "insert" is done at all.
Given the fact that the term "direct inclusion sort" yields no google hits at all, and "direct insertion sorting" only 27 hits, the first three of which are this post here and two identically phrased blog posts, I doubt that this term has any widely accepted meaning. So the part of your question about
some book its saying they are the same with direct direct-inclusion sorting
is hard to answer, unless we find a clear definition of what direct-inclusion sorting actually is.

How do I read one number at a time and store it in an array, skipping duplicates?

I'm trying to read numbers from a file into an array, discarding duplicates. For instance, say the following numbers are in a file:
41 254 14 145 244 220 254 34 135 14 34 25
Though the number 34 occurs twice in the file, I would only like to store it once in the array. How would I do this?
(fixed, but I guess a better term would be a 64 bit Unsigned int) (was using numbers above 255)
vector<int64_t> v;
copy(istream_iterator<int64_t>(cin), istream_iterator<int64_t>(), back_inserter(v));
set<int64_t> s;
vector<int64_t> ov; ov.reserve(v.size());
for( auto i = v.begin(); i != v.end(); ++i ) {
if ( s.insert(v[i]).second )
ov.push_back(v[i]);
}
// ov contains only unique numbers in the same order as the original input file.