Implementing External Merge Sort

Implementing External Merge Sort - c++

I am aware of External merge sort and how it works.
But currently i'm stuck while implementing it. I've written code to sort and merge the arrays but I'm facing problem while reading and writing the data from/into the file, i want to implement the following methods in C++:
1. int * read(int s, int e) : This method should read from file all the number
starting from 's' till 'e' and return the array
2. write(int a[], int s, int e) : This method should write to file the input
array by replacing the numbers from s to e.
For eg.
Given file has the following numbers:
1
2
3
4
5
6
read(0, 2) should return [1,2,3]
write([4,5,6], 0, 2) should update the file to :
4
5
6
4
5
6
How can I implement both these methods?

The first thing you should do is stop working with raw pointers.
std::vector<int> will be just as efficient, and far less bug prone.
Second, the file format matters. I will assume a binary file with packed 32 bit signed integers.
The signature for read and write is now:
std::vector<int> read( std::ifstream const& f, int offset );
void write( std::ofstream& f, int offset, std::vector<int> const& data );
ifstream and ofstream have seek methods -- in particular, ifstream has seekg and ofstream has seekp.
ifstream.read( char* , length ) reads length bytes from the file at the current get position (set by seekg, and advanced by read). If you aren't concerned with memory layout of your file, you can get the .data() from the std::vector<int>, reinterpret it to a char*, and proceed to read( reinterpret_cast<char*>(vec.data()), sizeof(int)*vec.size() ) to read in the buffer all at once.
ofstream has a similar write method which works much the same way.
While writing data rawly to disk and back is dangerous, in most (every?) implementation you'll be safe with data written and read in the same execution session (and probably even between sessions). Take more care if the data is meant to persist between sessions, or if it is output/input from your code.

There are no C++ standard functions to jump to lines in files. So you have to read the file line by line (with getline, for example. http://www.cplusplus.com/reference/string/string/getline/).
As far as I remember, external merge sort (the old one, designed for a computer with a few tape drives), when used with separate files, doesn't need an interface like yours - you can work sequentially.

Related

How to save a vector to a file without "printing" it? [duplicate]

first sorry for my bad english. i am just joined this forum and search for
how to correctly write vector to binary file. i just got from this forum an answer like this(i have modified it a little):
#include <iostream>
#include <string.h>
#include <vector>
#include <fstream>
using namespace std;
class Student
{
public:
char m_name[30];
int m_score;
public:
Student()
{
}
Student(const char name[], const int &score)
:m_score(score)
{
strcpy(m_name, name);
}
void print() const
{
cout.setf(ios::left);
cout.width(20);
cout << m_name << " " << m_score << endl;
}
};
int main()
{
vector<Student> student;
student.push_back(Student("Alex",19));
student.push_back(Student("Maria",20));
student.push_back(Student("muhamed",20));
student.push_back(Student("Jeniffer",20));
student.push_back(Student("Alex",20));
student.push_back(Student("Maria",21));
ofstream fout("data.dat", ios::out | ios::binary);
fout.write((char*) &student, sizeof(student));
fout.close();
vector<Student> student2;
ifstream fin("data.dat", ios::in | ios::binary);
fin.seekg(0, ifstream::end);
int size = fin.tellg() / sizeof (student2);
student2.resize(size);
fin.seekg(0, ifstream::beg);
fin.read((char*)&student2, sizeof(student2));
vector<Student>::const_iterator itr = student2.begin();
while(itr != student2.end())
{
itr->print();
++itr;
}
fin.close();
return 0;
}
but when i run it. on my linux mint i got this result:
Alex 19
Maria 20
muhamed 20
Jeniffer 20
Alex 20
Maria 21
*** glibc detected *** ./from vector to binary: corrupted double-linked list: 0x0000000000633030 ***
i am new to c++.
some one please help me, been stuck in this problem last 2 weeks.
thanks in advance for the answer.

You are writing to file the vector structure, not its data buffer. Try change writing procedure to:
ofstream fout("data.dat", ios::out | ios::binary);
fout.write((char*)&student[0], student.size() * sizeof(Student));
fout.close();
And instead of calculation size of vector from file size, it's better write vector size (number of objects) before. In the case you can write to the same file other data.
size_t size = student.size();
fout.write((char*)&size, sizeof(size));

To store a vector<T> of PODs in a file, you have to write the contents of the vector, not the vector itself. You can access the raw data with &vector[0], address of the first element (given it contains at least one element). To get the raw data length, multiply the number of elements in the vector with the size of one element:
strm.write(reinterpret_cast<const char*>(&vec[0]), vec.size()*sizeof(T));
The same applies when you read the vector from the file; The element count is the total file size divided by the size of one element
(given that you only store one type of POD in the file):
const size_t count = filesize / sizeof(T);
std::vector<T> vec(count);
strm.read(reinterpret_cast<char*>(&vec[0]), count*sizeof(T));
This only works if you can calculate the number of elements based on the file size (if you only store one type of POD or if all vectors contain the same number of elements). If you have vectors with different PODs with different lengths, you have to write the number of elements in the vector to the file before writing the raw data.
Furthermore, when you transfer numeric types in binary form between different systems, be aware of endianness.

For functions read() and write(), you need what is called "plain old data" or "POD". That means basically that the class or structure must have no pointers inside them, and no virtual functions. the implementation of vector certainly has pointers - I'm not sure about virtual functions.
You will have to write a function that stores a student at a time (or that translates a bunch of students to a array [not a vector] of bytes or some such - but that's more complex).
The reason you can't write non-POD data, in particular pointers, to a binary file is that when you read the data back again, you can almost certainly bet that the memory layout has changed from when you wrote it. It becomes a little bit like trying to park in the same parking space at the shops - someone else will have parked in the third spot from the entrance when you turn up next time, so you'll have to pick another spot. Think of the memory allocated by the compiler as parking spaces, and the student information as cars.
[Technically, in this case, it's even worse - your vector doesn't actually contain the students inside the class, which is what you are writing to the file, so you haven't even saved the information about the students, just the information about where they are located (the number of the parking spaces)]

You probably cannot write in binary (the way you are doing) any std::vector because that template contains internal pointers, and writing and re-reading them is meaningless.
Some general advices:
don't write in binary any STL template containers (like std::vector or std::map), they surely contain internal pointers that you really don't want to write as is. If you really need to write them, implement your own writing and reading routines (e.g. using STL iterators).
avoid using strcpy without care. Your code will crash if the name has more than 30 characters. At least, use strncpy(m_name, name, sizeof(m_name)); (but even that would work badly for a 30 characters name). Actually, m_name should be a std::string.
serialize explicitly your container classes (by handling each meaningful member data). You could consider using JSON notation (or perhaps YAML, or maybe even XML -which I find too complex so don't recommend) to serialize. It gives you a textual dump format, which you could easily inspect with a standard editor (e.g. emacs or gedit). You'll find a lot of serializing free libraries, e.g. jsoncpp and many others.
learn to compile with g++ -Wall -g and to use the gdb debugger and the valgrind memory leakage detector; also learn to use make and to write your Makefile-s.
take advantage that Linux is free software, so you can look into its source code (and you may want to study stdc++ implementation even if the STL headers are complex).

Some methods of reading from txt files

I am trying to learn some techniques of reading from files in C++ and I came up with this example.
Assume the following is the content of the txt file that I want to read from
A
1 2 3
4 5 6
7 8 9
B
1 2 3
4 5 6
7 8 9
So, what I want to do here is that if we read A, then we start to read the matrix A from below and store them into a[i][j]. And the same for B. In other circumstances we take them as exception, which we don't care here now.
The problem for me now is mixed reading. I know how to read integer and how to read strings from the file separately, like the stupid way while(fin>>it). but can anyone tell me a fast way of this kind of mixed reading that I don't have to declare several reading variables (type) such as string and int?
For example, I only know how to read integer in a whole line and don't know the newline handler, which means I don't know how to recognize if we reached the end of the line or something like this:
ifstream fin;
fin.open(infilename);
int it;
int arr[3][3];
int i=0, j=0;
while(fin>>it){
arr[i][j]=it;
`\\I am confused at this place and don't know how to write the condition`
}
fin.close();
Moreover, since there are both char and int type, do I have to declare char? and how does fin>>char really work? reading char by char in a line or something else?
I'll really appreciate if someone can guide me on this! Thanks!

You shouldn't have to worry about newlines as >> skips whitespace and newlines. However, putting fin>>it in the while condition complicates things if you want different data types. You could instead read in to a character to represent each matrix, then within the loop read from fin to the matrix:
char c;
int mat[3][3];
while (fin>>c){
//save c somewhere
fin>>mat[0][0]>>mat[0][1]>>mat[0][2];
fin>>mat[1][0]>>....;
fin>>mat[2][0]>>....;
//store mat
}

Parsing binary data from file

and thank you in advance for your help!
I am in the process of learning C++. My first project is to write a parser for a binary-file format we use at my lab. I was able to get a parser working fairly easily in Matlab using "fread", and it looks like that may work for what I am trying to do in C++. But from what I've read, it seems that using an ifstream is the recommended way.
My question is two-fold. First, what, exactly, are the advantages of using ifstream over fread?
Second, how can I use ifstream to solve my problem? Here's what I'm trying to do. I have a binary file containing a structured set of ints, floats, and 64-bit ints. There are 8 data fields all told, and I'd like to read each into its own array.
The structure of the data is as follows, in repeated 288-byte blocks:
Bytes 0-3: int
Bytes 4-7: int
Bytes 8-11: float
Bytes 12-15: float
Bytes 16-19: float
Bytes 20-23: float
Bytes 24-31: int64
Bytes 32-287: 64x float
I am able to read the file into memory as a char * array, with the fstream read command:
char * buffer;
ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes
So, from what I understand, I now have a pointer to an array called "buffer". If I were to call buffer[0], I should get a 1-byte memory address, right? (Instead, I'm getting a seg fault.)
What I now need to do really ought to be very simple. After executing the above ifstream code, I should have a fairly long buffer populated with a number of 1's and 0's. I just want to be able to read this stuff from memory, 32-bits at a time, casting as integers or floats depending on which 4-byte block I'm currently working on.
For example, if the binary file contained N 288-byte blocks of data, each array I extract should have N members each. (With the exception of the last array, which will have 64N members.)
Since I have the binary data in memory, I basically just want to read from buffer, one 32-bit number at a time, and place the resulting value in the appropriate array.
Lastly - can I access multiple array positions at a time, a la Matlab? (e.g. array(3:5) -> [1,2,1] for array = [3,4,1,2,1])

Firstly, the advantage of using iostreams, and in particular file streams, relates to resource management. Automatic file stream variables will be closed and cleaned up when they go out of scope, rather than having to manually clean them up with fclose. This is important if other code in the same scope can throw exceptions.
Secondly, one possible way to address this type of problem is to simply define the stream insertion and extraction operators in an appropriate manner. In this case, because you have a composite type, you need to help the compiler by telling it not to add padding bytes inside the type. The following code should work on gcc and microsoft compilers.
#pragma pack(1)
struct MyData
{
int i0;
int i1;
float f0;
float f1;
float f2;
float f3;
uint64_t ui0;
float f4[64];
};
#pragma pop(1)
std::istream& operator>>( std::istream& is, MyData& data ) {
is.read( reinterpret_cast<char*>(&data), sizeof(data) );
return is;
}
std::ostream& operator<<( std::ostream& os, const MyData& data ) {
os.write( reinterpret_cast<const char*>(&data), sizeof(data) );
return os;
}

char * buffer;
ifstream datafile (filename,ios::in|ios::binary|ios::ate);
datafile.read (buffer, filesize); // Filesize in bytes
you need to allocate a buffer first before you read into it:
buffer = new filesize[filesize];
datafile.read (buffer, filesize);
as to the advantages of ifstream, well it is a matter of abstraction. You can abstract the contents of your file in a more convenient way. You then do not have to work with buffers but instead can create the structure using classes and then hide the details about how it is stored in the file by overloading the << operator for instance.

You might perhaps look for serialization libraries for C++. Perhaps s11n might be useful.

This question shows how you can convert data from a buffer to a certain type. In general, you should prefer using a std::vector<char> as your buffer. This would then look like this:
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
int main() {
std::ifstream input("your_file.dat");
std::vector<char> buffer;
std::copy(std::istreambuf_iterator<char>(input),
std::istreambuf_iterator<char>(),
std::back_inserter(buffer));
}
This code will read the entire file into your buffer. The next thing you'd want to do is to write your data into valarrays (for the selection you want). valarray is constant in size, so you have to be able to calculate the required size of your array up-front. This should do it for your format:
std::valarray array1(buffer.size()/288); // each entry takes up 288 bytes
Then you'd use a normal for-loop to insert the elements into your arrays:
for(int i = 0; i < buffer.size()/288; i++) {
array1[i] = *(reinterpret_cast<int *>(buffer[i*288])); // first position
array2[i] = *(reinterpret_cast<int *>(buffer[i*288]+4)); // second position
}
Note that on a 64-bit system this is unlikely to work as you expect, because an integer would take up 8 bytes there. This question explains a bit about C++ and sizes of types.
The selection you describe there can be achieved using valarray.

Writing to file using c and c++

When I try to write the file using C; fwrite which accepts void type as data, it is not interpreted by text editor.
struct index
{
index(int _x, int _y):x(_x), y(_y){}
int x, y;
}
index i(4, 7);
FILE *stream;
fopen_s(&stream, "C:\\File.txt", "wb");
fwrite(&i, sizeof(index), 1, stream);
but when I try with C++; ofstream write in binary mode, it is readable. why doesn't it come up same as written using fwrite?

This is the way to write binary data using a stream in C++:
struct C {
int a, b;
} c;
#include <fstream>
int main() {
std::ofstream f("foo.txt",std::ios::binary);
f.write((const char*)&c, sizeof c);
}
This shall save the object in the same way as fwrite would. If it doesn't for you, please post your code with streams - we'll see what's wrong.

C++'s ofstream stream insertion only does text. The difference between opening a iostream in binary vs text mode is weather or not end of line character conversion happens. If you want to write a binary format where a 32 bit int takes exactly 32 bits use the c functions in c++.
Edit on why fwrite may be the better choice:
Ostream's write method is more or less a clone of fwrite(except it is a little less useful since it only takes a byte array and length instead of fwrite's 4 params) but by sticking to fwrite there is no way to accidentally use stream insertion in one place and write in another. More less it is a safety mechanism. While you gain that margin of safety you loose a little flexibility, you can no longer make a iostream derivative that compresses output with out changing any file writing code.

How to speed-up loading of 15M integers from file stream?

I have an array of precomputed integers, it's fixed size of 15M values. I need to load these values at the program start. Currently it takes up to 2 mins to load, file size is ~130MB. Is it any way to speed-up loading. I'm free to change save process as well.
std::array<int, 15000000> keys;
std::string config = "config.dat";
// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
std::ostream_iterator<int>(out, "\n"));
// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
std::istream_iterator<int>(), keys.begin());
in_ranks.close();
Thanks in advance.
SOLVED. Used the approach proposed in accepted answer. Now it takes just a blink.
Thanks all for your insights.

You have two issues regarding the speed of your write and read operations.
First, std::copy cannot do a block copy optimization when writing to an output_iterator because it doesn't have direct access to underlying target.
Second, you're writing the integers out as ascii and not binary, so for each iteration of your write output_iterator is creating an ascii representation of your int and on read it has to parse the text back into integers. I believe this is the brunt of your performance issue.
The raw storage of your array (assuming a 4 byte int) should only be 60MB, but since each character of an integer in ascii is 1 byte any ints with more than 4 characters are going to be larger than the binary storage, hence your 130MB file.
There is not an easy way to solve your speed problem portably (so that the file can be read on different endian or int sized machines) or when using std::copy. The easiest way is to just dump the whole of the array to disk and then read it all back using fstream.write and read, just remember that it's not strictly portable.
To write:
std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );
And to read:
std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );
----Update----
If you are really concerned about portability you could easily use a portable format (like your initial ascii version) in your distribution artifacts then when the program is first run it could convert that portable format to a locally optimized version for use during subsequent executions.
Something like this perhaps:
std::array<int, 15000000> keys;
// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
std::ifstream in("data.txt");
std::copy(std::istream_iterator<int>(in),
std::istream_iterator<int>(), keys.begin());
in.close();
std::fstream out("data.bin", ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );
} else {
std::fstream in("data.bin", ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );
}
If you have an install process this preprocessing could also be done at that time...

Attention. Reality check ahead:
Reading integers from a large text file is an IO bound operation unless you're doing something completely wrong (like using C++ streams for this). Loading 15M integers from a text file takes less than 2 seconds on an AMD64#3GHZ when the file is already buffered (and only a bit long if had to be fetched from a sufficiently fast disk). Here's a quick & dirty routine to prove my point (that's why I do not check for all possible errors in the format of the integers, nor close my files at the end, because I exit() anyway).
$ wc nums.txt
15000000 15000000 156979060 nums.txt
$ head -n 5 nums.txt
730547560
-226810937
607950954
640895092
884005970
$ g++ -O2 read.cc
$ time ./a.out <nums.txt
=>1752547657
real 0m1.781s
user 0m1.651s
sys 0m0.114s
$ cat read.cc
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <vector>
int main()
{
char c;
int num=0;
int pos=1;
int line=1;
std::vector<int> res;
while(c=getchar(),c!=EOF)
{
if (c>='0' && c<='9')
num=num*10+c-'0';
else if (c=='-')
pos=0;
else if (c=='\n')
{
res.push_back(pos?num:-num);
num=0;
pos=1;
line++;
}
else
{
printf("I've got a problem with this file at line %d\n",line);
exit(1);
}
}
// make sure the optimizer does not throw vector away, also a check.
unsigned sum=0;
for (int i=0;i<res.size();i++)
{
sum=sum+(unsigned)res[i];
}
printf("=>%d\n",sum);
}
UPDATE: and here's my result when read the text file (not binary) using mmap:
$ g++ -O2 mread.cc
$ time ./a.out nums.txt
=>1752547657
real 0m0.559s
user 0m0.478s
sys 0m0.081s
code's on pastebin:
http://pastebin.com/NgqFa11k
What do I suggest
1-2 seconds is a realistic lower bound for a typical desktop machine for load this data. 2 minutes sounds more like a 60 Mhz micro controller reading from a cheap SD card. So either you have an undetected/unmentioned hardware condition or your implementation of C++ stream is somehow broken or unusable. I suggest to establish a lower bound for this task on your your machine by running my sample code.

if the integers are saved in binary format and you're not concerned with Endian problems, try reading the entire file into memory at once (fread) and cast the pointer to int *

You could precompile the array into a .o file, which wouldn't need to be recompiled unless the data changes.
thedata.hpp:
static const int NUM_ENTRIES = 5;
extern int thedata[NUM_ENTRIES];
thedata.cpp:
#include "thedata.hpp"
int thedata[NUM_ENTRIES] = {
10
,200
,3000
,40000
,500000
};
To compile this:
# make thedata.o
Then your main application would look something like:
#include "thedata.hpp"
using namespace std;
int main() {
for (int i=0; i<NUM_ENTRIES; i++) {
cout << thedata[i] << endl;
}
}
Assuming the data doesn't change often, and that you can process the data to create thedata.cpp, then this is effectively instant loadtime. I don't know if the compiler would choke on such a large literal array though!

Save the file in a binary format.
Write the file by taking a pointer to the start of your int array and convert it to a char pointer. Then write the 15000000*sizeof(int) chars to the file.
And when you read the file, do the same in reverse: read the file as a sequence of chars, take a pointer to the beginning of the sequence, and convert it to an int*.
of course, this assumes that endianness isn't an issue.
For actually reading and writing the file, memory mapping is probably the most sensible approach.

If the numbers never change, preprocess the file into a C++ source and compile it into the application.
If the number can change and thus you have to keep them in separate file that you have to load on startup then avoid doing that number by number using C++ IO streams. C++ IO streams are nice abstraction but there is too much of it for such simple task as loading a bunch of number fast. In my experience, huge part of the run time is spent in parsing the numbers and another in accessing the file char by char.
(Assuming your file is more than single long line.) Read the file line by line using std::getline(), parse numbers out of each line using not streams but std::strtol(). This avoids huge part of the overhead. You can get more speed out of the streams by crafting your own variant of std::getline(), such that reads the input ahead (using istream::read()); standard std::getline() also reads input char by char.

Use a buffer of 1000 (or even 15M, you can modify this size as you please) integers, not integer after integer. Not using a buffer is clearly the problem in my opinion.

If the data in the file is binary and you don't have to worry about endianess, and you're on a system that supports it, use the mmap system call. See this article on IBM's website:
High-performance network programming, Part 2: Speed up processing at both the client and server
Also see this SO post:
When should I use mmap for file access?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js