Implementing external mergesort. How to get started?

Implementing external mergesort. How to get started? - c++

For a project I have to implement an external mergesort algorithm. It will be used to sort a file with mostly numbers or strings and will be some GBs in size. This is the definition of the mergesort that I've been given
void MergeSort (char *infile,
unsigned char field,
block_t *buffer,
unsigned int nmem_blocks,
char *outfile,
unsigned int *nsorted_segs,
unsigned int *npasses,
unsigned int *nios);
I'm not allowed to change that. The first argument is the file that I'm going to sort. The second the field according to which I want to sort the file (doesn't interest me right now), the third argument is the buffer. Which is a struct. Here is the definition of a block
typedef struct
{
unsigned int blockid;
unsigned int nreserved; // how many reserved entries
record_t entries[MAX_RECORDS_PER_BLOCK]; // array of records
bool valid; // if set, then this block is valid
unsigned char misc;
unsigned int next_blockid;
unsigned int dummy;
} block_t;
the fourth argument is the number of blocks in memory. The last three arguments can be set by me.
My questions are:
Do I take the file and cut it into two files?
Is the buffer a file stored in the harddrive or does it stay in the memory? Do I have to create a new file? I'm a little confused with this part.
These are my thoughts to start right now. First I get the file and split it in two parts. I also create a buffer, which I don't know what size it should have. Then I read the first block of records from the first file, and compare the numbers to the first block of records of the second file. Whenever the number is lesser or equal to another I will send it to the output file. Can you evaluate my stream of thoughts? Or am I thinking it wrong?

Refer my github repo - https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md
Problem Stement
All Sorting Algorithm works within the RAM .When the data to be sorted does not fit into the RAM and instead they resides in the slower external memory (usually a hard drive) , this technique is used . Example , If we have to Sort 100 numbers with each number 1KB and our RAM size is 10KB ,external merge sort works like a charm !.
How to ?
Split Phase
Split the 100 KB file into 10 files each 10kb
Sort the 10KB files using some efficient Sorting Algo in O(nlogn)
Stores each of the smaller files to disk .
Merge Phase
Do a K-way merge with each smaller files one by one. Inline the details .
After the Split Phase , A list of file handler of all the splitted files will be stored - sortedTempFileHandlerList
Now, We creates a list of heapnode - heapnodes. Each heapnode will stores the actual entry read from the file and also the file which owns it . The heapnodes is heapified and it will be a min-heap.
Assuming there 10 files , heapnodes will takes 10KB only (each number assume 1KB) .
Loop While Least Element (the top of heap ) is INT_MAX
Picks the node with least element from heapnodes . ( 0(1) since heapnodes is a min heap )
Write the element to sortedLargeFile (it will be the sorted number)
Find the filehandler of the corresponding element by looking at heapnode.filehandler .
Read the next item from the file . If it's EOF, mark the item as INT_MAX
Put the item to heap top . Again Heapify to persist min heap property .
Continue ;
At the end of the Merge Phase sortedLargeFile will have all the elements in sorted order .
Example
Say We have a file largefile with the following Contents
5 8 6 3 7 1 4 9 10 2
In Split Phase ,We Split them into the Sorted chunks in 5 separate temp files.
temp1 - 5 ,8 temp2 - 3 ,6 temp3 - 1, 7 temp4 -4 , 9 temp5 - 2 ,10
Next Construct a Min Heap with top element from each files
1
/ \
2 5
/ \
4 3
Now picks the least Element from the min heap and write it to sortedOutputFile - 1.
Finds the next element of the file which owns min element 1 .
The no is 7 from temp3 . Move it to heap.
7 2
/ \ / \
2 5 Heapify --> 3 5
/ \ / \
4 3 4 7
Picks the least element 2 and moves it to sortedOutputFile - 1 2.
Finds the next element of the file which owns min element 2 .
The no is 10 from temp5 . Move it to heap.
10 3
/ \ / \
3 5 Heapify --> 4 5
/ \ / \
4 7 10 7
Picks the least element 3 and moves it to sortedOutputFile - 1 2 3.
Finds the next element of the file which owns min element 3 .
The no is 6 from temp2 . Move it to heap.
6 4
/ \ / \
4 5 Heapify --> 6 5
/ \ / \
10 7 10 7
Picks the least element 4 and moves it to sortedOutputFile - 1 2 3 4.
Finds the next element of the file which owns min element 4 .
The no is 9 from temp4 . Move it to heap.
9 5
/ \ / \
6 5 Heapify --> 6 9
/ \ / \
10 7 10 7
Picks the least element 5 and moves it to sortedOutputFile - 1 2 3 4 5.
Finds the next element of the file which owns min element 5.
The no is 8 from temp1 . Move it to heap
8 6
/ \ / \
6 9 Heapify --> 7 9
/ \ / \
10 7 10 8
Picks the least element 6 and moves it to sortedOutputFile - 1 2 3 4 5 6 .
Finds the next element of the file which owns min element 5 . .
We have see EOF . So mark the read no as INT_MAX .
INT_MAX 7
/ \ / \
7 9 Heapify --> 8 9
/ \ / \
10 8 10 INT_MAX
Picks the least element 6 and moves it to sortedOutputFile - 1 2 3 4 5 6 7 .
If we loop this process , we would reaches a point where , the heap would looks like below and the
sortedOutputFile - 1 2 3 4 5 6 7 8 9 10 .
We would also breaks at this point when the min element from heap becomes INT_MAX .
INT_MAX
/ \
INT_MAX INT_MAX
/ \
INT_MAX INT_MAX

I think the solution depends on the size of input file.
What I would do is to check the size of the file first, if it's smaller than certain size, for example, 1GB (providing 1GB is a small amount of memory on your machine), then I'll read the whole file, store the content in memory, merge sort them, then write to the new file.
Otherwise, I'll have to divide the original file to K temp files less than 1GB, merge sort each of them, then do a K way merge sort between the files, and finally concats the K files together. Basically you need to divide and conquer in two levels, first the files on disc, then the contents in memory.

If you are using any reasonable OS (anything with an X in it or BSD) and have enough memory rely on 'sort'. If you hit the limit on file size, use 'split' and then the --merge option of sort to merge the already sorted files.
If you really need to write code for an external sort you can save yourself a lot of trouble by starting with a thorough reading of Knuth's TAOCP Vol III, Chp 5 on external sorting.

Related

How to construct a tree given its depth and postorder traversal, then print its preorder traversal

I need to construct a tree given its depth and postorder traversal, and then I need to generate the corresponding preorder traversal. Example:
Depth: 2 1 3 3 3 2 2 1 1 0
Postorder: 5 2 8 9 10 6 7 3 4 1
Preorder(output): 1 2 5 3 6 8 9 10 7 4
I've defined two arrays that contain the postorder sequence and depth. After that, I couldn't come up with an algorithm to solve it.
Here's my code:
int postorder[1000];
int depth[1000];
string postorder_nums;
getline(cin, postorder_nums);
istringstream token1(postorder_nums);
string tokenString1;
int idx1 = 0;
while (token1 >> tokenString1) {
postorder[idx1] = stoi(tokenString1);
idx1++;
}
string depth_nums;
getline(cin, depth_nums);
istringstream token2(depth_nums);
string tokenString2;
int idx2 = 0;
while (token2 >> tokenString2) {
depth[idx2] = stoi(tokenString2);
idx2++;
}
Tree tree(1);

You can do this actually without constructing a tree.
First note that if you reverse the postorder sequence, you get a kind of preorder sequence, but with the children visited in opposite order. So we'll use this fact and iterate over the given arrays from back to front, and we will also store values in the output from back to front. This way at least the order of siblings will come out right.
The first value we get from the input will thus always be the root value. Obviously we cannot store this value at the end of the output array, as it really should come first. But we will put this value on a stack until all other values have been processed. The same will happen for any value that is followed by a "deeper" value (again: we are processing the input in reversed order). But as soon as we find a value that is not deeper, we flush a part of the stack into the output array (also filling it up from back to front).
When all values have been processed, we just need to flush the remaining values from the stack into the output array.
Now, we can optimise our space usage here: as we fill the output array from the back, we have free space at its front to use as the stack space for this algorithm. This has as nice consequence that when we arrive at the end we don't need to flush the stack anymore, because it is already there in the output, with every value where it should be.
Here is the code for this algorithm where I did not include the input collection, which apparently you already have working:
// Input example
int depth[] = {2, 1, 3, 3, 3, 2, 2, 1, 1, 0};
int postorder[] = {5, 2, 8, 9, 10, 6, 7, 3, 4, 1};
// Number of values in the input
int n = sizeof(depth)/sizeof(int);
int preorder[n]; // This will contain the ouput
int j = n; // index where last value was stored in preorder
int stackSize = 0; // how many entries are used as stack in preorder
for (int i = n - 1; i >= 0; i--) {
while (depth[i] < stackSize) {
preorder[--j] = preorder[--stackSize]; // flush it
}
preorder[stackSize++] = postorder[i]; // stack it
}
// Output the result:
for (int i = 0; i < n; i++) {
std::cout << preorder[i] << " ";
}
std::cout << "\n";
This algorithm has an auxiliary space complexity of O(1) -- so not counting the memory needed for the input and the output -- and has a time complexity of O(n).

I won't give you the code, but some hints how to solve the problem.
First, for postorder graph processing you first visit the children, then print (process) the value of the node. So, the tree or subtree parent is the last thing that is processed in its (sub)tree. I replace 10 with 0 for better indentation:
2 1 3 3 3 2 2 1 1 0
--------------------
5 2 8 9 0 6 7 3 4 1
As explained above, node of depth 0, or the root, is the last one. Let's lower all other nodes 1 level down:
2 1 3 3 3 2 2 1 1 0
-------------------
1
5 2 8 9 0 6 7 3 4
Now identify all nodes of depth 1, and lower all that is not of depth 0 or 1:
2 1 3 3 3 2 2 1 1 0
-------------------
1
2 3 4
5 8 9 0 6 7
As you can see, (5,2) is in a subtree, (8,9,10,6,7,3) in another subtree, (4) is a single-node subtree. In other words, all that is to the left of 2 is its subtree, all to the right of 2 and to the left of 3 is in the subtree of 3, all between 3 and 4 is in the subtree of 4 (here: empty).
Now lets deal with depth 3 in a similar way:
2 1 3 3 3 2 2 1 1 0
-------------------
1
2 3 4
5 6 7
8 9 0
2 is the parent for 2;
6 is the parent for 8, 8, 10;
3 is ahe parent for 6,7;
or very explicitly:
2 1 3 3 3 2 2 1 1 0
-------------------
1
/ / /
2 3 4
/ / /
5 6 7
/ / /
8 9 0
This is how you can construct a tree from the data you have.
EDIT
Clearly, this problem can be solved easily by recursion. In each step you find the lowest depth, print the node, and call the same function recursively for each of its subtrees as its argument, where the subtree is defined by looking for current_depth + 1. If the depth is passed as another argument, it can save the necessity of computing the lowest depth.

Can we really avoid extra space when all the values are non-negative?

This question is a follow-up of another one I had asked quite a while ago:
We have been given an array of integers and another number k and we need to find the total number of continuous subarrays whose sum equals to k. For e.g., for the input: [1,1,1] and k=2, the expected output is 2.
In the accepted answer, #talex says:
PS: BTW if all values are non-negative there is better algorithm. it doesn't require extra memory.
While I didn't think much about it then, I am curious about it now. IMHO, we will require extra memory. In the event that all the input values are non-negative, our running (prefix) sum will go on increasing, and as such, sure, we don't need an unordered_map to store the frequency of a particular sum. But, we will still need extra memory (perhaps an unordered_set) to store the running (prefix) sums that we get along the way. This obviously contradicts what #talex said.
Could someone please confirm if we absolutely do need extra memory or if it could be avoided?
Thanks!

Let's start with a slightly simpler problem: all values are positive (no zeros). In this case the sub arrays can overlap, but they cannot contain one another.
I.e.: arr = 2 1 5 1 1 5 1 2, Sum = 8
2 1 5 1 1 5 1 2
|---|
|-----|
|-----|
|---|
But this situation can never occur:
* * * * * * *
|-------|
|---|
With this in mind there is algorithm that doesn't require extra space (well.. O(1) space) and has O(n) time complexity. The ideea is to have left and right indexes indicating the current sequence and the sum of the current sequence.
if the sum is k increment the counter, advance left and right
if the sum is less than k then advance right
else advance left
Now if there are zeros the intervals can contain one another, but only if the zeros are on the margins of the interval.
To adapt to non-negative numbers:
Do as above, except:
skip zeros when advancing left
if sum is k:
count consecutive zeros to the right of right, lets say zeroes_right_count
count consecutive zeros to the left of left. lets say zeroes_left_count
instead of incrementing the count as before, increase the counter by: (zeroes_left_count + 1) * (zeroes_right_count + 1)
Example:
... 7 0 0 5 1 2 0 0 0 9 ...
^ ^
left right
Here we have 2 zeroes to the left and 3 zeros to the right. This makes (2 + 1) * (3 + 1) = 12 sequences with sum 8 here:
5 1 2
5 1 2 0
5 1 2 0 0
5 1 2 0 0 0
0 5 1 2
0 5 1 2 0
0 5 1 2 0 0
0 5 1 2 0 0 0
0 0 5 1 2
0 0 5 1 2 0
0 0 5 1 2 0 0
0 0 5 1 2 0 0 0

I think this algorithm would work, using O(1) space.
We maintain two pointers to the beginning and end of the current subsequence, as well as the sum of the current subsequence. Initially, both pointers point to array[0], and the sum is obviously set to array[0].
Advance the end pointer (thus extending the subsequence to the right), and increase the sum by the value it points to, until that sum exceeds k. Then advance the start pointer (thus shrinking the subsequence from the left), and decrease the sum, until that sum gets below k. Keep doing this until the end pointer reaches the end of the array. Keep track of the number of times the sum was exactly k.

What is the tree-structure of a heap?

I'm reading Nicolai M. Josuttis's "The C++ standard library, a tutorial and reference", ed2.
He explains the heap data structure and related STL functions in page 607:
The program has the following output:
on entry: 3 4 5 6 7 5 6 7 8 9 1 2 3 4
after make_heap(): 9 8 6 7 7 5 5 3 6 4 1 2 3 4
after pop_heap(): 8 7 6 7 4 5 5 3 6 4 1 2 3
after push_heap(): 17 7 8 7 4 5 6 3 6 4 1 2 3 5
after sort_heap(): 1 2 3 3 4 4 5 5 6 6 7 7 8 17
I'm wondering how could this be figured out? for example, why the leaf "4" under path 9-6-5-4 is the left side child of node "5", not the right side one? And after pop_heap what's the tree structure then? In IDE debugging mode I could only see see the content of the vector, is there a way to figure out the tree structure?

why the leaf "4" under path 9-6-5-4 is the left side child of node "5", not the right side one?
Because if it was on the right side, that would mean there is a gap in the underlying vector. The tree structure is for illustrative purposes only. It is not a representation of how the heap is actually stored. The tree structure is mapped onto the underlying vector via a simple mathematical formula.
The root node of the tree is the first element of the vector (index 0). The index of the left child of a node is obtained from its parent's index by the simple formula: i * 2 + 1. And the index of the right child is obtained by i * 2 + 2.
And after pop_heap what's the tree structure then?
The root node is swapped with the greater of its two children1, and this is repeated until it is at the bottom of the tree. Then it is swapped with the last element. This element is then pushed up the tree, if necessary, by swapping with its parent if it is greater.
The root node is swapped with the last element of the heap. Then, this element is pushed down the heap by swapping with the greater of its two children1. This is repeated until it is in the correct position (i.e. it is not less than either of its children).
So after pop_heap, your tree looks like this:
----- 8 -----
| |
---7--- ---6---
| | | |
-7- -4- -5- x5
| | | | | | x
3 6 4 1 2 3 9
The 9 is not actually part of the heap anymore, but it is still part of the vector until you erase it, via a call pop_back or similar.
1. if the children are equal, as in the case of the adjacent 7's in the tree in your example, it could go either way. I believe that std::pop_heap sends it to the right, though I'm not sure if this is implementation defined

The first element in the vector is the root at index 0. Its left child is at index 1 and its right child at index 2. In general: left_child(i) = 2 * i + 1 and right_child(i) = 2 * i + 2 and parent(i) = floor((i - 1) / 2)
Another way to think about it is the heap fills each level from left to right in the tree. Following the elements in the vector the first level is 9 (1 value), second level 8 6 (2 values) and third level 7 7 5 5 (4 values), and so on. Both these ways will help you draw the heap in a tree structure when given a vector.

Maximum element after M operations

Given are the 3 elements N1,N2,N3
Now we can perform operation on these elements.
Operation is as follow :
In a single operation,we will choose a particular element and decrease the value of elements to half (i.e. if initial value of element is x, then after decrement it will be x/2 where division is integer division, e.g. 3/2=1 and 4/2=2). Meanwhile, the value of other two elements will increase by one.
Now we need to minimise the maximum element if we can perform this operation for ATMOST M seconds and each second we can perform this operation atmost once.
Example : Let N1=1 , N2=2 , N3=3 , M=1 Then here answer is 3
Explanation : We can pick the 3rd element and make it half. Note that first and second element will increase by 1 units. So the values become 2,3,1. Maximum of these values is 3. Hence answer is 3.
My Approach : Every time pick up the largest element and it decrease by half and increase other two by +1.
Code :
long long ans=max(N1,max(N2,N3));
for(int i=0;i<m;i++){
if(N1>=N2 && N1>=N3){
N1/=2;
N2++;
N3++;
}
else if(N2>=N1 && N2>=N3){
N2/=2;
N1++;
N3++;
}
else{
N1++;
N2++;
N3/=2;
}
ans=min(ans,max(N1,max(N2,N3)));
}
Failure :
But let N1=8 , N2=8 , N3=4 , M=3 then answer is 5 and this approach goes wrong as according to mentioned algorithm steps would have been :
8 8 4 -> 4 9 5 -> 5 4 6 -> 6 5 3
But correct one is :
8 8 4 -> 9 9 2 -> 4 10 3 -> 5 5 4
Constraints : M is between 1 and 100 . N1 , N2 and N3 can go upto 10^9.

How to balance between two arrays such as the difference is minimized?

I have an array A[]={3,2,5,11,17} and B[]={2,3,6}, size of B is always less than A. Now I have to map from every element B to distinct elements of A such that the total difference sum( abs(Bi-Aj) ) becomes minimum (Where Bi has been mapped to Aj). What is the type of algorithm?
For the example input, I could select, 2->2=0 , 3->3=0 and then 6->5=1. So the total cost is 0+0+1 = 1. I have been thinking sorting both the arrays and then take the first sizeof B elements from the A. Will this work?

It can be thought of as an unbalanced Assignment Problem.
The cost matrix shall be the difference in values of B[i] and A[j]. You can add dummy elements to B so that the problem becomes balanced and put the costs associated very high.
Then Hungarian Algorithm can be applied to solve it.
For the example case A[]={3,2,5,11,17} and B[]={2,3,6} the cost matrix shall be:
. 3 2 5 11 17
2 1 0 3 9 15
3 0 1 2 8 14
6 3 4 1 5 11
d1 16 16 16 16 16
d2 16 16 16 16 16

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js