Find common elements from two very large Arrays

Find common elements from two very large Arrays - c++

There are two integer arrays ,each in very large files (size of each is larger than RAM). How would you find the common elements in the arrays in linear time.
I cant find a decent solution to this problem. Any ideas?

One pass on one file build a bitmap (or a Bloom filter if the integer range is too large for a bitmap in memory).
One pass on the other file find the duplicates (or candidates if using a Bloom filter).
If you use a Bloom filter, the result is probabilistic. New passes can reduce the false positive (Bloom filters don't have false negative).

Assuming integer size is 4 bytes.
Now we can have maximum of 2^32 integers i.e I can have a bitvector of 2^32 bits (512 MB) to represent all integers where each bit reperesents 1 integer.
1. Initialize this vector with all zeroes
2. Now go through one file and set bits in this vector to 1 if you find an integer.
3. Now go through other file and look for any set bit in bit Vector.
Time complexity O(n+m)
space complexity 512 MB

You can obviously use an hash table to find common elements with O(n) time complexity.
First, you need to create an hash table using the first array, then compare the second array using this hash table.

Let's say enough RAM is available to hold 5% of hash of either given file-array (FA).
So, I can split the file arrays (FA1 and FA2) into 20 chunks each - say do a MOD 20 of the contents. We get FA1(0)....FA1(19) and FA2(0)......FA2(19). This can be done in linear time.
Hash FA1(0) in memory and compare contents of FA2(0) with this hash. Hashing and checking for existence are constant time operations.
Destroy this hash and repeat for FA1(1)...FA1(19). This is also linear. So, the whole operation is linear.

Assuming you are talking of integers with the same size, and written in the files in binary mode, you first sort the 2 files (use a quicksort, but reading and writing to the file "offsets" ).
Then you just need to move from the start of the 2 files, and check for matches, if you have a match write the output to another file (assuming you can't also store the result in memory) and keep moving on the files until EOF.

Sort files. With fixed length integers it can be done in O(n) time:
Get some part of file, sort it with radix sort, write to temporary file. Repeat until all data finished. This part is O(n)
Merge sorted parts. This is O(n) too. You can even skip repeated numbers.
On sorted files find a common subset of integers: compare numbers, write it down if they are equal, then step one number ahead on file with smaller number. This is O(n).
All operations are O(n) and final algorithm is O(n) too.
EDIT: bitmap method is much faster if you have enough memory for bitmaps. This method works for any fixed size integers, 64-bit for example. Bitmap of size 2^31 Mb will not be practical for at least a few years :)

Related

Sorting elements into "bins" C++

Hello I've been wondering what was the most efficient way to sort elements into "bins" in c++.
For example, we might have 3 bins:
Bin A accepts integers from 0-10
Bin B accepts integers from 5-20
Bin C accepts integers from 25-30
I want to be able to put an integer into the correct bin based on the range the bin accepts. If the int falls into the range of multiple bins, put it into the bin that is the least full. Finally, I want to be able to tell if the integer doesn't fit into any bin.
What is the most efficient way to do this?

It depends.
If the number of bins is small, a simple linear search is usually plenty efficient. Especially if you take into consideration the programmer's time.
If the set of integers is small and the bins don't often overlap, it can be done in constant time. We create a vector indexed by the integer, with as value another vector of (pointers to) the bins. So you can look up the list of applicable bins in constant time. You still need to loop through the applicable bins to find out which is the least full, so this is only constant time if the maximum number of overlapping bins is bounded by a constant.
If the set of integers is large but the bins themselves are relatively small, we can do something similar, but using a hash table instead of a vector.
If we can't make any assumptions, then an interval tree is probably the best solution. But its worst-case behaviour is still linear in the number of bins, because the bins might all overlap and would all need to be checked to find the one that is least full.

What's the most performant way to create a sorted copy of a C++ vector?

Given a C++ vector (let's say it's of doubles, and let's call it unsorted), what is the most performant way to create a new vector sorted which contains a sorted copy of unsorted?
Consider the following naïve solution:
std::vector<double> sorted = unsorted;
std::sort(sorted.begin(), sorted.end());
This solution has two steps:
Create an entire copy of unsorted.
Sort it.
However, there is potentially a lot of wasted effort in the initial copy of step 1, particularly for a large vector that is (for example) already mostly sorted.
Were I writing this code by hand, I could combine the first pass of my sorting algorithm with step 1, by having the first pass read the values from the unsorted vector while writing them, partially sorted as necessary, into sorted. Depending on the algorithm, subsequent steps might then work just with the data in sorted.
Is there a way to do such a thing with the C++ standard library, Boost, or a third-party cross-platform library?
One important point would be to ensure that the memory for the sorted C++ vector isn't needlessly initialized to zeroes before the sorting begins. Many sorting algorithms would require immediate random-write access to the sorted vector, so using reserve() and push_back() won't work for that first pass, yet resize() would waste time initializing the vector.
Edit: As the answers and comments don't necessarily see why the "naïve solution" is inefficient, consider the case where the unsorted array is in fact already in sorted order (or just needs a single swap to become sorted). In that case, regardless of the sort algorithm, with the naïve solution every value will need to be read at least twice--once while copying, and once while sorting. But with a copy-while-sorting solution, the number of reads could potentially be halved, and thus the performance approximately doubled. A similar situation arises, regardless of the data in unsorted, when using sorting algorithms that are more performant than std::sort (which may be O(n) rather than O(n log n)).

The standard library - on purpose - doesn't have a sort-while-copying function, because the copy is O(n) while std::sort is O(n log n).
So the sort will totally dominate the cost for any larger values of n. (And if n is small, it doesn't matter anyway).

Assuming the vector of doubles doesn't contain special numbers like NAN or infinity, then the doubles can be treated as 64 bit sign + magnitude integers, which can be converted to be used for a radix sort which is fastest. These "sign + magnitude integers" will need to be converted into 64 bit unsigned integers. These macros can be used to convert back and forth SM stands fro sign + magnitude, ULL for unsigned long long (uint64_t). It's assumed that the doubles are cast to type unsigned long long in order to use these macros:
#define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
#define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))
Note that using these macros will treat negative zero as less than positive zero, but this is normally not an issue.
Since radix sort needs an initial read pass to generate a matrix of counts (which are then converted into the starting or ending indices of logical bucket boundaries), then in this case, the initial read pass would be a copy pass that also generates the matrix of counts. A base 256 sort would use a matrix of size [8][256], and after the copy, 8 radix sort passes would be performed. If the vector is much larger than cache size, then the dominant time factor will be the random access writes during each radix sort pass.

Sorting 1 million numbers with only 100k memory cells

In C++, is it possible to sort 1 million numbers, assuming we know the range of the numbers, with using only 100,000 memory cells?
Specifically, a .bin file contains a million numbers within a given range, and need to sort these numbers in descending order into another .bin file, but I am only allowed to use an array of size 100,000 for sorting. Any ideas?

I think I read it somewhere on SO or Quora about map-reduce:
Divide 1 mil. numbers into 10 blocks. Read in the first block of 100k numbers, sort it using quicksort, then write it back to the original file. Do the same procedure for the remaining 9 blocks. Then perform a 10-way merge on 10 sorted blocks in the original file (you only need 10 cells for this) and write the merged output to another file. You can write to a ~100k buffer then flush it to output file for faster write.

Assuming that the range of numbers has 100,000 values or fewer, you can use Counting Sort.
The idea is to use memory cells as counts for the numbers in the range. For example, if the range is 0..99999, inclusive, make an array int count[100000], and run through the file incrementing the counts:
count[itemFromFile]++;
Once you went through the whole file, go through the range again. For each count[x] that is not zero output x the corresponding number of times. The result will be the original array sorted in ascending order.

You can implement a version of the quick-sort algorithm that works on files instead than on vectors.
So, recursively split the file in lower-than pivot/higer-than pivot, sort those files, and recombine them. When the size gets under the available memory, just start to work in memory than with files.

Sorting large files in better time than 0(n log n) time

I have two files:
One file stores name "mapping.txt" of 10GB:
1 "First string"
2 "Second string"
3 "Third string"
...
199000000 "199000000th string"
And the other file stores integers from mapping.txt in some arbitrary order (stored in file.txt):
88 76 23 1 5 7 9 10 78 12 99 12 15 16 77 89 90 51
Now I want to sort "mapping.txt" in the order specified by the integers above like:
88 "88th string"
76 "76th string"
23 "23rd string"
1 "1st string"
5 "5th string"
7 "7th string"
How do I accomplish this using C++?
I know that for every integer in the file one can perform a binary search in "mapping.txt", but since its time complexity is O(n log n), it is not very efficient for large files.
I'd like a way to do this that is more performant than 0(n log n).

Here's what I would do. This may not be the most efficient way, but I can't think of a better one.
First, you pass over the big file once to build an index of the offset at which each line starts. This index should fit into memory, if your lines are long enough.
Then you pass over the small file, read each index, jump to the corresponding location in the big file, and copy that line to the target file.
Because your index is continuous and indexed by integer, lookup is constant time. Any in-memory lookup time will be completely overshadowed by disk seek time anyway, though.

I know that for every integer in file.txt one can perform a binary search in "mapping.txt"
As you said binary search is not useful here, besides the reason that you exposed you also have the challenge that mapping.txt is not in a friendly format to perform searching or indexing.
If possible I would recommend to change the format of the mapping file to one more suitable to do direct seek calls. For instance, you could think in a file containing fixed length strings so you can calculate the position of each entry ( that would be constant in the number of fseek calls but keep in mind that the function itself wouldn't be constant)
[EDIT]:
Other thing you could do to minimize access to the mapping.txt is the following:
Load the "order" file into an array in memory but in a way where the position is the actual line on mapping.txt and the element is the desired position on the new file, for instance the first element of that array would be 4 because 1 is on the 4th position (in your example).
For convenience split the new array into N buckets files so if an element would go to the 200th position that would be the first position on the 4th bucket (for example).
Now you can access the mapping file in a sequential fashion, you would for each line check on your array for the actual position in your new file and put then in the corresponding bucket.
Once you passed the whole mapping file (you only have to checked it once), you only need to append the N buckets into your desired file.

As Sebastian suggested, try
creating an index over the mapping file ("mapping.txt") with the offset (and optionally length) of each string in the file.
Then access that index for each entry in the ordering file ("file.txt") and seek to the stored position in the text file.
This has linear time complexity depending on the size of the two files and linear space complexity with small factor depending on the line count of "mapping.txt"
For fast and memory-efficient sequential read access to large regular files use mmap(2) and madvise(2) or their corresponding constructs in the Windows API. If the file is larger than your address space, mmap it in chunks as large as possible. Don't forget to madvise the kernel on the different access pattern in step 2 (random vs. sequential).
Please don't copy that much stuff from a file onto the heap, if you don't need it later and your system has memory maps!

Given you have a list of exactly how you want the data output, I'd try an array

You would be best served to split this problem up into smaller problems:
Split mapping.txt and file.txt into n and m entry chunks respectively (n and m could be the same size or different)
Take your normal map-sorting routine and modify it to take a chunk number (the chunk being which m-offset of file.txt you're operating on) and perform the map-sorting on those indices from the various mapping.txt chunks.
Once complete, you will have m output-X.txt files which you can merge into your actual output file.
Since your data is ASCII, it will be a pain to map fixed windows into either file thus splitting both into smaller files will be helpful.

This is a pretty good candidate for mergesort.
This will be O(n log n) but most algorithms will not beat that.
You just need to use the index file to alter the key comparison.
You will find merge sort in any decent algorithms text book and it is well sort to doing a external sort to disk, for whne the file to be sorted is bigger than memory.
If you really must beat O(n log n), pass over a the file and build a hash table, indexed by the key, of where every line is. Then read the index file and and use the hash table to locate each line.
In theory this would be O(n + big constant).
I see some problems with this however: what is n? that will be a big hash table. Implementation may very well slower than the O(n log n) solution due to "big constant" being really big. Even if you mmap the file for effienct access you may get a lot of paging.

Can you encode to less bits when you don't need to preserve order?

Say you have a List of 32-bit Integers and the same collection of 32-bit Integers in a Multiset (a set that allows duplicate members)
Since Sets don't preserve order but List do, does this mean we can encode a Multiset in less bits than the List?
If so how would you encode the Multiset?
If this is true what other examples are there where not needing to preserve order saves bits?
Note, I just used 32-bit Integers as an example. Does the data type matter in the encoding? Does the data type need to be fixed length and comparable for you to get the savings?
EDIT
Any solution should work well for collections that have low duplication as well as high duplication. Its obvious with high duplication encoding a Multiset by just simply counting duplicates is very easy, but this takes more space if there is no duplication in the collection.

In the multiset, each entry would be a pair of numbers: The integer value, and a count of how many times it is used in the set. This means additional repeats of each value in the multiset do not cost any more to store (you just increment the counter).
However (assuming both values are ints) this would only be more efficient storage than a simple list if each list item is repeated twice or more on average - There could be more efficient or higher performance ways of implementing this, depending on the ranges, sparsity, and repetitive of the numbers being stored. (For example, if you know there won't be more than 255 repeats of any value, you could use a byte rather than an int to store the counter)
This approach would work with any types of data, as you are just storing the count of how many repeats there are of each data item. Each data item needs to be comparable (but only to the point where you know that two items are the same or different). There is no need for the items to take the same amount of storage each.

If there are duplicates in the multiset, it could be compressed to a smaller size than a naive list. You may want to have a look at Run-length encoding, which can be used to efficiently store duplicates (a very simple algorithm).
Hope that is what you meant...

Data compression is a fairly complicated subject, and there are redundancies in data that are hard to use for compression.
It is fundamentally ad hoc, since a non-lossy scheme (one where you can recover the input data) that shrinks some data sets has to enlarge others. A collection of integers with lots of repeats will do very well in a multimap, but if there's no repetition you're using a lot of space on repeating counts of 1. You can test this by running compression utilities on different files. Text files have a lot of redundancy, and can typically be compressed a lot. Files of random numbers will tend to grow when compressed.
I don't know that there really is an exploitable advantage in losing the order information. It depends on what the actual numbers are, primarily if there's a lot of duplication or not.

In principle, this is the equivalent of sorting the values and storing the first entry and the ordered differences between subsequent entries.
In other words, for a sparsely populated set, only little saving can be had, but for a more dense set, or one with clustered entries - more significant compression is possible (i.e. less bits need to be stored per entry, possibly less than one in the case of many duplicates). I.e. compression is possible but the level depends on the actual data.

The operation sort followed by list delta will result in a serialized form that is easier to compress.
E.G. [ 2 12 3 9 4 4 0 11 ] -> [ 0 2 3 4 4 9 11 12 ] -> [ 0 2 1 1 0 5 2 1 ] which weighs about half as much.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js