How to find differences between 2 almost-identical files very fast?

How to find differences between 2 almost-identical files very fast? - c++

If you have two mainly identical files with 1000s of records, how will you write code to find differences between them. Assume that unix/linux commands are not allowed to be used.
My idea:
Because most of entries are the same, we can sort the entries of the two files, then compare each entry one by one, e.g. entry i in file1 compare to entry i in file2. It is O(n lg n), n is size of file.
Are there O(n) solution ?

Hash tables are your friends.
Get a record from file 1.
Hash it.
Get the corresponding memory address.
Set it to 1.
Repeat for all records in file 1.
Repeat for all records in file 2, but add 2 instead of setting to 1.
Now you know which record exists in both files (value 3), which exists in the first file only (value 1), and which exists in the second file only (value 2). And in linear time.
Note: If you're implementing your own hash table, you have to handle growing the size of your table as needed, as well as collisions. I'm sure if you could do that then you wouldn't have a hard time with this question, so use a library.

Related

How can I perform search on a lookup table without loading it in memory?

Now I have a file recording the entries of a lookup table. If the number of entries is small, I can simply load this file into an STL map and perform search in my code. But what if there are many many entries? If I do it in the way above, it may cause error such as out of memory. I'm here to listen to your advice...
P.S. I just want to perform search without loading all entries into memory.
Can Key-value database solve this problem?

You'll have to load the data from hard drive eventually but sure if a table is huge it won't fit into memory to do a linear search through it, so:
think if you can split the data into a set of files
make an index table of what file contains what entries (say the first 100 entries are in "file1_100", second hundred is in "file101_201" an so on)
using index table from step 2 locate the file to load
load the file and do a linear search
That is a really simplified scheme for a typical database management system so you may want to use one like MySQL, PostgreSQL, MsSQL, Oracle or any one of them.
If that's a study project then after you're done with the search problem, consider optimizing linear operations (by switching to something like binary search) and tables (real databases use balanced tree structures, hash tables and like).

One method would be to reorganize the data in the file into groups.
For example, let's consider a full language dictionary. Usually, dictionaries are too huge to read completely into memory. So one idea is to group the words by first letter.
In this example, you would first read in the appropriate group based on the letter. So if the word you are searching for begins with "m", you would load the "m" group into memory.
There are other methods of grouping such as word (key) length. There can also be subgroups too. In this example, you could divide the "m" group by word lengths or by second letter.
After grouping, you may want to write the data back to another file so you don't have to modify the data anymore.
There are many ways to store groups on the file, such as using a "section" marker. These would be for another question though.
The ideas here, including from #047, are to structure the data for the most efficient search, giving your memory constraints.

sorting huge file that is almost sorted

I'm facing with the following problem:
I have a huge file (let's say 30 GB), that is streamed in memory with a specific API.
This API only allows me to read going forward (not backward). But the files can be read as many times as I want.
The file contains data that is almost all sorted, as in, 99% of the data is sorted but it can happen that a record is not in its correct position and should have been inserted much before if everything was sorted.
I'm trying to create a duplicate of this file, except it would need to be sorted.
Is there a graceful way to do this ?
The only way I can think of is the most generic way:
read the file
create batch of a few GB of memory, sort them, write them to a file on the HDD
use external merge to merge all these temporary files into the final output
However this is not using the specificities that the data is "almost" sorted. Would there be a better way to do this ? For instance without using external files on the HDD?

You could do this (example in Python)
last = None
special = []
for r in records:
if last is None or r > last:
last = r
else:
special.append(r)
if len(special) > max_memory:
break
if len(special) > max_memory:
# too many out of sequence records, use a regular sort
...
else:
sort(special)
i = 0
for r in records:
while i < len(special) and special[i] < r:
write(special[i])
i += 1
write(r)
while i < len(special):
write(special[i])
i += 1

Use a variation of bottom up merge sort called natural merge sort. The idea here is to find runs of ordered data, then repeatedly merge those runs back and forth between two files (all sequential I/O) until there's only a single run left. If the sort doesn't have to be stable (preserve the order of equal elements), then you can consider a run boundary to occur whenever a pair of sequential elements are out of order. This eliminates some housekeeping. If the sort needs to be stable, then you need to keep track of run boundaries on the initial pass that finds the runs, this could be an array of counts (the size of each run). Hopefully this array would fit in memory. After each merge pass, the number of counts in the array is cut in half, and once there's only a single count, the sort is done.
Wiki article (no sample code given though): natural bottom up merge sort .
If all the out of order elements consist of somewhat isolated records, you could separate the out of order elements into a third file, only copying in order records from the first file to the second file. Then you sort the third file with any method you want (bottom up merge sort is probably still best if the third file is large), then merge the second and third files to create a sorted file.
If you have multiple hard drives, keep the files on separate drives. If doing this on a SSD drive, it won't matter. If using a single hard drive, reading or writing a large number of records at a time, like 10MB to 100MB per read or write, will greatly reduce the seek overhead during the sort process.

FAT, optimize performance when retrieve a file

I have an implementation of database with one file per record, and I have about 10000 records.
I'm trying to optimize the performance of access to file, and I have a little doubt.
Is split files into folders better then keep all in single folder, for quick access to the files? ex: from 0 to 999 in folder 0, from 1000 to 1999 in 2 etc...
What is better for this, FAT16 or FAT32?

If you are accessing the files directly, then you won't have any performance drop. If you are searching for a particular file on the disk, it would be faster to store them in folders. This way folders would emulate db indexes. But as #blow mentioned, why don't you use something like Sqlite?

When you retrieve a file by filename you most likely do a linear search in the directory containing that file, you skip all directory entries until you find the one that matches the given filename.
This search operation may be slow if you do it every time for every file, there are many files in the directory and reads are slow (if your CPU is slow you lose even more).
You may want to build some sort of an index, a compact array of pairs filename+location sorted by filename, which you can keep in memory to quickly find files w/o rereading the directory entries.
Things can be greatly simplified if there's a constant number of files and they have the same length or are padded to the same length. In that case you don't need any search as you can calculate the location of each file directly from the filename, provided, of course, that the order of the files is fixed.
The only practical difference between FAT1x and FAT32 in this context is the size of the file allocation table, that set of linked lists/chains that tells you which clusters are free or occupied by file/directory data and tells you which cluster is the next in a file/directory after the given one. In FAT32, the cluster chain elements are 32-bit, 2 times larger than on FAT16. If the number of used clusters is small (less than ~64K), you are going to read twice as much data from FAT32 while traversing the cluster chains compared with FAT16. Also, finding a free cluster on FAT32 (when you create a new file/dir or grow an existing one) can be slow if there are many clusters on the disk (and there can be up to 2^28 on FAT32 AFAIR vs 2^16 of FAT16). You don't want to start searching for a free cluster from the beginning of the FAT every time. You want to keep somewhere a pointer to the last place you stopped the search and the next time search from there and then go to the beginning of the FAT when you've reached the FAT's end.

Split them across directories (the split number depending on your cluster size) and do not use LFN (LongFileName) if you can, because it will slow down your operation. I also work on embbeded systems. I did not have to access 1000s of files like you, but i avoided LFN (especially for royalty reasons).

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.
Sample log file:
User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4
Corresponding results:
A： 1-2-3， 2-3-4
B： 2-3-4
2-3-4 is the most popular three-page sequence
My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.
However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.
How can I do this better?

If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}

If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.

The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.

If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.

A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page

One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.

Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.

Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.

I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js