Searching multiple words in many blocks of data

Searching multiple words in many blocks of data - c++

I have to search about 100 words in blocks of data (20000 blocks approximately) and each block consists of about 20 words. The blocks should be returned in the decreasing order of the number of matches. The brute force technique is very cumbersome because you have to search for all the 100 words one by one and then combine the number of related searches in a complicated manner. Is there any other algorithm which allows to search multiple words at the same time and store the number of matching words?
Thank you

You can use the Aho-Corasick algorithm to search for all 100 words at a time. There are several implementations available here in SO and on github.

Why no consider using multithreading to store the result? Make an array with size equal to number of blocks, then each thread count for the result in one block, then the thread writes the result to the corresponding entry in the array. Later on you sort the array by decreasing order then you get the result.

Related

How to efficiently gather the repeating elements in a given array?

I'd like to gather the duplicates in a given array. For example, i have an array like this:
{1,5,3,1,5,6,3}
and i want the result to be:
{3,3,1,1,5,5,6}
In my case, the number of cluster is unknowen before calculation, and the order is not concerned.
I achived this by using the bult-in function Sort in C++. However, actually the ordering is not necessary. Hence, i guess there are probably more efficient methods to accomplish it.
Thanks in advance.

First, construct a histogram noting frequencies of each number. You can use a dictionary to accomplish this in O(n) time and space.
Next, loop over the dictionary's keys (order is unimportant here) and for each one, write a number of instances of that key equal to the corresponding value.
Example:
{1,5,3,1,5,6,3} input
{1->2,5->2,3->2,6->1} histogram dictionary
{1,1,5,5,3,3,6} wrote two 1s, two 5s, two 3s, then one 6
This whole thing is O(n) time and space. Certainly you can't do better than O(n) time. Whether you can do better than O(n) space or not while maintaining O(n) time I cannot say.

Sorting 1 million numbers with only 100k memory cells

In C++, is it possible to sort 1 million numbers, assuming we know the range of the numbers, with using only 100,000 memory cells?
Specifically, a .bin file contains a million numbers within a given range, and need to sort these numbers in descending order into another .bin file, but I am only allowed to use an array of size 100,000 for sorting. Any ideas?

I think I read it somewhere on SO or Quora about map-reduce:
Divide 1 mil. numbers into 10 blocks. Read in the first block of 100k numbers, sort it using quicksort, then write it back to the original file. Do the same procedure for the remaining 9 blocks. Then perform a 10-way merge on 10 sorted blocks in the original file (you only need 10 cells for this) and write the merged output to another file. You can write to a ~100k buffer then flush it to output file for faster write.

Assuming that the range of numbers has 100,000 values or fewer, you can use Counting Sort.
The idea is to use memory cells as counts for the numbers in the range. For example, if the range is 0..99999, inclusive, make an array int count[100000], and run through the file incrementing the counts:
count[itemFromFile]++;
Once you went through the whole file, go through the range again. For each count[x] that is not zero output x the corresponding number of times. The result will be the original array sorted in ascending order.

You can implement a version of the quick-sort algorithm that works on files instead than on vectors.
So, recursively split the file in lower-than pivot/higer-than pivot, sort those files, and recombine them. When the size gets under the available memory, just start to work in memory than with files.

Fast in string search

I have a problem that I am looking for some guidance to solve the most efficient way. I have 200 million strings of data ranging in size from 3 characters to 70 characters. The strings consist of letters numbers and several special characters such as dashes and underscores. I need to be able to quickly search for the entire string or any substring within a string (minimum substring size is 3). Quickly is defined here as less than 1 second.
As my first cut at this I did the following:
Created 38 index files. An index contains all the substrings that start with a particular letter. The first 4mb contains 1 million hash buckets (start of the hash chains). The rest of the index contains the linked list chains from the hash buckets. My hashing is very evenly distributed. The 1 million hash buckets are kept in RAM and mirrored to disk.
When a string is added to the index it is broken down into its non-duplicate (within itself) 3-n character substrings (when n is the length of the string-1). So, for example, "apples" is stored in the "A" index as pples,pple,ppl,pp (substrings are also stored in the "L" and "P" indexes).
The search/add server runs as a daemon (in C++) and works like a champ. Typical search times are less than 1/2 second.
The problem is on the front end of the process. I typically add 30,000 keys at a time. This part of the process takes forever. By way of benchmark, the load time into an empty index of 180,000 variable length keys is approximately 3 1/2 hours.
This scheme works except for the very long load times.
Before I go nuts optimizing (or trying to) I'm wondering is whether or not there is a better way to solve this problem. Front and back wildcard searches (ie: string like '%ppl%' in a DBMS are amazingly slow (on the order of hours in MySQL for example) for datasets this large. So it would seem that DBMS solutions are out of the question. I can't use full-text searches because we are not dealing with normal words, but strings that may or may not be composed of real words.

From your description, the loading of data takes all that time because you're dealing with I/O, mirroring the inflated strings to hard disk. This will definitely be a bottleneck, mainly depending on the way you read and write data to the disk.
A possible improvement on execution time may be achieved using mmap with some LRU policy. I'm quite sure the idea of replicating data is to make the search faster, but since you're using -- as it seems to be -- only one machine, you're bottleneck will go dive from memory searching to I/O requests.
Another solution, which you may not be interested in -- it's sickly funny and disturbing as well (: --, is to split the data among multiple machines. Considering the way you've structured the data, the implementation itself may take a bit of time, but it would be very straightforward. You'd have:
each machine gets responsible by a set buckets, chosen using something close to hash_id(bucket) % num_machines;
insertions are performed locally, from each machine;
searches may be either interfaced by some type your query-application, or simply clustered into sets of queries -- if the application is not interative;
searches may even have the interface distributed, considering you may send start a request from a node, and forward requests to another node (also clustered requests, to avoid excessive I/O overhead).
Another good point is that, as you said, data is evenly distributed -- ALREADY \o/; this is usually one of the pickiest parts of a distributed implementation. Besides, this would be highly scalable, as you may add another machine whenever data grows in size.

Instead of doing everything in one pass, solve the problem in 38 passes.
Read each of the 180,000 strings. Find "A"s in each string, and write out stuff only to the "A" hash table. After you are done, write the entire finished result of the "A" hash table out to disk. (have enough RAM to store the entire "A" hash table in memory -- if you don't, make smaller hash tables. Ie, have 38^2 hash tables on pairs of starting letters, and have 1444 different tables. You could even dynamically change how many letters the hash tables are keyed off of have based on how common a prefix they are, so they are all of modest size. Keeping track of how long such prefixes are isn't expensive.)
Then read each of the 180,000 strings, looking for "B". Etc.
My theory is that you are going slower than you could because of thrashing of your cache of your massive tables.
The next thing that might help is to limit how long the strings are you do a hash on, in order to shrink the size of your tables.
Instead of doing all 2278 substrings of length 3 to 70 of a string of length 70, if you limited the length of the hash to 10 characters there are only 508 substrings of length 3 to 10. And there may not be that many collisions on strings of length longer than 10. You could, again, have the length of the hashes be dynamic -- the length X hash might have a flag for "try a length X+Y hash if your string is longer than X, this is too common", and otherwise simply terminate the hashing. That could reduce the amount of data in your tables, at the cost of slower lookup in some cases.

mapreduce program

Consider one .txt file.. in that i have no of paragraphs separated by a new line character.
Now i need to count the no of words in each paragraph.. Consider the counted words as a key in
the mapper and assign a value 1 initially for all And in Reducer give me a sorted output
Please give me a complete code for better understanding, Because I am a fresher
And please give me better clarification in how it counts the no of words in each paragraph

Mapper doing the counting would not yield you the performance that you are trying to achieve through the map reduce technique.
To really utilise the benefit of map reduce, you should consider treating the paragraph number
(1 for the 1st paragraph, 2 for the 2nd and so on) and then sending these paragraphs for individual counting to different reducers running on different nodes (harnesses the capability of parallel processing) and then to sort the output you may feed it into a simple program to do the sorting for you, or if the number of paragraphs is large, feeding this into another map reduce job. In that case, you would need to consider a range of numbers as the key for map reduce, say numbers (count of words in the paragraph obtained from the previous map reduce job) from 1 to 10 should fall into one bucket and should be mapped to one key, and then the individual reducers can work on these individual buckets to sort them, and the result can be collated in the end to get the complete sorted output.
An example implementation of map-reduce can be found at : http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

How to get partition where value belongs in partitioned interval?

I have interval that is partitioned int large amount of smaller partitions.
There aren't any spaces also there aren't any overlapping intervals.
E.g: (0;600) is separated into:
(0;10>
(10;25>
(25;100>
(100;125>
(125;550>
(550;600)
Now i have large amount of values and i need to get partition id for each of them.
I can store array of values that partitions this interval into smaller intervals.
But if all values belongs to last partition it'll need to pass through whole array.
So i'm searching for any better solution to store these intervals. I want simple - max cca 150 lines length algorithm and i don't want to use any library except std.

Since there are no "empty spaces" in your partitioning, the end of each partition is redundant (it's the same as the start of the next partition).
And since you have the partition list sorted, you can simply use binary search, with std::upper_bound.
See it in action.
Edit: Correction (upper_bound, not lower_bound).

You could just improve your search algorithm.
Put all the ranges in the array, and then use Binary Search Algorithm to search for the right range.
It will cost O(logn), and it's really easy to implement.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js