mapreduce program - mapreduce

Consider one .txt file.. in that i have no of paragraphs separated by a new line character.
Now i need to count the no of words in each paragraph.. Consider the counted words as a key in
the mapper and assign a value 1 initially for all And in Reducer give me a sorted output
Please give me a complete code for better understanding, Because I am a fresher
And please give me better clarification in how it counts the no of words in each paragraph

Mapper doing the counting would not yield you the performance that you are trying to achieve through the map reduce technique.
To really utilise the benefit of map reduce, you should consider treating the paragraph number
(1 for the 1st paragraph, 2 for the 2nd and so on) and then sending these paragraphs for individual counting to different reducers running on different nodes (harnesses the capability of parallel processing) and then to sort the output you may feed it into a simple program to do the sorting for you, or if the number of paragraphs is large, feeding this into another map reduce job. In that case, you would need to consider a range of numbers as the key for map reduce, say numbers (count of words in the paragraph obtained from the previous map reduce job) from 1 to 10 should fall into one bucket and should be mapped to one key, and then the individual reducers can work on these individual buckets to sort them, and the result can be collated in the end to get the complete sorted output.
An example implementation of map-reduce can be found at : http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

Related

How to efficiently gather the repeating elements in a given array?

I'd like to gather the duplicates in a given array. For example, i have an array like this:
{1,5,3,1,5,6,3}
and i want the result to be:
{3,3,1,1,5,5,6}
In my case, the number of cluster is unknowen before calculation, and the order is not concerned.
I achived this by using the bult-in function Sort in C++. However, actually the ordering is not necessary. Hence, i guess there are probably more efficient methods to accomplish it.
Thanks in advance.
First, construct a histogram noting frequencies of each number. You can use a dictionary to accomplish this in O(n) time and space.
Next, loop over the dictionary's keys (order is unimportant here) and for each one, write a number of instances of that key equal to the corresponding value.
Example:
{1,5,3,1,5,6,3} input
{1->2,5->2,3->2,6->1} histogram dictionary
{1,1,5,5,3,3,6} wrote two 1s, two 5s, two 3s, then one 6
This whole thing is O(n) time and space. Certainly you can't do better than O(n) time. Whether you can do better than O(n) space or not while maintaining O(n) time I cannot say.

Not sure which data structure to use

Assuming I have the following text:
today was a good day and today was a sunny day.
I break up this text into lines, seperated by white spaces, which is
Today
was
a
good
etc.
Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.
However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.
I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but
not important right now).
I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.
Edit: This is what I've done so far: http://pastebin.com/JncR4kw9
You should use a map. Infact, you should use an unordered_map.
unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.
unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map. This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).
The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.
unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.
Why not use two data structures? The vector you have now, and a map, using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.
Sort the vector in alphabetical order.
Scan it and compare every word to those that follow, until you find a different one, and son on.
a, a, and, day, day, sunny, today, today, was, was
2 1 2 1 2 2
A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree
Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.
One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

Algorithm to prevent duplication of field while copying data from text file to excel

I have over 100 text (.txt) files. The data in these files is comma-separated, for example:
good,bad,fine
The aforementioned items are just an example, the data can contain anything from individual words and email-IDs to phone numbers. I have to copy all of these values into a single column of one or more Excel (.xlsx) spreadsheet files.
Now the problem is that I don't want my data to be redundant and the data items should be over 10 million. I'm planning to implement this design in C++. I prefer an algorithm which is efficient so that I can complete my work fast.
Separate your task into 2 steps:
a. getting the list of items in the memory - removing duplicates.
b. putting them into excel.
a: Use a sorted linked tree to gather all items, that's how you will find the duplicate fast.
b: Once done with the list - I would write everything to a simple file and import it to excel, rather than try to do it with c++ against excel API.
After your comment- if a memory problem rise, you might want to create a tree per first letter and - use files to store "each list", so you will not get memory overflow...
it is less efficient, but with today's computing power you won't feel it.
Main idea here is to find fast if you have this word on not and if not - add it to the "list". Searching a sorted tree should do the trick. if you want to avoid worst case scenario there is AVL tree if I recall correctly, this tree will remain even no matter the order of the inserts, yet it is harder to code.
The quickest way would be to simply load all the values (as std::strings) into an std::set. The interface is dead simple. It won't be the fastest way, but you never know.
In particular unordered_set might be faster if you have it available, but will be harder to use because the large number of strings will cause rehashing if you don't prepare properly.
As you read a string, insert it into the set: if(my_set.insert("the string").second) .... That condition will be true for new values only.
10 million is a lot but if each string is around 20 bytes or so, you may get away with a simplistic approach like this.

Searching multiple words in many blocks of data

I have to search about 100 words in blocks of data (20000 blocks approximately) and each block consists of about 20 words. The blocks should be returned in the decreasing order of the number of matches. The brute force technique is very cumbersome because you have to search for all the 100 words one by one and then combine the number of related searches in a complicated manner. Is there any other algorithm which allows to search multiple words at the same time and store the number of matching words?
Thank you
You can use the Aho-Corasick algorithm to search for all 100 words at a time. There are several implementations available here in SO and on github.
Why no consider using multithreading to store the result? Make an array with size equal to number of blocks, then each thread count for the result in one block, then the thread writes the result to the corresponding entry in the array. Later on you sort the array by decreasing order then you get the result.

Fast in string search

I have a problem that I am looking for some guidance to solve the most efficient way. I have 200 million strings of data ranging in size from 3 characters to 70 characters. The strings consist of letters numbers and several special characters such as dashes and underscores. I need to be able to quickly search for the entire string or any substring within a string (minimum substring size is 3). Quickly is defined here as less than 1 second.
As my first cut at this I did the following:
Created 38 index files. An index contains all the substrings that start with a particular letter. The first 4mb contains 1 million hash buckets (start of the hash chains). The rest of the index contains the linked list chains from the hash buckets. My hashing is very evenly distributed. The 1 million hash buckets are kept in RAM and mirrored to disk.
When a string is added to the index it is broken down into its non-duplicate (within itself) 3-n character substrings (when n is the length of the string-1). So, for example, "apples" is stored in the "A" index as pples,pple,ppl,pp (substrings are also stored in the "L" and "P" indexes).
The search/add server runs as a daemon (in C++) and works like a champ. Typical search times are less than 1/2 second.
The problem is on the front end of the process. I typically add 30,000 keys at a time. This part of the process takes forever. By way of benchmark, the load time into an empty index of 180,000 variable length keys is approximately 3 1/2 hours.
This scheme works except for the very long load times.
Before I go nuts optimizing (or trying to) I'm wondering is whether or not there is a better way to solve this problem. Front and back wildcard searches (ie: string like '%ppl%' in a DBMS are amazingly slow (on the order of hours in MySQL for example) for datasets this large. So it would seem that DBMS solutions are out of the question. I can't use full-text searches because we are not dealing with normal words, but strings that may or may not be composed of real words.
From your description, the loading of data takes all that time because you're dealing with I/O, mirroring the inflated strings to hard disk. This will definitely be a bottleneck, mainly depending on the way you read and write data to the disk.
A possible improvement on execution time may be achieved using mmap with some LRU policy. I'm quite sure the idea of replicating data is to make the search faster, but since you're using -- as it seems to be -- only one machine, you're bottleneck will go dive from memory searching to I/O requests.
Another solution, which you may not be interested in -- it's sickly funny and disturbing as well (: --, is to split the data among multiple machines. Considering the way you've structured the data, the implementation itself may take a bit of time, but it would be very straightforward. You'd have:
each machine gets responsible by a set buckets, chosen using something close to hash_id(bucket) % num_machines;
insertions are performed locally, from each machine;
searches may be either interfaced by some type your query-application, or simply clustered into sets of queries -- if the application is not interative;
searches may even have the interface distributed, considering you may send start a request from a node, and forward requests to another node (also clustered requests, to avoid excessive I/O overhead).
Another good point is that, as you said, data is evenly distributed -- ALREADY \o/; this is usually one of the pickiest parts of a distributed implementation. Besides, this would be highly scalable, as you may add another machine whenever data grows in size.
Instead of doing everything in one pass, solve the problem in 38 passes.
Read each of the 180,000 strings. Find "A"s in each string, and write out stuff only to the "A" hash table. After you are done, write the entire finished result of the "A" hash table out to disk. (have enough RAM to store the entire "A" hash table in memory -- if you don't, make smaller hash tables. Ie, have 38^2 hash tables on pairs of starting letters, and have 1444 different tables. You could even dynamically change how many letters the hash tables are keyed off of have based on how common a prefix they are, so they are all of modest size. Keeping track of how long such prefixes are isn't expensive.)
Then read each of the 180,000 strings, looking for "B". Etc.
My theory is that you are going slower than you could because of thrashing of your cache of your massive tables.
The next thing that might help is to limit how long the strings are you do a hash on, in order to shrink the size of your tables.
Instead of doing all 2278 substrings of length 3 to 70 of a string of length 70, if you limited the length of the hash to 10 characters there are only 508 substrings of length 3 to 10. And there may not be that many collisions on strings of length longer than 10. You could, again, have the length of the hashes be dynamic -- the length X hash might have a flag for "try a length X+Y hash if your string is longer than X, this is too common", and otherwise simply terminate the hashing. That could reduce the amount of data in your tables, at the cost of slower lookup in some cases.