MapReduce - Emitting the top 20% occurring words in the document - mapreduce

I was reading about MapReduce here , and the first example they give is counting the number of occurrences for each word in the document. I was wondering, suppose you wanted to get the top 20% occurring words in the document, how can you achieve that? it seems unnatural since each node in the cluster cannot see the whole files, just the list of all occurrences for a single word.
Is there way to achieve that?

Yes you certainly can achieve this : by forcing hadoop to have just a single reducer (though with this approach you lose the advantage of distributed computing per se).
This can be done as follows:
// Configuring mapred to have just one reducer
conf.setInt("mapred.tasktracker.reduce.tasks.maximum", 1);
conf.setInt("mapred.reduce.tasks", 1);
Now since you have just one reducer, you can keep track of the top 20% and emit them out in run() or cleanup() of the reducer. See here for more.

Related

Map-Reduce with a wait

The concept of map-reduce is very familiar. It seems like a great fit for a problem I'm trying to solve, but it's either missing something (or I lack enough understanding of the concept).
I have a stream of items, structured as follows:
{
"jobId": 777,
"numberOfParts": 5,
"data": "some data..."
}
I want to do a map-reduce on many such items.
My mapping operation is straightforward - take the jobId.
My reduce operation is irrelevant for this phase, but all we know is that it takes multiple strings (the "some data..." part) and somehow reduces them to a single object.
The only problem is - I need all five parts of this job to complete before I can reduce all the strings into a single object. Every item has a "numberOfParts" property which indicates the number of items I must have before I apply the reduce operation. The items are not ordered, therefore I don't have a "partId" field.
Long story short - I need to apply some kind of a waiting mechanism that waits for all parts of the job to complete before initiating the reduce operation, and I need this waiting mechanism to rely on a value that exists within the payload (therefore solutions like kafka wouldn't work).
Is there a way to do that, hopefully using a single tool/framework?
I only want to write the map/reduce part and the "waiting" logic, the rest I believe should come out of the box.
**** EDIT ****
I'm currently in the design phase of the project and therefore not using any framework (such as spark, hadoop, etc...)
I asked this because I wanted to find out the best way to tackle this problem.
"Waiting" is not the correct approach.
Assuming your jobId is the key, and data contains some number of parts (zero or more), then you must have multiple reducers. One that gathers all parts of the same job, then another that processes all jobs with a collection of parts greater than or equal to numberOfParts while ignoring others

Cloud Spanner: Implication of split being 'too big'

The documentation states that one split should not be bigger then 'a few GB'.
Is there a hard limit on that where Cloud Spanner will stop storing more data in one split ?
Nothing can be found in the limits-section here: https://cloud.google.com/spanner/quotas
What is the implication of e.g. splits growing to 20-30GB ?
I can think of problems when those splits need to be moved around between instances while being read/written
I know the second point sound like we should split up our primary key/add a sharding-key as first primary-key-part.
But if you have hundreds of customers having really big product catalogs and you need to interleave brand- and category-tables so you can join on them. And alternative approaches of storing one product-catalog in several splits become very slow on secondary index queries (like: query all active products in a catalog).
Thanks a lot in advance because this would help us a lot of understanding Cloud Spanner better for our planned production-use.
Christian Gintenreiter
A split can only be served by a single node, so very large splits may cause the single node to become a performance bottleneck. You may start to see performance degradation with a split size greater than 2GB. The hard limit on split size is bound by the the storage limit for a single node, which is 2TB.
Can you please provide some more details about your schema and interleaving?

multiple MAP function in pyspark

Hi
I'm new in pyspark and i'm going to implement DBSCAN using MAP_REDUCE technique which is explained in https://github.com/mraad/dbscan-spark, but i don't understand something ,
obviously if we have multiple computers then we assign each cell to a MAP and as explained in the link, after calling REDUCE we find out Contents of each epsilon neighbor of cell, but in single computer how we can run and assign MAP's to cell's .
how do we define multiple maps in single computer(pyspark) and assign them too cell's ?
I wrote fishnet(cell,eps) that return point location according to cell's epsilon neighbor .
I want to pass it to each MAP but i don't know how to do it in pyspark.
Something like(if we have 4 cell's) :
map1(fishnet) map2(fishnet) map3(fishnet) map4(fishnet)
I would appriciate for any solution
It's the job of Spark / MapReduce to distribute the mappers to different workers. Don't mess with that part, let Spark decide where to invoke the actual mappers.
Beware that Spark is not very well suited for clustering. It's clustering capabilities are very limited, and the performance is pretty bad. See e.g.:
Neukirchen, Helmut. "Performance of Big Data versus High-Performance Computing: Some Observations."
It needed 900 cores with Spark to outperform a good single-core application like ELKI! And other Spark DBSCAN implementations would either not work reliably (i.e. fail) or produce wrong results.

Knime : How to achieve multi split by row

I would like to split a data set into multiple data set of 1000 rows and how is it possible?
The Node row splitter has only two output . Let me know if there is any way to use java snippet for this requirement.
It is not entirely well specified how you want to split the table, but there are two loop types that might do what you are looking for: Chunk Loop (Start) or Group Loop (Start). Your workflow probably would look like this:
[(Chunk/Group) Loop Start] --> Your processing nodes of the selected rows --> [Loop End]
In the part Your processing nodes of the selected rows you will only see the splitted parts you need.
The difference between the two nodes is the following: the Chunk Loop Start nodes collect the rows to a group by their position (consecutive nodes part of the same group till the requested number of rows are consumed), while the Group Loop Start collects the rows with the same properties to the same collection for processing. (The Loop End node might be not the best fit depending on your processing requirements, in that case look for other Loop End nodes.)
In case these are not sufficient, you might try the parallel chunk loop nodes or as I remember there are bagging, ensemble and cross validation (X-Validation) nodes too in some extensions. (For more complex workflows you can also use recursive loops.) For feature elimination, you can also find support.

Storm and stop words

I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html),
I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.
If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.
If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:
receive a batch of tuples to check (say 10000 at a time)
build a single query from their content and look up corresponding stop words in persistence
attach a boolean to each tuple specifying what to with each one
+ add a Filter right after that to filter based on that boolean.
And if you feel adventurous:
Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).
Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.