Hadoop V2: Turn off shuffling/sorting? - mapreduce

I was wondering if there was any way to turn off shuffling/sorting in the Map phase of a job? My job doesn't require a Reduce phase so I don't need the shuffle and sort.
Im using hadoop version 2.2.0
Thanks

You can setNumReduceTask to 0 which will just map the data without shuffling and sorting.

Related

multiple MAP function in pyspark

Hi
I'm new in pyspark and i'm going to implement DBSCAN using MAP_REDUCE technique which is explained in https://github.com/mraad/dbscan-spark, but i don't understand something ,
obviously if we have multiple computers then we assign each cell to a MAP and as explained in the link, after calling REDUCE we find out Contents of each epsilon neighbor of cell, but in single computer how we can run and assign MAP's to cell's .
how do we define multiple maps in single computer(pyspark) and assign them too cell's ?
I wrote fishnet(cell,eps) that return point location according to cell's epsilon neighbor .
I want to pass it to each MAP but i don't know how to do it in pyspark.
Something like(if we have 4 cell's) :
map1(fishnet) map2(fishnet) map3(fishnet) map4(fishnet)
I would appriciate for any solution
It's the job of Spark / MapReduce to distribute the mappers to different workers. Don't mess with that part, let Spark decide where to invoke the actual mappers.
Beware that Spark is not very well suited for clustering. It's clustering capabilities are very limited, and the performance is pretty bad. See e.g.:
Neukirchen, Helmut. "Performance of Big Data versus High-Performance Computing: Some Observations."
It needed 900 cores with Spark to outperform a good single-core application like ELKI! And other Spark DBSCAN implementations would either not work reliably (i.e. fail) or produce wrong results.

Efficient way of shuffling data randomly and then splitting it into training and test set?

I'm working on a code of python that is about machine learning where i have to randomly shuffle the 100000 samples and the split data into training and test set. I have stored data into two numpy arrays. If I use following command it is too time consuming.
c=zip(a,b)
np.random.shuffle(c)
a,b = (*c)
where a and b are two numpy arrays. Is there any efficient way to shuffle data randomly and then split it into training and test set? Can some one please suggest a python code that can help me?
You may want to use the function train_test_split from scikit learn's cross-validation feature for this purpose.

When does shuffling occur in Apache Spark?

I am optimizing parameters in Spark, and would like to know exactly how Spark is shuffling data.
Precisely, I have a simple word count program, and would like to know how spark.shuffle.file.buffer.kb is affecting the run time. Right now, I only see slowdown when I make this parameter very high (I am guessing this prevents every task's buffer from fitting in memory simultaneously).
Could someone explain how Spark is performing reductions? For example, the data is read and partitioned in an RDD, and when an "action" function is called, Spark sends out tasks to the worker nodes. If the action is a reduction, how does Spark handle this, and how are shuffle files / buffers related to this process?
Question : As for your question concerning when shuffling is triggered on Spark?
Answer : Any join, cogroup, or ByKey operation involves holding objects in hashmaps or in-memory buffers to group or sort. join, cogroup, and groupByKey use these data structures in the tasks for the stages that are on the fetching side of the shuffles they trigger. reduceByKey and aggregateByKey use data structures in the tasks for the stages on both sides of the shuffles they trigger.
Explanation : How does shuffle operation work in Spark?
The shuffle operation is implemented differently in Spark compared to Hadoop. I don't know if you are familiar with how it works with Hadoop but let's focus on Spark for now.
On the map side, each map task in Spark writes out a shuffle file (os disk buffer) for every reducer – which corresponds to a logical block in Spark. These files are not intermediary in the sense that Spark does not merge them into larger partitioned ones. Since scheduling overhead in Spark is lesser, the number of mappers (M) and reducers(R) is far higher than in Hadoop. Thus, shipping M*R files to the respective reducers could result in significant overheads.
Similar to Hadoop, Spark also provide a parameter spark.shuffle.compress to specify compression libraries to compress map outputs. In this case, it could be Snappy (by default) or LZF. Snappy uses only 33KB of buffer for each opened file and significantly reduces risk of encountering out-of-memory errors.
On the reduce side, Spark requires all shuffled data to fit into memory of the corresponding reducer task, on the contrary of Hadoop that had an option to spill this over to disk. This would of course happen only in cases where the reducer task demands all shuffled data for a GroupByKey or a ReduceByKey operation, for instance. Spark throws an out-of-memory exception in this case, which has proved quite a challenge for developers so far.
Also with Spark there is no overlapping copy phase, unlike Hadoop that has an overlapping copy phase where mappers push data to the reducers even before map is complete. This means that the shuffle is a pull operation in Spark, compared to a push operation in Hadoop. Each reducer should also maintain a network buffer to fetch map outputs. Size of this buffer is specified through the parameter spark.reducer.maxMbInFlight (by default, it is 48MB).
For more information about shuffling in Apache Spark, I suggest the following readings :
Optimizing Shuffle Performance in Spark by Aaron Davidson and Andrew Or.
SPARK-751 JIRA issue and Consolidating Shuffle files by Jason Dai.
It occurs whenever data needs to moved between executors (worker nodes)

When to use use MapReduce in Hbase?

I want to understand MapReduce of Hbase from application point of view, Need some real use cases of it to better understand the efficient use case of writing these jobs.
If there is any link to document or examples that explains the real use cases, Please share.
I can give some example based on my use cases. If you already store your data in hbase, you can write a java program, which scans a table and do something, then write the output to hbase or somewhere else. OR you can use mapreduce to do the same. The difference is, mapreduce will run where the data is and network traffic is used only for result data. We have hourly jobs to calculate sum and average of kpis and input data is huge but output data is tiny for this task. If i did not use mapreduce, i need to move one hour of data over network which is 18gb. But mapreduce output is only 1mb and i can write it to hbase or file or somewhere else.
Also mapreduce gives you parallel task execution ability, which you can have in java but why :)
Keep in mind that YARN creates map tasks according to your hbase table's split count. So if you need more map task, split your table.
If you already store your data in hadoop hdfs, you are lucky, a mapreduce reading from hdfs is much faster than reading from hbase. Also you can still write mapreduce output to hbase, if you want.
Please look into the usecases given
1. here.
2. And a small reference here - 30.Joins
3. May be an end to end example here
In the end, it all depends on your understanding of each concept Map reduce, Hbase and use it as per your need in your project. The same task can be done with or without map reduce. Happy coding

How can you build a MapReduce pipeline in RavenDB?

I have a theoretical question about RavenDB - is it possible to build a MapReduce pipeline?
So for instance, let's say I have MapReduce output Y from Index 1 and I use that output for some queries, but I need to run another MapReduce operation on Y in order to get some lower-level statistics.
Is it possible for me to write Index 2 which can further MapReduce the output from Index 1?
Aaronontheweb,
Not at the moment, no.
It is a feature that you could handle using Indexed Properties bundle in 1.2