I am using the dataproc cluster for spark processing. I am new to whole google cloud stuff. In our application we have 100s of jobs which uses dataproc. With every job we spawn new cluster and terminate it once the job is over. I am using pyspark for processing purpose.
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
What is the best software configuration for improving the performance of the dataproc cluser. I am aware of the in-house infrastructure optimisation of hadoop/spark cluster. Is it applicable as it is for dataroc cluster or something else is needed?
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
Any help in time and price optimisation is appreciated. Thanks in advance.
Thanks
Manish
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
That's absolutely fine. We've used that on 300+ node clusters, only issues were with long-running clusters when nodes were getting preempted, and jobs were not optimised to account for node reclamation (no RDD replication, huge long-running DAGs). Also Tez does not like preemptible nodes getting reclaimed.
Is it applicable as it is for dataroc cluster or something else is needed?
Correct. However Google Storage driver has different characteristics when it comes to operation latency (for example, FileOutputCommitter can take huge amounts of time when trying to do recursive move or remove with overpartitioned output), and memory usage (writer buffers are 64 Mb vs 4 Kb on HDFS).
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
Only performance tests can help with that.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Make sure to use dynamic allocation and your cluster is sized to your workload. Scheduling tab in YARN UI should show utilisation close to 100% (if not, your cluster is oversized to the job, or you have not enough partitions). In Spark UI, better to have number running tasks close to number of cores (if not, it again might be not enough partitions, or cluster is oversized).
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
From throughput perspective, GCS is not bad, but it is much worse in case of many small files, both from reading (when computing splits) and writing (when FileOutputCommitter) perspective. Also many parallel writes can result in OOMs due to bigger write buffer size.
Related
SideInput is sort of like broadcast in Spark, meaning you are caching data to a local worker machines for fast lookup to reduce network/shuffle overhead. It is logical to think limit to how much memory you can have should fit in heap. In Dataflow documentation, it says limit is 20K shard. What does this mean? How big is a shard?
To answer your original question, you can configure the amount of in-memory caching done by a Dataflow worker via the --workerCacheSizeMb option on the command line, which is setWorkerCacheSizeMb if you are invoking a pipeline programmatically. The default is 100Mb.
Can someone explain using the word count example, why Spark would be faster than Map Reduce?
bafna's answer provides the memory-side of the story, but I want to add other two important facts:DAG and ecosystem
Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step. (It would be easier to understand this point if you are familiar with the execution plan optimization in RDBMS or the DAG-style execution of Apache Tez)
Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.
And I want to add that, even if your data is too big for main memory, you can still use spark by choosing to persist you data on disk. Although by doing this you give up the advantages of in-memory processing, you can still benefit from the DAG execution optimization.
Some informative answers on Quora:
here and here.
I think there are three primary reasons.
The main two reasons stem from the fact that, usually, one does not run a single MapReduce job, but rather a set of jobs in sequence.
One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive, because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage. This is an innovation over MapReduce that came from Microsoft's Dryad paper, and is not original to Spark.
The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don't need to be read from disk for each operation.
What about Spark jobs that would boil down to a single MapReduce job? In many cases also these run faster on Spark than on MapReduce. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.
Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark's shuffle implementation works very similarly to MapReduce's: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.
I have a map reduce job.
I have used
job.setNumReduceTasks(0);
to control its speed.
Is there any other way to slow down the speed of the job.
We are afraid that the job running too fast may affect our database.
You can exploit the queues supported in yarn. For instance you can create a queue with appropriate restrictions over memory and cpu cores and then set your job configurations to use that queue for launching map reduce job.
I would suggest you go through the following documentation on fair scheduler.
For your current solution of setting numReducers = 0, I think that it might not be the best way to restrict the compute.
I wrote a script that analyzes a lot of files on an AWS cluster.
Running it on the cloud seems to be slower than I expected - the filesystem is shared via NFS, so the round-trip through the network seems to be the limiting step here. Bottom line - the processing power of the cluster is limited by the speed of the internal network which is considerably slower than the speed of the SSD the data is located in.
How would you optimize the cluster so that IO intensive jobs will run efficiently?
There isn't much you can do given the circumstances.
Obviously the speed of the NFS itself is the drawback.
Consider:
Chunking - grab only pieces of the files required to do as much as possible
Copying locally - create a locking mechanism and copy the file in full locally, process, push back. This can require a lot of work (what if the worker gives up and doesnt clear the lock)
Optimize the NFS share - increase IO throughput by clustering the NFS, raiding it, etc.
With a remote FS you want to limit the amount of back and forth. You can be creative, but creative can be a problem in itself.
I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves).
Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the above by quite a margin.
What could be the possible reasons? Is it the network latency? It's running on 100Mbps Ethernet network. Will it help if I increase the number of nodes?
I am using Hadoop Streaming and my code is in python2.7.
MapReduce isn't really meant to handle that small of an input dataset. The MapReduce framework has to determine which nodes will run tasks and then spin up a JVM to run each individual Map and Reduce task(s) (the number of tasks is dependent on the size of your data set). That usually has a latency on the order of tens of seconds. Shipping non local data between nodes is also expensive as it involves sending data over the wire. For such a small dataset, the overhead of setting up a MapReduce job in a distributed cluster is likely higher than the runtime of the job itself. On a single node you only see the overhead of starting up tasks on a local machine and don't have to do any data copying over the network, that's why the job finishes faster on a single machine. If you had multi gigabyte files, you would see better performance on several machines.