How to slow down the speed of a mapreduce job - mapreduce

I have a map reduce job.
I have used
job.setNumReduceTasks(0);
to control its speed.
Is there any other way to slow down the speed of the job.
We are afraid that the job running too fast may affect our database.

You can exploit the queues supported in yarn. For instance you can create a queue with appropriate restrictions over memory and cpu cores and then set your job configurations to use that queue for launching map reduce job.
I would suggest you go through the following documentation on fair scheduler.
For your current solution of setting numReducers = 0, I think that it might not be the best way to restrict the compute.

Related

How Could I Monitor Lambda Concurrent Executions on a Second-by-Second Basis (or Find a Better Solution to Limit Lambda ConcurrentExecutions)?

I am working on a massive distributive computing platform built within AWS Lambda. The platform is extremely spiky, so most of the time the number of ConcurrentExecutions is below 50, but we can hit maximum (1000 currently) for up to an hour or more if a large batch job hits the system (it is an event-driven system). This is a problem as we will have customer-facing APIs that will lag terribly. Finally, I am not an architect, so I have minimal control over how the system was designed, but I have been asked to devise a clever Concurrent Execution limiting solution
I'm not new to AWS, so I know about the standard ways to handle this problem. #1 is reserve concurrency on the user-facing lambdas. I'm not allowed to do that for the sake of this exercise (though I'll go tell my boss thats whats necessary if it truly is). I'm thinking of a system where we designate high-priority (for UI) and low priority functions (for batch processing), and the low-priority functions will check a stored (DynamoDB) value output from Cloudwatch on the current number of ConcurrentExecutions. If a low priority function finds that we are in danger of using all the ConcurrentExecutions, it will post to a queue with exponential backoff in place. This all should work, save the problem that ConcurrentExecutions are only monitored in one-minute increments, which is too slow, as many of our Lambdas run for around 500ms.
So my questions are as follows:
Is there a way to set up a custom ConcurrentExecutions metric that has second-by-second data points, and if so, how would you do it?
Is there a better way to implement a counter than Cloudwatch?
Am I just missing something here and someone has a clever way to manage Lambda ConcurrentExecutions
I don't think it's necessary to create a monitor or throttling solution at all. You will need to to build test and maintain something additional to your core solution. Instead, two suggestions:
Sounds like the current design has one lambda function doing too much. Decompose the Lambdas further, so you can split the Lambdas into a Ui/public lambda, and one or more dedicated to the batch processes. This way you can spread the concurrent execution limit across more Lambdas. The limit is per Lambda function.
Second, request a service quota/limit increase
To raise the limit above 1,000 concurrent function executions, submit a request to the AWS Support Center by following the steps in our documentation. This feature is available in all regions where Lambda is available.
See AWS Lambda Raises Default Concurrent Execution Limits.
https://aws.amazon.com/about-aws/whats-new/2017/05/aws-lambda-raises-default-concurrent-execution-limit/
The limit management team is very flexible when asking for a limit to be raped they were generally raise it to any reasonable number that our solution requires.
To request a limit increase, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html

how to use two aws ec2 instances(1 gpu and 1 cpu instance) with one storage to(run code, store/share files) & reduce cost

My team is using a gpu instance to run machine learning tensorflow based, yolo,computer vision applications and use it for training machine learning models also.. It costs 7$ an hour and has 8 gpu's. Was trying to reduce costs on it. We need 8 gpu's for faster training and sometimes many people can use different gpu's at the same time.
For our use case we are not using sometimes the gpu's(8 gpus) at all for atleast 1-2 weeks of a month. But a use of the gpu may arrive during that time but maynot also. So i wanted to know is there a way to edit the code and do all cpu intensive operations when gpu not needed through a low cost cpu instance. And turn on the gpu instance only when needed use it and then stop it when work done.
I thought of using efs for putting code on the shared file system and then running from there but i read an article( https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs ) where its written that i should never run code from network based drives because the speed can become really slow. So i dont know if its good to run machine learning application from efs file system. I was thinking of making virtual environments on folders in efs but i dont think that is a good idea.
Could anyone suggest good ways of achieving this and reduce costs. And if you are suggesting to use an instance with lower number of gpu's that i have considered but we sometimes need 8 gpu's for faster training but we dont use the gpus at all for 1-2 weeks but the costs are still incurred.
Please suggest a way on how to achieve a low cost for this use case without using spot or reserved instances.
Thanks in advance
A few thoughts:
GPU instances now allow hibernation, so when launching your GPU select the new Stop Instance behavior 'hibernate' which will let you turn it off for 2 weeks but spin it up quickly if necessary
If you only have one instance, look into using EBS for data storage with a high volume of provisioned iops to move data on/off your instance quickly
Alternately, move your model to Sagemaker to ensure you are only charged for GPU use when you are actively training your model
If you are applying your model (inferencing) move that workload to a cheap instance. A trained yolo model can run inferencing on very small CPU instances, no need for a GPU for that part of the workload at all.
To reduce inference costs, you can use Elastic Inference which supports pay-per-use functionality:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-inference.html

Optimization of the google dataproc cluster

I am using the dataproc cluster for spark processing. I am new to whole google cloud stuff. In our application we have 100s of jobs which uses dataproc. With every job we spawn new cluster and terminate it once the job is over. I am using pyspark for processing purpose.
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
What is the best software configuration for improving the performance of the dataproc cluser. I am aware of the in-house infrastructure optimisation of hadoop/spark cluster. Is it applicable as it is for dataroc cluster or something else is needed?
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
Any help in time and price optimisation is appreciated. Thanks in advance.
Thanks
Manish
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
That's absolutely fine. We've used that on 300+ node clusters, only issues were with long-running clusters when nodes were getting preempted, and jobs were not optimised to account for node reclamation (no RDD replication, huge long-running DAGs). Also Tez does not like preemptible nodes getting reclaimed.
Is it applicable as it is for dataroc cluster or something else is needed?
Correct. However Google Storage driver has different characteristics when it comes to operation latency (for example, FileOutputCommitter can take huge amounts of time when trying to do recursive move or remove with overpartitioned output), and memory usage (writer buffers are 64 Mb vs 4 Kb on HDFS).
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
Only performance tests can help with that.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Make sure to use dynamic allocation and your cluster is sized to your workload. Scheduling tab in YARN UI should show utilisation close to 100% (if not, your cluster is oversized to the job, or you have not enough partitions). In Spark UI, better to have number running tasks close to number of cores (if not, it again might be not enough partitions, or cluster is oversized).
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
From throughput perspective, GCS is not bad, but it is much worse in case of many small files, both from reading (when computing splits) and writing (when FileOutputCommitter) perspective. Also many parallel writes can result in OOMs due to bigger write buffer size.

Cloud Dataflow/Beam: Side Input Limit

SideInput is sort of like broadcast in Spark, meaning you are caching data to a local worker machines for fast lookup to reduce network/shuffle overhead. It is logical to think limit to how much memory you can have should fit in heap. In Dataflow documentation, it says limit is 20K shard. What does this mean? How big is a shard?
To answer your original question, you can configure the amount of in-memory caching done by a Dataflow worker via the --workerCacheSizeMb option on the command line, which is setWorkerCacheSizeMb if you are invoking a pipeline programmatically. The default is 100Mb.

Why is Spark faster than Hadoop Map Reduce

Can someone explain using the word count example, why Spark would be faster than Map Reduce?
bafna's answer provides the memory-side of the story, but I want to add other two important facts:DAG and ecosystem
Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step. (It would be easier to understand this point if you are familiar with the execution plan optimization in RDBMS or the DAG-style execution of Apache Tez)
Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.
And I want to add that, even if your data is too big for main memory, you can still use spark by choosing to persist you data on disk. Although by doing this you give up the advantages of in-memory processing, you can still benefit from the DAG execution optimization.
Some informative answers on Quora:
here and here.
I think there are three primary reasons.
The main two reasons stem from the fact that, usually, one does not run a single MapReduce job, but rather a set of jobs in sequence.
One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive, because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage. This is an innovation over MapReduce that came from Microsoft's Dryad paper, and is not original to Spark.
The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don't need to be read from disk for each operation.
What about Spark jobs that would boil down to a single MapReduce job? In many cases also these run faster on Spark than on MapReduce. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.
Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark's shuffle implementation works very similarly to MapReduce's: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.