I have been able to run code in distributed google cloud ML but when I run it that data gets replicated on each machine within the cluster but I want to distribute the data on each machine.
How can i distribute data on each machine within the cluster on cloud ML ?
Please help!!!!
Typically, in distributed asynchronous training, instead of having each worker train on a non-overlapping partitions of the data, you want each worker to work on all of the data.
In asynchronous training, the parameters do not wait to receive the updates from all workers -- it processes updates as they come. So if one worker is slower than the others, then it will contribute fewer updates than the other workers. If you partition the data such that each worker has access only to its own data, that means you are effectively down-weighting the examples that belong to slower workers because they cause fewer updates to the parameters. That would adversely affect the quality and generalizability of your model.
If you use synchronous training and force updates to wait for all workers, you can safely partition the data across workers, however, training will be as slow as the slowest worker since each step has to wait for the updates from all workers. If you don't force updates from all workers, then the situation may actually be worse than asynchronous training because examples from slow workers are likely to be ignored completely.
Because it is more robust, asynchronous training is more common.
Luckily, having all workers examine all data is generally a sensible thing to do. As long as you randomize the data (here and here), then the examples being examined at any given time (across all workers) is a set of batch_size * num_workers examples sampled (almost) uniform randomly with replacement from the full dataset.
That canonical approach to reading data in asynchronous training often works sufficiently well in practice, especially in a distributed training. However, if you have so much data you can only perform a few epochs of training, your model may benefit from seeing each example the same number of times (sampling without replacement). That is more complicated and less robust, but it can be done; that's a separate post.
Related
I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.
For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,
Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?
Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.
[Queue manager]
SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.
[Single vs multiple training jobs]
Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).
[Instance limit increase]
You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.
[One possible option] (Assuming multiple lightweight models)
You could create a single training job, with instance count, the number of instances available to you.
Inside the training job, your code can run a for loop and perform all the individual training jobs you need.
In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.
Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file.
If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.
My apache beam scio dataflow job is asking for more workers than my current quota. The job completes successfully, but is limited to 575 workers. What are the consequences of not giving it the RAM it is asking for. More disk IO of intermediate steps? Slower sink IO? Does it depend on what's going on with the job? In particular, my job is pretty simple really has 2 steps:
-aggregateByKey
-DO IO per key
I can run my own experiments, but I'm also interested in the cost of the job, since it isn't extremely time sensitive operation (aka I'm okay letting it run longer if it is cheaper)...
In this case, your job will have a higher runtime than if your quota was higher, but the aggregate amount of time spent performing work by all workers should be about the same.
Dataflow bills you on the amount of time each CPU, memory and storage unit is allocated. If the total CPU-hours, RAM GB-hours and storage GB-hours are about the same, your job should cost about the same.
Note: Dataflow also charges by the amount of bytes shuffled if you use the shuffle service. This should also not be affected by the number of workers.
Can someone explain using the word count example, why Spark would be faster than Map Reduce?
bafna's answer provides the memory-side of the story, but I want to add other two important facts:DAG and ecosystem
Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step. (It would be easier to understand this point if you are familiar with the execution plan optimization in RDBMS or the DAG-style execution of Apache Tez)
Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.
And I want to add that, even if your data is too big for main memory, you can still use spark by choosing to persist you data on disk. Although by doing this you give up the advantages of in-memory processing, you can still benefit from the DAG execution optimization.
Some informative answers on Quora:
here and here.
I think there are three primary reasons.
The main two reasons stem from the fact that, usually, one does not run a single MapReduce job, but rather a set of jobs in sequence.
One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive, because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage. This is an innovation over MapReduce that came from Microsoft's Dryad paper, and is not original to Spark.
The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don't need to be read from disk for each operation.
What about Spark jobs that would boil down to a single MapReduce job? In many cases also these run faster on Spark than on MapReduce. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.
Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark's shuffle implementation works very similarly to MapReduce's: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.
I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves).
Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the above by quite a margin.
What could be the possible reasons? Is it the network latency? It's running on 100Mbps Ethernet network. Will it help if I increase the number of nodes?
I am using Hadoop Streaming and my code is in python2.7.
MapReduce isn't really meant to handle that small of an input dataset. The MapReduce framework has to determine which nodes will run tasks and then spin up a JVM to run each individual Map and Reduce task(s) (the number of tasks is dependent on the size of your data set). That usually has a latency on the order of tens of seconds. Shipping non local data between nodes is also expensive as it involves sending data over the wire. For such a small dataset, the overhead of setting up a MapReduce job in a distributed cluster is likely higher than the runtime of the job itself. On a single node you only see the overhead of starting up tasks on a local machine and don't have to do any data copying over the network, that's why the job finishes faster on a single machine. If you had multi gigabyte files, you would see better performance on several machines.
Having a very specific access pattern for my data, I wonder about the expected mapreduce performance of Cassandra. These are my requirements:
There will be 10 Million Documents (e.g. JSON, a couple of KB each)
in my database There will be occasional updates of the documents
Users want to create results from the whole dataset that require
processing of each document
Users will want to do this in a
semi-interactive fashion, trying out effects of changes they make to
the processing of each document. Waiting for the result a couple of
minutes is ok.
Users would like to be able to spend money (scaling up
or out) to increase interactive speed if there is a desire to
increase processing speed.
There will not be large user numbers,
processing needs to be done a couple of times per hour, maybe.
Durability is not a primary concern, as the data is replicated from a
source system anyway.
This sounds like a good Job for Cassandra and MapReduce but given that MapReduce is not intended to be used semi-interactively but rather as a background job, I wonder what performance possibilities I can expect using Cassandra.
My other options are plain MySQL with documents stored as CLOBS or partitioned Redis.
Can anyone provide clues on how to estimate the speed possibilities?