Strategies for optimizing the performance of IO intesive jobs on AWS - amazon-web-services

I wrote a script that analyzes a lot of files on an AWS cluster.
Running it on the cloud seems to be slower than I expected - the filesystem is shared via NFS, so the round-trip through the network seems to be the limiting step here. Bottom line - the processing power of the cluster is limited by the speed of the internal network which is considerably slower than the speed of the SSD the data is located in.
How would you optimize the cluster so that IO intensive jobs will run efficiently?

There isn't much you can do given the circumstances.
Obviously the speed of the NFS itself is the drawback.
Consider:
Chunking - grab only pieces of the files required to do as much as possible
Copying locally - create a locking mechanism and copy the file in full locally, process, push back. This can require a lot of work (what if the worker gives up and doesnt clear the lock)
Optimize the NFS share - increase IO throughput by clustering the NFS, raiding it, etc.
With a remote FS you want to limit the amount of back and forth. You can be creative, but creative can be a problem in itself.

Related

Does Dask communicate with HDFS to optimize for data locality?

In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.

Optimization of the google dataproc cluster

I am using the dataproc cluster for spark processing. I am new to whole google cloud stuff. In our application we have 100s of jobs which uses dataproc. With every job we spawn new cluster and terminate it once the job is over. I am using pyspark for processing purpose.
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
What is the best software configuration for improving the performance of the dataproc cluser. I am aware of the in-house infrastructure optimisation of hadoop/spark cluster. Is it applicable as it is for dataroc cluster or something else is needed?
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
Any help in time and price optimisation is appreciated. Thanks in advance.
Thanks
Manish
Is it safe to use hybrid of stable node and pre-emptible nodes for the cost reduction?
That's absolutely fine. We've used that on 300+ node clusters, only issues were with long-running clusters when nodes were getting preempted, and jobs were not optimised to account for node reclamation (no RDD replication, huge long-running DAGs). Also Tez does not like preemptible nodes getting reclaimed.
Is it applicable as it is for dataroc cluster or something else is needed?
Correct. However Google Storage driver has different characteristics when it comes to operation latency (for example, FileOutputCommitter can take huge amounts of time when trying to do recursive move or remove with overpartitioned output), and memory usage (writer buffers are 64 Mb vs 4 Kb on HDFS).
Which instance type is best suit for dataproc cluster when we are processing avro formatted data around 150GB of size.
Only performance tests can help with that.
I have tried spark's dataframe caching / persist for time optimization. But it was not that useful. Is there any way to instruct spark that entire resources (memory, processing power) belong to this job so that it can process it faster?
Make sure to use dynamic allocation and your cluster is sized to your workload. Scheduling tab in YARN UI should show utilisation close to 100% (if not, your cluster is oversized to the job, or you have not enough partitions). In Spark UI, better to have number running tasks close to number of cores (if not, it again might be not enough partitions, or cluster is oversized).
Does reading and writing back to GCS bucket have a performance hit? If yes, is there any way to optimize it?
From throughput perspective, GCS is not bad, but it is much worse in case of many small files, both from reading (when computing splits) and writing (when FileOutputCommitter) perspective. Also many parallel writes can result in OOMs due to bigger write buffer size.

Image downloads using CURL and Linux, maximizing performance?

I'm downloading a large amount of images (.jpg) from various domains and I need to complete this in the fastest time possible. To perform this I'm using CURL along with concurreny threading with std::async, from my linux server (2 CPU, 40GB SSD)
However, I'm aware, that even with threading, my download speed and performance is limited by my server CPU processing and memory capacity.
My questions seem relatively simple:
Do linux servers automatically call-upon both CPU's for high demand processes? I would think, by creating multiple long running threads, both CPU's would be used? Or do I need to go down the route of parallel processing?
In terms of a) language features and b) server configurations, are there any other approaches (other than concurrency and parallelism) that can make these download operations perform faster?

AWS : Splitting software & data in different volumes

AWS recommends keeping data & OS on separate EBS volumes. I have a webserver running on EC2 with an EBS volume. On a bare VM, I install the following:
- webserver, wsgi, pip & related software/config (some in /etc some in /home/<user>)
- server code & static assets in /var/www/
- log files are written to /var/log/<respective-folder>
- maintenance scripts in /home/<user>/
Database server is separate. For a webserver, which of the above items would benefit from higher IOPS and for which ones it doesn't matter ? My understanding is that the server code & log files should be moved to a separate EBS volume with higher IOPS. Or should I just move all of my stuff (except the softwares I installed in /etc i.e. webserver) to a separate volume with better IOPS ?
I would recommend that you have a separate EBS volume for code, logs, and maintenance in the case that you need to move it to another server. That allows you a faster TTR (time to resolution), than having to build an entire server.
The code shouldn't be changing largely past deployment, so I would focus on a general purpose SSD here, and look towards a caching layer (Varnish (full page caching) & CDN (static assets) ) more than having disk I/O issues. A CDN is a quick win and mitigates most the I/O for reading static assets. At 50GB, you get 150 IOPS, and with the mitigate of the static assets; the I/O should be fine.
As for logs, if you are a high traffic site, then you should definitely focus on I/O here as you don't want have blocking I/O here. This is mainly focusing on access logs more than error logs, as those shouldn't be past ERROR level on a production systems. If you aren't high traffic, then you should be fine with general purpose SSD, at 10GB, you get 30 IOPS, and that is generally enough.
What are your maintenance scripts doing? If they are generating and outputting files, then you could use SSD, but if you need high I/O, you should revisit the code and optimize the code as these disks can get expensive, and that cost is usually wasteful for maintenance that runs intermittently.
As for your web-server, et cetera, that should be based on infrastructure as code, via OpsWorks or Puppet, and doesn't need much in the term of I/O as those are usually memory-based processes once built and deployed.

Nuodb and HDFS as storage

Using HDFS for Nuodb as storage. Would this have a performance impact?
If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?
On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.
How would Nuodb manage these kind of latency gotchas?
Good afternoon,
My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.
First... a mini lesson on NuoDB architecture/layout:
The most basic NuoDB set-up includes:
Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory
Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.
Transaction Engines process incoming SQL requests and manage transactions.
Storage Managers read and write data to and from "disk" (Archive Directory)
All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.
Ok, so now lets talk about Storage:
This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:
Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)
So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.
And now to finally address the question of latency:
Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.
I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.
Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.
All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.
Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.
If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).
Thank you,
Elisabete
Technical Support Engineer at NuoDB