Why Impala spend a lot of time Opening HDFS File (TotalRawHdfsOpenFileTime)? - hdfs

I find that my Impala swarm performs not stable, normally it takes only a few seconds (less than 10s) to finish a query, but occasionally it will take more than 40s (and this situation will last for a few minutes), and when that happens, accroding to the profile, TotalRawHdfsOpenFileTime is very high, which implies most of the time is spend on opening HDFS file.
So what is the possible reason and how can I solve it?

This is time spent opening files. If you're querying HDFS, this often means that it's spending time fetching data from the namenode.
We saw dramatic improvements in a lot of production deployments running into this bottleneck by enabling file handle caching - https://docs.cloudera.com/documentation/enterprise/5-15-x/topics/impala_scalability.html#scalability_file_handle_cache

Related

Spark History Server ListBucket costs

We are using Spark history 3.2.1 to monitor our Spark applications.
We have thousands of daily jobs (running on Kubernetes) that writes event logs to S3 bucket (in a dedicated folder).
We are using history-server to analyze and compare completed jobs (incomplete running jobs never appeared in the UI but it's not a requirement now).
Recently I've noticed increase in our ListBucket API Operation in AWS billing cost explorer. This cost is higher than the cost of the StandardStorage (the price we pay for storing the data itself). It's up to few hundreds per month!
Running history-server with DEBUG log level exposed the "problem": every 10s the the history-server list the bucket to get all logs and then it iterate over each folder to get it's content. So if I want to keep the last 10,000 jobs, I'll have to pay for 10,101 ListBucket requests every 10s!
Here is one example (out of the 10k) reproduced locally with minio as S3:
22/02/20 06:44:31 DEBUG wire: http-outgoing-57 << "<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>local-audience</Name><Prefix>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/</Prefix><KeyCount>2</KeyCount><MaxKeys>5000</MaxKeys><Delimiter>/</Delimiter><IsTruncated>false</IsTruncated><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/appstatus_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.304Z</LastModified><ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag><Size>0</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/events_1_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.136Z</LastModified><ETag>"f91cc774d92c6f6c2ca4d0e1a1e76e13"</ETag><Size>868837</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents></ListBucketResult>"
To ensure that the cost comes from history-server I turned it off for a day and there was no charge per ListBucket since then:
To mitigate the problem (because we still need the history-server), I can set the spark.history.fs.update.interval to higher number (such as 3600s or so). As we are checking the history-server once a day it is overkill and doesn't worth it (cost wise).
Why does it scan the completed jobs every time (over and over again) and not only new jobs? is there a way to configure such behavior to avoid those ListBucket operations?
If I care only for completed jobs, and assuming I can wait few minutes to see the list, is there a mode that can load the list only when I login to the UI? (rather than periodically doing it for nothing).
P.S - I'm using AWS lifecycle rules to clean this folder every few few days (and not the server cleaning feature), by expiration objects after few days.
treewalking in s3 is (a) expensive and (b) horribly slow, especially given that a deep tree scan exists. If you want to fix this and can write scala code, see if you can fix the server to switch to a deep listing by moving to FileSystem.listFiles(path, true). Yes that involves coding, but the OSS community depends on everyone fixing their own personal issues and sharing the outcome
After digging into this issue, I decided to stop using the "rolling" feature for now - as my application jobs are relatively small.
I removed the:
spark.eventLog.rolling.enabled: true
spark.eventLog.rolling.maxFileSize: 16m
from the spark-submit command and the cost is now back to normal...
I also wrote about it here.
#stevel thanks for your answer - I will try to contribute and fix that! :)

Athena Query Queuing Time: What affects it?

I've started using Athena recently and it appears useful. However, one thing that bugs me is that Query Queuing times can sometimes be very long (around a minute). At other times, queries are executed almost immediately.
I have not been able to identify reasons why queries sometimes queue for so long, and not at other times. The only thing I noticed is that Table Creation and other DDL statements don't queue for long.
What are the factors that affect queuing time? Server load? Query length? Query complexity?
How can I reduce queuing time? There's no information on this available in the documents as far as I'm aware.
Around a minute it's not that long. We had a few weeks during which we had a random queuing time up to 10m. After a lot of back-and-forth with the support they finally tweaked something and queuing time was reduced to up to 1m, with average 10s.
The queuing is unrelated to your specific query or even "max concurrent queries" settings on your account, it's related to global region Athena load and many other hidden settings that AWS engineers can tweak.

How to optimize AWS DMS MySql Aurora to Redshift replication?

I've been using AWS DMS to perform ongoing replication from MySql Aurora to Redshift. However, the ongoing replication is causing constant 25-30% CPU load on the target. This is because it produces many small files on S3 and loads/processes them non-stop. Redshift is not really designed for handling large number of small tasks.
In order to optimize, i've made it so that the process starts at the beginning of each hour, waits till the target is in-sync, and then stops. So, instead of working continually, it works for 5-8 minutes at the beginning of each hour. Even so, it is still very slow and unoptimized because it still has to process hundreds of small s3 files, only in shorter timespan.
Can this be optimized further? Is there a way to tell DMS to buffer these changes for larger period of time, and not produce fewer larger instead of many small s3 files? We really don't mind having higher target latency.
The amount of data transferred between Aurora and Redshift is rather small. There are around ~20K changes per hour, and we're using 4-node dc1.large redshift cluster. It should be able to handle those 20K changes in matter of seconds, not minutes
maybe, you can try BatchApplyTimeoutMin and BatchApplyTimeoutMax.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.ChangeProcessingTuning.html
BatchApplyTimeoutMin sets the minimum amount of time in seconds that AWS DMS waits between each application of batch changes. The default value is 1.
You can change the value to 1200, even 3600.
Bump up maxFileSize in the target settings - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html

Cloud ML: Varying training time taken for the same data

I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).

Strategies for optimizing the performance of IO intesive jobs on AWS

I wrote a script that analyzes a lot of files on an AWS cluster.
Running it on the cloud seems to be slower than I expected - the filesystem is shared via NFS, so the round-trip through the network seems to be the limiting step here. Bottom line - the processing power of the cluster is limited by the speed of the internal network which is considerably slower than the speed of the SSD the data is located in.
How would you optimize the cluster so that IO intensive jobs will run efficiently?
There isn't much you can do given the circumstances.
Obviously the speed of the NFS itself is the drawback.
Consider:
Chunking - grab only pieces of the files required to do as much as possible
Copying locally - create a locking mechanism and copy the file in full locally, process, push back. This can require a lot of work (what if the worker gives up and doesnt clear the lock)
Optimize the NFS share - increase IO throughput by clustering the NFS, raiding it, etc.
With a remote FS you want to limit the amount of back and forth. You can be creative, but creative can be a problem in itself.