Spark History Server ListBucket costs - amazon-web-services

We are using Spark history 3.2.1 to monitor our Spark applications.
We have thousands of daily jobs (running on Kubernetes) that writes event logs to S3 bucket (in a dedicated folder).
We are using history-server to analyze and compare completed jobs (incomplete running jobs never appeared in the UI but it's not a requirement now).
Recently I've noticed increase in our ListBucket API Operation in AWS billing cost explorer. This cost is higher than the cost of the StandardStorage (the price we pay for storing the data itself). It's up to few hundreds per month!
Running history-server with DEBUG log level exposed the "problem": every 10s the the history-server list the bucket to get all logs and then it iterate over each folder to get it's content. So if I want to keep the last 10,000 jobs, I'll have to pay for 10,101 ListBucket requests every 10s!
Here is one example (out of the 10k) reproduced locally with minio as S3:
22/02/20 06:44:31 DEBUG wire: http-outgoing-57 << "<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>local-audience</Name><Prefix>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/</Prefix><KeyCount>2</KeyCount><MaxKeys>5000</MaxKeys><Delimiter>/</Delimiter><IsTruncated>false</IsTruncated><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/appstatus_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.304Z</LastModified><ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag><Size>0</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/events_1_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.136Z</LastModified><ETag>"f91cc774d92c6f6c2ca4d0e1a1e76e13"</ETag><Size>868837</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents></ListBucketResult>"
To ensure that the cost comes from history-server I turned it off for a day and there was no charge per ListBucket since then:
To mitigate the problem (because we still need the history-server), I can set the spark.history.fs.update.interval to higher number (such as 3600s or so). As we are checking the history-server once a day it is overkill and doesn't worth it (cost wise).
Why does it scan the completed jobs every time (over and over again) and not only new jobs? is there a way to configure such behavior to avoid those ListBucket operations?
If I care only for completed jobs, and assuming I can wait few minutes to see the list, is there a mode that can load the list only when I login to the UI? (rather than periodically doing it for nothing).
P.S - I'm using AWS lifecycle rules to clean this folder every few few days (and not the server cleaning feature), by expiration objects after few days.

treewalking in s3 is (a) expensive and (b) horribly slow, especially given that a deep tree scan exists. If you want to fix this and can write scala code, see if you can fix the server to switch to a deep listing by moving to FileSystem.listFiles(path, true). Yes that involves coding, but the OSS community depends on everyone fixing their own personal issues and sharing the outcome

After digging into this issue, I decided to stop using the "rolling" feature for now - as my application jobs are relatively small.
I removed the:
spark.eventLog.rolling.enabled: true
spark.eventLog.rolling.maxFileSize: 16m
from the spark-submit command and the cost is now back to normal...
I also wrote about it here.
#stevel thanks for your answer - I will try to contribute and fix that! :)

Related

Changing DynamoDB tables from Provisioned to On-Demand throughput at scale

I'm planning to convert 22 DynamoDB tables created through Terraform from Provisioned to On-Demand throughput.
Initial tests for changing this through Terraform showed that it takes about 30 minutes per table, which is too slow and makes using our current deployment pipeline in production a no-go.
I'm trying to speed things up and these are the options I thought of:
Create a job that will do a single table, and spawn 22 of those to run concurrently (Jenkins setup might let 10 run at a time - no control over that). I see this as low risk, but possibly lengthy to run in production.
Scripting the whole thing to: convert using the cli, cleaning up resources not needed post-conversion, deleting existing Terraform state files, then doing a Terraform import of the new DynamoDB resources. Seems much riskier than 1).
Something else I haven't thought of...
Outstanding question: Is the provisioned -> per-request conversion sensitive to amount of stored data?
Looking for opinions, and/or information from folks who might have gone through a similar exercise.

Manually trigger an Amazon S3 Inventory Report

Is there any way to manually kick off an Amazon S3 Inventory report job?
I'm working on a project that creates daily inventory reports to another account but I can't seem to find a way to manually kick off the run. We're in the design / development phase of a data telemetry project and are tweaking our inventory configurations but having to wait for the daily job to run to see if the configuration satisfies our requirements is really inconvenient and slowing us down.
Is there a way to manually kick off an inventory report run after making a configuration change? I've tried looking in the api documentation as well as the boto3 documentation and all I have found is a call to create a bucket inventory configuration but nothing to actually perform a run.
Thanks,
Bill
As far as I know the inventory report does not run on-demand. It's quite a heavy operation for AWS for many buckets have billions of objects, so I can understand why they don't provide that service for free.
The aws cli can be used of course to get an inventory but it's incredibly slow (takes HOURS if not days just to list all objects in a bucket of a few million objects). Basically the only real options for large buckets is custom scripting with parallel execution. There are quite some open source projects out there that do this.
But since your original question is about the inventory report itself I'm afraid there is no real alternative.

How to optimize AWS DMS MySql Aurora to Redshift replication?

I've been using AWS DMS to perform ongoing replication from MySql Aurora to Redshift. However, the ongoing replication is causing constant 25-30% CPU load on the target. This is because it produces many small files on S3 and loads/processes them non-stop. Redshift is not really designed for handling large number of small tasks.
In order to optimize, i've made it so that the process starts at the beginning of each hour, waits till the target is in-sync, and then stops. So, instead of working continually, it works for 5-8 minutes at the beginning of each hour. Even so, it is still very slow and unoptimized because it still has to process hundreds of small s3 files, only in shorter timespan.
Can this be optimized further? Is there a way to tell DMS to buffer these changes for larger period of time, and not produce fewer larger instead of many small s3 files? We really don't mind having higher target latency.
The amount of data transferred between Aurora and Redshift is rather small. There are around ~20K changes per hour, and we're using 4-node dc1.large redshift cluster. It should be able to handle those 20K changes in matter of seconds, not minutes
maybe, you can try BatchApplyTimeoutMin and BatchApplyTimeoutMax.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.ChangeProcessingTuning.html
BatchApplyTimeoutMin sets the minimum amount of time in seconds that AWS DMS waits between each application of batch changes. The default value is 1.
You can change the value to 1200, even 3600.
Bump up maxFileSize in the target settings - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html

AWS Bill Generation Time in GMT

I want to know at what time amazon updates the reports that are created in s3 buckets. Is it updated on midnight? Want to know the exact time in GMT
Looking back at 1+ year of billing reports, it doesn't look like you should expect that the billing reports will be generated at a specific time.
This makes perfect sense. Even if some background job was always triggered at some specific time (and this is probably an oversimplification for such a complex billing system), I can't assume even Amazon would be able to guarantee that ALL these background jobs (i.e. for ALL customers) would finish at the same time every time they run.
There is always a different data set + other ongoing workload to consider, which would certainly affect the completion time.
FWIW, I am including a screenshot from my S3 bucket:

Running application on a cluster

Abstract
I have my processing done using two console applications (Stage-estimate, Stage-step), each application processes files on disk, files are organized into folders. Each folder represents one step of processing which is considered completed when all files are estimated.
As an example lets consider that we are at Step 0 and the folder 0 contains the following files:
Folder 0 contains:
000.data
001.data
002.data
...
999.data
We have the data files, now we need to estimate them, we run Stage-estimate application 1000 times that result with the following directory structure:
Folder 0 contains:
000.data
000.estimate
001.data
001.estimate
002.data
002.estimate
...
999.data
999.estimate
Step 0 is now complete we have all the data/estimate pairs. In order to switch to Step 1 we run Stage-step application 1000 times on every data/estimate pair files and it results with new set of 1000 *.data files into folder 1. After Stage-step application completed, we have a folder 1 with the same structure as we had on Step 0:
Folder 1 contains:
000.data
001.data
002.data
...
999.data
From now on the process repeats until it is canceled.
The Problem
Application Stage-estimate does some pretty heavy calculations it consumes 99% of overall processing power compared to Stage-step application.
I was planing to use AWS in order to speed the things up. I don't want to start inventing special batch files that would call my applications the way described above, I know that there is special software that does some high-lifting at scheduling processes and other cluster related stuff.
Question
I was never dealing with cluster computing, off top of my head I see that application is parallelized really nice and it fits into AWS infrastructure. On the other hand I'm complete newbie in the world of cluster-computing and I don't know where to start from. I was dealing with AWS however nothing related to cluster computing, I don't know how to organize the flow I've described and how to make it run efficiently, so I would appreciate if you point me in right direction or provide some links on demos / best practices.
Thank you in advance!
__________Edit__________
Based on your comment, you can put all your jobs from stage 0 into a queue and start to process it. You can also have a logic what checks if you have only a few jobs left and tries to add new jobs from stage 1. This would speed up a bit your calculation, gives you better resource usage, but it's optional and makes your system more complex.
I suggest you to use SQS ( Or SWF) for storing the jobs, S3 for storing the files and an autoscaling group of spot instances for worker nodes.
Unfortunately Lambda doesn't support C++ at the moment. ( Node.js and Java is supported.)
________Original________
AWS supports several concepts which you may consider:
Decoupling: You can use SQS (Simple Queue Service) for job queuing, which gives you a redundant and fault tolerant job queue. You can have a fleet of worker instances, which are requesting jobs form the queue, running them and if they are finished, deleting the job from the queue. If the instances hangs/crashes during the execution of the job, after the timeout period the job goes back to the queue and an other instance will execute it again.
Other service is the SWF ( Simple Workflow Service). This service internally uses SQS queues, with this service, you may need less script to glue your entire workflow together.
Redundant storage: I would definitely use AWS S3 for storage, because it's cheap and redundant. After the first read, I don't think you need any advanced (file system like) feature. ( for example locking.)
Spot instances: For the worker nodes, I would use Spot instances which are much cheaper. The only issue with them if you need a really fast answer for your task all the time. ( If you generating daily reports, spot instances are perfect solution.)
+1: You may use AWS Lambda function to run your jobs. You can trigger your lambda function based on S3 events. For example you uploaded a new *.data file. However Lambda functions cannot run too long. But if you are able to use lambda function, then all your environment will contains only S3 buckets and lambda function. Both of them are AWS managed service, so your system would be extremely flexible, fault tolerant. I can't say any exact details about pricing, but I assume it would be cheaper then running EC2 instances.
Summary: If you can run your estimations parallel, AWS gives you a lots of power and speed. (for a good money) especially if your load is changing during the day.
A good source: White Paper on ‘Cloud Architectures’ and Best Practices of Amazon S3, EC2, SimpleDB, SQS