Read data effectively in AWS EMR Spark cluster - amazon-web-services

I am trying to get acquainted with Amazon Big Data tools and I want to preprocess data from S3 for eventually using it for Machine Learning.
I am struggling to understand how to effectively read data into an AWS EMR Spark cluster.
I have a Scala script which takes a lot of time to run, most of that time is taken up by Spark's explode+pivot on my data and then using Spark-CSV to write to file.
But even reading the raw data files takes up too much time in my view.
Then I created a script only to read in data with sqlContext.read.json() from 4 different folders (data sizes of 0.18MB, 0.14MB, 0.0003MB and 399.9MB respectively). I used System.currentTimeMillis() before and after each read function to see how much time it takes and with 4 different instances' settings the results were the following:
m1.medium (1) | m1.medium (4) | c4.xlarge (1) | c4.xlarge (4)
1. folder 00:00:34.042 | 00:00:29.136 | 00:00:07.483 | 00:00:06.957
2. folder 00:00:04.980 | 00:00:04.935 | 00:00:01.928 | 00:00:01.542
3. folder 00:00:00.909 | 00:00:00.673 | 00:00:00.455 | 00:00:00.409
4. folder 00:04:13.003 | 00:04:02.575 | 00:03:05.675 | 00:02:46.169
The number after the instance type indicates how many nodes were used. 1 is only master and 4 is one master, 3 slaves of the same type.
Firstly, it is weird that reading in first two similarly sized folders take up different amount of time.
But still how does it take so much time (in seconds) to read in less than 1MB of data?
I had 1800MB of data a few days ago and my job with data processing script on c4.xlarge (4 nodes) took 1,5h before it failed with error:
controller log:
INFO waitProcessCompletion ended with exit code 137 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 4870 seconds
2016-07-01T11:50:38.920Z INFO Step created jobs:
2016-07-01T11:50:38.920Z WARN Step failed with exitCode 137 and took 4870 seconds
stderr log:
16/07/01 11:50:35 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[21] at json at DataPreProcessor.scala:435)
16/07/01 11:50:35 INFO TaskSchedulerImpl: Adding task set 4.0 with 24 tasks
16/07/01 11:50:36 WARN TaskSetManager: Stage 4 contains a task of very large size (64722 KB). The maximum recommended task size is 100 KB.
16/07/01 11:50:36 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5330, localhost, partition 0,PROCESS_LOCAL, 66276000 bytes)
16/07/01 11:50:36 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 5331, localhost, partition 1,PROCESS_LOCAL, 66441997 bytes)
16/07/01 11:50:36 INFO Executor: Running task 0.0 in stage 4.0 (TID 5330)
16/07/01 11:50:36 INFO Executor: Running task 1.0 in stage 4.0 (TID 5331)
Command exiting with ret '137'
This data was doubled in size over the weekend. So if I get ~1GB of new data each day now (and it will grow soon and fast) then I hit big data sizes very soon and I really need an effective way to read and process the data quickly.
How can I do that? Is there anything that I am missing? I can upgrade my instances, but for me it does not seem normal that reading in 0.2MB of data with 4x c4.xlarge (4 vCPU, 16ECU, 7.5GiB mem) instances take 7 seconds (even with inferring data schema automatically for ~200 JSON attributes).

Related

Why does my Python app always cold start twice on AWS lambda?

I have a lambda, in Python where I am loading a large machine learning model during the cold start. The code is something like this:
uuid = uuid4()
app_logger.info("Loading model... %s" % uuid)
endpoints.embedder.load()
def create_app() -> FastAPI:
app = FastAPI()
app.include_router(endpoints.router)
return app
app_logger.info("Creating app... %s" % uuid)
app = create_app()
app_logger.info("Loaded app. %s" % uuid)
handler = Mangum(app)
The first time after deployment, AWS Lambda seems to start the Lambda twice as seen by the two different UUIDs. Here are the logs:
2023-01-05 21:44:40.083 | INFO | myapp.app:<module>:47 - Loading model... 76a5ac6f-a4fc-490e-b21c-83bb5ef458eb
2023-01-05 21:44:42.406 | INFO | myapp.embedder:load:31 - Loading embedding model
2023-01-05 21:44:50.626 | INFO | myapp.app:<module>:47 - Loading model... c633a9c6-bcfc-44d5-bacf-9834b39ee300
2023-01-05 21:44:51.878 | INFO | myapp.embedder:load:31 - Loading embedding model
2023-01-05 21:45:00.418 | INFO | myapp.app:<module>:59 - Creating app... c633a9c6-bcfc-44d5-bacf-9834b39ee300
2023-01-05 21:45:00.420 | INFO | myapp.app:<module>:61 - Loaded app. c633a9c6-bcfc-44d5-bacf-9834b39ee300
This happens consistently. It executes it for 10 seconds the first time, then seems to restart and do it again. There are no errors in the logs that indicate why this would be. I have my Lambda configured to run with 4G of memory and it always loads with < 3GB used.
Any ideas why this happens and how to avoid it?
To summarize all the learnings in the comments so far:
AWS limits the init phase to 10 seconds. This is explained here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html
If the app exceeds 10 seconds, it gets inited again without this limit
If you hit the 10 second limit, there are two ways to deal with this:
Init the model after the function is loaded during the invocation. The downsides being that you don't get the CPU boost and lower cost initialization.
Use provisioned concurrency. Init is not limited to 10 seconds, but this is more expensive and can still run into the same problems as not using it, e.g. if you get a burst in usage.
Moving my model to EFS does improve startup time compared to S3 and Docker layer caching, but it is not sufficient to make it init in < 10 seconds. It might work for other use cases with slightly smaller models though.
Perhaps someday SnapStart will address this problem for Python. Until then, I am going back to EC2.

Cross join end up giving "No space left on device"

I am trying to cross join two data frames and apply few transformations and finally trying to write the result into temp S3 location. But I am always ending up with below No space left on device error. Looks like it is due to calling spill(). Could you please help me how to overcome this error with the correct configurations?
Configuration details:
Cluster: AWS EMR cluster
CORE nodes: 2 initially and it scaling up to 15 nodes.
TASK nodes: 0 initially and it scaling up to 15 ON-DEMAND basis.
instance type: r4.2xlarge (8 core, 61GB RAM, 128 EBS)
Dataframe1 & Dataframes2 partitions size: 26 partitions.
Dataframe1 record count = 115580
Dataframe2 record count = 94191
Dataframe1 columns count: 53 ( 1 column holding JSON data)
Dataframe2 columns count: 36
spark.sql.shuffle.partitions: 500
"spark.executor.memoryOverhead": "4852"
"spark.driver.memoryOverhead": "4852"
Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 63 in stage 68.0 failed 4 times, most recent failure: Lost task 63.3 in stage 68.0 (TID 1640) (ip-10-66-199-71.ec2.internal executor 44):
org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter#7ea8a25 : No space left on device
org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter#7ea8a25 : No space left on device
Thanks in Advance..!!
Sekhar
Its a common issue, and AWS provides official documentation on how to solve it:
How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?

XGBoostError: rabit/internal/utils.h:90: Allreduce failed - Error while attempting XGboost on Dask Fargate Cluster in AWS

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.
Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:
Workers: 39
Total threads: 156
Total memory: 371.93 GiB
So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.
Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?
X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test
dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)
output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])

AWS GlueJob Error - Command failed with exit code 137

I am executing a AWS-Gluejob with python shell. It fails inconsistently with the error "Command failed with exit code 137" and executes perfectly fine with no changes sometimes.
What does this error signify? Are there any changes we can do in the job configuration to handle the same?
Error Screenshot
Adding the worker type to the Job Properties will resolve the issue.Based on the file size please select the worker type as below:
Standard – When you choose this type, you also provide a value for Maximum capacity. Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The Standard worker type has a 50 GB disk and 2 executors.
G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs.
G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs and jobs that run ML transforms.

Spark 1.5 - unexpected number of executors

I am running a spark job on AWS (cr1.8xlarge instance, 32 cores with 240 GB memory each node) with the following configuration:
(The cluster has one master and 25 slaves, and I want each slave node to have 2 executors)
However, in the job tracker, it has only 25 executors:
Why does it have only 25 executors while I explicitly ask it to make 50? Thanks!