AWS Glue Pyspark Parquet write to S3 taking too long - amazon-web-services

I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns. I noticed that it takes really a long time (around a day even) just to load and write one week of data. There are months of data that needs to be written. I tried increasing the worker nodes but it does not seem to fix the problem.
My glue job currently has 60 G.1x worker nodes.
My SparkConf in the code looks like this
conf = pyspark.SparkConf().setAll([
("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2"),
("spark.speculation", "false"),
("spark.sql.parquet.enableVectorizedReader", "false"),
("spark.sql.parquet.mergeSchema", "true"),
("spark.sql.crossJoin.enabled", "true"),
("spark.sql.sources.partitionOverwriteMode","dynamic"),
("spark.hadoop.fs.s3.maxRetries", "20"),
("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
])
I believe it does succeed in writing the files into partitions, however it is taking really long to delete all the temporary spark-staging files it created. When I checked the tasks, this seems to take most of the time.
2021-04-22 03:08:50,558 INFO [Thread-6] s3n.S3NativeFileSystem (S3NativeFileSystem.java:rename(1355)): rename s3://<bucket-name>/etl/sessions/.spark-staging-8df58afd-d6b2-4ca0-8611-429125abe2ae/p_date=2020-12-16/geo=u1hm s3://<bucket-name>/etl/sessions/p_date=2020-12-16/geo=u1hm
My write to S3 looks like this
finalDF.coalesce(50).write.partitionBy('p_date','geohash').save("s3://{bucket}/{basepath}/{table}/".format(bucket=args['DataBucket'], basepath='etl',
table='sessions'), format='parquet', mode="overwrite")
Any help would be appericiated.

Related

Fastest way to get exact count of rows for a 100GB CSV file stored on S3

What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).
Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).
I know the approximate size of the file.
Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/+100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.
The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.
Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).

Running multiple apache spark streaming jobs

I'm new to Spark streaming and as I can see there are different ways of doing the same thing which makes me a bit confused.
This is the scenario:
We have multiple events (over 50 different events) happening every minute and I want to do some data transformation and then change the format from json to parquet and store the data in a s3 bucket. I'm creating a pipeline where we get the data and store it in a s3 bucket and then the transformation happens (Spark jobs). My questions are:
1- Is it good if I run a lambda function which sorts out each event type in a separate subdirectories and then read the folder in sparkStreaming? or is it better to store all the events in a same directory and then read it in my spark streaming?
2- How can I run multiple sparkStreamings at the same time? (I tried to loop through a list of schemas and folders but apparently it doesn't work)
3- Do I need an orchestration tool (airflow) for my purpose? I need to look for new events all the time with no pause in between.
I'm going to use, KinesisFirehose -> s3 (data lake) -> EMR(Spark) -> s3 (data warehouse)
Thank you so much before hand!

Difficulty loading large dataframes from S3 jsonl data into glue for conversion to parquet: Memory Constraints and failed worker spawning

I am attempting to load large datasets from S3 in JSONL format using AWS glue. The S3 data is accessed through a glue table projection. Once they are loaded, I save them back to a different S3 location in Parquet format. For the most part this strategy works, but for some of my datasets, the glue job runs out of memory. On closer inspection, it would seem it is trying to load the entire large dataset onto one executor before redistributing it.
I have tried upgrading the worker size to G.1X from standard so that it would have more memory and CPU, and this did not work for me: the script would still crash.
It was recommended I try the techniques for partitioning input data outlined on this page:
https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
I tried to use the groupSize and groupFiles parameters, but these did not work for me: the script would still crash.
Finally, I was recommended to set --config options like in a typical spark environment so that my program could use more memory, but as I am working in AWS I am not able to set the config.
The following code demonstrates what I'm attempting to do:
datasource = glueContext.create_dynamic_frame.from_catalog(
database=SOURCE_GLUE_DATABASE,
table_name=SOURCE_TABLE
)
# yapf: disable
glueContext.write_dynamic_frame.from_options(
frame=datasource,
connection_type="s3",
connection_options={
"path": OUTPUT_PATH,
},
format="parquet"
)
The expected result ( and the one I see for most of my datasets ) is that parquet data is written out to the new S3 location. In the worst cases, looking through the logs reveals that only one worker is being used, despite having set the maximum capacity setting for the job to 10 or more. It seems that it just doesn't want to create more workers, and I can't understand why.

Automate aws Athena partition loading [duplicate]

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.

AWS Elastic Mapreduce optimizing Pig job

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue.
The logfiles in question are stored in s3 with keys that correspond to the dates they are emitted from the logging server, eg: /2013/03/01/access.log. These files are very, very large. My mapreduce job runs an Apache Pig script that simply examines some of the uri paths stored in the log files and outputs generalized counts that correspond to our business logic.
My client code in boto takes date times as input on cli and schedules a jobflow with a PigStep instance for every date needed. Thus, passing something like python script.py 2013-02-01 2013-03-01 would iterate over 29 days worth of datetime objects and create pigsteps with the respective input keys for s3. This means that the resulting jobflow could have many, many steps, one for each day in the timedelta between the from_date and to_date.
My problem is that my EMR jobflow is exceedingly slow, almost absurdly so. It's been running for a night now and hasn't made it even halfway through that example set. Is there something wrong I am doing creating many jobflow steps like this? Should I attempt to generalize the pig script for the different keys instead, rather than preprocessing it in the client code and creating a step for each date? Is this a feasible place to look for an optimization on Elastic Mapreduce? It's worth mentioning that a similar job for a months worth of comparable data passed to the AWS elastic-mapreduce cli ruby client took about 15 minutes to execute (this job was fueled by the same pig script.)
EDIT
Neglected to mention, job was scheduled for two instances of type m1.small, which admittedly may in itself be the problem.