How do I use EMR with spark DataFrames - amazon-web-services

I am running an AWS EMR job with spark. My input data is in my S3 bucket (csv files as .gz).
I am trying to filter multiple input files (one month worth of data, 1 file = 1 day) by first reading them in my spark dataframe, do some filtering and writing the result in my s3 bucket.
My Problem: I thought spark dataframes are already optimized to run on multiple nodes, but when running my code it only uses one node resulting in long computing time.
My code
input_bucket = my-bucket
input_path = '/2019/01/*/*.gz' #reading all january files
spark = SparkSession.builder.appName("Pythonexample").getOrCreate()
df = spark.read.csv(path=input_bucket+input_path, header=True, inferSchema=True)
df = df.drop("Time","Status") #keeping only relevant col
df = df.dropDuplicates()
df.show()
data = return_duplicates(df,'ID') # data = df without unique rows, only duplicates
data.write.format("com.databricks.spark.csv").option("header", "true").save(input_bucket+'/output')
my function
def return_duplicates(df, column):
w = Window.partitionBy(column)
return df.select('*', f.count(column).over(w).alias('dupeCount')).where('dupeCount > 1').drop('dupeCount')
Question: What should I change?
How can I use Map-Reduce or something similar (parallelize()?) with spark dataframes to use multiple nodes and reduce computing time?

Related

Why is AWS athena slower than reading parquet directly?

I have created a table on AWS Athena. It is partitioned both on S3, and Athena. I am now trying to load the table into a pandas dataframe using 2 methods from the awswrangler library: AWS Athena read_sql_query vs reading parquet directly as below:
start = time.time()
df = wr.athena.read_sql_query(sql="SELECT * FROM rumorz_db.ohlcv_1d_partition where coin='ADA'",
database="rumorz_db",
boto3_session=s3.session)
print(time.time() - start)
start = time.time()
current_data = parquet_util.readParquet(
file_path=s3.getObjectURI(RUMORZ_DATA_S3_BUCKET, f"market_data/1d_ohlcv_raw.parquet/"),
partition_filter=lambda x: x["coin"] == 'ADA'
)
print(time.time() - start)
The AWS Athena method takes 6 seconds, while read_parquet takes 2 seconds. I thought Athena was faster than reading parquet directly, not 3 times slower. Is this expected?

Error writing parquet file in s3 with Pyspark

I'm trying to read some tables(parquet files) and do some joins and write them as parquet format in S3 but I'm getting an error or taking more than a couple of hours to write the table.
error:
An error was encountered:
Invalid status code '400' from https://.... with error payload: {"msg":"requirement failed: session isn't active."}
I am able to write other tables as a parquet except for that table.
This is my sample code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.config("spark.sql.catalogImplementation", "in-memory").getOrCreate()
table1 = spark.read.parquet("s3://.../table1")
table1.createOrReplaceTempView("table1")
table2 = spark.read.parquet("s3://.../table2")
table2.createOrReplaceTempView("table2")
table3 = spark.read.parquet("s3://.../table3")
table3.createOrReplaceTempView("table3")
table4 = spark.read.parquet("s3://.../table4")
table4.createOrReplaceTempView("table4")
Final_table = spark.sql("""
select
a.col1
a.col2
...
d.coln
from
table1 a
left outer join
table2 b
on
cond1
cond2
cond3
left outer join
table3 c
on
...
""")
Final_table.count()
# 3813731240
output_file="s3://.../final_table/"
final_table.write.option("partitionOverwriteMode", "dynamic").mode('overwrite').partitionBy("col1").parquet(output_file)
Just to add more, I've tried repartition but didn't work. Also, I've tried with different EMR clusters such as
Cluster1:
Master
m5.24xlarge
Cluster2:
Master
m5.24xlarge
1 core
m5.24xlarge
Cluster3:
Master
m5d.2xlarge
8 cores
m5d.2xlarge
EMR release version
5.29.0
Most spark jobs can be optimized by visualizing their DAG.
In this scenario if you are able to run the sql and get the count in minimum time and all your time is consumed just for writing then here are some suggestions
Since you already know the count of your dataframe, remove the count operation as it is unnecessary overhead for your job.
Now you are partitioning your data based on col1, so better try repartitioning your data so that the least shuffle is performed at the time of writing.
You can do something like
df.repartition('col1', 100).write
Also you can set the number based on the partition count if you know it.

AWS Glue Dynamic_frame with pushdown predicate not filtering correctly

I am writing an script for AWS Glue that is sourced in S3 stored parquet files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in.
The table partitions are (in order): account_id > region > vpc_id > dt
And the code for creating the dynamic_frame is the following:
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "dt='" + DATE + "'")
where DATE = '2019-10-29'
However it seems that Glue still attempts to read data from other days. Maybe it's because I have to specify a push_down_predicate for the other criteria?
As per the comments, the logs show that the date partition column is marked as "dt" where as in your table it is being referred by the name "date"
Logs
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY/dt=2019-07-15
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-03
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-08-27
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-29 ...
Your Code
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "date='" + DATE + "'")
Change the date partitions column name to dt in your table and same in push_down_predicate parameter in the above code.
I also see extra forward slashes in some of the paths in above logs, were these partitions added manually through athena using ALTER TABLE command? If so, I would recommend to use MSCK REPAIR command to load all partitions in the table to avoid such issues. Extra blank slashes in S3 path some times lead to errors while doing ETL through spark.

How to partition data by datetime in AWS Glue?

The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough

Add a partition on glue table via API on AWS?

I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.