No space left on device error with pyspark aws glue - amazon-web-services

I am using AWS glue to extract dynamoDB items into S3. I read all the items using the pyspark and was glue and apply a transformation on the items retrieved from DynamoDB and write into S3. But I always run into the error "No space left on device."
The worker type I use is G.1X, and each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and the size of dynamoDB is 6GB.
Based on the AWS documentation, During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity How can I set the shuffling programmatically? Please find my sample code below,
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Map
from awsglue.transforms import Filter
from pyspark import SparkConf
conf = SparkConf()
glue_context = GlueContext(SparkContext.getOrCreate())
# mytable got id and uri
resources_table_dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": "my_table",
"dynamodb.throughput.read.percent": "0.4",
"dynamodb.splits": "8"
}
)
# Filter out rows whose ids are same
def filter_new_id(dynamicRecord):
uri = dynamicRecord['Uri']
uri_split = uri.split(":")
# Get the internal ID
internal_id = uri_split[1]
print(dynamicRecord)
if internal_id == dynamicRecord['id']:
return False
return True
# Keep only the items whose IDs are different.
resource_with_old_id = Filter.apply(
frame=resources_table_dynamic_frame,
f=lambda x: filter_new_id(x),
transformation_ctx='resource_with_old_id'
)
glue_context.write_dynamic_frame_from_options(
frame=resource_with_old_id,
connection_type="s3",
connection_options={"path": "s3://path/"},
format="json"
)

I addressed this issue with the following tweak in the code posted in OP.
resources_table_dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": "my_table",
"dynamodb.throughput.read.percent": "0.5",
"dynamodb.splits": "200"
},
additional_options={
"boundedFiles" : "30000"
}
)
I added boundedFiles as suggested in AWS doc here and increased the dynamodb.splits to make it work for me.

Related

AWS Glue Job - create parquet from Data Catalog Table to S3

Im struggling with what supposed to be a simple task (maybe i do not understand whats under the hood in the glue data catalog).
I have an S3 bucket with parquet files
I ran a glue crawler through it to create a catalogue table so i could query it in Athena, etc...
Im trying to create a glue job that retreives the data from that table (catalogue table), partitiones is, and saves in another S3 bucket. I receive this exception every time i use the catalogue as source:
Py4JJavaError: An error occurred while calling o497.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 29 in stage 39.0 failed 4 times, most recent failure: Lost task 29.3 in stage 39.0 (TID 437, 172.34.18.186, executor 1): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:57)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:460)
...blabla
However, if i use the initial S3 bucket as source, it works. I pulled the data from both sources and compared schemas (they are the same, at least on the surface).
Here is my test glue job code:
import sys
import pandas as pd
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pyspark.sql.functions as f
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import StructField, StructType, StringType,LongType
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
## Read the data
from_catalogue = glueContext.create_dynamic_frame.from_catalog(database = "ss-demo-database", table_name = "tlc_green_data", transformation_ctx = "datasource0")
from_s3 = glueContext.create_dynamic_frame.from_options(
format_options={},
connection_type="s3",
format="parquet",
connection_options={
"paths": ["s3://test/tlc_green_data/green_tripdata_2014-02.parquet"],
"recurse": False,
},
transformation_ctx="from_s3",
)
from_s3.printSchema()
from_catalogue.printSchema()
S3_location = "s3://test/demo/"
# Result are store back to the S3 bucket (this works if from s3, doesnt work if from catalogue)
datasink = glueContext.write_dynamic_frame_from_options(
frame=from_catalogue,
connection_type="s3",
connection_options={
"path": S3_location
},
format="parquet",
format_options = {"compression": "snappy"},
transformation_ctx ="datasink")
job.commit()
I tried converting to spark df and with overwrite option - same thing
I tried only selecting 10 rows - same thing
Your help is greatly appreciated!

Advice/Guidance - composer/beam/dataflow on gcp

I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))

loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?

I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table names and ingests these tables into s3. The glue script is written in python (pyspark)
this is sample how the configuration file looks :
{"main_key":{
"source_type": "rdbms",
"source_schema": "DATABASE",
"source_table": "DATABASE.Table_1",
}}
Assuming your Glue job can connect to the database and a Glue Connection has been added to it. Here's a sample extracted from my script that does something similar, you would need to update the jdbc url format that works for your database, this one uses sql server, implementation details for fetching the config file, looping through items, etc.
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from datetime import datetime
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};databaseName={db_name}"
connection_details = {
"user": 'db_user',
"password": 'db_password',
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
tables_config = get_tables_config_from_s3_as_dict()
date_partition = datetime.today().strftime('%Y%m%d')
write_date_partition = f'year={date_partition[0:4]}/month={date_partition[4:6]}/day={date_partition[6:8]}'
for key, value in tables_config.items():
table = value['source_table']
df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_details)
write_path = f's3a://bucket-name/{table}/{write_date_partition}'
df.write.parquet(write_path)
Just write a normal for loop to loop through your DB configuration then follow Spark JDBC documentation to connect to each of them in sequence.

Transfering files into S3 buckets based on Glue job status

I am new to **AWS Glue,** and my aim is to extract transform and load files uploaded in S3 bucket to RDS instance. Also I need to transfer the files into separate S3 buckets based on the Glue Job status (Success /Failure). There will be more than one file uploaded into the initial S3 bucket. How can I get the name of the files uploaded so that i can transfer those files to appropriate buckets.
Step 1: Upload files to S3 bucket1.
Step 2: Trigger lamda function to call Job1
Step 3: On success of job1 transfer file to S3 bucket2
Step 4: On failure transfer to another S3 bucket
Have a lambda event trigger listening to the folder you are uploading
the files to S3 In the lambda, use AWS Glue API to run the glue job
(essentially a python script in AWS Glue).
In Glue python script, use the appropriate library, such as pymysql, etc.
as an external library packaged with your python script.
Perform data load operations from S3 to your RDS tables. If you are
using Aurora Mysql, then AWS has provided a nice feature
"load from S3", so you can directly load the
file into the tables (you may need to do some configurations in the
PARAMETER GROUP / IAM Roles).
Lambda script to call glue job:
s3 = boto3.client('s3')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="<YOUR GLUE JOB NAME>"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
raise e
Glue Script:
import mysql.connector
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import SQLContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
url="<RDS URL>"
uname="<USER NAME>"
pwd="<PASSWORD>"
dbase="DBNAME"
def connect():
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
return cur, conn
def create_stg_table():
cur, conn = connect()
createStgTable1 = <CREATE STAGING TABLE IF REQUIRED>
loadQry = "LOAD DATA FROM S3 PREFIX 'S3://PATH FOR YOUR CSV' REPLACE INTO TABLE <DB.TABLENAME> FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (#var1, #var2, #var3, #var4, #var5, #var6, #var7, #var8) SET ......;"
cur.execute(createStgTable1)
cur.execute(loadQry)
conn.commit()
conn.close()
You can then create a cloudwatch alert wherein check for the glue job status, and depending upon the status, perform file copy operations between S3. We have similar setup in our production.
Regards
Yuva

ETL from AWS DataLake to RDS

I'm relatively new to DataLakes and Im going through some research for a project on AWS.
I have created a DataLake and have tables generated from Glue Crawlers, I can see the data in S3 and query it using Athena. So far so good.
There is a requirement to transform parts of the data stored in the datalake to RDS for applications to read the data. What is the best solution for ETL from S3 DataLake to RDS?
Most posts I've come across talk about ETL from RDS to S3 and not the other way around.
By creating a Glue Job using the Spark job type I was able to use my S3 table as a data source and an Aurora/MariaDB as the destination.
Trying the same with a python job type didn't allow me to view any S3 tables during the Glue Job Wizard screens.
Once the data is in Glue DataFrame of Spark DataFrame, wrinting it out is pretty much straight forward. Use RDBMS as data sink.
For example, to write to a Redshift DB,
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"preactions" -> "<another SQL queries>",
"postactions" -> "<some SQL queries>"
)),
redshiftTmpDir = tempDir,
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
As shown above, use the JDBC Connection you've created to write the data to.
You can accomplish that with a Glue Job. Sample code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
file_paths = ['path']
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': file_paths}, format="csv", format_options={"separator": ",", "quoteChar": '"', "withHeader": True})
df.printSchema()
df.show(10)
options = {
'user': 'usr',
'password': 'pwd',
'url': 'url',
'dbtable': 'tabl'}
glueContext.write_from_options(frame_or_dfc=df, connection_type="mysql", connection_options=options)