I am trying to write one Custom transform in AWS glue. But it is outputting empty files in s3. I am new and do not understand what I am doing wrong here.
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
# df_filtered = df.filter(df["type"] == "ACHDebit")
dyf_filtered = DynamicFrame.fromDF(df, glueContext, "dyf_filtered")
return(DynamicFrameCollection({"CustomTransform0": dyf_filtered}, glueContext))
My end goal is to encrypt a few columns. But not sure how to do it. So I am trying with adding a custom transform using a map and then applying KMS.encrpt to the df column. I am happy to take more suggestions as I am very new to this.
Thanks in advance for your help.
Related
I am new to AWS Glue and pyspark. I have a table in RDS which contains a varchar field id. I want to map id to a String field in the output json which is inside a json array field (let's say newId):
{
"sources" : [
"newId" : "1234asdf"
]
}
How can I achieve this using the transforms defined in the pyspark script of the AWS Glue job.
Use the AWS Glue Map Transformation to map the string field into a field inside a JSON array in target.
NewFrame= Map.apply(frame=OldFrame, f=map_fields)
and define a function map_fields like such:
def map_fields(rec):
rec["sources"] = {}
rec["sources"] = [{"newID": rec["id"]}]
del rec["id"]
return rec
Make sure to delete the original field as done in del rec["uid"] otherwise the logic doesn't work.
I am running an AWS EMR job with spark. My input data is in my S3 bucket (csv files as .gz).
I am trying to filter multiple input files (one month worth of data, 1 file = 1 day) by first reading them in my spark dataframe, do some filtering and writing the result in my s3 bucket.
My Problem: I thought spark dataframes are already optimized to run on multiple nodes, but when running my code it only uses one node resulting in long computing time.
My code
input_bucket = my-bucket
input_path = '/2019/01/*/*.gz' #reading all january files
spark = SparkSession.builder.appName("Pythonexample").getOrCreate()
df = spark.read.csv(path=input_bucket+input_path, header=True, inferSchema=True)
df = df.drop("Time","Status") #keeping only relevant col
df = df.dropDuplicates()
df.show()
data = return_duplicates(df,'ID') # data = df without unique rows, only duplicates
data.write.format("com.databricks.spark.csv").option("header", "true").save(input_bucket+'/output')
my function
def return_duplicates(df, column):
w = Window.partitionBy(column)
return df.select('*', f.count(column).over(w).alias('dupeCount')).where('dupeCount > 1').drop('dupeCount')
Question: What should I change?
How can I use Map-Reduce or something similar (parallelize()?) with spark dataframes to use multiple nodes and reduce computing time?
I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/
What I need to do in my perception is the following:
Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)
My questions are the following.
1. Do you agree with my thought process or you would change some element to make it more efficient?
2. What do you think about running a python script to transform the data? Is there any simpler idea?
3. Which technology would you suggest to set up the database?
Thank you for the comments!
Best regards,
Bartek
If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:
CREATE TABLE dataset.pollution
(
id NUMERIC,
sampling_rate STRING,
timestamp TIMESTAMP,
location STRUCT<
id NUMERIC,
latitude FLOAT64,
longitude FLOAT64,
altitude FLOAT64,
country STRING,
exact_location INT64,
indoor INT64
>,
sensor STRUCT<
id NUMERIC,
pin STRING,
sensor_type STRUCT<
id INT64,
name STRING,
manufacturer STRING
>
>,
sensordatavalues ARRAY<STRUCT<
id NUMERIC,
value FLOAT64,
value_type STRING
>>
)
Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.
The code, would be something like this:
from google.cloud import storage
import json
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using
json.loads() method
data = json.loads(blob.download_as_string(client=None))
# Partition the request in order to avoid reach quotas
partition = len(data)/4
cont = 0
data_aux = []
for part in data:
if cont >= partition:
errors = client.insert_rows(table, data_aux) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print(errors)
cont = 0
data_aux = []
# Avoid empty values (clean data)
if part['location']['altitude'] is "":
part['location']['altitude'] = 0
if part['location']['latitude'] is "":
part['location']['latitude'] = 0
if part['location']['longitude'] is "":
part['location']['longitude'] = 0
data_aux.append(part)
cont += 1
As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].
Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.
And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.
[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library
[2] https://cloud.google.com/storage
[3] https://cloud.google.com/bigquery/quotas
[4] https://cloud.google.com/bigquery/docs/visualize-data-studio
The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough
I am using AWS Glue and need to transform Boolean (True and False), columns within a Redshift datawarehouse schema to a "Yes"/"No" in another Redshift schema. At present, there does not appear to be a simple way to do so in the AWS Glue GUI.
I have been following the guide here as: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-map.html
and created the function:
def ConvertBoolean(dataFrame,ColumnName):
dataFrame["booleanTransform"] = {}
if dataFrame[ColumnName] == True:
dataFrame["booleanTransform"] = "Yes"
else:
dataFrame["booleanTransform"] = "No"
del dataFrame[ColumnName]
dataFrame[ColumnName] = {}
dataFrame[ColumnName] = dataFrame["booleanTransform"]
del dataFrame["booleanTransform"]
return dataFrame
But do not know where the function should be stored or how to pass the dynamicframe as that is not noted in the documentation example provided.
How would this be best accomplished in the pyspark code of AWS Glue?
Do you really have to use Glue for that? It sounds as if a simple CTAS would be more time and money efficient:
CREATE TABLE newtable
-- you may also want to set DIST and SORTKEYs for the newtable here
AS
SELECT
CASE my_bool_column
WHEN TRUE THEN 'Yes'
ELSE 'No'
END::VARCHAR(3) as my_bool_column,
all_other_columns
FROM oldtable;
If you are using redshift why don't you write a sql script that does that for you. I don't think you need to do anything with glue.
Anyway if you still need to do it using glue just use the Apache Spark DataFrame:
df.withColumn("columnName", when(df.columnName, lit('Yes').otherwise(lit('No'))
Transforming back to a DynamicDataframe can be done using fromDF() function.