I am new to AWS Glue and pyspark. I have a table in RDS which contains a varchar field id. I want to map id to a String field in the output json which is inside a json array field (let's say newId):
{
"sources" : [
"newId" : "1234asdf"
]
}
How can I achieve this using the transforms defined in the pyspark script of the AWS Glue job.
Use the AWS Glue Map Transformation to map the string field into a field inside a JSON array in target.
NewFrame= Map.apply(frame=OldFrame, f=map_fields)
and define a function map_fields like such:
def map_fields(rec):
rec["sources"] = {}
rec["sources"] = [{"newID": rec["id"]}]
del rec["id"]
return rec
Make sure to delete the original field as done in del rec["uid"] otherwise the logic doesn't work.
Related
I am trying to use DynamoDB operation BatchWriteItem, wherein I want to insert multiple records into one table.
This table has one partition key and one sort key.
I am using AWS lambda and Go language.
I get the elements to be inserted into a slice.
I am following this procedure.
Create PutRequest structure and add AttributeValues for the first record from the list.
I am creating WriteRequest from this PutRequest
I am adding this WriteRequest to an array of WriteRequests
I am creating BatchWriteItemInput which consists of RequestItems, which is basically a map of Tablename and the array of WriteRequests.
After that I am calling BatchWriteItem, which results into an error:
Provided list of item keys contains duplicates.
Any pointers, why this could be happening?
You've provided two or more items with identical primary keys (which in your case means identical partition and sort keys).
Per the BatchWriteItem docs, you cannot perform multiple operations on the same item in the same BatchWriteItem request.
Consideration: This answers works for Python
As #Benoit has remarked, the boto3 documentation states:
If you want to bypass no duplication limitation of single batch write request as botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the BatchWriteItem operation: Provided list of item keys contains duplicates.
you could specify overwrite_by_pkeys=['partition_key', 'sort_key'] on the batch writer to "de-duplicate request items in buffer if match new request item on specified primary keys" according to the documentation and the source code. That is, if the combination primary-sort already exists in the buffer it will drop that request and replace it with the new one.
Example
Suppose there is pandas dataframe that you want to write to a DynamoDB table, the following function could be helpful,
import json
import datetime as dt
import boto3
import pandas as pd
from typing import Optional
def write_dynamoDB(df:'pandas.core.frame.DataFrame', tbl:str, partition_key:Optional[str]=None, sort_key:Optional[str]=None):
'''
Function to write a pandas DataFrame to a DynamoDB Table through
batchWrite operation. In case there are any float values it handles
them by converting the data to a json format.
Arguments:
* df: pandas DataFrame to write to DynamoDB table.
* tbl: DynamoDB table name.
* partition_key (Optional): DynamoDB table partition key.
* sort_key (Optional): DynamoDB table sort key.
'''
# Initialize AWS Resource
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(tbl)
# Check if overwrite keys were provided
overwrite_keys = [partition_key, sort_key] if partition_key else None
# Check if they are floats (convert to decimals instead)
if any([True for v in df.dtypes.values if v=='float64']):
from decimal import Decimal
# Save decimals with JSON
df_json = json.loads(
json.dumps(df.to_dict(orient='records'),
default=date_converter,
allow_nan=True),
parse_float=Decimal
)
# Batch write
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
for element in df_json:
batch.put_item(
Item=element
)
else: # If there are no floats on data
# Batch writing
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
columns = df.columns
for row in df.itertuples():
batch.put_item(
Item={
col:row[idx+1] for idx,col in enumerate(columns)
}
)
def date_converter(obj):
if isinstance(obj, dt.datetime):
return obj.__str__()
elif isinstance(obj, dt.date):
return obj.isoformat()
by calling write_dynamoDB(dataframe, 'my_table', 'the_partition_key', 'the_sort_key').
Use batch_writer instead of batch_write_item:
import boto3
dynamodb = boto3.resource("dynamodb", region_name='eu-west-1')
my_table = dynamodb.Table('mirrorfm_yt_tracks')
with my_table.batch_writer(overwrite_by_pkeys=["user_id", "game_id"]) as batch:
for item in items:
batch.put_item(
Item={
'user_id': item['user_id'],
'game_id': item['game_id'],
'score': item['score']
}
)
If you don't have a sort key, overwrite_by_pkeys can be None
This is essentially the same answer as #MiguelTrejo (thanks! +1) but simplified
I have a database stored in Amazon Redshift and an array is stored in table column in JSON format.
How to fetch a string from array?
Using json_extract_path_text you can retrieve values from a column
In Redshift database I have JSON in one column
This query performs join and to get seperate column results.
SELECT json_extract_path_text(O._doc,'domain') AS Domain,
json_extract_path_text(P._doc,'email') AS Email
FROM intelligense_mongo.organisations AS O
INNER JOIN intelligense_mongo.people AS P
ON json_extract_path_text(O._doc,'_id') =
json_extract_path_text(P._doc,'organisation_id')
Where
json_extract_path_text(O._doc,'tools_name') = '%"WordPress"%'
Use the JSON_EXTRACT_PATH_TEXT Function:
select json_extract_path_text('{"f2":{"f3":1},"f4":{"f5":99,"f6":"star"}}','f4', 'f6');
json_extract_path_text
----------------------
star
I have a dynamic dataframe in AWS glue which I created using the below piece of code.
val rawDynamicDataFrame = glueContext.getCatalogSource(
database = rawDBName,
tableName = rawTableName,
redshiftTmpDir = "",
transformationContext = "rawDynamicDataFrame"
).getDynamicFrame()
In order to get the schema of the above dynamic frame, I used the below piece of code:
val x = rawDynamicDataFrame.schema
Now x is of type com.amazonaws.services.glue.schema.Schema. How can I parse the schema object?
To check if a field exist in schema use containsField(fieldPath):
if (rawDynamicDataFrame.schema.containsField("app_name")) {
// do something
}
Maybe you can use field_names = [field.name for field in self. rawDynamicDataFrame.schema().fields] to get a list of field names.
I am using AWS Glue and need to transform Boolean (True and False), columns within a Redshift datawarehouse schema to a "Yes"/"No" in another Redshift schema. At present, there does not appear to be a simple way to do so in the AWS Glue GUI.
I have been following the guide here as: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-map.html
and created the function:
def ConvertBoolean(dataFrame,ColumnName):
dataFrame["booleanTransform"] = {}
if dataFrame[ColumnName] == True:
dataFrame["booleanTransform"] = "Yes"
else:
dataFrame["booleanTransform"] = "No"
del dataFrame[ColumnName]
dataFrame[ColumnName] = {}
dataFrame[ColumnName] = dataFrame["booleanTransform"]
del dataFrame["booleanTransform"]
return dataFrame
But do not know where the function should be stored or how to pass the dynamicframe as that is not noted in the documentation example provided.
How would this be best accomplished in the pyspark code of AWS Glue?
Do you really have to use Glue for that? It sounds as if a simple CTAS would be more time and money efficient:
CREATE TABLE newtable
-- you may also want to set DIST and SORTKEYs for the newtable here
AS
SELECT
CASE my_bool_column
WHEN TRUE THEN 'Yes'
ELSE 'No'
END::VARCHAR(3) as my_bool_column,
all_other_columns
FROM oldtable;
If you are using redshift why don't you write a sql script that does that for you. I don't think you need to do anything with glue.
Anyway if you still need to do it using glue just use the Apache Spark DataFrame:
df.withColumn("columnName", when(df.columnName, lit('Yes').otherwise(lit('No'))
Transforming back to a DynamicDataframe can be done using fromDF() function.
I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)