Set missing column values to a default using AWS Glue Jobs

Set missing column values to a default using AWS Glue Jobs - amazon-web-services

I'm trying to extract a dataset from dynamodb to s3 using Glue. In the process I want to select a handful of columns, then set a default value for any and all rows/columns that have missing values.
My attempt is currently to use the "Map" function, but it doesn't seem to be calling my method.
Here is what I have:
def SetDefaults(rec):
print("checking record")
for col in rec:
if not rec[col]:
rec[col] = "missing"
return rec
## Read raw(source) data from target DynamoDB
raw_data_dyf = glueContext.create_dynamic_frame_from_options("dynamodb", {"dynamodb.input.tableName" : my_dynamodb_table, "dynamodb.throughput.read.percent" : "0.50" } )
## Get the necessary columns
selected_data_dyf = ApplyMapping.apply(frame = raw_data_dyf, mappings = mappingList)
## get rid of null values
mapped_dyF = Map.apply(frame=selected_data_dyf, f=SetDefaults)
## write it all out as a csv
datasink = glueContext.write_dynamic_frame.from_options(frame=mapped_dyF , connection_type="s3", connection_options={ "path": my_train_data }, format="csv", format_options = {"writeHeader": False , "quoteChar": "-1" })
My ApplyMapping.apply call is doing the right thing, where mappingList is defined by a bunch of:
mappingList.append(('gsaid', 'bigint', 'gsaid', 'bigint'))
mappingList.append(('objectid', 'bigint', 'objectid', 'bigint'))
mappingList.append(('objecttype', 'bigint', 'objecttype', 'bigint'))
I have no errors, everything runs to completion. My data is all in s3, but there are many empty values still, rather than the "missing" entry I would like.
The "checking record" print statement never prints out. What am I missing here?

Alternative solution:
Convert DynamicFrame to Spark DataFrame
Use the DataFrame's fillna() method to fill the null values
Convert the DataFrame back to DynamicFrame

Related

PyFlink on Kinesis Analytics Studio - Cannot convert DataStream to Amazon Kinesis Data Stream

I have a DataStream <pyflink.datastream.data_stream.DataStream> coming from a CoFlatMapFunction (simplified here):
%flink.pyflink
# join two streams and update the rule-set
class MyCoFlatMapFunction(CoFlatMapFunction):
def open(self, runtime_context: RuntimeContext):
state_desc = MapStateDescriptor('map', Types.STRING(), Types.BOOLEAN())
self.state = runtime_context.get_map_state(state_desc)
def bool_from_user_number(self, user_number: int):
'''Retunrs True if user_number is greater than 0, False otherwise.'''
if user_number > 0:
return True
else:
return False
def flat_map1(self, value):
'''This method is called for each element in the first of the connected streams'''
self.state.put(value[1], self.bool_from_user_number(value[2]))
def flat_map2(self, value):
'''This method is called for each element in the second of the connected streams (exchange_server_tickers_data_py)'''
current_dateTime = datetime.now()
dt = current_dateTime
x = value[1]
y = value[2]
yield Row(dt, x, y)
def generate__ds(st_env):
# interpret the updating Tables as DataStreams
type_info1 = Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.INT()])
ds1 = st_env.to_append_stream(table_1 , type_info=type_info1)
type_info2 = Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()])
ds2 = st_env.to_append_stream(table_2 , type_info=type_info2)
output_type_info = Types.ROW([ Types.PICKLED_BYTE_ARRAY() ,Types.STRING(),Types.STRING() ])
# Connect the two streams
connected_ds = ds1.connect(ds2)
# Apply the CoFlatMapFunction
ds = connected_ds.key_by(lambda a: a[0], lambda a: a[0]).flat_map(MyCoFlatMapFunction(), output_type_info)
return ds
ds = generate__ds(st_env)
The output, however, I am unable to view, either via registering it as a view / table, writing to a sink table or (the best case) using a Kinesis Streams sink to write data from the Flink stream into a Kinesis stream. Firehouse would also not fit my use case as the 30 second latency would be too long. Any help would be appreciated, thanks!
What I have tried:
Registering it as a view / table like so:
# interpret the DataStream as a Table
input_table = st_env.from_data_stream(ds).alias("dt", "x", "y")
z.show(input_table, stream_type="update")
Which gives an error of:
Query schema: [dt: RAW('[B', '...'), x: STRING, y: STRING]
Sink schema: [dt: RAW('[B', ?), x: STRING, y: STRING]
I have also tried writing to a sink table, like so:
%flink.pyflink
# create a sink table to emit results
st_env.execute_sql("""DROP TABLE IF EXISTS table_sink""")
st_env.execute_sql("""
CREATE TABLE table_sink (
dt RAW('[B', '...'),
x VARCHAR(32),
y STRING
) WITH (
'connector' = 'print'
)
""")
# convert the Table API table to a SQL view
table = st_env.from_data_stream(ds).alias("dt", "spread", "spread_orderbook")
st_env.execute_sql("""DROP TEMPORARY VIEW IF EXISTS table_api_table""")
st_env.create_temporary_view('table_api_table', table)
# emit the Table API table
st_env.execute_sql("INSERT INTO table_sink SELECT * FROM table_api_table").wait()
I get the error:
org.apache.flink.table.api.ValidationException: Unable to restore the RAW type of class '[B' with serializer snapshot '...'.
I have also tried to use a sink and add_sink to write the data to a sink, which would be an AWS kinesis data stream like in these Docs, like so:
%flink.pyflink
from pyflink.common.serialization import JsonRowSerializationSchema
from pyflink.datastream.connectors import KinesisStreamsSink
output_type_info = Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()])
serialization_schema = JsonRowSerializationSchema.Builder().with_type_info(output_type_info).build()
# Required
sink_properties = {
'aws.region': 'eu-west-2'
}
kds_sink = KinesisStreamsSink.builder()
.set_kinesis_client_properties(sink_properties)
.set_serialization_schema(SimpleStringSchema())
.set_partition_key_generator(PartitionKeyGenerator
.fixed())
.set_stream_name("test_stream")
.set_fail_on_error(False)
.set_max_batch_size(500)
.set_max_in_flight_requests(50)
.set_max_buffered_requests(10000)
.set_max_batch_size_in_bytes(5 * 1024 * 1024)
.set_max_time_in_buffer_ms(5000)
.set_max_record_size_in_bytes(1 * 1024 * 1024)
.build()
ds.sink_to(kds_sink)
Which i assume would work, but KinesisStreamsSink is not found in pyflink.datastream.connectors and I am unable to find any documentation on how to do this within AWS Kinesis Analytics Studio. Any help would be much much appreciated, thank you! How would I go about writing the data to a Kinesis Streams sink / converting it to a table?

Okay, i have figured it out. There were a couple issues with the particular Pyflink version available on AWS Kinesis Analytics Studio (1.13). The error messages themselves were not that useful, so for anyone who is having issues themselves I would really recommend viewing the errors in the Flink Web UI. Firstly, the MapStateDescriptor datatypes must be specified using Types.PICKLED_BYTE_ARRAY(). Secondly, not shown in the Qn, but each MapStateDescriptor must have a distinct name. I also found that using Row from pyflink.common threw errors for me. It worked better for me to switch to using use Tuples by specifying Types.TUPLE() as is done in this example. I also had to switch to specifying the output as a tuple.
Another thing I have not done is specify a watermark strategy for the DataStream, which could potentially be done by extracting the timestamp from the first field, and assign watermarks based on knowledge of the stream:
class MyTimestampAssigner(TimestampAssigner):
def extract_timestamp(self, value, record_timestamp: int) -> int:
return int(value[0])
watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)).with_timestamp_assigner(MyTimestampAssigner())
ds = ds.assign_timestamps_and_watermarks(watermark_strategy)
# the first field has been used for timestamp extraction, and is no longer necessary
# replace first field with a logical event time attribute
table = st_env.from_data_stream(ds, col("dt").rowtime, col('f0'), col('f1'))
But i have instead created a sink table for writing to a Kinesis Data Stream again as an output. In total, the corrected code would look something like this:
from pyflink.table.expressions import col
from pyflink.datastream.state import MapStateDescriptor
from pyflink.datastream.functions import RuntimeContext, CoFlatMapFunction
from pyflink.common.typeinfo import Types
from pyflink.common import Duration as Time, WatermarkStrategy, Duration
from pyflink.common.typeinfo import Types
from pyflink.common.watermark_strategy import TimestampAssigner
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor
from datetime import datetime
# Register the tables in the env
table1 = st_env.from_path("sql_table_1")
table2 = st_env.from_path("sql_table_2")
# interpret the updating Tables as DataStreams
type_info1 = Types.TUPLE([Types.SQL_TIMESTAMP(), Types.STRING(), Types.INT()])
ds1 = st_env.to_append_stream(table2, type_info=type_info1)
type_info2 = Types.TUPLE([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()])
ds2 = st_env.to_append_stream(table1, type_info=type_info2)
# join two streams and update the rule-set state
class MyCoFlatMapFunction(CoFlatMapFunction):
def open(self, runtime_context: RuntimeContext):
'''This method is called when the function is opened in the runtime. It is the initialization purposes.'''
# Map state that we use to maintain the filtering and rules
state_desc = MapStateDescriptor('map', Types.PICKLED_BYTE_ARRAY(), Types.PICKLED_BYTE_ARRAY())
self.state = runtime_context.get_map_state(state_desc)
# maintain state 2
ob_state_desc = MapStateDescriptor('map_OB', Types.PICKLED_BYTE_ARRAY(), Types.PICKLED_BYTE_ARRAY())
self.ob_state = runtime_context.get_map_state(ob_state_desc)
# called on ds1
def flat_map1(self, value):
'''This method is called for each element in the first of the connected streams '''
list_res = value[1].split('|')
for i in list_res:
time = datetime.utcnow().replace(microsecond=0)
yield (time, f"{i}_one")
# called on ds2
def flat_map2(self, value):
'''This method is called for each element in the second of the connected streams'''
list_res = value[1].split('|')
for i in list_res:
time = datetime.utcnow().replace(microsecond=0)
yield (time, f"{i}_two")
connectedStreams = ds1.connect(ds2)
output_type_info = Types.TUPLE([Types.SQL_TIMESTAMP(), Types.STRING()])
ds = connectedStreams.key_by(lambda value: value[1], lambda value: value[1]).flat_map(MyCoFlatMapFunction(), output_type=output_type_info)
name = 'output_table'
ds_table_name = 'temporary_table_dump'
st_env.execute_sql(f"""DROP TABLE IF EXISTS {name}""")
def create_table(table_name, stream_name, region, stream_initpos):
return """ CREATE TABLE {0} (
f0 TIMESTAMP(3),
f1 STRING,
WATERMARK FOR f0 AS f0 - INTERVAL '5' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = '{1}',
'aws.region' = '{2}',
'scan.stream.initpos' = '{3}',
'sink.partitioner-field-delimiter' = ';',
'sink.producer.collection-max-count' = '100',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
) """.format(
table_name, stream_name, region, stream_initpos
)
# Creates a sink table writing to a Kinesis Data Stream
st_env.execute_sql(create_table(name, 'output-test', 'eu-west-2', 'LATEST'))
table = st_env.from_data_stream(ds)
st_env.execute_sql(f"""DROP TEMPORARY VIEW IF EXISTS {ds_table_name}""")
st_env.create_temporary_view(ds_table_name, table)
# emit the Table API table
st_env.execute_sql(f"INSERT INTO {name} SELECT * FROM {ds_table_name}").wait()

Django ValueError Can only compare identically-labeled Series objects

i am getting this error. one df dataframe is read from json API and second df2 is read from csv i want to compare one column of csv to API and then matched value to save into new csv. can anyone help me
df2=pd.read_csv(file_path)
r = requests.get('https://data.ct.gov/resource/6tja-6vdt.json')
df = pd.DataFrame(r.json())
df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
print(df)

Probably just make df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
this
df['verified'] = df['salespersoncredential'] == df2['salespersoncredential']
assuming the dtypes and are correct.
If the indexes are different on the two dataframes, you might need to .reset_index().

How Scan Values with colon ":" from a Dynamodb Table?

I am facing a similar problem, using boto3 the query does not work, while it works on console.
First I tried this scan without success:
text = 'city:barcelona'
filter_expr = Attr('timestamp').between('2020-04-01', '2020-04-27')
filter_expr = filter_expr & Attr('text').eq(text)
table.scan(FilterExpression = filter_expr, Limit = 1000)
Then, I notice that for a text variable that does not contain ":", the scan works.
So, I tried this second scan using ExpressionAttributeNames and ExpressionAttributeValues
table.scan(
FilterExpression = "#n0 between :v0 AND :v1 AND #n1 = :v2",
ExpressionAttributeNames = {'#n0': 'timestamp', '#n1': 'text'},
ExpressionAttributeValues = {
':v0': '2020-04-01',
':v1': '2020-04-27',
':v2': {"S": text}},
Limit = 1000
)
Failed again.
By the end, if I change in the first example:
text = 'barcelona'
filter_expr = filter_expr & Attr('text').contains(text)
I can get the records. IMO, it is clear that the problem is the ":"
Is there another way to search by texts with ":" character?

[writing an answer so that we can close out the question]
I ran both examples and they worked correctly for me. I configured text and timestamp as string fields. Check you have an up to date boto3 library.
Note: I changed ':v2': {"S": text} to ':v2': text because you're using resource level scan and you don't need to supply the low-level attribute type (it's only required for client level scan).

How to remove error records from a Dynamic dataframe in AWS glue?

I have a dynamic dataframe which contains error records.Please find the code below.
val rawDataFrame = glueContext.getCatalogSource(database = rawDBName, tableName = rawTBLName).getDynamicFrame();
println(s"RAW_DF-----count: ${rawDataFrame.count} errors: ${rawDataFrame.errorsCount}")
The above print statement prints as below.
RAW_DF-----count: 168456 errors: 4
I need to create a dynamic data frame which contains only 168456 records and I need to eliminate 4 error records.Kindly help.

Error records are not converting to Spark's DataFrame so try to transform your DynamicFrame to df and back:
val noErrorsDyf = DynamicFrame(rawDataFrame.toDF(), glueContext)

AWS Glue ApplyMapping from double to string

I'm having a bit of a frustrating issues with a Glue Job.
I have a table which I have created from a crawler. It's gone through some CSV data and created a schema. Some elements of the schema need to be modified, e.g. numbers to strings and apply a header.
I seem to be running into some problems here - the schema for some fields appears to be have picked up as a double. When I try and convert this into a string which is what I require, it includes some empty precision e.g. 1234 --> 1234.0.
The mapping code I have is something like:
applymapping1 = ApplyMapping.apply(
frame = datasource0,
mappings = [
("col1","double","first_column_name","string"),
("col2","double","second_column_name","string")
],
transformation_ctx = "applymapping1"
)
And the resulting table I get after I've crawled the data is something like:
first_column_name second_column_name
1234.0 4321.0
5678.0 8765.0
as opposed to
first_column_name second_column_name
1234 4321
5678 8765
Is there a good way to work around this? I've tried changing the schema in the table that is initially created by the crawler to a bigint as opposed to a double, but when I update the mapping code to ("col1","bigint","first_column_name","string") the table just ends up being null.

Just a little correction from botchniaque answer, you actually have to do BOTH ResolveChoice and then ApplyMapping to ensure the correct type conversion.
ResolveChoice will make sure you just have one type in your column. If you do not make this step and the ambiguity is not resolved, the column will become a struct and Redshift will show this as null in the end.
So apply ResolveChoice to make sure all your data is one type (int, for ie)
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
Finally, use ApplyMapping to change type for what you want
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1")
Hope this helps (:

Maybe your data is really of type double (some values may have a fractions), and that's why changing type results in data being turned to null. Also it's no wonder that when you change type of a double field to string it gets serialized with a decimal component - it's still a double, just printed.
Have you tried explicitly casting the values to integer?
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
And then to case to string
df3 = ResolveChoice.apply(df2, specs = [("col1", "cast:string"), ("col2", "cast:string")])
or use ApplyMapping to change type and rename as you did above.
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1"
)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Set missing column values to a default using AWS Glue Jobs - amazon-web-services

Alternative solution: Convert DynamicFrame to Spark DataFrame Use the DataFrame's fillna() method to fill the null values Convert the DataFrame back to DynamicFrame

Related

PyFlink on Kinesis Analytics Studio - Cannot convert DataStream to Amazon Kinesis Data Stream

Django ValueError Can only compare identically-labeled Series objects

How Scan Values with colon ":" from a Dynamodb Table?

How to remove error records from a Dynamic dataframe in AWS glue?

AWS Glue ApplyMapping from double to string

Categories

Resources