AWS Glue ApplyMapping from double to string - amazon-web-services

I'm having a bit of a frustrating issues with a Glue Job.
I have a table which I have created from a crawler. It's gone through some CSV data and created a schema. Some elements of the schema need to be modified, e.g. numbers to strings and apply a header.
I seem to be running into some problems here - the schema for some fields appears to be have picked up as a double. When I try and convert this into a string which is what I require, it includes some empty precision e.g. 1234 --> 1234.0.
The mapping code I have is something like:
applymapping1 = ApplyMapping.apply(
frame = datasource0,
mappings = [
("col1","double","first_column_name","string"),
("col2","double","second_column_name","string")
],
transformation_ctx = "applymapping1"
)
And the resulting table I get after I've crawled the data is something like:
first_column_name second_column_name
1234.0 4321.0
5678.0 8765.0
as opposed to
first_column_name second_column_name
1234 4321
5678 8765
Is there a good way to work around this? I've tried changing the schema in the table that is initially created by the crawler to a bigint as opposed to a double, but when I update the mapping code to ("col1","bigint","first_column_name","string") the table just ends up being null.

Just a little correction from botchniaque answer, you actually have to do BOTH ResolveChoice and then ApplyMapping to ensure the correct type conversion.
ResolveChoice will make sure you just have one type in your column. If you do not make this step and the ambiguity is not resolved, the column will become a struct and Redshift will show this as null in the end.
So apply ResolveChoice to make sure all your data is one type (int, for ie)
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
Finally, use ApplyMapping to change type for what you want
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1")
Hope this helps (:

Maybe your data is really of type double (some values may have a fractions), and that's why changing type results in data being turned to null. Also it's no wonder that when you change type of a double field to string it gets serialized with a decimal component - it's still a double, just printed.
Have you tried explicitly casting the values to integer?
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
And then to case to string
df3 = ResolveChoice.apply(df2, specs = [("col1", "cast:string"), ("col2", "cast:string")])
or use ApplyMapping to change type and rename as you did above.
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1"
)

Related

How to pass Glue annotations for multiple inputs in AWS for multiple input source

Glue diagram is generated as per the annotations passed to it and edges are created as per #input frame value passed, I want to generate diagram where it should take multiple inputs as there should be multiple edges coming to vertex for each source but in all the example it's given for single input source only , I tried comma separated value but in that diagram it is not getting generated at all. If anyone can share a link to blog or video where annotation is explained in more detail that will also be very helpful,
## #type: DataSink
## #args: [connection_type = "s3", connection_options = {"path": "s3://example-data-destination/taxi-data"}, format = "json", transformation_ctx = "datasink2"]
## #return: datasink2
## #inputs: [frame = applymapping1]
I was able to find the answer , just for future reference if someone else get's stuck in this situation ## #inputs: [frame1 = applymapping1,frame2 = applymapping2,frame3 = applymapping3,frame4 = applymapping4 ]

How Scan Values with colon ":" from a Dynamodb Table?

I am facing a similar problem, using boto3 the query does not work, while it works on console.
First I tried this scan without success:
text = 'city:barcelona'
filter_expr = Attr('timestamp').between('2020-04-01', '2020-04-27')
filter_expr = filter_expr & Attr('text').eq(text)
table.scan(FilterExpression = filter_expr, Limit = 1000)
Then, I notice that for a text variable that does not contain ":", the scan works.
So, I tried this second scan using ExpressionAttributeNames and ExpressionAttributeValues
table.scan(
FilterExpression = "#n0 between :v0 AND :v1 AND #n1 = :v2",
ExpressionAttributeNames = {'#n0': 'timestamp', '#n1': 'text'},
ExpressionAttributeValues = {
':v0': '2020-04-01',
':v1': '2020-04-27',
':v2': {"S": text}},
Limit = 1000
)
Failed again.
By the end, if I change in the first example:
text = 'barcelona'
filter_expr = filter_expr & Attr('text').contains(text)
I can get the records. IMO, it is clear that the problem is the ":"
Is there another way to search by texts with ":" character?
[writing an answer so that we can close out the question]
I ran both examples and they worked correctly for me. I configured text and timestamp as string fields. Check you have an up to date boto3 library.
Note: I changed ':v2': {"S": text} to ':v2': text because you're using resource level scan and you don't need to supply the low-level attribute type (it's only required for client level scan).

Set missing column values to a default using AWS Glue Jobs

I'm trying to extract a dataset from dynamodb to s3 using Glue. In the process I want to select a handful of columns, then set a default value for any and all rows/columns that have missing values.
My attempt is currently to use the "Map" function, but it doesn't seem to be calling my method.
Here is what I have:
def SetDefaults(rec):
print("checking record")
for col in rec:
if not rec[col]:
rec[col] = "missing"
return rec
## Read raw(source) data from target DynamoDB
raw_data_dyf = glueContext.create_dynamic_frame_from_options("dynamodb", {"dynamodb.input.tableName" : my_dynamodb_table, "dynamodb.throughput.read.percent" : "0.50" } )
## Get the necessary columns
selected_data_dyf = ApplyMapping.apply(frame = raw_data_dyf, mappings = mappingList)
## get rid of null values
mapped_dyF = Map.apply(frame=selected_data_dyf, f=SetDefaults)
## write it all out as a csv
datasink = glueContext.write_dynamic_frame.from_options(frame=mapped_dyF , connection_type="s3", connection_options={ "path": my_train_data }, format="csv", format_options = {"writeHeader": False , "quoteChar": "-1" })
My ApplyMapping.apply call is doing the right thing, where mappingList is defined by a bunch of:
mappingList.append(('gsaid', 'bigint', 'gsaid', 'bigint'))
mappingList.append(('objectid', 'bigint', 'objectid', 'bigint'))
mappingList.append(('objecttype', 'bigint', 'objecttype', 'bigint'))
I have no errors, everything runs to completion. My data is all in s3, but there are many empty values still, rather than the "missing" entry I would like.
The "checking record" print statement never prints out. What am I missing here?
Alternative solution:
Convert DynamicFrame to Spark DataFrame
Use the DataFrame's fillna() method to fill the null values
Convert the DataFrame back to DynamicFrame

pandas 'outer' merge of multiple csvs using too much memory

I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!
I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.

How can I parse multiple date columns in Pandas?

I have a field/column in a .csv file that I am loading into Pandas that will not parse as a datetime data type in Pandas. I don't really understand why. I want both FirstTime and SecondTime to parse as datetime64 in Pandas DataFrame.
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime', 'SecondTime'])
The code above will only parse SecondTime as datetime64[ns]. FirstTime is left as a Object data type. If I do the following code instead:
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime'])
It still will not parse FirstTime as a datetime64[ns].
The format for both columns is the same:
# Example FirstTime
# (%f is always .000)
2015-11-05 16:52:37.000
# Example SecondTime
# (%f is always .000)
2015-11-04 15:33:15.000
What am I missing here? Is the first column not able to be datetime by default or something in Pandas?
did you try
df = pd.read_csv('MyData.csv', names=header, parse_dates=True)
I had a similar problem and it turned out in one of my date variables there is an integer cell. So, python recognize it as "object" and the other one is recognized as "int64". You need to make sure both variables are integer.
You can use df.dtypes to see the format of your vaiables.