How Scan Values with colon ":" from a Dynamodb Table? - amazon-web-services

I am facing a similar problem, using boto3 the query does not work, while it works on console.
First I tried this scan without success:
text = 'city:barcelona'
filter_expr = Attr('timestamp').between('2020-04-01', '2020-04-27')
filter_expr = filter_expr & Attr('text').eq(text)
table.scan(FilterExpression = filter_expr, Limit = 1000)
Then, I notice that for a text variable that does not contain ":", the scan works.
So, I tried this second scan using ExpressionAttributeNames and ExpressionAttributeValues
table.scan(
FilterExpression = "#n0 between :v0 AND :v1 AND #n1 = :v2",
ExpressionAttributeNames = {'#n0': 'timestamp', '#n1': 'text'},
ExpressionAttributeValues = {
':v0': '2020-04-01',
':v1': '2020-04-27',
':v2': {"S": text}},
Limit = 1000
)
Failed again.
By the end, if I change in the first example:
text = 'barcelona'
filter_expr = filter_expr & Attr('text').contains(text)
I can get the records. IMO, it is clear that the problem is the ":"
Is there another way to search by texts with ":" character?

[writing an answer so that we can close out the question]
I ran both examples and they worked correctly for me. I configured text and timestamp as string fields. Check you have an up to date boto3 library.
Note: I changed ':v2': {"S": text} to ':v2': text because you're using resource level scan and you don't need to supply the low-level attribute type (it's only required for client level scan).

Related

AWS Glue RenameField and writing to bucket failing with Broken Pipe error

I have json0 which is a dynamic frame with columns and nested columns. I have used both Unbox for outer flattening of nested columns and relationalize for deeply nested columns.
I have renamed the field using RenameField.apply to remove period(".") with "_" . And finally I am writing that renamed dynamic frame to the bucket, but writing renamed dynamic frame is causing Broken pipe error.
anyone with AWS data expertise out here with any solutions regarding this?
json0 = Unbox.apply(frame=cf, path="s3_path", format="json",
transformation_ctx="json0" + tn)
json0 = json0.relationalize("json0", "s3//path")
cf = json0.select(list(json0.keys())[0])
df_relationalize = cf.toDF()
for field in df_relationalize.schema.fields:
cf = RenameField.apply(frame=cf, old_name="`" + field.name + "`", new_name=field.name.replace(".", "_"),
transformation_ctx="renamefield_" + field.name)
jsonBreakoutCol1 = glueContext.write_dynamic_frame.from_options(frame=cf,
connection_type="s3",
connection_options={"path": json_path},
format="parquet")

PowerBI - how to get an output from one query to another? (For Sophos Central APIs)

I want to get the output of the first query (SophosBearerToken) to the other one, but it gives me the following error
Formula.Firewall: Query 'Query2' (step 'PartnerID') references other
queries or steps, so it may not directly access a data source. Please
rebuild this data combination.
first query : SophosBearerToken (works)
let
SophosBearerToken = "Bearer " & (Json.Document(Web.Contents("https://id.sophos.com/api/v2/oauth2/token",
[
Headers = [#"Content-Type"="application/x-www-form-urlencoded"],
Content = Text.ToBinary("grant_type=client_credentials&client_id=" & #"SophosClientID" & "&client_secret=" & #"SophosClientSecret" & "&scope=token")
]
)) [access_token])
in
SophosBearerToken
second query: Query2(fail)
let
PartnerIDQuery = Json.Document(Web.Contents("https://api.central.sophos.com/whoami/v1", [Headers = [#"Authorization"=#"SophosBearerToken"]])),
PartnerID = PartnerIDQuery[id]
in
PartnerID
but when I add the output of the first query manually to the second one it works
what could be the mistake I'm doing in here?

AWS Glue pushdown predicate not working properly

I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Set missing column values to a default using AWS Glue Jobs

I'm trying to extract a dataset from dynamodb to s3 using Glue. In the process I want to select a handful of columns, then set a default value for any and all rows/columns that have missing values.
My attempt is currently to use the "Map" function, but it doesn't seem to be calling my method.
Here is what I have:
def SetDefaults(rec):
print("checking record")
for col in rec:
if not rec[col]:
rec[col] = "missing"
return rec
## Read raw(source) data from target DynamoDB
raw_data_dyf = glueContext.create_dynamic_frame_from_options("dynamodb", {"dynamodb.input.tableName" : my_dynamodb_table, "dynamodb.throughput.read.percent" : "0.50" } )
## Get the necessary columns
selected_data_dyf = ApplyMapping.apply(frame = raw_data_dyf, mappings = mappingList)
## get rid of null values
mapped_dyF = Map.apply(frame=selected_data_dyf, f=SetDefaults)
## write it all out as a csv
datasink = glueContext.write_dynamic_frame.from_options(frame=mapped_dyF , connection_type="s3", connection_options={ "path": my_train_data }, format="csv", format_options = {"writeHeader": False , "quoteChar": "-1" })
My ApplyMapping.apply call is doing the right thing, where mappingList is defined by a bunch of:
mappingList.append(('gsaid', 'bigint', 'gsaid', 'bigint'))
mappingList.append(('objectid', 'bigint', 'objectid', 'bigint'))
mappingList.append(('objecttype', 'bigint', 'objecttype', 'bigint'))
I have no errors, everything runs to completion. My data is all in s3, but there are many empty values still, rather than the "missing" entry I would like.
The "checking record" print statement never prints out. What am I missing here?
Alternative solution:
Convert DynamicFrame to Spark DataFrame
Use the DataFrame's fillna() method to fill the null values
Convert the DataFrame back to DynamicFrame

AWS Glue ApplyMapping from double to string

I'm having a bit of a frustrating issues with a Glue Job.
I have a table which I have created from a crawler. It's gone through some CSV data and created a schema. Some elements of the schema need to be modified, e.g. numbers to strings and apply a header.
I seem to be running into some problems here - the schema for some fields appears to be have picked up as a double. When I try and convert this into a string which is what I require, it includes some empty precision e.g. 1234 --> 1234.0.
The mapping code I have is something like:
applymapping1 = ApplyMapping.apply(
frame = datasource0,
mappings = [
("col1","double","first_column_name","string"),
("col2","double","second_column_name","string")
],
transformation_ctx = "applymapping1"
)
And the resulting table I get after I've crawled the data is something like:
first_column_name second_column_name
1234.0 4321.0
5678.0 8765.0
as opposed to
first_column_name second_column_name
1234 4321
5678 8765
Is there a good way to work around this? I've tried changing the schema in the table that is initially created by the crawler to a bigint as opposed to a double, but when I update the mapping code to ("col1","bigint","first_column_name","string") the table just ends up being null.
Just a little correction from botchniaque answer, you actually have to do BOTH ResolveChoice and then ApplyMapping to ensure the correct type conversion.
ResolveChoice will make sure you just have one type in your column. If you do not make this step and the ambiguity is not resolved, the column will become a struct and Redshift will show this as null in the end.
So apply ResolveChoice to make sure all your data is one type (int, for ie)
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
Finally, use ApplyMapping to change type for what you want
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1")
Hope this helps (:
Maybe your data is really of type double (some values may have a fractions), and that's why changing type results in data being turned to null. Also it's no wonder that when you change type of a double field to string it gets serialized with a decimal component - it's still a double, just printed.
Have you tried explicitly casting the values to integer?
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
And then to case to string
df3 = ResolveChoice.apply(df2, specs = [("col1", "cast:string"), ("col2", "cast:string")])
or use ApplyMapping to change type and rename as you did above.
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1"
)