I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Related
I am facing a similar problem, using boto3 the query does not work, while it works on console.
First I tried this scan without success:
text = 'city:barcelona'
filter_expr = Attr('timestamp').between('2020-04-01', '2020-04-27')
filter_expr = filter_expr & Attr('text').eq(text)
table.scan(FilterExpression = filter_expr, Limit = 1000)
Then, I notice that for a text variable that does not contain ":", the scan works.
So, I tried this second scan using ExpressionAttributeNames and ExpressionAttributeValues
table.scan(
FilterExpression = "#n0 between :v0 AND :v1 AND #n1 = :v2",
ExpressionAttributeNames = {'#n0': 'timestamp', '#n1': 'text'},
ExpressionAttributeValues = {
':v0': '2020-04-01',
':v1': '2020-04-27',
':v2': {"S": text}},
Limit = 1000
)
Failed again.
By the end, if I change in the first example:
text = 'barcelona'
filter_expr = filter_expr & Attr('text').contains(text)
I can get the records. IMO, it is clear that the problem is the ":"
Is there another way to search by texts with ":" character?
[writing an answer so that we can close out the question]
I ran both examples and they worked correctly for me. I configured text and timestamp as string fields. Check you have an up to date boto3 library.
Note: I changed ':v2': {"S": text} to ':v2': text because you're using resource level scan and you don't need to supply the low-level attribute type (it's only required for client level scan).
i'm trying to copy gz files from my S3 directory to Snowflake.
i created a table in snowflake (notice that the 'extra' field is defined as 'Variant')
CREATE TABLE accesslog
(
loghash VARCHAR(32) NOT NULL,
logdatetime TIMESTAMP,
ip VARCHAR(15),
country VARCHAR(2),
querystring VARCHAR(2000),
version VARCHAR(15),
partner INTEGER,
name VARCHAR(100),
countervalue DOUBLE PRECISION,
username VARCHAR(50),
gamesessionid VARCHAR(36),
gameid INTEGER,
ingameid INTEGER,
machineuid VARCHAR(36),
extra variant,
ingame_window_name VARCHAR(2000),
extension_id VARCHAR(50)
);
i used this copy command in snowflake:
copy INTO accesslog
FROM s3://XXX
pattern='.*cds_201911.*'
CREDENTIALS = (
aws_key_id='XXX',
aws_secret_key='XXX')
FILE_FORMAT=(
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TYPE = CSV
COMPRESSION = GZIP
FIELD_DELIMITER = '\t'
)
ON_ERROR = CONTINUE
I run it, and got this result (i got many error lines, this is an example to one)
snowflake result
snowflake result -more
a17589e44ae66ffb0a12360beab5ac12 2019-11-01 00:08:39 155.4.208.0 SE 0.136.0 3337 game_process_detected 0 OW_287d4ea0-4892-4814-b2a8-3a5703ae68f3 e9464ba4c9374275991f15e5ed7add13 765 19f030d4-f85f-4b85-9f12-6db9360d7fcc [{"Name":"file","Value":"wowvoiceproxy.exe"},{"Name":"folder","Value":"C:\\Program Files (x86)\\World of Warcraft\\_retail_\\Utils\\WowVoiceProxy.exe"}]
can you please tell me what cause this error?
thanks!
I'm guessing;
The 'Error parsing JSON' is certainly related to the extra variant field.
The JSON looks fine, but there are potential problems with the backslashes \.
If you look at the successfully loaded lines, have the backslashes been removed?
This can (maybe) happen if you have STAGE settings involving escape characters.
The \\Utils substring in the Windows path value can then trigger a Unicode decode error, eg.
Error parsing JSON: hex digit is expected in \U???????? escape sequence, pos 123
UPDATE:
It turns out you have to turn off escape char processing by adding the following to the FILE_FORMAT:
ESCAPE_UNENCLOSED_FIELD = NONE
The alternative is to doublequote fields or to doubly escape backslash, eg. C:\\\\Program Files.
I'm trying to extract a dataset from dynamodb to s3 using Glue. In the process I want to select a handful of columns, then set a default value for any and all rows/columns that have missing values.
My attempt is currently to use the "Map" function, but it doesn't seem to be calling my method.
Here is what I have:
def SetDefaults(rec):
print("checking record")
for col in rec:
if not rec[col]:
rec[col] = "missing"
return rec
## Read raw(source) data from target DynamoDB
raw_data_dyf = glueContext.create_dynamic_frame_from_options("dynamodb", {"dynamodb.input.tableName" : my_dynamodb_table, "dynamodb.throughput.read.percent" : "0.50" } )
## Get the necessary columns
selected_data_dyf = ApplyMapping.apply(frame = raw_data_dyf, mappings = mappingList)
## get rid of null values
mapped_dyF = Map.apply(frame=selected_data_dyf, f=SetDefaults)
## write it all out as a csv
datasink = glueContext.write_dynamic_frame.from_options(frame=mapped_dyF , connection_type="s3", connection_options={ "path": my_train_data }, format="csv", format_options = {"writeHeader": False , "quoteChar": "-1" })
My ApplyMapping.apply call is doing the right thing, where mappingList is defined by a bunch of:
mappingList.append(('gsaid', 'bigint', 'gsaid', 'bigint'))
mappingList.append(('objectid', 'bigint', 'objectid', 'bigint'))
mappingList.append(('objecttype', 'bigint', 'objecttype', 'bigint'))
I have no errors, everything runs to completion. My data is all in s3, but there are many empty values still, rather than the "missing" entry I would like.
The "checking record" print statement never prints out. What am I missing here?
Alternative solution:
Convert DynamicFrame to Spark DataFrame
Use the DataFrame's fillna() method to fill the null values
Convert the DataFrame back to DynamicFrame
I need to select only one line, the last line from many multiple line csv files and add them to a table in aws athena, and then export them to a csv as a whole list.
I am trying to collect data from many sources and the csv files are updated weekly but I only need one line from each file. I have used the standard import to athena and it imports all lines from the selected csv's in the bucket but I need only the last line of each, so that i have the most resent data from that file.
CREATE EXTERNAL TABLE IF NOT EXISTS inventory.laptops (
`date` string,
`serialnum` string,
`biosver` string,
`machine` string,
`manufacturer` string,
`model` string,
`win` string,
`winver` string,
`driveletter` string,
`size` string,
`macaddr` string,
`domain` string,
`ram` string,
`processor` string,
`users` string,
`fullname` string,
`location` string,
`lastconnected` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'quoteChar' = '"',
'field.delim' = ','
) LOCATION 's3://my-s3-bucket/'
TBLPROPERTIES ('has_encrypted_data'='false',"skip.header.line.count"="1");
I need the last line from each csv file in the s3 but I get every line using this creation query
Yes, CREATE TABLE defines how to read the file. You will need to craft a SELECT statement to retrieve the desired line. You will need to use some identifier in the file that can indicate the last line, such as having the latest date.
For example, if the last line always has the most recent date, you could use:
SELECT *
FROM inventory.laptops
ORDER BY date
LIMIT 1
If there is no field that can be used to identify the last line, you might need to cheat by finding out the number of lines in the file, then skipping over all but the last line using skip.header.line.count.
Normally, the order of rows in a file is unimportant.
So this is impossible but you can create a lambda function to concatenate the last line of multiple csv files in a bucket directory and print to a single csv and then import it to athena for querying. I used python to solve this.
import logging
import boto3 ,os
import json
logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
data = ''
# retrieve bucket name and file_key from the S3 event
bucket_name = os.environ['s3_bucket']
# get the object
obj_list = s3.list_objects_v2(Bucket = bucket_name, Prefix = 'bucket prefix')
x = 0
for object in obj_list['Contents']:
obj = s3.get_object(Bucket=bucket_name, Key=object['Key'])
# get lines inside the csv
lines = obj['Body'].read().split(b'\n')
f = 0
for r in lines:
f += 1
#Reads the number of lines in the file
b = 0
for r in lines:
if x < 1:
x +=1
if b == 0:
header = (r.decode())
data +=(header)
b += 1
if b == f-1:
data += (r.decode())
s3.put_object(Bucket=bucket_name, Key='Concat.csv', Body=data)
My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine