Is there a way to define AWS Glue input path with wildcard? - amazon-web-services

I have a Glue job, it looks at the files for the current date (each date has a folder in S3) and process the data in this folder (e.g: "s3://bucket_name/year/month/day"), now I want to find a way to define the input s3 path which tells Glue to look at the previous day and current day, is there a way to do this?
current_glue_input_path = "s3://bucket_name/2021/08/12"
I want to find a regex expression (maybe a wildcard?) and tell Glue to look at "s3://bucket_name/2021/08/11" and "s3://bucket_name/2021/08/12", is there a way to do so?
From this documentation: under the 'Example of Excluding a Subset of Amazon S3 Partitions' section:
The second part, 2015/0[2-9]/**, excludes days in months 02 to 09, in year 2015.
Not sure if this makes sense, can someone help please? Thanks.
(I just realized that this documentation is the regex for Glue crawler, I'm talking about the Glue job, am I looking at the wrong place...?)

Would calculating current and previous date programmatically work? Python sample below -
from datetime import datetime, timedelta
date_today = datetime.today().strftime('%Y%m%d')
date_yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y%m%d')
current_glue_input_path = f's3://bucket_name/{date_today[0:4]}/{date_today[4:6]}/{date_today[6:8]}'
yesterday_glue_input_path = f's3://bucket_name/{date_yesterday[0:4]}/{date_yesterday[4:6]}/{date_yesterday[6:8]}'

Related

S3 select - How can I query by non-standard timestamp comparison

I'm using a S3 bucket where the data is organized into files by an ID & year/month - meaning one file per ID & month.
In each (csv.gz) file each record has a timestamp in the format: YYYY-MM-dd HH:mm:ss (note the missing T).
Now, when querying the data I want to support datetime granularity down to seconds so naturally it's desired to filter the data in S3 already prior to managing the data in Python.
I can't however find any method to do this.
The function TO_TIMESTAMP doesn't support a user provided format (expects a T date/time separator) and combining SUBSTRING and CAST (CAST(SUBSTRING(my_timestamp_column, 1, 10) AS TIMESTAMP)) yields a The query cannot be evaluated error.
Is there any way around this?
The documentation states that the function TO_TIMESTAMP is "the inverse operation of TO_STRING" which is not quite true as the latter supports a time_format_pattern.
I think I had to solve the same or similar problem. As you said, the documentation (https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-date.html#s3-glacier-select-sql-reference-to-timestamp) states that the function TO_TIMESTAMP is the inverse operation of TO_STRING. But the documentation to me was misleading, because it does not make clear that the TO_TIMESTAMP function does support time_format_pattern as a second argument. The documentation shows that it only takes one argument, but it can in fact take two.
I was able to convert a non-standard timestamp 20190101T050000.000Z from type string to timestamp like so:
aws s3api select-object-content --bucket foo_bucket --key foo.json.gz --expression "SELECT * FROM s3object s WHERE TO_TIMESTAMP(s.\"timestamp\", 'yMMdd''T''Hmmss.SSS''Z''') < TO_TIMESTAMP('20190101T050000.000Z', 'yMMdd''T''Hmmss.SSS''Z''')" --expression-type 'SQL' --input-serialization '{ "CompressionType": "GZIP","JSON": {"Type": "DOCUMENT"}}' --output-serialization '{"JSON": {"RecordDelimiter": "\n"}}' /dev/shm/foo.json
Hope that helps somebody out.
Having same issue over here, I went an step over and change my csv file to grant date field with require format by timestamp date type in S3 Select.The requiere format is described here S3 data types
So first, in order to response the question, based on S3 Select documentation, I think is not possible to work with a date without T at the end. By the time you correct that, you will be able to work with CAST function. Next is what I do:
select * from s3object as s where CAST('2020-01-01T' AS TIMESTAMP) < CAST('2021-01-01T' AS TIMESTAMP)
That works just okay, however as you can see, I'm not passing s."Date" which is the field header in my csv file due to following error:
Attempt to convert from one data type to another failed at line 1, column 39: cast from STRING to TIMESTAMP.
I hope have been help a little bit, and hope someone can help with this error.

AWS IoT Analytics Delta Window

I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)
What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.
Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents
After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.
I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.

How can I delete old files by name in S3 bucket?

Much like in S3-Bucket/Management/Lifecycles using prefixes, I'd like to prune old files that have certain words.
I'm looking to remove files that start with Screenshot or has screencast in the filename older than 365 days.
Examples
/Screenshot 2017-03-19 10.11.12.png
folder1/Screenshot 2019-03-01 14.31.55.png
folder2/sub_folder/project-screencast.mp4
I'm currently testing if lifecycle prefixes work on files too.
You can write a program to do it, such as this Python script:
import boto3
s3 = boto3.client('s3', region_name='ap-southeast-2')
response = s3.list_objects_v2(Bucket='my-bucket')
keys_to_delete = [{'Key': object['Key']}
for object in response['Contents']
if object['LastModified'] < datetime(2018, 3, 20)
and ('Screenshot' in object['Key'] or 'screencast' in object['Key'])
]
s3.delete_objects(Bucket='my-bucket', Delete={'Objects': keys_to_delete})
You could modify it to be "1 year ago" rather than a specific date.
I don't believe that you can apply lifecycle rules with wildcards such as *screencast*, only with prefixes such as "taxes/" or "taxes/2010".
For your case, I would probably write a script (or perhaps an Athena query) to filter an S3 Inventory report for those files that match your name/age conditions, and then prune them.
Of course, you could write a program to do this as #John Rotenstein suggests. The one time that might not be ideal is if you have millions or billions of objects because the time to enumerate the list of objects would be significant. But it would be fine for a reasonable number of objects.

PyTrends Historical Hourly Data

I am new to Python and coding in general. Thank you for your patience.
Using PyTrends I am trying to get hourly results for Google Trends for a single search term for an entire year . I see that the Python Software foundation (https://pypi.org/project/pytrends/ )states"'now 1-H' " "Seems to only work for 1, 4 hours only" . I have tried some examples of people trying to get custom hourly searches but none work for me. I am wondering is it no longer possible to get Google Trends historical hourly data and I should just stop looking?
Maybe a little bit late already, but in case anybody finds this question in the future, you can find the most recent way of how to get historical hourly Google Trends data on the github of the PyTrends developer:
https://github.com/GeneralMills/pytrends#historical-hourly-interest
A short snippet with the respective function applied:
# Since pytrends is returning a DataFrame object, we need pandas:
import pandas as pd
# Import of pytrends (needs to be pip installed first):
from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-US', tz=360)
kw_list = ['search_term_1', 'search_term_2']
search_df = pytrends.get_historical_interest(kw_list, year_start=2019,
month_start=5, day_start=1,
hour_start=0, year_end=2019,
month_end=7, day_end=31, hour_end=0,
cat=0, geo='', gprop='', sleep=60)
Replace the respective parameters with the desired ones and there you go - hourly historical Google Trends data.
Hope it helped!
Best,
yawicz

Is it possible to use one parameter in another in AWS Data Pipeline?

Current setup:
There's a master data source that contains attendance records per day for students in a given school. Imagine the data is structured in a CSV format like so:
name|day|in_attendance
jack|01/01/2018|0
and so on and so forth, throughout the entire year. Now, the way we grab attendance information from a specific period in time is to specify the year & month via parameters we're handing to an AWS Datapipeline Step, like so:
myAttendanceLookupStep: PYTHON=python34,s3://school_attendance_lookup.py,01,2018
that step runs the Python file defined, and 01 and 2018 specify month and year we're looking up. However, I want to change it so that it looks more like this:
myAttendanceLookupStep: PYTHON=python34,s3://school_attendance_lookup.py,%myYear,%myMonth
myYear: 2018
myMonth: 01
Is there any way to achieve this kind of behavior in AWS Data Pipeline?
It turns out that the syntax I was using in the example wasn't far from the proper syntax. You can use supplied parameters in any portion of the pipeline (activities, etc.) - you do #{myParameterName} in place of where the parameter would go.
This doesn't appear to be documented in AWS Data Pipeline documentation.