AWS Data Pipeline scheduling with expressions and date functions

AWS Data Pipeline scheduling with expressions and date functions - amazon-web-services

I want to schedule a AWS Data Pipeline job hourly. I would like to create hourly partition on S3 using this. Something like:
s3://my-bucket/2016/07/19/09/
s3://my-bucket/2016/07/19/10/
s3://my-bucket/2016/07/19/11/
I am using expressions for my EMRActivity for this:
s3://my-bucket/#{year(minusHours(#scheduledStartTime,1))}/#{month(minusHours(#scheduledStartTime,1))}/#{day(minusHours(#scheduledStartTime,1))}/#{hour(minusHours(#scheduledStartTime,1))}
However, hour and month functions give me data such as 7 for July instead of 07, and 3 for 3rd hour instead of 03. I would like to get hours,months and hours with 0 appended (when required)
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-reference-functions-datetime.html

You can use the format function to get hours/months in the format you want.
#{format(myDateTime,'YYYY-MM-dd hh:mm:ss')}
Refer to the link for more details: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-reference-functions-datetime.html
In your case, to display hour with 0 appended this should work:
#{format(minusHours(#scheduledStartTime,1), 'hh')}
you can replace 'hh' with 'MM' to get months with 0 appended.

Related

AWS Firehose dynamic partitioning and date parsing

I'm trying to do dynamic data partitioning by date with a kinesis delivery/firehose stream. The payload I'm expecting is JSON, with this general format
{
"clientId": "ASGr496mndGs80oCC97mf",
"createdAt": "2022-09-21T14:44:53.708Z",
...
}
I don't control the format of this date I'm working with.
I have my delivery firehose set to have "Dynamic Partitioning" and "Inline JSON Parsing" enabled (because both are apparently required per the AWS console UI).
I've got these set as "Dynamic Partitioning Keys"
year
.createdAt| strptime("%Y-%m-%dT%H:%M:%S.%fZ")| strftime("%Y")
month
.createdAt| strptime("%Y-%m-%dT%H:%M:%S.%fZ")| strftime("%m")
day
.createdAt| strptime("%Y-%m-%dT%H:%M:%S.%fZ")| strftime("%d")
hour
.createdAt| strptime("%Y-%m-%dT%H:%M:%S.%fZ")| strftime("%h")
But that gives me errors like date \"2022-09-21T18:30:04.431Z\" does not match format \"%Y-%m-%dT%H:%M:%S.%fZ.
It looks like strptime expects decimal seconds to be padded out to 6 places, but I have 3. I don't control the format of this date I'm working with. This seems to be JQ expressions, but I have exactly zero experience using it, and the AWS documentation for this stuff leaves an awful lot to be desired.
Is there a way to get strptime to successfully parse this format, or to just ignore the minute, second, and millisecond part of the time (I only care about hours)?
Is there another way to achieve what I'm trying to do here?

You can try following :
.createdAt | strptime("%Y-%m-%dT%H:%M:%S%Z") | strftime("%Y")
It is trimming the milliseconds whereas retaining rest of the information in the datetime.
Here is the jq snippet example

Cloud scheduler trigger on the first monday of each month

I'm trying to schedule a job to trigger on the first Monday of each month:
This is the cron expression I got: 0 5 1-7 * 1
(which ,as far as I can read unix cron expressions, triggers at 5:00 am on Monday if it happens to be in the first 7 days of the month)
However the job is triggered on what seems to be random days at 5:00 am. the job was triggered today on the 16 of Aug!
Am I reading the expression awfully wrong? BTW, I'm setting the timezone to be on AEST, if that makes difference.

You can use the legacy cron syntax to describe the schedule.
For your case, specify something like below:
"first monday of month 05:00"
Do explore the "Custom interval" tab in the provided link, to get better understanding on this.

AWS IoT Analytics Delta Window

I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)

What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.

Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents

After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.

I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.

Is it possible to use one parameter in another in AWS Data Pipeline?

Current setup:
There's a master data source that contains attendance records per day for students in a given school. Imagine the data is structured in a CSV format like so:
name|day|in_attendance
jack|01/01/2018|0
and so on and so forth, throughout the entire year. Now, the way we grab attendance information from a specific period in time is to specify the year & month via parameters we're handing to an AWS Datapipeline Step, like so:
myAttendanceLookupStep: PYTHON=python34,s3://school_attendance_lookup.py,01,2018
that step runs the Python file defined, and 01 and 2018 specify month and year we're looking up. However, I want to change it so that it looks more like this:
myAttendanceLookupStep: PYTHON=python34,s3://school_attendance_lookup.py,%myYear,%myMonth
myYear: 2018
myMonth: 01
Is there any way to achieve this kind of behavior in AWS Data Pipeline?

It turns out that the syntax I was using in the example wasn't far from the proper syntax. You can use supplied parameters in any portion of the pipeline (activities, etc.) - you do #{myParameterName} in place of where the parameter would go.
This doesn't appear to be documented in AWS Data Pipeline documentation.

CEP query based on date / time of the day

In WSO2 CEP, I made an execution plan that includes the following query:
(it will be fired if the temperature exeeds 20 degrees 3 times in a row within 10 seconds)
from MQTTstream[meta_temperature > 20]#window.time(10 sec)
select count(meta_temperature) as meta_temperature
having meta_temperature > 3
insert into out_temperatureAlarm
How can I achieve that the query is only applied if it is a special time of the day, e.g. 08:00 until 10:00 o'clock?
Is there something that I could put into the query like:
having meta_temperature > 3 and HOUR_OF_THE_DAY BETWEEN 8 and 10

You can use a cron window #window.cron instead of using a time window #window.time. You can specify Cron expression string for desired time periods in Siddhi [1]. Please refer quartz scheduler documentation to get more information on cron expression strings [2].
[1] https://docs.wso2.com/display/CEP400/Inbuilt+Windows#InbuiltWindows-croncron
[2] http://www.quartz-scheduler.org/documentation/quartz-1.x/tutorials/crontrigger

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Data Pipeline scheduling with expressions and date functions - amazon-web-services

Related

AWS Firehose dynamic partitioning and date parsing

Cloud scheduler trigger on the first monday of each month

AWS IoT Analytics Delta Window

Is it possible to use one parameter in another in AWS Data Pipeline?

CEP query based on date / time of the day

Categories

Resources