I have an internal API, backed by CloudWatch Events, that lets developers schedule cronjobs. The user provided schedule expression must be a valid value for CloudWatch Events. Is there a utility/library for validating the rate and cron schedule expression values before making the API call to AWS to create the rule?
https://ap-southeast-1.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#rules:action=create
There is a undocumented API when using the above, you provide a cron expressiaon and it will validate the expression and return the next 10 dates that this cron will run.
The console does a POST to here
https://ap-southeast-1.console.aws.amazon.com/cloudwatch/CloudWatch/data/jetstream.TestScheduleExpression/20191203005825803-2287194097589902
With this payload:
{"Expression":"cron(0 19 ? * MON-FRI *)","Limit":10}
!CW-Client-Metrics!
{"clientMetrics":{"cwdbSetWizardRuleScheduleExpressionAct":1,"cwdbSaveCronAndGetTriggerDatesAct":1}}
Returns the next 10 runtimes as epoch in the NextTriggerDates response:
I am afraid I have not seen any good regex to handle the AWS style so far.
You can use following regex to validate it:
// Minutes
/^([*]|([0-5]?\d)|((([0-5]?\d)|(\*))\/([0-5]?\d))|(([0-5]?\d)-([0-5]?\d))|((([0-5]?\d)|(\*))(,(([0-5]?\d)|(\*)))*))$/
// Hours
/^([*]|[01]?\d|2[0-3]|((([01]?\d|2[0-3]?)|(\*))\/([01]?\d|2[0-3]?))|(([01]?\d|2[0-3]?)-([01]?\d|2[0-3]?))|((([01]?\d|2[0-3]?)|(\*))((,)(([01]?\d|2[0-3]?)|(\*))){0,23}))$/
// Day of months
/^([*]|[?]|(([1-9]|[12]\d|3[01])[LW]?)|(([1-9]|[12]\d|3[01])-([1-9]|[12]\d|3[01]))|((([1-9]|[12]\d|3[01])|(\*))(\/)([1-9]|[12]\d|3[01]))|((([1-9]|[12]\d|3[01])|(\*))((,)(([1-9]|[12]\d|3[01])|(\*)))*))$/
// Months
/^([*]|([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|((([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(\*))\/(([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)))|((([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC))-(([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)))|((([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(\*))((,)(([2-9]|1[0-2]?)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(\*)))*))$/
// Day of Week
/^([*]|[?]|([1-7]L?)|(SUN|MON|TUE|WED|THU|FRI|SAT)|((([1-7])|(SUN|MON|TUE|WED|THU|FRI|SAT))(-|,|#)(([1-7])|(SUN|MON|TUE|WED|THU|FRI|SAT)))|((([1-7])|(SUN|MON|TUE|WED|THU|FRI|SAT)|(\*))\/(([1-7])|(SUN|MON|TUE|WED|THU|FRI|SAT)))|((([1-7])|(SUN|MON|TUE|WED|THU|FRI|SAT)|(\*))((,)(([1-7])|(SUN|MON|TUE|WED|THU|FRI|SAT)|(\*)))*))$/
// Year
/^([*]|([1-2]\d{3})|(((([1-2]\d{3})|(\*)))\/((\d{0,4})))|(([1-2]\d{3})-([1-2]\d{0,3}))|((([1-2]\d{3})|(\*))((,)(([1-2]\d{3})|(\*)))*))$/
note: years regex don't check range (1970 - 2199) but it can be easily done with code.
I'm not sure if it helps, but I created a AWS cron expression validator in Python and published it on PyPI here https://pypi.org/project/aws-cron-expression-validator/
pip install aws-cron-expression-validator
Usage;
from aws_cron_expression_validator.validator import AWSCronExpressionValidator
my_expression = "0 18 ? * MON-FRIbad *"
try:
AWSCronExpressionValidator.validate(my_expression)
except ValueError as e:
print(f"Oh no! My expression was invalid: {e}")
# Returns; Oh no! My expression was invalid: Invalid day-of-week value 'MON-FRIbad'.
Related
I'm trying to schedule a job to trigger on the first Monday of each month:
This is the cron expression I got: 0 5 1-7 * 1
(which ,as far as I can read unix cron expressions, triggers at 5:00 am on Monday if it happens to be in the first 7 days of the month)
However the job is triggered on what seems to be random days at 5:00 am. the job was triggered today on the 16 of Aug!
Am I reading the expression awfully wrong? BTW, I'm setting the timezone to be on AEST, if that makes difference.
You can use the legacy cron syntax to describe the schedule.
For your case, specify something like below:
"first monday of month 05:00"
Do explore the "Custom interval" tab in the provided link, to get better understanding on this.
I have a Glue job, it looks at the files for the current date (each date has a folder in S3) and process the data in this folder (e.g: "s3://bucket_name/year/month/day"), now I want to find a way to define the input s3 path which tells Glue to look at the previous day and current day, is there a way to do this?
current_glue_input_path = "s3://bucket_name/2021/08/12"
I want to find a regex expression (maybe a wildcard?) and tell Glue to look at "s3://bucket_name/2021/08/11" and "s3://bucket_name/2021/08/12", is there a way to do so?
From this documentation: under the 'Example of Excluding a Subset of Amazon S3 Partitions' section:
The second part, 2015/0[2-9]/**, excludes days in months 02 to 09, in year 2015.
Not sure if this makes sense, can someone help please? Thanks.
(I just realized that this documentation is the regex for Glue crawler, I'm talking about the Glue job, am I looking at the wrong place...?)
Would calculating current and previous date programmatically work? Python sample below -
from datetime import datetime, timedelta
date_today = datetime.today().strftime('%Y%m%d')
date_yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y%m%d')
current_glue_input_path = f's3://bucket_name/{date_today[0:4]}/{date_today[4:6]}/{date_today[6:8]}'
yesterday_glue_input_path = f's3://bucket_name/{date_yesterday[0:4]}/{date_yesterday[4:6]}/{date_yesterday[6:8]}'
I want to setup a cloudwatch event which will get triggered in every X minutes, but should not be triggered at 0th minute. ie it should be triggered at current time+X minutes, 2X minutes, 3X minutes etc. How can i do that.
Update: i had done setting up cloudwatch event, my only problem is crone expression. I want to get a cron expression which can schedule the event which start from 23rd minutes from the current time and at every 23rd minutes thereafter.
0/23 * * * ? * doesnt work because it gets triggered at 0th minute
23/23 * * * ? * doesnt work because 1st event may not be 23rd minutes apart from current time
Jay,
You'll need to setup cloudwatch event for this. Here's the link to it. You can further look at the cron expression using this link. And, you can build the cron expression online using this link.
The above mention links should fulfill your requirements.
Let me know if you need any further help. Thanks!
UPDATE:
As per your comments, following cron expression should fulfill your requirements. You can check by pasting this expression here.
0 23/23 * * * ? *
I think that you just want the events to be triggered at every :23 and :46 minutes.
Here should be the correct cron expression:
23,46 * * * ? *
I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)
What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.
Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents
After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.
I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.
I want to schedule a AWS Data Pipeline job hourly. I would like to create hourly partition on S3 using this. Something like:
s3://my-bucket/2016/07/19/09/
s3://my-bucket/2016/07/19/10/
s3://my-bucket/2016/07/19/11/
I am using expressions for my EMRActivity for this:
s3://my-bucket/#{year(minusHours(#scheduledStartTime,1))}/#{month(minusHours(#scheduledStartTime,1))}/#{day(minusHours(#scheduledStartTime,1))}/#{hour(minusHours(#scheduledStartTime,1))}
However, hour and month functions give me data such as 7 for July instead of 07, and 3 for 3rd hour instead of 03. I would like to get hours,months and hours with 0 appended (when required)
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-reference-functions-datetime.html
You can use the format function to get hours/months in the format you want.
#{format(myDateTime,'YYYY-MM-dd hh:mm:ss')}
Refer to the link for more details: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-reference-functions-datetime.html
In your case, to display hour with 0 appended this should work:
#{format(minusHours(#scheduledStartTime,1), 'hh')}
you can replace 'hh' with 'MM' to get months with 0 appended.