Get tasks status in AWS Step Functions (boto3) - amazon-web-services

I am currently using boto3 (the Amazon Web Services (AWS) SDK for Python) to create state machines, start executions and also in my workers to retrieve tasks and report their status (completed successfully or failed).
I have another service that needs to know the tasks' status and I would like to do so by retrieving it from AWS. I searched the available methods and it is only possible to get the status of a state machine/execution as a whole (RUNNING|SUCCEEDED|FAILED|TIMED_OUT|ABORTED).
There is also the get_execution_history method but each step is identified by an id numbered sequentially and there is no information about the task itself (only in the "stateEnteredEventDetails" event, where the name of the task is present, but the subsequentially events may not be related to it, so it is impossible to know if the task was successful or not).
Is it really not possible to retrieve the status of a specific task, or am I missing something?
Thank you!

I had the same problem, and it seems that step functions does not consider the states and tasks as entities, and therefore there is not an API to get info about them.
In order to get info about the task's status you need to parse the information in the execution history. In my case I first check the execution status:
import boto3
import json
client = boto3.client("stepfunctions")
response = client.describe_execution(
executionArn=EXECUTION_ARN
)
status = response["status"]
and if it is "FAILED" then I analyze the history and get the most relevant fields for my use case (for events of type "TaskFailed"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000
)
events = response["events"]
while response.get("nextToken"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000,
nextToken=response["nextToken"]
)
events += response["events"]
causes = [
json.loads(e["taskFailedEventDetails"]["cause"])
for e in events
if e["type"] == "TaskFailed"
]
return [
{
"ClusterArn": cause["ClusterArn"],
"Containers": [
{
"ContainerArn": container["ContainerArn"],
"Name": container["Name"],
"ExitCode": container["ExitCode"],
"Overrides": cause["Overrides"]["ContainerOverrides"][i]
}
for i, container in enumerate(cause["Containers"])
],
"TaskArn": cause["TaskArn"],
"StoppedReason": cause["StoppedReason"]
}
for cause in causes
]

Related

Cron job not able to trigger AWS Step function

I followed this tutorial here to configure a cron job to kick off my AWS Step Function Data Science SDK. Every time it tries to trigger the state machine it immediately get
I added the newly generated IAM role to have full access to SageMaker and Lambda so I am not sure what is going on.
I am going to provide more information here. The way I setup my sklearn estimator is like this
sklearn_estimator = SKLearn(
entry_point= sm_script,
role = role,
instance_count=1,
dependencies=[sm_utils_file, config_path, 'requirements.txt'],
instance_type=training_instance,
sagemaker_session=sm_sess,
framework_version=FRAMEWORK_VERSION,
base_job_name='{}-training'.format(base_name),
hyperparameters = {'config': config_file},
metric_definitions=[
{'Name': 'client_devices_validation_accuracy_top1', 'Regex': "client devices accuracy for top 1 = ([0-9.]+)"},
{'Name': 'client_devices_validation_f1_top1', 'Regex': "client devices f1 for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_accuracy_top1', 'Regex': "account and password accuracy for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_f1_top1', 'Regex': "account and password f1 for top 1 = ([0-9.]+)"}]
)
As you can see I have set some dependencies there that the model needs to train. For the training step I have defined it as so
training_step = steps.TrainingStep(
"Train Step",
estimator=sklearn_estimator,
data={
"train": sagemaker.TrainingInput(pre_train_utils.resolution_data_path()),
},
job_name=execution_input["TrainingJobName"],
wait_for_completion=True,
)
This part pre_train_utils.resolution_data_path() grabs the newest data from redshift and the pre_train_utils is stored as a dependency in the estimator so it should be fine? I am now thinking that this could be the problem?
Update:
I was able to find the error which states this
An error occurred while executing the state 'Train Step' (entered at the event id #2). The JSONPath '$$.Execution.Input['TrainingJobName']' specified for the field 'TrainingJobName.$' could not be found in the input
Specifically it needs to have a json input that looks like this in my case
{
"TrainingJobName": "tt-resolution-classifier-training-2022-09-02",
"ModelName": "tt-resolution-classifier-model-2022-09-02",
"EndpointName": "tt-resolution-classifier-endpoint",
"LambdaFunctionName": "odi-ds-grab-ticket-training-metrics"
}
How do I past this into AWSCloudWatch cron job, if I cannot pass it then I cannot automatically have the state machine train and deploy the endpoint...

Propogating error message through Fail state in aws Step Functions

I am using aws Step Functions to manage a workflow. I am using Fail states to handle errors within the workflow. I would like to propagate some of the json from the Step Function workflow so that a user can easily identify the source of their error. So for example, if the json input to a Fail state looked like this:
{
"error": "some error text",
"other_stuff": {...}
}
Then I would like to pull the source of the error. I have set up my Fail state like so:
FailState:
Type: Fail
Cause: States.Format($.error)
Error: Failure Here
However, this simply produces the literal string States.Format($.error) as the Cause for the Fail state. How can I use aws states language and the Fail state to show the actual error as a part of the output of the Fail state? Any solution that can successfully propagate the error text from Step Input to Step Output for the Fail state would be sufficient to solve this problem.
If anyone else stumbles on this question, I contacted AWS support and this is what they told me:
"The ‘Cause’ and ‘Error’ fields in this state only accept the string type values. This is why you are getting the literal string as a response. However, the good news is that, we already have an existing feature request, pending with Step Functions Development team, to implement a feature for sending JSON Path(like $.error) into the Fail state."
So for some reason AWS step functions does not allow you to pass dynamic error messages. They did offer some workarounds such as changing the failure state to success and propagating the error message that way, or creating an sns topic in the case of statemachine failure to post to. I personally just updated the status polling API to grab the state at index [-2] to propagate the error to the user. In any case, to use this functionality currently some workaround should be employed, and hopefully AWS can get this feature out quickly.
I was able to achieve something close to a desired behavior by creating a "Fail-Me" Lambda that fails with an unhandled exception, dynamically choosing exception class based on the Error and Cause provided as its payload. If "Error" was a name of a built-in exception class it uses it, otherwise it creates its own class.
In State Machine, use an Invoke Lambda state calling this "Fail-Me" lambda with no retrier and no catcher.
import inspect, sys
# Expected payload (event), example:
# {
# "Error":"RuntimeError",
# "Cause":"No Cause"
# }
def lambda_handler(event, context):
ErrorType = event.get("Error") or ""
ErrorCause = event.get("Cause") or ""
if not ErrorType.isidentifier():
ErrorCause = "{}: {}".format(ErrorType, ErrorCause)
ErrorType = "OtherError"
DynamicExceptionClass = type(ErrorType, (BaseException,), {})
for name, obj in inspect.getmembers(sys.modules["builtins"]):
if inspect.isclass(obj) and issubclass (obj, BaseException):
if name==ErrorType: #ErrorType is an existing built-in exception - use it
raise obj(ErrorCause)
#Create our own dynamic class and raise it
DynamicExceptionClass = type(ErrorType, (BaseException,), {})
raise DynamicExceptionClass(ErrorCause)
I too want to use a JSON path in a Fail state to dynamically set the Cause and Error fields. In my case, I have a known set of errors so for each one I created a separate Fail state. Each one of those fail states corresponded with a catch object where Next is set to the appropriate fail state.
{
"TaskState": {
"Type": "Task",
...
"Catch": [
{
"ErrorEquals": ["ErrorA"],
"Next": "FailureA"
},
{
"ErrorEquals": ["ErrorB"],
"Next": "FailureB"
}
]
},
"FailureA": {
"Type": "Fail",
"Error": "FailureA",
"Cause": "This failed because of A"
},
"FailureB": {
"Type": "Fail",
"Error": "FailureB",
"Cause": "This failed because of B"
}
}

Log entries api not retrieving log entries

I am trying to retrieve custom logs for a particular project in google-cloud. I am using this api:
https://logging.googleapis.com/v2/entries:list
as per the example given in this link.
The below is the payload:
{
"filter": "projects/projectA/logs/slow_log",
"resourceNames": [
"projects/projectA"
]
}
There is a custom log based metric called slow_log I created in that projectA, which gathers query logs from cloud-SQL database in that project. I also generated data before calling this api. I am able to see the data in stack-driver console, but unable to get it from the rest call.
Every time I run this api, I only get this response and nothing else:
"nextPageToken": "EAA4suKu3qnLwbtrSg8iDSIDCgEAKgYIgL7q8wVSBwibvMSMvhhglPDiiJzdjt_zAWocCgwI2buKhAYQlvTd2gESCAgLEMPV7ukCGAAgAQ"
Is there anything missing here?
How is it possible to pass time range in this query?
Update
Changed the request as per the comment below as gave the full path of the logs: still only the token is displayed
{
"filter": "projects/projectA/logs/cloudsql.googleapis.com%2Fmysql-slow.log",
"projectIds": [
"projectA"
],
"orderBy": "timestamp desc"
}
Also I give this command from command line:
gcloud logging read logName="projects/projectA/logs/cloudsql.googleapis.com%2Fmysql-slow.log"
then it fetches the logs in command line, so I am not sure what I am missing in the api explorer and postman where I get only nextpage token.
resourceNames, filter and orderBy are mandatory, try like this:
{
"resourceNames": [
"projects/projectA"
],
"filter": "projects/projectA/logs/cloudsql.googleapis.com%2Fmysql-slow.log",
"orderBy": "timestamp desc"
}

BigQuery displaying wrong results - Duplicating data from Cloud Function?

I am a junior developer and I was in charge of implementing the Facebook API to an existing project. However, the business team figured out that the Google Analytics results displayed on BigQuery are wrong. They asked me to fix it. This is the architecture:
What I have done is:
On BigQuery, checking how close/far are the results from Google Analytics. I found there is a pattern, the results I am getting on BigQuery are always either 1, 2 or 3 times the original value of GA.
I checked if there is actually multiple cron jobs on the Compute Engine. There is actually only 1 cron job and running once a day.
I verified the results on Google Cloud Storage. And the result on Google Cloud Storage are correct as you can see bellow:
Based on those informations, I strongly believe that the issue is coming from the Cloud Function as it's the only element between GCS and BQ. I have look at the Cloud Function that trigger files from GCS and I could not find any duplicate operations.
Do you know how can I find the issue?
Cloud Function
BUCKET = "xxxx"
GOOGLE_PROJECT = "xxxx"
HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Last Non-Direct Click Conversions": "last_non_direct_click_conversions",
"Last Non-Direct Click Conversion Value": "last_non_direct_click_conversion_value",
"Last Click Prio Conversions": "last_click_prio_conversions",
"Last Click Prio Conversion Value": "last_click_prio_conversion_value",
"Data-Driven Conversions": "dda_conversions",
"Data-Driven Conversion Value": "dda_conversion_value",
"% Change in Conversions from Last Non-Direct Click to Last Click Prio": "last_click_prio_vs_last_click",
"% Change in Conversions from Last Non-Direct Click to Data-Driven": "dda_vs_last_click"
}
SPEND_HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Spend": "spend"
}
tables_schema = {
"google-analytics": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("goal", bigquery.enums.SqlTypeNames.STRING, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversions", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE')
],
"google-analytics-spend": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("spend", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
]
}
def download_from_gcs(file):
client = storage.Client()
bucket = client.get_bucket(BUCKET)
blob = bucket.get_blob(file['name'])
file_name = os.path.basename(os.path.normpath(file['name']))
blob.download_to_filename(f"/tmp/{file_name}")
return file_name
def load_in_bigquery(file_object, dataset: str, table: str):
client = bigquery.Client()
table_id = f"{GOOGLE_PROJECT}.{dataset}.{table}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=tables_schema[table]
)
job = client.load_table_from_file(file_object, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
def __order_columns(df: pd.DataFrame, spend=False) ->pd.DataFrame:
# We want to have source and medium columns at the third position
# for a spend data frame and at the fourth postion for others df
# because spend data frame don't have goal column.
pos = 2 if spend else 3
cols = df.columns.tolist()
cols[pos:2] = cols[-2:]
cols = cols[:-2]
return df[cols]
def __common_transformation(df: pd.DataFrame, date: str, goal: str) -> pd.DataFrame:
# for any kind of dataframe, we add date and week columns
# based on the file name and we split Source/Medium from the csv
# into two different columns
week_of_the_year = datetime.strptime(date, '%Y-%m-%d').isocalendar()[1]
df.insert(0, 'date', date)
df.insert(1, 'week', week_of_the_year)
mapping = SPEND_HEADER_MAPPING if goal == "spend" else HEADER_MAPPING
print(df.columns.tolist())
df = df.rename(columns=mapping)
print(df.columns.tolist())
print(df)
df["source_medium"] = df["source_medium"].str.replace(' ', '')
df[["source", "medium"]] = df["source_medium"].str.split('/', expand=True)
df = df.drop(["source_medium"], axis=1)
df["week"] = df["week"].astype(int, copy=False)
return df
def __transform_spend(df: pd.DataFrame) -> pd.DataFrame:
df["spend"] = df["spend"].astype(float, copy=False)
df = __order_columns(df, spend=True)
return df[df.columns[:6]]
def __transform_attribution(df: pd.DataFrame, goal: str) -> pd.DataFrame:
df.insert(2, 'goal', goal)
df["last_non_direct_click_conversions"] = df["last_non_direct_click_conversions"].astype(int, copy=False)
df["last_click_prio_conversions"] = df["last_click_prio_conversions"].astype(int, copy=False)
df["dda_conversions"] = df["dda_conversions"].astype(float, copy=False)
return __order_columns(df)
def transform(df: pd.DataFrame, file_name) -> pd.DataFrame:
goal, date, *_ = file_name.split('_')
df = __common_transformation(df, date, goal)
# we only add goal in attribution df (google-analytics table).
return __transform_spend(df) if "spend" in file_name else __transform_attribution(df, goal)
def main(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
file_name = download_from_gcs(file)
df = pd.read_csv(f"/tmp/{file_name}")
transformed_df = transform(df, file_name)
with open(f"/tmp/bq_{file_name}", "w") as file_object:
file_object.write(transformed_df.to_csv(index=False))
with open(f"/tmp/bq_{file_name}", "rb") as file_object:
table = "google-analytics-spend" if "spend" in file_name else "google-analytics"
load_in_bigquery(file_object, dataset='attribution', table=table)
update
Yes, the cloud function is triggered by the GCS object finalize event. Moreover, the function won't be automatically retried on failure.
I am following your suggestions and I am now checking the log table on my Cloud Function page. On the last 10 lines of logs data, it seems that 3 different instances of the Cloud Function were run. I am not able to get more details when I am expanding each lines.
I am also going to check BigQuery logs now. I guess the easiest solution would be to use BigQueryAuditMetadata and get logs about when the table was updated?
From my point of view, this is a very big topic, so it might very difficult to provide one precise solution to solve all issues. So, I won't be able to solve the issue, but I can only express some personal observations and provide some suggestions.
The cloud function is triggered by the GCS object finalize event - can you check that this is correct, please? In that case, the event is 'going through the PubSub' before triggering the cloud function invocation. Now there are 2 things to have in mind:
The PubSub is based on 'deliver at least once' paradigm, thus duplicate message deliveries are possible.
Such cloud function invocation has an automatic acknowledgement. And the developer has not control over that. The PubSub cannot be used to control the overall process state. And the longer the cloud function is being executed (up to timeout 540 seconds or more), the more chances that the PubSub makes (an internal) decision that the message has not been delivered, therefore it should be delivered again, thus a new invocation of the cloud function.
Some additional details are here: Issue: Cloud Function explicit acknowledgement of a pubsub message
Now, how to see if that happens. Personally I would start with the logging. When a cloud function starts, I would log the object name and some hash code (i.e. CRC32C or MD5, etc. which are available from the event metadata) - just in the cloud function code. In that case, I would be able to see many cloud function invocations for one GCS object from the logs (if that happens). Another good idea - to get information - how long a cloud function is being executed.
By doing that step we can check if the cloud function is called more than once fora given GCS object.
The next step - how we load the data. The 'load' is a job. It means that there exist a queue and a scheduler (somewhere in GCP BigQuery service). And the load job stays in the queue until it is picked up for loading, and then that job is performed/executed. All of that is extremely 'asynchronous'. Can you check if there are failed load jobs, please? That can be done though BigQuery UI in the simplest case.
On top of that, loading from inside of the cloud function memory - from my point to view - not only very expensive, but also very risky and unreliable. Even simple save the csv into a GCS bucket and load from the bucket - may be much better.
The next - there is a quota 1500 load jobs per table per day as far as I remember. If you have many files to load - you can easily exceed that quota.
Alternative way for 'loading' data - use streaming. It does not have such quota limitations, but it is chargeable, thus you are to pay for streaming.
I will stop for now, let me know if the above was useful, please. And in what direction you are going to develop your solution.
=> Update 04 February 2021 10:50 GMT
To avoid copy and paste - see the answer here: Cloud Function running multiple times instead of once

GCP Cloud Tasks: shorten period for creating a previously created named task

We are developing a GCP Cloud Task based queue process that sends a status email whenever a particular Firestore doc write-trigger fires. The reason we use Cloud Tasks is so a delay can be created (using scheduledTime property 2-min in the future) before the email is sent, and to control dedup (by using a task-name formatted as: [firestore-collection-name]-[doc-id]) since the 'write' trigger on the Firestore doc can be fired several times as the document is being created and then quickly updated by backend cloud functions.
Once the task's delay period has been reached, the cloud-task runs, and the email is sent with updated Firestore document info included. After which the task is deleted from the queue and all is good.
Except:
If the user updates the Firestore doc (say 20 or 30 min later) we want to resend the status email but are unable to create the task using the same task-name. We get the following error:
409 The task cannot be created because a task with this name existed too recently. For more information about task de-duplication see https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create#body.request_body.FIELDS.task.
This was unexpected as the queue is empty at this point as the last task completed succesfully. The documentation referenced in the error message says:
If the task's queue was created using Cloud Tasks, then another task
with the same name can't be created for ~1hour after the original task
was deleted or executed.
Question: is there some way in which this restriction can be by-passed by lowering the amount of time, or even removing the restriction all together?
The short answer is No. As you've already pointed, the docs are very clear regarding this behavior and you should wait 1 hour to create a task with same name as one that was previously created. The API or Client Libraries does not allow to decrease this time.
Having said that, I would suggest that instead of using the same Task ID, use different ones for the task and add an identifier in the body of the request. For example, using Python:
from google.cloud import tasks_v2
from google.protobuf import timestamp_pb2
import datetime
def create_task(project, queue, location, payload=None, in_seconds=None):
client = tasks_v2.CloudTasksClient()
parent = client.queue_path(project, location, queue)
task = {
'app_engine_http_request': {
'http_method': 'POST',
'relative_uri': '/task/'+queue
}
}
if payload is not None:
converted_payload = payload.encode()
task['app_engine_http_request']['body'] = converted_payload
if in_seconds is not None:
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
task['schedule_time'] = timestamp
response = client.create_task(parent, task)
print('Created task {}'.format(response.name))
print(response)
#You can change DOCUMENT_ID with USER_ID or something to identify the task
create_task(PROJECT_ID, QUEUE, REGION, DOCUMENT_ID)
Facing a similar problem of requiring to debounce multiple instances of Firestore write-trigger functions, we worked around the default Cloud Tasks task-name based dedup mechanism (still a constraint in Nov 2022) by building a small debounce "helper" using Firestore transactions.
We're using a helper collection _syncHelper_ to implement a delayed throttle for side effects of write-trigger fires - in the OP's case, send 1 email for all writes within 2 minutes.
In our case we are using Firebease Functions task queue utils and not directly interacting with Cloud Tasks but thats immaterial to the solution. The key is to determine the task's execution time in advance and use that as the "dedup key":
async function enqueueTask(shopId) {
const queueName = 'doSomething';
const now = new Date();
const next = new Date(now.getTime() + 2 * 60 * 1000);
try {
const shouldEnqueue = await getFirestore().runTransaction(async t=>{
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate()> now) {
return false;
}
await t.set(syncRef, { timestamp: Timestamp.fromDate(next) });
return true;
});
if (shouldEnqueue) {
let queue = getFunctions().taskQueue(queueName);
await queue.enqueue({
timestamp: next.toISOString(),
},
{ scheduleTime: next }); }
} catch {
...
}
}
This will ensure a new task is enqueued only if the "next execution" time has passed.
The execution operation (also a cloud function in our case) will remove the sync data entry if it hasn't been changed since it was executed:
exports.doSomething = functions.tasks.taskQueue({
retryConfig: {
maxAttempts: 2,
minBackoffSeconds: 60,
},
rateLimits: {
maxConcurrentDispatches: 2,
}
}).onDispatch(async data => {
let { timestamp } = data;
await sendYourEmailHere();
await getFirestore().runTransaction(async t => {
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate() <= new Date(timestamp)) {
await t.delete(syncRef);
}
});
});
This isn't a bullet proof solution (if the doSomething() execution function has high latency for example) but good enough for 99% of our use cases.