Cron job not able to trigger AWS Step function - amazon-web-services

I followed this tutorial here to configure a cron job to kick off my AWS Step Function Data Science SDK. Every time it tries to trigger the state machine it immediately get
I added the newly generated IAM role to have full access to SageMaker and Lambda so I am not sure what is going on.
I am going to provide more information here. The way I setup my sklearn estimator is like this
sklearn_estimator = SKLearn(
entry_point= sm_script,
role = role,
instance_count=1,
dependencies=[sm_utils_file, config_path, 'requirements.txt'],
instance_type=training_instance,
sagemaker_session=sm_sess,
framework_version=FRAMEWORK_VERSION,
base_job_name='{}-training'.format(base_name),
hyperparameters = {'config': config_file},
metric_definitions=[
{'Name': 'client_devices_validation_accuracy_top1', 'Regex': "client devices accuracy for top 1 = ([0-9.]+)"},
{'Name': 'client_devices_validation_f1_top1', 'Regex': "client devices f1 for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_accuracy_top1', 'Regex': "account and password accuracy for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_f1_top1', 'Regex': "account and password f1 for top 1 = ([0-9.]+)"}]
)
As you can see I have set some dependencies there that the model needs to train. For the training step I have defined it as so
training_step = steps.TrainingStep(
"Train Step",
estimator=sklearn_estimator,
data={
"train": sagemaker.TrainingInput(pre_train_utils.resolution_data_path()),
},
job_name=execution_input["TrainingJobName"],
wait_for_completion=True,
)
This part pre_train_utils.resolution_data_path() grabs the newest data from redshift and the pre_train_utils is stored as a dependency in the estimator so it should be fine? I am now thinking that this could be the problem?
Update:
I was able to find the error which states this
An error occurred while executing the state 'Train Step' (entered at the event id #2). The JSONPath '$$.Execution.Input['TrainingJobName']' specified for the field 'TrainingJobName.$' could not be found in the input
Specifically it needs to have a json input that looks like this in my case
{
"TrainingJobName": "tt-resolution-classifier-training-2022-09-02",
"ModelName": "tt-resolution-classifier-model-2022-09-02",
"EndpointName": "tt-resolution-classifier-endpoint",
"LambdaFunctionName": "odi-ds-grab-ticket-training-metrics"
}
How do I past this into AWSCloudWatch cron job, if I cannot pass it then I cannot automatically have the state machine train and deploy the endpoint...

Related

BigQuery displaying wrong results - Duplicating data from Cloud Function?

I am a junior developer and I was in charge of implementing the Facebook API to an existing project. However, the business team figured out that the Google Analytics results displayed on BigQuery are wrong. They asked me to fix it. This is the architecture:
What I have done is:
On BigQuery, checking how close/far are the results from Google Analytics. I found there is a pattern, the results I am getting on BigQuery are always either 1, 2 or 3 times the original value of GA.
I checked if there is actually multiple cron jobs on the Compute Engine. There is actually only 1 cron job and running once a day.
I verified the results on Google Cloud Storage. And the result on Google Cloud Storage are correct as you can see bellow:
Based on those informations, I strongly believe that the issue is coming from the Cloud Function as it's the only element between GCS and BQ. I have look at the Cloud Function that trigger files from GCS and I could not find any duplicate operations.
Do you know how can I find the issue?
Cloud Function
BUCKET = "xxxx"
GOOGLE_PROJECT = "xxxx"
HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Last Non-Direct Click Conversions": "last_non_direct_click_conversions",
"Last Non-Direct Click Conversion Value": "last_non_direct_click_conversion_value",
"Last Click Prio Conversions": "last_click_prio_conversions",
"Last Click Prio Conversion Value": "last_click_prio_conversion_value",
"Data-Driven Conversions": "dda_conversions",
"Data-Driven Conversion Value": "dda_conversion_value",
"% Change in Conversions from Last Non-Direct Click to Last Click Prio": "last_click_prio_vs_last_click",
"% Change in Conversions from Last Non-Direct Click to Data-Driven": "dda_vs_last_click"
}
SPEND_HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Spend": "spend"
}
tables_schema = {
"google-analytics": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("goal", bigquery.enums.SqlTypeNames.STRING, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversions", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE')
],
"google-analytics-spend": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("spend", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
]
}
def download_from_gcs(file):
client = storage.Client()
bucket = client.get_bucket(BUCKET)
blob = bucket.get_blob(file['name'])
file_name = os.path.basename(os.path.normpath(file['name']))
blob.download_to_filename(f"/tmp/{file_name}")
return file_name
def load_in_bigquery(file_object, dataset: str, table: str):
client = bigquery.Client()
table_id = f"{GOOGLE_PROJECT}.{dataset}.{table}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=tables_schema[table]
)
job = client.load_table_from_file(file_object, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
def __order_columns(df: pd.DataFrame, spend=False) ->pd.DataFrame:
# We want to have source and medium columns at the third position
# for a spend data frame and at the fourth postion for others df
# because spend data frame don't have goal column.
pos = 2 if spend else 3
cols = df.columns.tolist()
cols[pos:2] = cols[-2:]
cols = cols[:-2]
return df[cols]
def __common_transformation(df: pd.DataFrame, date: str, goal: str) -> pd.DataFrame:
# for any kind of dataframe, we add date and week columns
# based on the file name and we split Source/Medium from the csv
# into two different columns
week_of_the_year = datetime.strptime(date, '%Y-%m-%d').isocalendar()[1]
df.insert(0, 'date', date)
df.insert(1, 'week', week_of_the_year)
mapping = SPEND_HEADER_MAPPING if goal == "spend" else HEADER_MAPPING
print(df.columns.tolist())
df = df.rename(columns=mapping)
print(df.columns.tolist())
print(df)
df["source_medium"] = df["source_medium"].str.replace(' ', '')
df[["source", "medium"]] = df["source_medium"].str.split('/', expand=True)
df = df.drop(["source_medium"], axis=1)
df["week"] = df["week"].astype(int, copy=False)
return df
def __transform_spend(df: pd.DataFrame) -> pd.DataFrame:
df["spend"] = df["spend"].astype(float, copy=False)
df = __order_columns(df, spend=True)
return df[df.columns[:6]]
def __transform_attribution(df: pd.DataFrame, goal: str) -> pd.DataFrame:
df.insert(2, 'goal', goal)
df["last_non_direct_click_conversions"] = df["last_non_direct_click_conversions"].astype(int, copy=False)
df["last_click_prio_conversions"] = df["last_click_prio_conversions"].astype(int, copy=False)
df["dda_conversions"] = df["dda_conversions"].astype(float, copy=False)
return __order_columns(df)
def transform(df: pd.DataFrame, file_name) -> pd.DataFrame:
goal, date, *_ = file_name.split('_')
df = __common_transformation(df, date, goal)
# we only add goal in attribution df (google-analytics table).
return __transform_spend(df) if "spend" in file_name else __transform_attribution(df, goal)
def main(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
file_name = download_from_gcs(file)
df = pd.read_csv(f"/tmp/{file_name}")
transformed_df = transform(df, file_name)
with open(f"/tmp/bq_{file_name}", "w") as file_object:
file_object.write(transformed_df.to_csv(index=False))
with open(f"/tmp/bq_{file_name}", "rb") as file_object:
table = "google-analytics-spend" if "spend" in file_name else "google-analytics"
load_in_bigquery(file_object, dataset='attribution', table=table)
update
Yes, the cloud function is triggered by the GCS object finalize event. Moreover, the function won't be automatically retried on failure.
I am following your suggestions and I am now checking the log table on my Cloud Function page. On the last 10 lines of logs data, it seems that 3 different instances of the Cloud Function were run. I am not able to get more details when I am expanding each lines.
I am also going to check BigQuery logs now. I guess the easiest solution would be to use BigQueryAuditMetadata and get logs about when the table was updated?
From my point of view, this is a very big topic, so it might very difficult to provide one precise solution to solve all issues. So, I won't be able to solve the issue, but I can only express some personal observations and provide some suggestions.
The cloud function is triggered by the GCS object finalize event - can you check that this is correct, please? In that case, the event is 'going through the PubSub' before triggering the cloud function invocation. Now there are 2 things to have in mind:
The PubSub is based on 'deliver at least once' paradigm, thus duplicate message deliveries are possible.
Such cloud function invocation has an automatic acknowledgement. And the developer has not control over that. The PubSub cannot be used to control the overall process state. And the longer the cloud function is being executed (up to timeout 540 seconds or more), the more chances that the PubSub makes (an internal) decision that the message has not been delivered, therefore it should be delivered again, thus a new invocation of the cloud function.
Some additional details are here: Issue: Cloud Function explicit acknowledgement of a pubsub message
Now, how to see if that happens. Personally I would start with the logging. When a cloud function starts, I would log the object name and some hash code (i.e. CRC32C or MD5, etc. which are available from the event metadata) - just in the cloud function code. In that case, I would be able to see many cloud function invocations for one GCS object from the logs (if that happens). Another good idea - to get information - how long a cloud function is being executed.
By doing that step we can check if the cloud function is called more than once fora given GCS object.
The next step - how we load the data. The 'load' is a job. It means that there exist a queue and a scheduler (somewhere in GCP BigQuery service). And the load job stays in the queue until it is picked up for loading, and then that job is performed/executed. All of that is extremely 'asynchronous'. Can you check if there are failed load jobs, please? That can be done though BigQuery UI in the simplest case.
On top of that, loading from inside of the cloud function memory - from my point to view - not only very expensive, but also very risky and unreliable. Even simple save the csv into a GCS bucket and load from the bucket - may be much better.
The next - there is a quota 1500 load jobs per table per day as far as I remember. If you have many files to load - you can easily exceed that quota.
Alternative way for 'loading' data - use streaming. It does not have such quota limitations, but it is chargeable, thus you are to pay for streaming.
I will stop for now, let me know if the above was useful, please. And in what direction you are going to develop your solution.
=> Update 04 February 2021 10:50 GMT
To avoid copy and paste - see the answer here: Cloud Function running multiple times instead of once

How to see progress when using Glue to export DynamoDB table

I'm trying to export every item in a DynamoDB table to S3. I found this tutorial https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/ and followed the example. Basically,
table = glueContext.create_dynamic_frame.from_options(
"dynamodb",
connection_options={
"dynamodb.input.tableName": table_name,
"dynamodb.throughput.read.percent": read_percentage,
"dynamodb.splits": splits
}
)
glueContext.write_dynamic_frame.from_options(
frame=table,
connection_type="s3",
connection_options={
"path": output_path
},
format=output_format,
transformation_ctx="datasink"
)
I tested it in a tiny table in nonprod environment and it works fine. But my Dynamo table in production is over 400GB, 200 mil items. I suppose it'll take a while, but I have no idea how long to expect. Hours, or even days? Are there any way to show progress? For example, showing a count of how many items have been processed. I don't want to blindly start this job and wait.
One way would be to enable continuous logging for your AWS Glue Job to monitor its progress.
Another way would be to trigger a Lambda function whenever a file has been stored in S3, using Amazon S3 event notifications.
Did you try the custom waiter class within was docs?
For instance custom waiter for a Glue Job should look something like this:
class JobCompleteWaiter(CustomWaiter):
def __init__(self, client):
super().__init__(
"JobComplete",
"get_job_run",
"JobRun.JobRunState",
{"SUCCEEDED": WaitState.SUCCEEDED, "FAILED": WaitState.FAILED},
client,
max_tries=100,
)
def wait(self, JobName, RunId):
self._wait(JobName=JobName, RunId=RunId)
According to boto3 docs, you should expect a set of 6 different possible states from a JOB: STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
So I chost checkein whether was SUCCEEDED or FAILED.

Get tasks status in AWS Step Functions (boto3)

I am currently using boto3 (the Amazon Web Services (AWS) SDK for Python) to create state machines, start executions and also in my workers to retrieve tasks and report their status (completed successfully or failed).
I have another service that needs to know the tasks' status and I would like to do so by retrieving it from AWS. I searched the available methods and it is only possible to get the status of a state machine/execution as a whole (RUNNING|SUCCEEDED|FAILED|TIMED_OUT|ABORTED).
There is also the get_execution_history method but each step is identified by an id numbered sequentially and there is no information about the task itself (only in the "stateEnteredEventDetails" event, where the name of the task is present, but the subsequentially events may not be related to it, so it is impossible to know if the task was successful or not).
Is it really not possible to retrieve the status of a specific task, or am I missing something?
Thank you!
I had the same problem, and it seems that step functions does not consider the states and tasks as entities, and therefore there is not an API to get info about them.
In order to get info about the task's status you need to parse the information in the execution history. In my case I first check the execution status:
import boto3
import json
client = boto3.client("stepfunctions")
response = client.describe_execution(
executionArn=EXECUTION_ARN
)
status = response["status"]
and if it is "FAILED" then I analyze the history and get the most relevant fields for my use case (for events of type "TaskFailed"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000
)
events = response["events"]
while response.get("nextToken"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000,
nextToken=response["nextToken"]
)
events += response["events"]
causes = [
json.loads(e["taskFailedEventDetails"]["cause"])
for e in events
if e["type"] == "TaskFailed"
]
return [
{
"ClusterArn": cause["ClusterArn"],
"Containers": [
{
"ContainerArn": container["ContainerArn"],
"Name": container["Name"],
"ExitCode": container["ExitCode"],
"Overrides": cause["Overrides"]["ContainerOverrides"][i]
}
for i, container in enumerate(cause["Containers"])
],
"TaskArn": cause["TaskArn"],
"StoppedReason": cause["StoppedReason"]
}
for cause in causes
]

AWS Lambda - Copy monthly snapshots to another region

I am trying to run a lambda that will kick off on a schedule to copy all snapshots taken the day prior to another region for DR purposes. I have a bit of code but it seems to not work as intended.
Symptoms:
It's grabbing the same snapshots multiple times and copying them
It always errors out on 2 particular snapshots, I don't know enough coding to write a log to figure out why. These snapshots work if I manually copy them though.
import boto3
from datetime import date, timedelta
SOURCE_REGION = 'us-east-1'
DEST_REGION = 'us-west-2'
ec2_source = boto3.client('ec2', region_name = SOURCE_REGION)
ec2_destination = boto3.client('ec2', region_name = DEST_REGION)
snaps = ec2_source.describe_snapshots(OwnerIds=['self'])['Snapshots']
yesterday = date.today() - timedelta(days = 1)
yesterday_snaps = [ s for s in snaps if s['StartTime'].date() == yesterday ]
for yester_snap in yesterday_snaps:
DestinationSnapshot = ec2_destination.copy_snapshot(
SourceSnapshotId = yester_snap['SnapshotId'],
SourceRegion = SOURCE_REGION,
Encrypted = True,
KmsKeyId='REMOVED FOR SECURITY',
DryRun = False
)
DestinationSnapshotID = DestinationSnapshot['SnapshotId']
ec2_destination.create_tags(Resources=[DestinationSnapshotID],
Tags=yester_snap['Tags']
)
waiter = ec2_destination.get_waiter('snapshot_completed')
waiter.wait(
SnapshotIds=[DestinationSnapshotID],
DryRun=False,
WaiterConfig={'Delay': 10,'MaxAttempts': 123}
)
Debugging
You can debug by simply putting print() statements in your code.
For example:
for yester_snap in yesterday_snaps:
print('Copying:', yester_snap['SnapshotId'])
DestinationSnapshot = ec2_destination.copy_snapshot(...)
The logs will appear in CloudWatch Logs. You can access the logs via the Monitoring tab in the Lambda function. Make sure the Lambda function has AWSLambdaBasicExecutionRole permissions so that it can write to CloudWatch Logs.
Today/Yesterday
Be careful about your definition of yesterday. Amazon EC2 instances run in the UTC timezone, so your concept of a today and yesterday might not match what is happening.
It might be better to add a tag to snapshots after they are copied (eg 'copied') rather than relying on dates to figure out which ones to copy.
CloudWatch Events rule
Rather than running this program once per day, an alternative method would be:
Create an Amazon CloudWatch Events rule that triggers on Snapshot creation:
{
"source": [
"aws.ec2"
],
"detail-type": [
"EBS Snapshot Notification"
],
"detail": {
"event": [
"createSnapshot"
]
}
}
Configure the rule to trigger an AWS Lambda function
In the Lambda function, copy the Snapshot that was just created
This way, the Snapshots are created immediately and there is no need to search for them or figure out which Snapshots to copy

How do I write a Cloud Function to receive, parse, and publish PubSub messages?

This can be considered a follow-up to this thread, but I need more help with moving things along. Hopefully someone can have a look over my attempts below and provide further guidance.
To summarize, I need a cloud function that
Is triggered by a PubSub message being published in topic A (this can be done in UI).
reads a messy object change notification message in "push" PubSub topic A.
"parse" it
publish a message in PubSub topic B, with the original message ID as data, and other metadata (e.g. file name, size, time) as attributes.
. 1:
Example of a messy object change notification:
\n "kind": "storage#object",\n "id": "bucketcfpubsub/test.txt/1544681756538155",\n "selfLink": "https://www.googleapis.com/storage/v1/b/bucketcfpubsub/o/test.txt",\n "name": "test.txt",\n "bucket": "bucketcfpubsub",\n "generation": "1544681756538155",\n "metageneration": "1",\n "contentType": "text/plain",\n "timeCreated": "2018-12-13T06:15:56.537Z",\n "updated": "2018-12-13T06:15:56.537Z",\n "storageClass": "STANDARD",\n "timeStorageClassUpdated": "2018-12-13T06:15:56.537Z",\n "size": "1938",\n "md5Hash": "sDSXIvkR/PBg4mHyIUIvww==",\n "mediaLink": "https://www.googleapis.com/download/storage/v1/b/bucketcfpubsub/o/test.txt?generation=1544681756538155&alt=media",\n "crc32c": "UDhyzw==",\n "etag": "CKvqjvuTnN8CEAE="\n}\n
To clarify, is this a message with blank "data" field, and all the information above are in attribute pairs (like "attribute name": "attribute data")? Or is it just a long string stuffed into the "data" field, with no "attributes"?
. 2:
In the above thread, a "pull" subscription is used. Is it better than using a "push" subscription? Push sample below:
def create_push_subscription(project_id,
topic_name,
subscription_name,
endpoint):
"""Create a new push subscription on the given topic."""
# [START pubsub_create_push_subscription]
from google.cloud import pubsub_v1
# TODO project_id = "Your Google Cloud Project ID"
# TODO topic_name = "Your Pub/Sub topic name"
# TODO subscription_name = "Your Pub/Sub subscription name"
# TODO endpoint = "https://my-test-project.appspot.com/push"
subscriber = pubsub_v1.SubscriberClient()
topic_path = subscriber.topic_path(project_id, topic_name)
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
push_config = pubsub_v1.types.PushConfig(
push_endpoint=endpoint)
subscription = subscriber.create_subscription(
subscription_path, topic_path, push_config)
print('Push subscription created: {}'.format(subscription))
print('Endpoint for subscription is: {}'.format(endpoint))
# [END pubsub_create_push_subscription]
Or do I need further code after this to receive messages?
Also, doesn't this create a new subscriber every time the Cloud Function is triggered by a pubsub message being published? Should I add a subscription delete code at the end of the CF, or are there more efficient ways to do this?
. 3:
Next, to parse the code, this sample code doing a few attributes as follows:
def summarize(message):
# [START parse_message]
data = message.data
attributes = message.attributes
event_type = attributes['eventType']
bucket_id = attributes['bucketId']
object_id = attributes['objectId']
Will this work with my above notification in 1:?
. 4:
How do I separate the topic_name? Steps 1 and 2 use topic A, while this step is to publish into topic B. Is is as simple as re-writing the topic_name in the below code example?
# TODO topic_name = "Your Pub/Sub topic name"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
for n in range(1, 10):
data = u'Message number {}'.format(n)
# Data must be a bytestring
data = data.encode('utf-8')
# Add two attributes, origin and username, to the message
publisher.publish(
topic_path, data, origin='python-sample', username='gcp')
print('Published messages with custom attributes.')
Source where I got most of the sample code from (besides the above thread):python-docs-samples. Will adapting and stringing the above code samples together produce useful code? Or will I still be missing stuff like "import ****"?
You should not attempt to manually create a Subscriber running in Cloud Functions. Instead, follow the documentation here for setting up a Cloud Function which will be called with all messages sent to a given topic by passing the --trigger-topic command line parameter.
To address some of your other concerns:
“Should I add a subscription delete code at the end of the CF”- Subscriptions are long-lived resources corresponding to a specific backlog of messages. If the subscription is created and deleted at the end of the cloud function, messages sent when it does not exist will not be received.
“How do I separate the topic_name”- The ‘topic_name’ in this example refers to the last part of the string formatted like this projects/project_id/topics/topic_name that will appear on this page in the cloud console for your topic after it has been created.