How to see progress when using Glue to export DynamoDB table - amazon-web-services

I'm trying to export every item in a DynamoDB table to S3. I found this tutorial https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/ and followed the example. Basically,
table = glueContext.create_dynamic_frame.from_options(
"dynamodb",
connection_options={
"dynamodb.input.tableName": table_name,
"dynamodb.throughput.read.percent": read_percentage,
"dynamodb.splits": splits
}
)
glueContext.write_dynamic_frame.from_options(
frame=table,
connection_type="s3",
connection_options={
"path": output_path
},
format=output_format,
transformation_ctx="datasink"
)
I tested it in a tiny table in nonprod environment and it works fine. But my Dynamo table in production is over 400GB, 200 mil items. I suppose it'll take a while, but I have no idea how long to expect. Hours, or even days? Are there any way to show progress? For example, showing a count of how many items have been processed. I don't want to blindly start this job and wait.

One way would be to enable continuous logging for your AWS Glue Job to monitor its progress.
Another way would be to trigger a Lambda function whenever a file has been stored in S3, using Amazon S3 event notifications.

Did you try the custom waiter class within was docs?
For instance custom waiter for a Glue Job should look something like this:
class JobCompleteWaiter(CustomWaiter):
def __init__(self, client):
super().__init__(
"JobComplete",
"get_job_run",
"JobRun.JobRunState",
{"SUCCEEDED": WaitState.SUCCEEDED, "FAILED": WaitState.FAILED},
client,
max_tries=100,
)
def wait(self, JobName, RunId):
self._wait(JobName=JobName, RunId=RunId)
According to boto3 docs, you should expect a set of 6 different possible states from a JOB: STARTING'|'RUNNING'|'STOPPING'|'STOPPED'|'SUCCEEDED'|'FAILED'|'TIMEOUT'
So I chost checkein whether was SUCCEEDED or FAILED.

Related

Cron job not able to trigger AWS Step function

I followed this tutorial here to configure a cron job to kick off my AWS Step Function Data Science SDK. Every time it tries to trigger the state machine it immediately get
I added the newly generated IAM role to have full access to SageMaker and Lambda so I am not sure what is going on.
I am going to provide more information here. The way I setup my sklearn estimator is like this
sklearn_estimator = SKLearn(
entry_point= sm_script,
role = role,
instance_count=1,
dependencies=[sm_utils_file, config_path, 'requirements.txt'],
instance_type=training_instance,
sagemaker_session=sm_sess,
framework_version=FRAMEWORK_VERSION,
base_job_name='{}-training'.format(base_name),
hyperparameters = {'config': config_file},
metric_definitions=[
{'Name': 'client_devices_validation_accuracy_top1', 'Regex': "client devices accuracy for top 1 = ([0-9.]+)"},
{'Name': 'client_devices_validation_f1_top1', 'Regex': "client devices f1 for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_accuracy_top1', 'Regex': "account and password accuracy for top 1 = ([0-9.]+)"},
{'Name': 'account_and_password_validation_f1_top1', 'Regex': "account and password f1 for top 1 = ([0-9.]+)"}]
)
As you can see I have set some dependencies there that the model needs to train. For the training step I have defined it as so
training_step = steps.TrainingStep(
"Train Step",
estimator=sklearn_estimator,
data={
"train": sagemaker.TrainingInput(pre_train_utils.resolution_data_path()),
},
job_name=execution_input["TrainingJobName"],
wait_for_completion=True,
)
This part pre_train_utils.resolution_data_path() grabs the newest data from redshift and the pre_train_utils is stored as a dependency in the estimator so it should be fine? I am now thinking that this could be the problem?
Update:
I was able to find the error which states this
An error occurred while executing the state 'Train Step' (entered at the event id #2). The JSONPath '$$.Execution.Input['TrainingJobName']' specified for the field 'TrainingJobName.$' could not be found in the input
Specifically it needs to have a json input that looks like this in my case
{
"TrainingJobName": "tt-resolution-classifier-training-2022-09-02",
"ModelName": "tt-resolution-classifier-model-2022-09-02",
"EndpointName": "tt-resolution-classifier-endpoint",
"LambdaFunctionName": "odi-ds-grab-ticket-training-metrics"
}
How do I past this into AWSCloudWatch cron job, if I cannot pass it then I cannot automatically have the state machine train and deploy the endpoint...

BigQuery displaying wrong results - Duplicating data from Cloud Function?

I am a junior developer and I was in charge of implementing the Facebook API to an existing project. However, the business team figured out that the Google Analytics results displayed on BigQuery are wrong. They asked me to fix it. This is the architecture:
What I have done is:
On BigQuery, checking how close/far are the results from Google Analytics. I found there is a pattern, the results I am getting on BigQuery are always either 1, 2 or 3 times the original value of GA.
I checked if there is actually multiple cron jobs on the Compute Engine. There is actually only 1 cron job and running once a day.
I verified the results on Google Cloud Storage. And the result on Google Cloud Storage are correct as you can see bellow:
Based on those informations, I strongly believe that the issue is coming from the Cloud Function as it's the only element between GCS and BQ. I have look at the Cloud Function that trigger files from GCS and I could not find any duplicate operations.
Do you know how can I find the issue?
Cloud Function
BUCKET = "xxxx"
GOOGLE_PROJECT = "xxxx"
HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Last Non-Direct Click Conversions": "last_non_direct_click_conversions",
"Last Non-Direct Click Conversion Value": "last_non_direct_click_conversion_value",
"Last Click Prio Conversions": "last_click_prio_conversions",
"Last Click Prio Conversion Value": "last_click_prio_conversion_value",
"Data-Driven Conversions": "dda_conversions",
"Data-Driven Conversion Value": "dda_conversion_value",
"% Change in Conversions from Last Non-Direct Click to Last Click Prio": "last_click_prio_vs_last_click",
"% Change in Conversions from Last Non-Direct Click to Data-Driven": "dda_vs_last_click"
}
SPEND_HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Spend": "spend"
}
tables_schema = {
"google-analytics": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("goal", bigquery.enums.SqlTypeNames.STRING, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversions", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE')
],
"google-analytics-spend": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("spend", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
]
}
def download_from_gcs(file):
client = storage.Client()
bucket = client.get_bucket(BUCKET)
blob = bucket.get_blob(file['name'])
file_name = os.path.basename(os.path.normpath(file['name']))
blob.download_to_filename(f"/tmp/{file_name}")
return file_name
def load_in_bigquery(file_object, dataset: str, table: str):
client = bigquery.Client()
table_id = f"{GOOGLE_PROJECT}.{dataset}.{table}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=tables_schema[table]
)
job = client.load_table_from_file(file_object, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
def __order_columns(df: pd.DataFrame, spend=False) ->pd.DataFrame:
# We want to have source and medium columns at the third position
# for a spend data frame and at the fourth postion for others df
# because spend data frame don't have goal column.
pos = 2 if spend else 3
cols = df.columns.tolist()
cols[pos:2] = cols[-2:]
cols = cols[:-2]
return df[cols]
def __common_transformation(df: pd.DataFrame, date: str, goal: str) -> pd.DataFrame:
# for any kind of dataframe, we add date and week columns
# based on the file name and we split Source/Medium from the csv
# into two different columns
week_of_the_year = datetime.strptime(date, '%Y-%m-%d').isocalendar()[1]
df.insert(0, 'date', date)
df.insert(1, 'week', week_of_the_year)
mapping = SPEND_HEADER_MAPPING if goal == "spend" else HEADER_MAPPING
print(df.columns.tolist())
df = df.rename(columns=mapping)
print(df.columns.tolist())
print(df)
df["source_medium"] = df["source_medium"].str.replace(' ', '')
df[["source", "medium"]] = df["source_medium"].str.split('/', expand=True)
df = df.drop(["source_medium"], axis=1)
df["week"] = df["week"].astype(int, copy=False)
return df
def __transform_spend(df: pd.DataFrame) -> pd.DataFrame:
df["spend"] = df["spend"].astype(float, copy=False)
df = __order_columns(df, spend=True)
return df[df.columns[:6]]
def __transform_attribution(df: pd.DataFrame, goal: str) -> pd.DataFrame:
df.insert(2, 'goal', goal)
df["last_non_direct_click_conversions"] = df["last_non_direct_click_conversions"].astype(int, copy=False)
df["last_click_prio_conversions"] = df["last_click_prio_conversions"].astype(int, copy=False)
df["dda_conversions"] = df["dda_conversions"].astype(float, copy=False)
return __order_columns(df)
def transform(df: pd.DataFrame, file_name) -> pd.DataFrame:
goal, date, *_ = file_name.split('_')
df = __common_transformation(df, date, goal)
# we only add goal in attribution df (google-analytics table).
return __transform_spend(df) if "spend" in file_name else __transform_attribution(df, goal)
def main(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
file_name = download_from_gcs(file)
df = pd.read_csv(f"/tmp/{file_name}")
transformed_df = transform(df, file_name)
with open(f"/tmp/bq_{file_name}", "w") as file_object:
file_object.write(transformed_df.to_csv(index=False))
with open(f"/tmp/bq_{file_name}", "rb") as file_object:
table = "google-analytics-spend" if "spend" in file_name else "google-analytics"
load_in_bigquery(file_object, dataset='attribution', table=table)
update
Yes, the cloud function is triggered by the GCS object finalize event. Moreover, the function won't be automatically retried on failure.
I am following your suggestions and I am now checking the log table on my Cloud Function page. On the last 10 lines of logs data, it seems that 3 different instances of the Cloud Function were run. I am not able to get more details when I am expanding each lines.
I am also going to check BigQuery logs now. I guess the easiest solution would be to use BigQueryAuditMetadata and get logs about when the table was updated?
From my point of view, this is a very big topic, so it might very difficult to provide one precise solution to solve all issues. So, I won't be able to solve the issue, but I can only express some personal observations and provide some suggestions.
The cloud function is triggered by the GCS object finalize event - can you check that this is correct, please? In that case, the event is 'going through the PubSub' before triggering the cloud function invocation. Now there are 2 things to have in mind:
The PubSub is based on 'deliver at least once' paradigm, thus duplicate message deliveries are possible.
Such cloud function invocation has an automatic acknowledgement. And the developer has not control over that. The PubSub cannot be used to control the overall process state. And the longer the cloud function is being executed (up to timeout 540 seconds or more), the more chances that the PubSub makes (an internal) decision that the message has not been delivered, therefore it should be delivered again, thus a new invocation of the cloud function.
Some additional details are here: Issue: Cloud Function explicit acknowledgement of a pubsub message
Now, how to see if that happens. Personally I would start with the logging. When a cloud function starts, I would log the object name and some hash code (i.e. CRC32C or MD5, etc. which are available from the event metadata) - just in the cloud function code. In that case, I would be able to see many cloud function invocations for one GCS object from the logs (if that happens). Another good idea - to get information - how long a cloud function is being executed.
By doing that step we can check if the cloud function is called more than once fora given GCS object.
The next step - how we load the data. The 'load' is a job. It means that there exist a queue and a scheduler (somewhere in GCP BigQuery service). And the load job stays in the queue until it is picked up for loading, and then that job is performed/executed. All of that is extremely 'asynchronous'. Can you check if there are failed load jobs, please? That can be done though BigQuery UI in the simplest case.
On top of that, loading from inside of the cloud function memory - from my point to view - not only very expensive, but also very risky and unreliable. Even simple save the csv into a GCS bucket and load from the bucket - may be much better.
The next - there is a quota 1500 load jobs per table per day as far as I remember. If you have many files to load - you can easily exceed that quota.
Alternative way for 'loading' data - use streaming. It does not have such quota limitations, but it is chargeable, thus you are to pay for streaming.
I will stop for now, let me know if the above was useful, please. And in what direction you are going to develop your solution.
=> Update 04 February 2021 10:50 GMT
To avoid copy and paste - see the answer here: Cloud Function running multiple times instead of once

Creating a CloudWatch Metrics from the Athena Query results

My Requirement
I want to create a CloudWatch-Metric from Athena query results.
Example
I want to create a metric like user_count of each day.
In Athena, I will write an SQL query like this
select date,count(distinct user) as count from users_table group by 1
In the Athena editor I can see the result, but I want to see these results as a metric in Cloudwatch.
CloudWatch-Metric-Name ==> user_count
Dimensions ==> Date,count
If I have this cloudwatch metric and dimensions, I can easily create a Monitoring Dashboard and send send alerts
Can anyone suggest a way to do this?
You can use CloudWatch custom widgets, see "Run Amazon Athena queries" in Samples.
It's somewhat involved, but you can use a Lambda for this. In a nutshell:
Setup your query in Athena and make sure it works using the Athena console.
Create a Lambda that:
Runs your Athena query
Pulls the query results from S3
Parses the query results
Sends the query results to CloudWatch as a metric
Use EventBridge to run your Lambda on a recurring basis
Here's an example Lambda function in Python that does step #2. Note that the Lamda function will need IAM permissions to run queries in Athena, read the results from S3, and then put a metric into Cloudwatch.
import time
import boto3
query = 'select count(*) from mytable'
DATABASE = 'default'
bucket='BUCKET_NAME'
path='yourpath'
def lambda_handler(event, context):
#Run query in Athena
client = boto3.client('athena')
output = "s3://{}/{}".format(bucket,path)
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
#S3 file name uses the QueryExecutionId so
#grab it here so we can pull the S3 file.
qeid = response["QueryExecutionId"]
#occasionally the Athena hasn't written the file
#before the lambda tries to pull it out of S3, so pause a few seconds
#Note: You are charged for time the lambda is running.
#A more elegant but more complicated solution would try to get the
#file first then sleep.
time.sleep(3)
###### Get query result from S3.
s3 = boto3.client('s3');
objectkey = path + "/" + qeid + ".csv"
#load object as file
file_content = s3.get_object(
Bucket=bucket,
Key=objectkey)["Body"].read()
#split file on carriage returns
lines = file_content.decode().splitlines()
#get the second line in file
count = lines[1]
#remove double quotes
count = count.replace("\"", "")
#convert string to int since cloudwatch wants numeric for value
count = int(count)
#post query results as a CloudWatch metric
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
MetricData = [
{
'MetricName': 'MyMetric',
'Dimensions': [
{
'Name': 'DIM1',
'Value': 'dim1'
},
],
'Unit': 'None',
'Value': count
},
],
Namespace = 'MyMetricNS'
)
return response
return

How to delete / drop multiple tables in AWS athena?

I am trying to drop few tables from Athena and I cannot run multiple DROP queries at same time. Is there a way to do it?
Thanks!
You are correct. It is not possible to run multiple queries in the one request.
An alternative is to create the tables in a specific database. Dropping the database will then cause all the tables to be deleted.
For example:
CREATE DATABASE foo;
CREATE EXTERNAL TABLE bar1 ...;
CREATE EXTERNAL TABLE bar2 ...;
DROP DATABASE foo CASCADE;
The DROP DATABASE command will delete the bar1 and bar2 tables.
You can use aws-cli batch-delete-table to delete multiple table at once.
aws glue batch-delete-table \
--database-name <database-name> \
--tables-to-delete "<table1-name>" "<table2-name>" "<table3-name>" ...
You can use AWS Glue interface to do this now. The prerequisite being you must upgrade to AWS Glue Data Catalog.
If you Upgrade to the AWS Glue Data Catalog from Athena, the metadata for tables created in Athena is visible in Glue and you can use the AWS Glue UI to check multiple tables and delete them at once.
FAQ on Upgrading data catalog: https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
You could write a shell script to do this for you:
for table in products customers stores; do
aws athena start-query-execution --query-string "drop table $table" --result-configuration OutputLocation=s3://my-ouput-result-bucket
done
Use AWS Glue's Python shell and invoke this function:
def run_query(query, database, s3_output):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
Athena configuration:
s3_input = 's3://athena-how-to/data'
s3_ouput = 's3://athena-how-to/results/'
database = 'your_database'
table = 'tableToDelete'
query_1 = "drop table %s.%s;" % (database, table)
queries = [ query_1]
#queries = [ create_database, create_table, query_1, query_2 ]
for q in queries:
print("Executing query: %s" % (q))
res = run_query(q, database, s3_ouput)
#Vidy
I would second what #Prateek said. Please provide an example of your code. Also, please tag your post with the language/shell that you're using to interact with AWS.
Currently, you cannot run multiple queries in one request. However, you can make multiple requests simultaneously. Currently, you can run 20 requests simultaneously (2018-06-15). You could do this through an API call or the console. In addition you could use the CLI or the SDK (if available for your language of choice).
For example, in Python you could use the multiprocess or threading modules to manage concurrent requests. Just remember to consider thread/multiprocess safety when creating resources/clients.
Service Limits:
Athena Service Limits
AWS Service Limits for which you can request a rate increase
I could not get Carl's method to work by executing DROP TABLE statements even though they did work in the console.
So I just thought it was worth posting my approach that worked for me, which uses a combination of the AWS Pandas SDK and the CLI
import awswrangler as wr
import boto3
import os
session = boto3.Session(
aws_access_key_id='XXXXXX',
aws_secret_access_key='XXXXXX',
aws_session_token='XXXXXX'
)
database_name = 'athena_db'
athena_s3_output = 's3://athena_s3_bucket/athena_queries/'
df = wr.athena.read_sql_query(
sql= "SELECT DISTINCT table_name FROM information_schema.tables WHERE
table_schema = '" + database_name + "'",
database= database_name,
s3_output = athena_s3_output,
boto3_session = session
)
print(df)
# ensure that your aws profile is valid for CLI commands
# i.e. your credentials are set in C:\Users\xxxxxxxx\.aws\credentials
for table in df['table_name']:
cli_string = 'aws glue delete-table --database-name ' + database_name + ' --name ' + table
print(cli_string)
os.system(cli_string)

Trying to disable all the Cloud Watch alarms in one shot

My organization is planning for a maintenance window for the next 5 hours. During that time, I do not want Cloud Watch to trigger alarms and send notifications.
Earlier, when I had to disable 4 alarms, I have written the following code in AWS Lambda. This worked fine.
import boto3
import collections
client = boto3.client('cloudwatch')
def lambda_handler(event, context):
response = client.disable_alarm_actions(
AlarmNames=[
'CRITICAL - StatusCheckFailed for Instance 456',
'CRITICAL - StatusCheckFailed for Instance 345',
'CRITICAL - StatusCheckFailed for Instance 234',
'CRITICAL - StatusCheckFailed for Instance 123'
]
)
But now, I was asked to disable all the alarms which are 361 in number. So, including all those names would take a lot of time.
Please let me know what I should do now?
Use describe_alarms() to obtain a list of them, then iterate through and disable them:
import boto3
client = boto3.client('cloudwatch')
response = client.describe_alarms()
names = [[alarm['AlarmName'] for alarm in response['MetricAlarms']]]
disable_response = client.disable_alarm_actions(AlarmNames=names)
You might want some logic around the Alarm Name to only disable particular alarms.
If you do not have the specific alarm arns, then you can use the logic in the previous answer. If you have a specific list of arns that you want to disable, you can fetch names using this:
def get_alarm_names(alarm_arns):
names = []
response = client.describe_alarms()
for i in response['MetricAlarms']:
if i['AlarmArn'] in alarm_arns:
names.append(i['AlarmName'])
return names
Here's a full tutorial: https://medium.com/geekculture/terraform-structure-for-enabling-disabling-alarms-in-batches-5c4f165a8db7