How to get last run Datetime of Crawler in Athena? - amazon-web-services

I have AWS Glue Crawler which runs twice a day and populates data in Athena.
Quicksight takes data from Athena and shows it in a dashboard.
I am implementing LastDataRefresh (Datetime) to show in a Quicksight dashboard. Is there a way I can get the last crawler run datetime so that I can store it in an Athena table and show in Quicksight ?
Any other suggestions are also welcome.

TL;DR Extract the crawler last run time from Glue's CloudWatch logs
Glue sends a series of events to CloudWatch during each crawler run. Extract and process the "finished running" logs from /aws-glue/crawlers log group to get the latest for each crawler.
Logs for a single crawler run:
2021-12-15T12:08:54.448+01:00 [7dd..] BENCHMARK : Running Start Crawl for Crawler lorawan_datasets_bucket_crawler
2021-12-15T12:09:12.559+01:00 [7dd..] BENCHMARK : Classification complete, writing results to database jokerman_events_database
2021-12-15T12:09:12.560+01:00 [7dd..] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
2021-12-15T12:09:27.064+01:00 [7dd..] BENCHMARK : Finished writing to Catalog
2021-12-15T12:12:13.768+01:00 [7dd..] BENCHMARK : Crawler has finished running and is in state READY
Extract and process the BENCHMARK : Crawler has finished running and is in state READY logs:
import boto3
from datetime import datetime, timedelta
def get_last_runs():
session = boto3.Session(profile_name='sandbox', region_name='us-east-1')
logs = session.client('logs')
startTime = startTime = datetime.now() - timedelta(days=14)
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.filter_log_events
filtered_events = logs.filter_log_events(
logGroupName="/aws-glue/crawlers",
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html#matching-terms-events
filterPattern="BENCHMARK state READY", # match "BENCHMARK : Crawler has finished running and is in state READY" messages
startTime=int(startTime.timestamp()*1000)
)
completed_runs = [
{"crawler": m.get("logStreamName"), "timestamp": datetime.fromtimestamp(m.get("timestamp")/1000).isoformat()}
for m in filtered_events["events"]
]
# rework the list to get a dictionary of the last runs by crawler
crawlers = set([r['crawler'] for r in completed_runs])
last_runs = dict()
for n in crawlers:
last_runs[n] = max([d["timestamp"] for d in completed_runs if d["crawler"] == n])
print(last_runs)
Output:
{
'lorawan_datasets_bucket_crawler': '2021-12-15T12:12:13.768000',
'jokerman_lorawan_events_table_crawler': '2021-12-15T12:12:12.007000'
}

Related

Apache Iceberg tables not working with AWS Glue in AWS EMR

I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:
[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]
IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is the code in the notebook:
%%configure -f
{
"conf":{
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.dev.type":"hadoop",
"spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
}
}
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
spark = SparkSession.builder.getOrCreate()
# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)
# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)
How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?
You need to pass the catalog name glue.
Example: glue_catalog.<your_database_name>.<your_table_name>
https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html

How to download new uploaded files from s3 to ec2 everytime

I have an s3 bucket which will receive new files throughout the day. I want to download these to my ec2 instance everytime a new file is uploaded to the bucket.
I have read that its possible using sqs or sns or lambda. Which is the easiest of them all? I need the file to be downloaded as early as possible once it is uploaded into the bucket.
EDIT
I basically will be getting png images in the bucket every few seconds or minutes. Everytime a new image is uploaded, I want to download that on the instance which is already running. I will do some AI processing. As the images will keeep coming into the bucket, I want to constantly keep downloading it in the ec2 and process it as soon as possible.
This is my code in the Lambda function so far.
import boto3
import json
def lambda_handler(event, context):
"""Read file from s3 on trigger."""
#print(event)
s3 = boto3.client("s3")
client = boto3.client("ec2")
ssm = boto3.client("ssm")
instanceid = "******"
if event:
file_obj = event["Records"][0]
#print(file_obj)
bucketname = str(file_obj["s3"]["bucket"]["name"])
print(bucketname)
filename = str(file_obj["s3"]["object"]["key"])
print(filename)
response = ssm.send_command(
InstanceIds=[instanceid],
DocumentName="AWS-RunShellScript",
Parameters={
"commands": [f"aws s3 cp {filename} ."]
}, # replace command_to_be_executed with command
)
# fetching command id for the output
command_id = response["Command"]["CommandId"]
time.sleep(3)
# fetching command output
output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
print(output)
return
However I am getting the following error
Test Event Name
test
Response
{
"errorMessage": "2021-12-01T14:11:30.781Z 88dbe51b-53d6-4c06-8c16-207698b3a936 Task timed out after 3.00 seconds"
}
Function Logs
START RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936 Version: $LATEST
END RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936
REPORT RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936 Duration: 3003.58 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 87 MB Init Duration: 314.81 ms
2021-12-01T14:11:30.781Z 88dbe51b-53d6-4c06-8c16-207698b3a936 Task timed out after 3.00 seconds
Request ID
88dbe51b-53d6-4c06-8c16-207698b3a936
When I remove all the lines related to ssm, it works fine. Is there any permission issue or is there any problem with the code?
EDIT2
My code is working but I dont see any output or change in my ec2 instance. I should be seeing an empty text file in the home directory but I dont see anything
Code
import boto3
import json
import time
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""Read file from s3 on trigger."""
#print(event)
s3 = boto3.client("s3")
client = boto3.client("ec2")
ssm = boto3.client("ssm")
instanceid = "******"
print("HI")
if event:
file_obj = event["Records"][0]
#print(file_obj)
bucketname = str(file_obj["s3"]["bucket"]["name"])
print(bucketname)
filename = str(file_obj["s3"]["object"]["key"])
print(filename)
print("sending")
try:
response = ssm.send_command(
InstanceIds=[instanceid],
DocumentName="AWS-RunShellScript",
Parameters={
"commands": ["touch hi.txt"]
}, # replace command_to_be_executed with command
)
# fetching command id for the output
command_id = response["Command"]["CommandId"]
time.sleep(3)
# fetching command output
output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
print(output)
except Exception as e:
logger.error(e)
raise e
There are several ways. One would be to setup s3 notifications to invoke a lambda function. Then lambda function would use SSM Run Command to execute AWS CLI S3 command on your instance to download the file from S3.
I don't know why there is any recommendation of Lambda here. What you need is simple: S3 object created event notification -> SQS and some job on your EC2 instance watching a long polling queue.
Here is an example of such a python script. You need to sort out how the object key is encoded in the event, but it will be there. I haven't tested this, but it should be pretty close.
import boto3
def main() -> None:
s3 = boto3.client("s3")
sqs = boto3.client("sqs")
while True:
res = sqs.receive_message(
QueueUrl="yourQueue",
WaitTimeSeconds=20,
)
for msg in res.get("Messages", []):
s3.download_file("yourBucket", msg["key"], "local/file/path")
if __name__ == "__main__":
main()
You can use S3 Event Notifications, which react to a new file coming into the s3 bucket.
The destinations supported by s3 event are SNS, SQS or AWS lambda.
You can directly use the lambda as destination as described by #Marcin
You can use SQS has queue with a lambda behind pulling from the queue. It allows you to have some capability like dead letter queue. You can then pull messages from the queue using different methods:
AWS CLI
AWS SDK
You can use SNS with different things behind (you can have many of these desinations in a row which symbolise the fan-out pattern:
a SQS queue to manage the files
an email to notify
a lambda function
...
You can find more explication in ths article: https://aws.plainenglish.io/system-design-s3-events-to-lambda-vs-s3-events-to-sqs-sns-to-lambda-2d41477d1cc9

Export from BigQuery to CSV based on client id

I have a BigQuery table filled with product data for a series of clients. The data has been flattened using a query. I want to export the data for each client to a Google Cloud Storage bucket in csv format - so each client has its own individual csv.
There are just over 100 clients, each with a client_id and the table itself is 1GB in size. I've looked into querying the table using a cloud function, but this would cost over 100,000 GB of data. I've also looked at importing the clients to individual tables directly from the source, but I would need to run the flattening query on each - again incurring a high data cost.
Is there a way of doing this that will limit data usage?
Have you thought about Dataproc?
You could write simple PySpark script where you load data from BigQuery and Write into Bucket splitting by client_id , something like this:
"""
File takes 3 arguments:
BIGQUERY-SOURCE-TABLE
desc: table being source of data in BiqQuery
format: project.dataset.table (str)
BUCKET-DEST-FOLDER
desc: path to bucket folder where CSV files will be stored
format: gs://bucket/folder/ (str)
SPLITER:
desc: name of column on which spit will be done during data saving
format: column-name (str)
"""
import sys
from pyspark.sql import SparkSession
if len(sys.argv) != 4:
raise Exception("""Usage:
filename.py BIGQUERY-SOURCE-TABLE BUCKET-DEST-FOLDER SPLITER"""
)
def main():
spark = SparkSession.builder.getOrCreate()
df = (
spark.read
.format("bigquery")
.load(sys.argv[1])
)
(
df
.write
.partitionBy(sys.argv[2])
.format("csv")
.option("header", True)
.mode("overwrite").
save(sys.argv[3])
)
if __name__ == "__main__":
main()
You will need to:
Save this script in Google Bucket,
Create Dataproc cluster for a while,
Run command written below,
Delete Dataproc cluster.
Let's say you have architecture as follows:
bigquery: myproject:mydataset.mytable
bucket: gs://mybucket/
dataproc cluster: my-cluster
So you will need to run following command:
gcloud dataproc jobs submit pyspark gs://mybucket/script-from-above.py \
--cluster my-cluster \
--region [region-of-cluster] \
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
-- \
myproject:mydataset.mytable gs://mybucket/destination/ client_id
This will save in gs://mybucket/destination/ data split by client_id and you will have files named:
client_id=1
client_id=2
...
client_id=n
As mentioned by #Mr.Batra, you can create partitions on your table based on client_id to regulate cost and amount of data queried.
Implementing Cloud Function and looping over each client id without partitions will cost more since with each
SELECT * FROM table WHERE client_id=xxx the query will scan the full table.

returning JSON response from AWS Glue Pythonshell job to the boto3 caller

Is there a way to send a JSON response (of a dictionary of outputs) from A AWS Glue pythonshell job? Similar to returning a JSON response from AWS Lambda?
I am calling a Glue pythonshell job like below:
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
print(response)
The response I get is a 200 stating the fact that the Glue start_job_run was a success. From the documentation, all I see is the result if a Glue job is either written in s3 or some other database.
I tried adding return {'result':'some_string'} at the end of my Glue pythonshell job to test if it works or not with below code.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['JOB_NAME',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
return {'result':"some_string"}
But it throws error SyntaxError: 'return' outside function
Glue is not made to return response as it is expected to run long running operation inside it. Blocking for response for long running task is not right approach in itself. Instead of it, you may use launch job (service 1) -> execute job (service 2)-> get result (service 3) pattern. You can send json response to AWS service 3 which you want to launch from AWS Service 2 (execute job) e.g. if you launch lambda from glue job, you can send json response to it.

How to insert in bigquery using pandas data frame in cron job on server

i am using this pandas to_gbq method. It works when i run manually but in cron job on server it won't run. please help.
audience = {'adaccount_id':adaccount_id ,
'audience_id': audience_id,
'audience_name': audience_name,
'audience_size':audience_size,
'created_at':created_at }
audience_dataframe = pd.DataFrame(audience)
##### inserting into bigquery###########
audience_dataframe.to_gbq('fb_raw_data.audience_size',
'XXXXXXXXXXXXX',
if_exists='replace',
private_key = r'C:\Users\Vikas Chauhan\Desktop\contact\New folder\private_key.json')
This works when i use it manually. But in cron job it doesn't work.