I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')
Related
I am running the following code to get the number of records in a parquet file placed inside an S3 bucket.
import boto3
import os
s3 = boto3.client('s3')
sql_stmt = """SELECT count(*) FROM s3object s"""
req_fact =s3.select_object_content(
Bucket = 'test_hadoop',
Key = 'counter_db.cm_workload_volume_sec.dt=2023-01-23.cm_workload_volume_sec+2+000000347262.parquet',
ExpressionType = 'SQL',
Expression = sql_stmt,
InputSerialization={'Parquet':{}},
OutputSerialization = {'JSON': {}})
for event in req_fact['Payload']:
if 'Records' in event:
print(event['Records']['Payload'].decode('utf-8'))
elif 'Stats' in event:
print(event['Stats'])
However I get this error: botocore.exceptions.ClientError: An error occurred (XNotImplemented) when calling the SelectObjectContent operation: This node does not support SelectObjectContent.
What is the issue?
I ran your code against a known good (uncompressed) parquet file with no errors.
resp = s3.select_object_content(
Bucket='my-test-bucket',
Key='counter_db.cm_workload_volume_sec.dt=2023-01-23.cm_workload_volume_sec+2+000000347262.parquet',
ExpressionType='SQL',
Expression="SELECT count(*) FROM s3object s",
InputSerialization={'Parquet': {}},
OutputSerialization={'JSON': {}},
)
Output:
{"_1":4}
In the AWS console you can navigate to the S3 bucket and find the file, highlight it and choose to run S3 select there (Actions > Query with S3 Select). That will allow you to validate that the file can be queried with S3 select (which I think is your issue here)
Note the following: Amazon S3 Select does not support whole-object compression for Apache Parquet objects.
I have a Cloud Run instance that receives list of files from cloud storage, checks if item exists in BigQuery and uploads it if it doesn't exist yet.
The pipeline structure is: Cloud Function (get list of files from GCS) > PubSub > Cloud Run > BigQuery.
I can't tell where the problem lies but I suspect its with PubSub at least once delivery, I've set the Acknowledgement deadline to 300 seconds and Retry Policy to "Retry after exponential backoff delay"
What I expect is to call the Function, get a list of files from GCS, the pipeline gets triggered and I see a 1:1 for files in GCS to BQ.
My question is what GCP services should I be using to get upload json files from GCS to upload to BQ? Dataflow? I ask because this seems to be something between the Cloud Run instances.
Relevant code in Cloud Run instance
Check if file exists
def existInTable(table_id, dataId):
try:
client = bigquery.Client(project=PROJECT)
tableref = PROJECT + "." + table_id
sql = """
SELECT version, dataId
FROM `{}`
WHERE version LIKE "{}" AND dataId LIKE "{}"
""".format(tableref,__version__.__version__,dataId)
query_job = client.query(sql)
results = query_job.result()
print("Found {} Entries for dataid {} and version {} in {}".format(results.total_rows,dataId,__version__.__version__,table_id))
if results.total_rows>0:
print("Skip upload")
return True
return False
except Exception as e:
print("Failed to find entries in table {}".format(table_id))
print(f"error: {e}")
return False
upload
def uploadCloudStorageFileToBigquery(table_id,gcsEntity):
client = bigquery.Client(project=PROJECT)
job_config = bigquery.LoadJobConfig(
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
# Report Path
uri = "gs://{}/{}/{}file.json".format(gcsEntity.bucketName,gcsEntity.version,gcsEntity.dataId)
# Start upload process
try:
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Successfully uploaded {},{} to {}, {} rows".format(gcsEntity.dataId,gcsEntity.version,table_id,destination_table.num_rows))
return True
except Exception as e:
print("Failed uploading {},{} to {} \n URI: {} \n".format(gcsEntity.dataId,gcsEntity.version,table_id,uri))
print(f"error: {e}")
return False
function that ties the two together
def runBigQueryUpload(gcsEntity):
# get Name for individual Table
individualTable = BIGQUERYDATASET + "." + gcsEntity.bucketName.replace("gcsBucket-","")
# Only continue if no entry exists so far in Individual BigQuery Table
noExceptionCaught = True
if existInTable(individualTable,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(individualTable,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# Upload to main table
if existInTable(BIGQUERYTABLEID,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(BIGQUERYTABLEID,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# If any try block failed, return False
if not noExceptionCaught:
print("Failed in upload process {} from {}".format(gcsEntity.dataId, gcsEntity.bucketName))
return 500
return 200
Hope this might help somebody else...
So I had to rethink the architecture completely.
I had three steps that I needed to do.
do some long processing on files (done with PubSub, on a per file basis, in order for Cloud Run to ack back on time).
Batch process all files in GCS to BQ in one go using write_disposition="WRITE_TRUNCATE", this overwrites a table and ensured no duplicate entries.
Pass a wildcard character in the uri for batch processing multiple files. I called this once entire batch was in GCS.
job_config = bigquery.LoadJobConfig(
autodetect=True,
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition="WRITE_TRUNCATE"
)
uri = "gs://{}/subfolder/*file.json".format(bucketName) #note wildcard
load_job = client.load_table_from_uri(
uri,
tableId,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
Notes: old code actually looped through each file in GCS and called load_table_from_uri each time (not batch processing and wastes quota).
Turns out two separate Cloud Run instances both checked that a file didn't exist in BQ and both uploaded at the same time (that's why I needed batch processing)
I have a Lambda function that is reading a CSV file, and each row is added to a DynamoDB table. I am using a print statement to print every row in the CSV to logs in CloudWatch.
There is a problem here as only 51 of the 129 rows are being printed.
Also, only a small amount of the rows that are actually found are being added to the DynamoDB tables.
Lambda Function:
# ChronojumpDataProcessor Lambda function
#
# This function is triggered by an object being created in an Amazon S3 bucket.
# The file is downloaded and each line is inserted into DynamoDB tables.
from __future__ import print_function
import json, urllib, boto3, csv
# Connect to S3 and DynamoDB
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
# Connect to the DynamoDB tables
athleteTable = dynamodb.Table('Athlete');
countermovementTable = dynamodb.Table('CMJ');
depthTable = dynamodb.Table('DepthJump');
# This handler is executed every time the Lambda function is triggered
def lambda_handler(event, context):
# Show the incoming event in the debug log
#print("Event received by Lambda function: " + json.dumps(event, indent=2))
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
localFilename = '/tmp/session.csv'
# Download the file from S3 to the local filesystem
try:
s3.meta.client.download_file(bucket, key, localFilename)
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
# Read the Session CSV file. Delimiter is the ',' character
with open(localFilename) as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
# Read each row in the file
rowCount = 0
for row in reader:
rowCount += 1
# Show the row in the debug log
print(row['athlete_id'], row['athlete_name'], row['jump_id'], row['date_time'], row['jump_type'], row['jump_tc'], row['jump_height'], row['jump_RSI'])
# Insert Athlete ID and Name into Athlete DynamoDB table
athleteTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name']})
# Insert CMJ details into Countermovement Jump DynamoDB table
if ((row['jump_type'] == "CMJ") | (row['jump_type'] == "Free")) :
countermovementTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'Height': row['jump_height']})
else :
# Insert Depth Jump details into Depth Jump DynamoDB table
depthTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'ContactTime': row['jump_tc'],
'Height': row['jump_height'],
'RSI': row['jump_RSI']})
# Finished!
return "%d data inserted" % rowCount
I added a Timeout to the Lambda function of 2 minutes as I thought that maybe there wasn't enough time provided for the function to read each row, but that did not fix the problem.
Your return statement is indented under the else, which means that the function will exit as soon as the if evaluates to False.
It should be indented to match the indent use on the with line.
We receive 1 csv file everyday in s3 bucket from our vendor at 11am.
I convert this file into parquet format using Glue at 11:30am.
I've enabled job bookmark to not process already processed files.
Nonetheless, I see some files are being reprocessed thus creating duplicates.
I read these questions and answers AWS Glue Bookmark produces duplicates for PARQUET and AWS Glue Job Bookmarking explanation
They gave good understanding of job bookmarking, but still do not address the issue.
AWS documentation says, it supports CSV files for bookmarking AWS documentation.
Wondering if someone help me understand what could be the problem and if possible solution as well :)
Edit:
Pasting sample code here as requested by Prabhakar.
staging_database_name = "my-glue-db"
s3_target_path = "s3://mybucket/mydata/"
"""
'date_index': date location in the file name
'date_only': only date column is inserted
'date_format': format of date
'path': sub folder name in master bucket
"""
#fouo classified files
tables_spec = {
'sample_table': {'path': 'sample_table/load_date=','pkey': 'mykey', 'orderkey':'myorderkey'}
}
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", kms_key_id)
])
sc = SparkContext(conf=spark_conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
for table_name, spec in tables_spec.items():
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=database_name,
table_name=table_name,
transformation_ctx='datasource0')
resolvechoice2 = ResolveChoice.apply(frame=datasource0, choice="make_struct", transformation_ctx='resolvechoice2')
# Create spark data frame with input_file_name column
delta_df = resolvechoice2.toDF().withColumn('ingest_datetime', lit(str(ingest_datetime)))
date_dyf = DynamicFrame.fromDF(delta_df, glueContext, "date_dyf")
master_folder_path1 = os.path.join(s3_target_path, spec['path']).replace('\\', '/')
master_folder_path=master_folder_path1+load_date
datasink4 = glueContext.write_dynamic_frame.from_options(frame=date_dyf,
connection_type='s3',
connection_options={"path": master_folder_path},
format='parquet', transformation_ctx='datasink4')
job.commit()
Spoke to AWS Support engineer and she mentioned that, she is able to reproduce the issue and have raised it with Glue technical team for resolution.
Nonetheless, I couldn't wait on them fixing the bug and have taken different approach.
Solution:
Disable Glue bookmark
After Glue job converts csv file to Parquet, I
move csv file to different location in S3 bucket.
I have about 200k CSVs(all with same schema). I wrote a Cloud Function for them to insert them to BigQuery such that as soon as I copy the CSV to a bucket, the function is executed and data is loaded to the BigQuery dataset
I basically used the same code as in the documentation.
dataset_id = 'my_dataset' # replace with your dataset ID
table_id = 'my_table' # replace with your table ID
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
table = bigquery_client.get_table(table_ref) # API request
def bigquery_csv(data, context):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://{}/{}'.format(data['bucket'], data['name'])
errors = bigquery_client.load_table_from_uri(uri,
table_ref,
job_config=job_config) # API request
logging.info(errors)
#print('Starting job {}'.format(load_job.job_id))
# load_job.result() # Waits for table load to complete.
logging.info('Job finished.')
destination_table = bigquery_client.get_table(table_ref)
logging.info('Loaded {} rows.'.format(destination_table.num_rows))
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
I can't figure what's wrong. No insert jobs are being shown in Stackdriver Logging and no functions are running once the copy job is complete.
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
You are hitting BigQuery load limit as defined in this link
You should split your file into smaller file and the upload will work