Cloud Run: Google Cloud Storage to BigQuery duplicate uploads - google-cloud-platform

I have a Cloud Run instance that receives list of files from cloud storage, checks if item exists in BigQuery and uploads it if it doesn't exist yet.
The pipeline structure is: Cloud Function (get list of files from GCS) > PubSub > Cloud Run > BigQuery.
I can't tell where the problem lies but I suspect its with PubSub at least once delivery, I've set the Acknowledgement deadline to 300 seconds and Retry Policy to "Retry after exponential backoff delay"
What I expect is to call the Function, get a list of files from GCS, the pipeline gets triggered and I see a 1:1 for files in GCS to BQ.
My question is what GCP services should I be using to get upload json files from GCS to upload to BQ? Dataflow? I ask because this seems to be something between the Cloud Run instances.
Relevant code in Cloud Run instance
Check if file exists
def existInTable(table_id, dataId):
try:
client = bigquery.Client(project=PROJECT)
tableref = PROJECT + "." + table_id
sql = """
SELECT version, dataId
FROM `{}`
WHERE version LIKE "{}" AND dataId LIKE "{}"
""".format(tableref,__version__.__version__,dataId)
query_job = client.query(sql)
results = query_job.result()
print("Found {} Entries for dataid {} and version {} in {}".format(results.total_rows,dataId,__version__.__version__,table_id))
if results.total_rows>0:
print("Skip upload")
return True
return False
except Exception as e:
print("Failed to find entries in table {}".format(table_id))
print(f"error: {e}")
return False
upload
def uploadCloudStorageFileToBigquery(table_id,gcsEntity):
client = bigquery.Client(project=PROJECT)
job_config = bigquery.LoadJobConfig(
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
# Report Path
uri = "gs://{}/{}/{}file.json".format(gcsEntity.bucketName,gcsEntity.version,gcsEntity.dataId)
# Start upload process
try:
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Successfully uploaded {},{} to {}, {} rows".format(gcsEntity.dataId,gcsEntity.version,table_id,destination_table.num_rows))
return True
except Exception as e:
print("Failed uploading {},{} to {} \n URI: {} \n".format(gcsEntity.dataId,gcsEntity.version,table_id,uri))
print(f"error: {e}")
return False
function that ties the two together
def runBigQueryUpload(gcsEntity):
# get Name for individual Table
individualTable = BIGQUERYDATASET + "." + gcsEntity.bucketName.replace("gcsBucket-","")
# Only continue if no entry exists so far in Individual BigQuery Table
noExceptionCaught = True
if existInTable(individualTable,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(individualTable,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# Upload to main table
if existInTable(BIGQUERYTABLEID,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(BIGQUERYTABLEID,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# If any try block failed, return False
if not noExceptionCaught:
print("Failed in upload process {} from {}".format(gcsEntity.dataId, gcsEntity.bucketName))
return 500
return 200

Hope this might help somebody else...
So I had to rethink the architecture completely.
I had three steps that I needed to do.
do some long processing on files (done with PubSub, on a per file basis, in order for Cloud Run to ack back on time).
Batch process all files in GCS to BQ in one go using write_disposition="WRITE_TRUNCATE", this overwrites a table and ensured no duplicate entries.
Pass a wildcard character in the uri for batch processing multiple files. I called this once entire batch was in GCS.
job_config = bigquery.LoadJobConfig(
autodetect=True,
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition="WRITE_TRUNCATE"
)
uri = "gs://{}/subfolder/*file.json".format(bucketName) #note wildcard
load_job = client.load_table_from_uri(
uri,
tableId,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
Notes: old code actually looped through each file in GCS and called load_table_from_uri each time (not batch processing and wastes quota).
Turns out two separate Cloud Run instances both checked that a file didn't exist in BQ and both uploaded at the same time (that's why I needed batch processing)

Related

How to download new uploaded files from s3 to ec2 everytime

I have an s3 bucket which will receive new files throughout the day. I want to download these to my ec2 instance everytime a new file is uploaded to the bucket.
I have read that its possible using sqs or sns or lambda. Which is the easiest of them all? I need the file to be downloaded as early as possible once it is uploaded into the bucket.
EDIT
I basically will be getting png images in the bucket every few seconds or minutes. Everytime a new image is uploaded, I want to download that on the instance which is already running. I will do some AI processing. As the images will keeep coming into the bucket, I want to constantly keep downloading it in the ec2 and process it as soon as possible.
This is my code in the Lambda function so far.
import boto3
import json
def lambda_handler(event, context):
"""Read file from s3 on trigger."""
#print(event)
s3 = boto3.client("s3")
client = boto3.client("ec2")
ssm = boto3.client("ssm")
instanceid = "******"
if event:
file_obj = event["Records"][0]
#print(file_obj)
bucketname = str(file_obj["s3"]["bucket"]["name"])
print(bucketname)
filename = str(file_obj["s3"]["object"]["key"])
print(filename)
response = ssm.send_command(
InstanceIds=[instanceid],
DocumentName="AWS-RunShellScript",
Parameters={
"commands": [f"aws s3 cp {filename} ."]
}, # replace command_to_be_executed with command
)
# fetching command id for the output
command_id = response["Command"]["CommandId"]
time.sleep(3)
# fetching command output
output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
print(output)
return
However I am getting the following error
Test Event Name
test
Response
{
"errorMessage": "2021-12-01T14:11:30.781Z 88dbe51b-53d6-4c06-8c16-207698b3a936 Task timed out after 3.00 seconds"
}
Function Logs
START RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936 Version: $LATEST
END RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936
REPORT RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936 Duration: 3003.58 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 87 MB Init Duration: 314.81 ms
2021-12-01T14:11:30.781Z 88dbe51b-53d6-4c06-8c16-207698b3a936 Task timed out after 3.00 seconds
Request ID
88dbe51b-53d6-4c06-8c16-207698b3a936
When I remove all the lines related to ssm, it works fine. Is there any permission issue or is there any problem with the code?
EDIT2
My code is working but I dont see any output or change in my ec2 instance. I should be seeing an empty text file in the home directory but I dont see anything
Code
import boto3
import json
import time
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""Read file from s3 on trigger."""
#print(event)
s3 = boto3.client("s3")
client = boto3.client("ec2")
ssm = boto3.client("ssm")
instanceid = "******"
print("HI")
if event:
file_obj = event["Records"][0]
#print(file_obj)
bucketname = str(file_obj["s3"]["bucket"]["name"])
print(bucketname)
filename = str(file_obj["s3"]["object"]["key"])
print(filename)
print("sending")
try:
response = ssm.send_command(
InstanceIds=[instanceid],
DocumentName="AWS-RunShellScript",
Parameters={
"commands": ["touch hi.txt"]
}, # replace command_to_be_executed with command
)
# fetching command id for the output
command_id = response["Command"]["CommandId"]
time.sleep(3)
# fetching command output
output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
print(output)
except Exception as e:
logger.error(e)
raise e
There are several ways. One would be to setup s3 notifications to invoke a lambda function. Then lambda function would use SSM Run Command to execute AWS CLI S3 command on your instance to download the file from S3.
I don't know why there is any recommendation of Lambda here. What you need is simple: S3 object created event notification -> SQS and some job on your EC2 instance watching a long polling queue.
Here is an example of such a python script. You need to sort out how the object key is encoded in the event, but it will be there. I haven't tested this, but it should be pretty close.
import boto3
def main() -> None:
s3 = boto3.client("s3")
sqs = boto3.client("sqs")
while True:
res = sqs.receive_message(
QueueUrl="yourQueue",
WaitTimeSeconds=20,
)
for msg in res.get("Messages", []):
s3.download_file("yourBucket", msg["key"], "local/file/path")
if __name__ == "__main__":
main()
You can use S3 Event Notifications, which react to a new file coming into the s3 bucket.
The destinations supported by s3 event are SNS, SQS or AWS lambda.
You can directly use the lambda as destination as described by #Marcin
You can use SQS has queue with a lambda behind pulling from the queue. It allows you to have some capability like dead letter queue. You can then pull messages from the queue using different methods:
AWS CLI
AWS SDK
You can use SNS with different things behind (you can have many of these desinations in a row which symbolise the fan-out pattern:
a SQS queue to manage the files
an email to notify
a lambda function
...
You can find more explication in ths article: https://aws.plainenglish.io/system-design-s3-events-to-lambda-vs-s3-events-to-sqs-sns-to-lambda-2d41477d1cc9

Rows not being found in Lambda function reading CSV file

I have a Lambda function that is reading a CSV file, and each row is added to a DynamoDB table. I am using a print statement to print every row in the CSV to logs in CloudWatch.
There is a problem here as only 51 of the 129 rows are being printed.
Also, only a small amount of the rows that are actually found are being added to the DynamoDB tables.
Lambda Function:
# ChronojumpDataProcessor Lambda function
#
# This function is triggered by an object being created in an Amazon S3 bucket.
# The file is downloaded and each line is inserted into DynamoDB tables.
from __future__ import print_function
import json, urllib, boto3, csv
# Connect to S3 and DynamoDB
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
# Connect to the DynamoDB tables
athleteTable = dynamodb.Table('Athlete');
countermovementTable = dynamodb.Table('CMJ');
depthTable = dynamodb.Table('DepthJump');
# This handler is executed every time the Lambda function is triggered
def lambda_handler(event, context):
# Show the incoming event in the debug log
#print("Event received by Lambda function: " + json.dumps(event, indent=2))
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
localFilename = '/tmp/session.csv'
# Download the file from S3 to the local filesystem
try:
s3.meta.client.download_file(bucket, key, localFilename)
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
# Read the Session CSV file. Delimiter is the ',' character
with open(localFilename) as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
# Read each row in the file
rowCount = 0
for row in reader:
rowCount += 1
# Show the row in the debug log
print(row['athlete_id'], row['athlete_name'], row['jump_id'], row['date_time'], row['jump_type'], row['jump_tc'], row['jump_height'], row['jump_RSI'])
# Insert Athlete ID and Name into Athlete DynamoDB table
athleteTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name']})
# Insert CMJ details into Countermovement Jump DynamoDB table
if ((row['jump_type'] == "CMJ") | (row['jump_type'] == "Free")) :
countermovementTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'Height': row['jump_height']})
else :
# Insert Depth Jump details into Depth Jump DynamoDB table
depthTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'ContactTime': row['jump_tc'],
'Height': row['jump_height'],
'RSI': row['jump_RSI']})
# Finished!
return "%d data inserted" % rowCount
I added a Timeout to the Lambda function of 2 minutes as I thought that maybe there wasn't enough time provided for the function to read each row, but that did not fix the problem.
Your return statement is indented under the else, which means that the function will exit as soon as the if evaluates to False.
It should be indented to match the indent use on the with line.

Change output CSV file name of AWS Athena queries

I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')

BigQuery load job does not insert all data

I have about 200k CSVs(all with same schema). I wrote a Cloud Function for them to insert them to BigQuery such that as soon as I copy the CSV to a bucket, the function is executed and data is loaded to the BigQuery dataset
I basically used the same code as in the documentation.
dataset_id = 'my_dataset' # replace with your dataset ID
table_id = 'my_table' # replace with your table ID
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
table = bigquery_client.get_table(table_ref) # API request
def bigquery_csv(data, context):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://{}/{}'.format(data['bucket'], data['name'])
errors = bigquery_client.load_table_from_uri(uri,
table_ref,
job_config=job_config) # API request
logging.info(errors)
#print('Starting job {}'.format(load_job.job_id))
# load_job.result() # Waits for table load to complete.
logging.info('Job finished.')
destination_table = bigquery_client.get_table(table_ref)
logging.info('Loaded {} rows.'.format(destination_table.num_rows))
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
I can't figure what's wrong. No insert jobs are being shown in Stackdriver Logging and no functions are running once the copy job is complete.
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
You are hitting BigQuery load limit as defined in this link
You should split your file into smaller file and the upload will work

Error when using continuation token on S3 download

I'm trying to download a large amount of small files from an S3 bucket - I'm doing this by using the following:
s3 = boto3.client('s3')
kwargs = {'Bucket': bucket}
with open('/Users/hr/Desktop/s3_backup/files.csv','w') as file:
while True:
# The S3 API response is a large blob of metadata.
# 'Contents' contains information about the listed objects.
resp = s3.list_objects_v2(**kwargs)
try:
contents = resp['Contents']
except KeyError:
return
for obj in contents:
key = obj['Key']
file.write(key)
file.write('\n')
# The S3 API is paginated, returning up to 1000 keys at a time.
# Pass the continuation token into the next response, until we
# reach the final page (when this field is missing).
try:
kwargs['ContinuationToken'] = resp['NextContinuationToken']
except KeyError:
break
However, after a certain amount of time I received this error message 'EndpointConnectionError: Could not connect to the endpoint URL'.
I know that there is still considerably more files on the s3 bucket. I have three questions:
Why is this error occurring when I haven't downloaded all files in the bucket?
Is there a way to start my code from the last file I downloaded from the S3 bucket (I don't want to have to re-download the file names I've already downloaded)
Is there a default ordering of the S3 bucket, is it alphabetical?