Rows not being found in Lambda function reading CSV file - amazon-web-services

I have a Lambda function that is reading a CSV file, and each row is added to a DynamoDB table. I am using a print statement to print every row in the CSV to logs in CloudWatch.
There is a problem here as only 51 of the 129 rows are being printed.
Also, only a small amount of the rows that are actually found are being added to the DynamoDB tables.
Lambda Function:
# ChronojumpDataProcessor Lambda function
#
# This function is triggered by an object being created in an Amazon S3 bucket.
# The file is downloaded and each line is inserted into DynamoDB tables.
from __future__ import print_function
import json, urllib, boto3, csv
# Connect to S3 and DynamoDB
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
# Connect to the DynamoDB tables
athleteTable = dynamodb.Table('Athlete');
countermovementTable = dynamodb.Table('CMJ');
depthTable = dynamodb.Table('DepthJump');
# This handler is executed every time the Lambda function is triggered
def lambda_handler(event, context):
# Show the incoming event in the debug log
#print("Event received by Lambda function: " + json.dumps(event, indent=2))
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
localFilename = '/tmp/session.csv'
# Download the file from S3 to the local filesystem
try:
s3.meta.client.download_file(bucket, key, localFilename)
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
# Read the Session CSV file. Delimiter is the ',' character
with open(localFilename) as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
# Read each row in the file
rowCount = 0
for row in reader:
rowCount += 1
# Show the row in the debug log
print(row['athlete_id'], row['athlete_name'], row['jump_id'], row['date_time'], row['jump_type'], row['jump_tc'], row['jump_height'], row['jump_RSI'])
# Insert Athlete ID and Name into Athlete DynamoDB table
athleteTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name']})
# Insert CMJ details into Countermovement Jump DynamoDB table
if ((row['jump_type'] == "CMJ") | (row['jump_type'] == "Free")) :
countermovementTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'Height': row['jump_height']})
else :
# Insert Depth Jump details into Depth Jump DynamoDB table
depthTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'ContactTime': row['jump_tc'],
'Height': row['jump_height'],
'RSI': row['jump_RSI']})
# Finished!
return "%d data inserted" % rowCount
I added a Timeout to the Lambda function of 2 minutes as I thought that maybe there wasn't enough time provided for the function to read each row, but that did not fix the problem.

Your return statement is indented under the else, which means that the function will exit as soon as the if evaluates to False.
It should be indented to match the indent use on the with line.

Related

Cloud Run: Google Cloud Storage to BigQuery duplicate uploads

I have a Cloud Run instance that receives list of files from cloud storage, checks if item exists in BigQuery and uploads it if it doesn't exist yet.
The pipeline structure is: Cloud Function (get list of files from GCS) > PubSub > Cloud Run > BigQuery.
I can't tell where the problem lies but I suspect its with PubSub at least once delivery, I've set the Acknowledgement deadline to 300 seconds and Retry Policy to "Retry after exponential backoff delay"
What I expect is to call the Function, get a list of files from GCS, the pipeline gets triggered and I see a 1:1 for files in GCS to BQ.
My question is what GCP services should I be using to get upload json files from GCS to upload to BQ? Dataflow? I ask because this seems to be something between the Cloud Run instances.
Relevant code in Cloud Run instance
Check if file exists
def existInTable(table_id, dataId):
try:
client = bigquery.Client(project=PROJECT)
tableref = PROJECT + "." + table_id
sql = """
SELECT version, dataId
FROM `{}`
WHERE version LIKE "{}" AND dataId LIKE "{}"
""".format(tableref,__version__.__version__,dataId)
query_job = client.query(sql)
results = query_job.result()
print("Found {} Entries for dataid {} and version {} in {}".format(results.total_rows,dataId,__version__.__version__,table_id))
if results.total_rows>0:
print("Skip upload")
return True
return False
except Exception as e:
print("Failed to find entries in table {}".format(table_id))
print(f"error: {e}")
return False
upload
def uploadCloudStorageFileToBigquery(table_id,gcsEntity):
client = bigquery.Client(project=PROJECT)
job_config = bigquery.LoadJobConfig(
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
# Report Path
uri = "gs://{}/{}/{}file.json".format(gcsEntity.bucketName,gcsEntity.version,gcsEntity.dataId)
# Start upload process
try:
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Successfully uploaded {},{} to {}, {} rows".format(gcsEntity.dataId,gcsEntity.version,table_id,destination_table.num_rows))
return True
except Exception as e:
print("Failed uploading {},{} to {} \n URI: {} \n".format(gcsEntity.dataId,gcsEntity.version,table_id,uri))
print(f"error: {e}")
return False
function that ties the two together
def runBigQueryUpload(gcsEntity):
# get Name for individual Table
individualTable = BIGQUERYDATASET + "." + gcsEntity.bucketName.replace("gcsBucket-","")
# Only continue if no entry exists so far in Individual BigQuery Table
noExceptionCaught = True
if existInTable(individualTable,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(individualTable,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# Upload to main table
if existInTable(BIGQUERYTABLEID,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(BIGQUERYTABLEID,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# If any try block failed, return False
if not noExceptionCaught:
print("Failed in upload process {} from {}".format(gcsEntity.dataId, gcsEntity.bucketName))
return 500
return 200
Hope this might help somebody else...
So I had to rethink the architecture completely.
I had three steps that I needed to do.
do some long processing on files (done with PubSub, on a per file basis, in order for Cloud Run to ack back on time).
Batch process all files in GCS to BQ in one go using write_disposition="WRITE_TRUNCATE", this overwrites a table and ensured no duplicate entries.
Pass a wildcard character in the uri for batch processing multiple files. I called this once entire batch was in GCS.
job_config = bigquery.LoadJobConfig(
autodetect=True,
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition="WRITE_TRUNCATE"
)
uri = "gs://{}/subfolder/*file.json".format(bucketName) #note wildcard
load_job = client.load_table_from_uri(
uri,
tableId,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
Notes: old code actually looped through each file in GCS and called load_table_from_uri each time (not batch processing and wastes quota).
Turns out two separate Cloud Run instances both checked that a file didn't exist in BQ and both uploaded at the same time (that's why I needed batch processing)

How to Copy Large Files From AWS S3 bucket to another S3 buckets using boto3 Python API?

How to Copy Large Files From AWS S3 bucket to another S3 buckets using boto3 Python API? If we use client.copy(), it fails by throwing error as "An error occurred (InvalidArgument) when calling the UploadPartCopy operation: Range specified is not valid for source object of size:"
As per AWS S3 boto3 API documentation, we should use multipart upload. I have googled it but could not find clear, precise answer to my question. Finally after reading boto3 api's thoroughly, I have found answer to my question. Here is the answer. This code works perfectly with multi-threading also.
Create s3_client in each thread in case if you use multi threading. I tested this method, works perfectly copying huge Terra bytes of data from one S3 bucket to different s3 bucket.
Code to get s3_client
def get_session_client():
# session = boto3.session.Session(profile_name="default")
session = boto3.session.Session()
client = session.client("s3")
return session, client
def copy_with_multipart(local_s3_client, src_bucket, target_bucket, key, object_size):
current_thread_name = get_current_thread_name()
try:
initiate_multipart = local_s3_client.create_multipart_upload(
Bucket=target_bucket,
Key=key
)
upload_id = initiate_multipart['UploadId']
# 5 MB part size
part_size = 5 * 1024 * 1024
byte_position = 0
part_num = 1
parts_etags = []
while (byte_position < object_size):
# The last part might be smaller than partSize, so check to make sure
# that lastByte isn't beyond the end of the object.
last_byte = min(byte_position + part_size - 1, object_size - 1)
copy_source_range = f"bytes={byte_position}-{last_byte}"
# Copy this part
try:
info_log(f"{current_thread_name} Creating upload_part_copy source_range: {copy_source_range}")
response = local_s3_client.upload_part_copy(
Bucket=target_bucket,
CopySource={'Bucket': src_bucket, 'Key': key},
CopySourceRange=copy_source_range,
Key=key,
PartNumber=part_num,
UploadId=upload_id
)
except Exception as ex:
error_log(f"{current_thread_name} Error while CREATING UPLOAD_PART_COPY for key {key}")
raise ex
parts_etags.append({"ETag": response["CopyPartResult"]["ETag"], "PartNumber": part_num})
part_num += 1
byte_position += part_size
try:
response = local_s3_client.complete_multipart_upload(
Bucket=target_bucket,
Key=key,
MultipartUpload={
'Parts': parts_etags
},
UploadId=upload_id
)
info_log(f"{current_thread_name} {key} COMPLETE_MULTIPART_UPLOAD COMPLETED SUCCESSFULLY, response={response} !!!!")
except Exception as ex:
error_log(f"{current_thread_name} Error while CREATING COMPLETE_MULTIPART_UPLOAD for key {key}")
raise ex
except Exception as ex:
error_log(f"{current_thread_name} Error while CREATING CREATE_MULTIPART_UPLOAD for key {key}")
raise ex
Invoking multipart method:
_, local_s3_client = get_session_client()
copy_with_multipart(local_s3_client, src_bucket_name, target_bucket_name, key, src_object_size)

Change output CSV file name of AWS Athena queries

I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')

BigQuery load job does not insert all data

I have about 200k CSVs(all with same schema). I wrote a Cloud Function for them to insert them to BigQuery such that as soon as I copy the CSV to a bucket, the function is executed and data is loaded to the BigQuery dataset
I basically used the same code as in the documentation.
dataset_id = 'my_dataset' # replace with your dataset ID
table_id = 'my_table' # replace with your table ID
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
table = bigquery_client.get_table(table_ref) # API request
def bigquery_csv(data, context):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://{}/{}'.format(data['bucket'], data['name'])
errors = bigquery_client.load_table_from_uri(uri,
table_ref,
job_config=job_config) # API request
logging.info(errors)
#print('Starting job {}'.format(load_job.job_id))
# load_job.result() # Waits for table load to complete.
logging.info('Job finished.')
destination_table = bigquery_client.get_table(table_ref)
logging.info('Loaded {} rows.'.format(destination_table.num_rows))
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
I can't figure what's wrong. No insert jobs are being shown in Stackdriver Logging and no functions are running once the copy job is complete.
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
You are hitting BigQuery load limit as defined in this link
You should split your file into smaller file and the upload will work

AWS Python Lambda Function - Upload File to S3

I have an AWS Lambda function written in Python 2.7 in which I want to:
1) Grab an .xls file form an HTTP address.
2) Store it in a temp location.
3) Store the file in an S3 bucket.
My code is as follows:
from __future__ import print_function
import urllib
import datetime
import boto3
from botocore.client import Config
def lambda_handler(event, context):
"""Make a variable containing the date format based on YYYYYMMDD"""
cur_dt = datetime.datetime.today().strftime('%Y%m%d')
"""Make a variable containing the url and current date based on the variable
cur_dt"""
dls = "http://11.11.111.111/XL/" + cur_dt + ".xlsx"
urllib.urlretrieve(dls, cur_dt + "test.xls")
ACCESS_KEY_ID = 'Abcdefg'
ACCESS_SECRET_KEY = 'hijklmnop+6dKeiAByFluK1R7rngF'
BUCKET_NAME = 'my-bicket'
FILE_NAME = cur_dt + "test.xls";
data = open('/tmp/' + FILE_NAME, 'wb')
# S3 Connect
s3 = boto3.resource(
's3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_SECRET_KEY,
config=Config(signature_version='s3v4')
)
# Uploaded File
s3.Bucket(BUCKET_NAME).put(Key=FILE_NAME, Body=data, ACL='public-read')
However, when I run this function, I receive the following error:
'IOError: [Errno 30] Read-only file system'
I've spent hours trying to address this issue but I'm falling on my face. Any help would be appreciated.
'IOError: [Errno 30] Read-only file system'
You seem to lack some write access right. If your lambda has another policy, try to attach this policy to your role:
arn:aws:iam::aws:policy/AWSLambdaFullAccess
It has full access on S3 as well, in case you can't write in your bucket. If it solves your issue, you'll remove some rights after that.
I have uploaded the image to s3 Bucket. In "Lambda Test Event", I have created one json test event which contains BASE64 of Image to be uploaded to s3 Bucket and Image Name.
Lambda Test JSON Event as fallows ======>
{
"ImageName": "Your Image Name",
"img64":"BASE64 of Your Image"
}
Following is the code to upload an image or any file to s3 ======>
import boto3
import base64
def lambda_handler(event, context):
s3 = boto3.resource(u's3')
bucket = s3.Bucket(u'YOUR-BUCKET-NAME')
path_test = '/tmp/output' # temp path in lambda.
key = event['ImageName'] # assign filename to 'key' variable
data = event['img64'] # assign base64 of an image to data variable
data1 = data
img = base64.b64decode(data1) # decode the encoded image data (base64)
with open(path_test, 'wb') as data:
#data.write(data1)
data.write(img)
bucket.upload_file(path_test, key) # Upload image directly inside bucket
#bucket.upload_file(path_test, 'FOLDERNAME-IN-YOUR-BUCKET /{}'.format(key)) # Upload image inside folder of your s3 bucket.
print('res---------------->',path_test)
print('key---------------->',key)
return {
'status': 'True',
'statusCode': 200,
'body': 'Image Uploaded'
}
change data = open('/tmp/' + FILE_NAME, 'wb') change the wb for "r"
also, I assume your IAM user has full access to S3 right?
or maybe the problem is in the request of that url...
try that cur_dt starts with "/tmp/"
urllib.urlretrieve(dls, "/tmp/" + cur_dt + "test.xls")