BigQuery load job does not insert all data - google-cloud-platform

I have about 200k CSVs(all with same schema). I wrote a Cloud Function for them to insert them to BigQuery such that as soon as I copy the CSV to a bucket, the function is executed and data is loaded to the BigQuery dataset
I basically used the same code as in the documentation.
dataset_id = 'my_dataset' # replace with your dataset ID
table_id = 'my_table' # replace with your table ID
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
table = bigquery_client.get_table(table_ref) # API request
def bigquery_csv(data, context):
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://{}/{}'.format(data['bucket'], data['name'])
errors = bigquery_client.load_table_from_uri(uri,
table_ref,
job_config=job_config) # API request
logging.info(errors)
#print('Starting job {}'.format(load_job.job_id))
# load_job.result() # Waits for table load to complete.
logging.info('Job finished.')
destination_table = bigquery_client.get_table(table_ref)
logging.info('Loaded {} rows.'.format(destination_table.num_rows))
However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
I can't figure what's wrong. No insert jobs are being shown in Stackdriver Logging and no functions are running once the copy job is complete.

However, when I copied all the CSVs to the bucket(which were about 43 TB), not all data was added to BigQuery and only about 500 GB was inserted.
You are hitting BigQuery load limit as defined in this link
You should split your file into smaller file and the upload will work

Related

Cloud Run: Google Cloud Storage to BigQuery duplicate uploads

I have a Cloud Run instance that receives list of files from cloud storage, checks if item exists in BigQuery and uploads it if it doesn't exist yet.
The pipeline structure is: Cloud Function (get list of files from GCS) > PubSub > Cloud Run > BigQuery.
I can't tell where the problem lies but I suspect its with PubSub at least once delivery, I've set the Acknowledgement deadline to 300 seconds and Retry Policy to "Retry after exponential backoff delay"
What I expect is to call the Function, get a list of files from GCS, the pipeline gets triggered and I see a 1:1 for files in GCS to BQ.
My question is what GCP services should I be using to get upload json files from GCS to upload to BQ? Dataflow? I ask because this seems to be something between the Cloud Run instances.
Relevant code in Cloud Run instance
Check if file exists
def existInTable(table_id, dataId):
try:
client = bigquery.Client(project=PROJECT)
tableref = PROJECT + "." + table_id
sql = """
SELECT version, dataId
FROM `{}`
WHERE version LIKE "{}" AND dataId LIKE "{}"
""".format(tableref,__version__.__version__,dataId)
query_job = client.query(sql)
results = query_job.result()
print("Found {} Entries for dataid {} and version {} in {}".format(results.total_rows,dataId,__version__.__version__,table_id))
if results.total_rows>0:
print("Skip upload")
return True
return False
except Exception as e:
print("Failed to find entries in table {}".format(table_id))
print(f"error: {e}")
return False
upload
def uploadCloudStorageFileToBigquery(table_id,gcsEntity):
client = bigquery.Client(project=PROJECT)
job_config = bigquery.LoadJobConfig(
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
# Report Path
uri = "gs://{}/{}/{}file.json".format(gcsEntity.bucketName,gcsEntity.version,gcsEntity.dataId)
# Start upload process
try:
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Successfully uploaded {},{} to {}, {} rows".format(gcsEntity.dataId,gcsEntity.version,table_id,destination_table.num_rows))
return True
except Exception as e:
print("Failed uploading {},{} to {} \n URI: {} \n".format(gcsEntity.dataId,gcsEntity.version,table_id,uri))
print(f"error: {e}")
return False
function that ties the two together
def runBigQueryUpload(gcsEntity):
# get Name for individual Table
individualTable = BIGQUERYDATASET + "." + gcsEntity.bucketName.replace("gcsBucket-","")
# Only continue if no entry exists so far in Individual BigQuery Table
noExceptionCaught = True
if existInTable(individualTable,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(individualTable,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# Upload to main table
if existInTable(BIGQUERYTABLEID,gcsEntity.dataId):
print("BQ Instance {} from {} Already exists in {}".format(gcsEntity.dataId,gcsEntity.bucketName,individualTable))
else:
try:
uploadCloudStorageFileToBigquery(BIGQUERYTABLEID,gcsEntity)
except Exception as e:
print("Failed upload to BigQuery table {} for {}".format(individualTable,gcsEntity.dataId ))
print(f"error: {e}")
noExceptionCaught = False
# If any try block failed, return False
if not noExceptionCaught:
print("Failed in upload process {} from {}".format(gcsEntity.dataId, gcsEntity.bucketName))
return 500
return 200
Hope this might help somebody else...
So I had to rethink the architecture completely.
I had three steps that I needed to do.
do some long processing on files (done with PubSub, on a per file basis, in order for Cloud Run to ack back on time).
Batch process all files in GCS to BQ in one go using write_disposition="WRITE_TRUNCATE", this overwrites a table and ensured no duplicate entries.
Pass a wildcard character in the uri for batch processing multiple files. I called this once entire batch was in GCS.
job_config = bigquery.LoadJobConfig(
autodetect=True,
schema= getSchema(),
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition="WRITE_TRUNCATE"
)
uri = "gs://{}/subfolder/*file.json".format(bucketName) #note wildcard
load_job = client.load_table_from_uri(
uri,
tableId,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
Notes: old code actually looped through each file in GCS and called load_table_from_uri each time (not batch processing and wastes quota).
Turns out two separate Cloud Run instances both checked that a file didn't exist in BQ and both uploaded at the same time (that's why I needed batch processing)

Rows not being found in Lambda function reading CSV file

I have a Lambda function that is reading a CSV file, and each row is added to a DynamoDB table. I am using a print statement to print every row in the CSV to logs in CloudWatch.
There is a problem here as only 51 of the 129 rows are being printed.
Also, only a small amount of the rows that are actually found are being added to the DynamoDB tables.
Lambda Function:
# ChronojumpDataProcessor Lambda function
#
# This function is triggered by an object being created in an Amazon S3 bucket.
# The file is downloaded and each line is inserted into DynamoDB tables.
from __future__ import print_function
import json, urllib, boto3, csv
# Connect to S3 and DynamoDB
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
# Connect to the DynamoDB tables
athleteTable = dynamodb.Table('Athlete');
countermovementTable = dynamodb.Table('CMJ');
depthTable = dynamodb.Table('DepthJump');
# This handler is executed every time the Lambda function is triggered
def lambda_handler(event, context):
# Show the incoming event in the debug log
#print("Event received by Lambda function: " + json.dumps(event, indent=2))
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
localFilename = '/tmp/session.csv'
# Download the file from S3 to the local filesystem
try:
s3.meta.client.download_file(bucket, key, localFilename)
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
# Read the Session CSV file. Delimiter is the ',' character
with open(localFilename) as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
# Read each row in the file
rowCount = 0
for row in reader:
rowCount += 1
# Show the row in the debug log
print(row['athlete_id'], row['athlete_name'], row['jump_id'], row['date_time'], row['jump_type'], row['jump_tc'], row['jump_height'], row['jump_RSI'])
# Insert Athlete ID and Name into Athlete DynamoDB table
athleteTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name']})
# Insert CMJ details into Countermovement Jump DynamoDB table
if ((row['jump_type'] == "CMJ") | (row['jump_type'] == "Free")) :
countermovementTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'Height': row['jump_height']})
else :
# Insert Depth Jump details into Depth Jump DynamoDB table
depthTable.put_item(
Item={
'AthleteID': row['athlete_id'],
'AthleteName': row['athlete_name'],
'DateTime': row['date_time'],
'JumpType': row['jump_type'],
'JumpID': row['jump_id'],
'ContactTime': row['jump_tc'],
'Height': row['jump_height'],
'RSI': row['jump_RSI']})
# Finished!
return "%d data inserted" % rowCount
I added a Timeout to the Lambda function of 2 minutes as I thought that maybe there wasn't enough time provided for the function to read each row, but that did not fix the problem.
Your return statement is indented under the else, which means that the function will exit as soon as the if evaluates to False.
It should be indented to match the indent use on the with line.

Export from BigQuery to CSV based on client id

I have a BigQuery table filled with product data for a series of clients. The data has been flattened using a query. I want to export the data for each client to a Google Cloud Storage bucket in csv format - so each client has its own individual csv.
There are just over 100 clients, each with a client_id and the table itself is 1GB in size. I've looked into querying the table using a cloud function, but this would cost over 100,000 GB of data. I've also looked at importing the clients to individual tables directly from the source, but I would need to run the flattening query on each - again incurring a high data cost.
Is there a way of doing this that will limit data usage?
Have you thought about Dataproc?
You could write simple PySpark script where you load data from BigQuery and Write into Bucket splitting by client_id , something like this:
"""
File takes 3 arguments:
BIGQUERY-SOURCE-TABLE
desc: table being source of data in BiqQuery
format: project.dataset.table (str)
BUCKET-DEST-FOLDER
desc: path to bucket folder where CSV files will be stored
format: gs://bucket/folder/ (str)
SPLITER:
desc: name of column on which spit will be done during data saving
format: column-name (str)
"""
import sys
from pyspark.sql import SparkSession
if len(sys.argv) != 4:
raise Exception("""Usage:
filename.py BIGQUERY-SOURCE-TABLE BUCKET-DEST-FOLDER SPLITER"""
)
def main():
spark = SparkSession.builder.getOrCreate()
df = (
spark.read
.format("bigquery")
.load(sys.argv[1])
)
(
df
.write
.partitionBy(sys.argv[2])
.format("csv")
.option("header", True)
.mode("overwrite").
save(sys.argv[3])
)
if __name__ == "__main__":
main()
You will need to:
Save this script in Google Bucket,
Create Dataproc cluster for a while,
Run command written below,
Delete Dataproc cluster.
Let's say you have architecture as follows:
bigquery: myproject:mydataset.mytable
bucket: gs://mybucket/
dataproc cluster: my-cluster
So you will need to run following command:
gcloud dataproc jobs submit pyspark gs://mybucket/script-from-above.py \
--cluster my-cluster \
--region [region-of-cluster] \
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
-- \
myproject:mydataset.mytable gs://mybucket/destination/ client_id
This will save in gs://mybucket/destination/ data split by client_id and you will have files named:
client_id=1
client_id=2
...
client_id=n
As mentioned by #Mr.Batra, you can create partitions on your table based on client_id to regulate cost and amount of data queried.
Implementing Cloud Function and looping over each client id without partitions will cost more since with each
SELECT * FROM table WHERE client_id=xxx the query will scan the full table.

Change output CSV file name of AWS Athena queries

I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')

AWS Glue job bookmark produces duplicates for csv files

We receive 1 csv file everyday in s3 bucket from our vendor at 11am.
I convert this file into parquet format using Glue at 11:30am.
I've enabled job bookmark to not process already processed files.
Nonetheless, I see some files are being reprocessed thus creating duplicates.
I read these questions and answers AWS Glue Bookmark produces duplicates for PARQUET and AWS Glue Job Bookmarking explanation
They gave good understanding of job bookmarking, but still do not address the issue.
AWS documentation says, it supports CSV files for bookmarking AWS documentation.
Wondering if someone help me understand what could be the problem and if possible solution as well :)
Edit:
Pasting sample code here as requested by Prabhakar.
staging_database_name = "my-glue-db"
s3_target_path = "s3://mybucket/mydata/"
"""
'date_index': date location in the file name
'date_only': only date column is inserted
'date_format': format of date
'path': sub folder name in master bucket
"""
#fouo classified files
tables_spec = {
'sample_table': {'path': 'sample_table/load_date=','pkey': 'mykey', 'orderkey':'myorderkey'}
}
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", kms_key_id)
])
sc = SparkContext(conf=spark_conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
for table_name, spec in tables_spec.items():
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=database_name,
table_name=table_name,
transformation_ctx='datasource0')
resolvechoice2 = ResolveChoice.apply(frame=datasource0, choice="make_struct", transformation_ctx='resolvechoice2')
# Create spark data frame with input_file_name column
delta_df = resolvechoice2.toDF().withColumn('ingest_datetime', lit(str(ingest_datetime)))
date_dyf = DynamicFrame.fromDF(delta_df, glueContext, "date_dyf")
master_folder_path1 = os.path.join(s3_target_path, spec['path']).replace('\\', '/')
master_folder_path=master_folder_path1+load_date
datasink4 = glueContext.write_dynamic_frame.from_options(frame=date_dyf,
connection_type='s3',
connection_options={"path": master_folder_path},
format='parquet', transformation_ctx='datasink4')
job.commit()
Spoke to AWS Support engineer and she mentioned that, she is able to reproduce the issue and have raised it with Glue technical team for resolution.
Nonetheless, I couldn't wait on them fixing the bug and have taken different approach.
Solution:
Disable Glue bookmark
After Glue job converts csv file to Parquet, I
move csv file to different location in S3 bucket.