How to run BigQuery after Dataflow job completed successfully

How to run BigQuery after Dataflow job completed successfully - google-cloud-platform

I am trying to run a query in BigQuery right after a dataflow job completes successfully. I have defined 3 different functions in main.py.
The first one is for running the dataflow job. The second one checks the dataflow jobs status. And the last one runs the query in BigQuery.
The trouble is the second function checks the dataflow job status multiple times for a period of time and after the dataflow job completes successfully, it does not stop checking the status.
And then function deployment fails due to 'function load attempt timed out' error.
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import os
import re
import config
from google.cloud import bigquery
import time
global flag
def trigger_job(gcs_path, body):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
request = service.projects().templates().launch(projectId=config.project_id, gcsPath=gcs_path, body=body)
response = request.execute()
def get_job_status(location, flag):
credentials=GoogleCredentials.get_application_default()
dataflow=build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
result=dataflow.projects().jobs().list(projectId=config.project_id, location=location).execute()
for job in result['jobs']:
if re.findall(r'' + re.escape(config.job_name) + '', job['name']):
while flag==0:
if job['currentState'] != "JOB_STATE_DONE":
print('NOT DONE')
else:
flag=1
print('DONE')
break
def bq(sql):
client = bigquery.Client()
query_job = client.query(sql, location='US')
gcs_path = config.gcs_path
body=config.body
trigger_job(gcs_path,body)
flag=0
location='us-central1'
get_job_status(location,flag)
sql= """CREATE OR REPLACE TABLE 'table' AS SELECT * FROM 'table'"""
bq(SQL)
Cloud Function timeout is set to 540 seconds but deployment fails in 3-4 minutes.
Any help is very appreciated.

It appears from the code snippet provided that your HTTP-triggered cloud function is not returning a HTTP response.
All HTTP-based cloud functions must return a HTTP response for proper termination. From the google documentation Ensure HTTP functions send an HTTP response (Emphasis - mine):
If your function is HTTP-triggered, remember to send an HTTP response,
as shown below. Failing to do so can result in your function executing
until timeout. If this occurs, you will be charged for the entire
timeout time. Timeouts may also cause unpredictable behavior or cold
starts on subsequent invocations, resulting in unpredictable behavior
or additional latency.
Thus, you must have a function that in your main.py that returns some sort of value, ideally a value that can be coerced into a Flask http response.

Related

After retrying, task marked success in Airflow but it should marked failed

I am not sure where I am doing wrong. I am running a DAG and I have set retries to 5. My expectation is that, If a task got failed, it will retry 5 times and if it doesn't get pass it will mark it as failed but contrary to that, the task marked as success and triggered the upstream task, despite it has not done all the updates required in that particular task to get complete. Am I missing any required parameter? Thanks.
Note- I am using dataproc cluster and updating records in BQ table and it has limitation of 1500 updates/table/day. That limit got exceeded, but airflow didn't bother to mark it failed, rather it has triggered upstream task. Any fix which we can use? so that the task would have failed even if there is limit/quota exceed issue?
issue- "write to table failed: Reason: 403 limit error due to append/update limit exceeded"- After this, It started running the other codes available. But how can I tell Airflow that it's an error and should stop the task?
writing_to_bq(df=ent1, table, retry_number=3)
def writing_to_bq(df,table,retry):
for _ in range(retry+1):
try:
df.to_gbq(table,if_exists=“append")
break
except
Exception as e:
log.info("Trying again”)

Normally if you set args globally (for all your tasks) or only in the task, it should work as expected :
dag_default_args={
'owner' : 'Airflow',
'retries': 5,
'retry_delay': timedelta(minutes=2),
...
}
with airflow.DAG(
"your_dag_id",
default_args=dag_default_args,
schedule_interval=None) as dag:
....
In this example, I used a retry_delay to 2 minutes.
After discussed together, you indicated the BigQueryInsertJobOperator operator doesn't mark the task as failed if you receive a 403 status for a quota limit exceeded.
You can also use a PythonOperator with Python BigQuery client, in this case normally the client will throw an error for a 403 status. If it's not the case, you can have more flexibility to apply logic based on the result :
def execute_bq_query():
from google.cloud import bigquery
client = bigquery.Client()
QUERY = 'SELECT ....'
query_job = client.query(QUERY)
rows = query_job.result()
PythonOperator(
task_id="task_id",
python_callable=execute_bq_query
)

AWS Error "Calling the invoke API action failed with this message: Rate Exceeded" when I use s3.get_paginator('list_objects_v2')

Some third party application is uploading around 10000 object to my bucket+prefix in a day. My requirement is to fetch all objects which were uploaded to my bucket+prefix in last 24 hours.
There are so many files in my bucket+prefix.
So I assume that when I call
response = s3_paginator.paginate(Bucket=bucket,Prefix='inside-bucket-level-1/', PaginationConfig={"PageSize": 1000})
then may be it makes multiple calls to S3 API and may be that's why it is showing Rate Exceeded error.
Below is my Python Lambda Function.
import json
import boto3
import time
from datetime import datetime, timedelta
def lambda_handler(event, context):
s3 = boto3.client("s3")
from_date = datetime.today() - timedelta(days=1)
string_from_date = from_date.strftime("%Y-%m-%d, %H:%M:%S")
print("Date :", string_from_date)
s3_paginator = s3.get_paginator('list_objects_v2')
list_of_buckets = ['kush-dragon-data']
bucket_wise_list = {}
for bucket in list_of_buckets:
response = s3_paginator.paginate(Bucket=bucket,Prefix='inside-bucket-level-1/', PaginationConfig={"PageSize": 1000})
filtered_iterator = response.search(
"Contents[?to_string(LastModified)>='\"" + string_from_date + "\"'].Key")
keylist = []
for key_data in filtered_iterator:
if "/" in key_data:
splitted_array = key_data.split("/")
if len(splitted_array) > 1:
if splitted_array[-1]:
keylist.append(splitted_array[-1])
else:
keylist.append(key_data)
bucket_wise_list.update({bucket: keylist})
print("Total Number Of Object = ", bucket_wise_list)
# TODO implement
return {
'statusCode': 200,
'body': json.dumps(bucket_wise_list)
}
So when we execute above Lambda Function then it shows below error.
"Calling the invoke API action failed with this message: Rate Exceeded."
Can anyone help to resolve this error and achieve my requirement ?

This is probably due to your account restrictions, you should add retry with some seconds between retries or increase pagesize

This is most likely due to you reaching your quota limit for AWS S3 API calls. The "bigger hammer" solution is to request a quota increase, but if you don't want to do that, there is another way using botocore.Config built in retries, for example:
import json
import time
from datetime import datetime, timedelta
from boto3 import client
from botocore.config import Config
config = Config(
retries = {
'max_attempts': 10,
'mode': 'standard'
}
)
def lambda_handler(event, context):
s3 = client('s3', config=config)
###ALL OF YOUR CURRENT PYTHON CODE EXACTLY THE WAY IT IS###
This config will use exponentially increasing sleep timer for a maximum number of retries. From the docs:
Any retry attempt will include an exponential backoff by a base factor of 2 for a maximum backoff time of 20 seconds.
There is also an adaptive mode which is still experimental. For more info, see the docs on botocore.Config retries
Another (much less robust IMO) option would be to write your own paginator with a sleep programmed in, though you'd probably just want to use the builtin backoff in 99.99% of cases (even if you do have to write your own paginator). (this code is untested and isn't even asynchronous, so the sleep will be in addition to the wait time for a page response. To make the "sleep time" exactly sleep_secs, you'll need to use concurrent.futures or asyncio (AWS built in paginators mostly use concurrent.futures)):
from boto3 import client
from typing import Generator
from time import sleep
def get_pages(bucket:str,prefix:str,page_size:int,sleep_secs:float) -> Generator:
s3 = client('s3')
page:dict = client.list_objects_v2(
Bucket=bucket,
MaxKeys=page_size,
Prefix=prefix
)
next_token:str = page.get('NextContinuationToken')
yield page
while(next_token):
sleep(sleep_secs)
page = client.list_objects_v2(
Bucket=bucket,
MaxKeys=page_size,
Prefix=prefix,
ContinuationToken=next_token
)
next_token = page.get('NextContinuationToken')
yield page

returning JSON response from AWS Glue Pythonshell job to the boto3 caller

Is there a way to send a JSON response (of a dictionary of outputs) from A AWS Glue pythonshell job? Similar to returning a JSON response from AWS Lambda?
I am calling a Glue pythonshell job like below:
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
print(response)
The response I get is a 200 stating the fact that the Glue start_job_run was a success. From the documentation, all I see is the result if a Glue job is either written in s3 or some other database.
I tried adding return {'result':'some_string'} at the end of my Glue pythonshell job to test if it works or not with below code.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['JOB_NAME',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
return {'result':"some_string"}
But it throws error SyntaxError: 'return' outside function

Glue is not made to return response as it is expected to run long running operation inside it. Blocking for response for long running task is not right approach in itself. Instead of it, you may use launch job (service 1) -> execute job (service 2)-> get result (service 3) pattern. You can send json response to AWS service 3 which you want to launch from AWS Service 2 (execute job) e.g. if you launch lambda from glue job, you can send json response to it.

AWS StepFunctions Task state gets cancelled when tearing down a Google Cloud cluster

I am using AWS StepFunctions to carry out several tasks on the Google Cloud side - creating a Dataproc cluster, submitting a task to it, and then tearing it down (each of which have their own Task state, as well as "poller" tasks that check when the jobs have been finished in order to move onto the next Task).
The issue is, for tearing down the cluster, the Task goes into the "cancelled" (gray) status instead of "in progress", followed by the poller Task. Once the cluster deletion lambda function executes the cluster deletion method, it should move on to the poller Task.
Here is a look at the cluster deletion lambda function:
from pprint import pprint
from google.cloud import storage
import googleapiclient.discovery
from rkstr8.cloud.google import GoogleCloudLambdaAuth
import time
def handler(event, context):
creds = event['GCP_creds']
GoogleCloudLambdaAuth(creds).configure_google_creds()
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project_id = event['gcp-administrative']['project']
zone = event['gcp-administrative']['zone']
try:
region_as_list = zone.split('-')[:-1]
region = '-'.join(region_as_list)
except (AttributeError, IndexError, ValueError):
raise ValueError('Invalid zone provided, please check your input.')
cluster = event['dataproc-administrative']['cluster_name']
print('Tearing down cluster...')
request = dataproc.projects().regions().clusters().delete(
projectId=project_id,
region=region,
clusterName=cluster)
time.sleep(30)
result = request.execute()
return result
Here is what the relevant part of the state machine building code looks like:
dproc_submit_state = AsyncPoller(
stats_path=DPROC_SUBMIT_POLLER_STATUS_PATH,
async_task=Task(
name=DPROC_SUBMIT,
resource=DPROC_SUBMIT_ARN_VAR,
input_path=DPROC_SUBMIT_INPUT_PATH,
result_path=DPROC_SUBMIT_RESULT_PATH,
next=DPROC_SUBMIT_POLLER
),
pollr_task=Task(
name=DPROC_SUBMIT_POLLER,
resource=DPROC_SUBMIT_POLLER_ARN_VAR,
input_path=DPROC_SUBMIT_RESULT_PATH,
result_path=DPROC_SUBMIT_POLLER_STATUS_PATH
),
faild_task=Fail(
name='HailScriptFailed'
),
succd_task=DPROC_DELETE,
pollr_wait_time=self.conf["POLLER_WAIT_TIME"]
).states()
dproc_delete_state = AsyncPoller(
stats_path=DPROC_DELETE_POLLER_STATUS_PATH,
async_task=Task(
name=DPROC_DELETE,
resource=DPROC_DELETE_ARN_VAR,
input_path=DPROC_DELETE_INPUT_PATH,
result_path=DPROC_DELETE_RESULT_PATH,
next=DPROC_DELETE_POLLER
),
pollr_task=Task(
name=DPROC_DELETE_POLLER,
resource=DPROC_DELETE_POLLER_ARN_VAR,
input_path=DPROC_DELETE_RESULT_PATH,
result_path=DPROC_DELETE_POLLER_STATUS_PATH
),
faild_task=Fail(
name='ClusterDeleteFailed'
),
succd_task='PipelineSucceeded',
pollr_wait_time=self.conf["POLLER_WAIT_TIME"]
).states()
Here is what the state machine looks like:

Why are you sleeping for 30 seconds between creating a request and executing it?
The default timeout for lambda is 3 seconds. My guess is that your lambda is just timing out.

Django HTTP Response object for GAE Cron

I am doing this using Django / GAE / Python environment:
cron:
#run events every 12 hours
and
def events(request):
# read all records
# Do some processing on a few records
return http.HTTPResponseGone('Some Records are modified' )
Result in production :
Job runs on time with 'failed' message
However, it has done the job exactly on the datastore as required
No error log entry seen
Dev : No errors ; returns the message 'Some Records are modified'
Is it possible to avoid HTTP Response returned ? There is no need for HTTPResponse for me, however, I have kept this as Dev server testing fails in its absence. Can some one
help me to make the code clean?

Gone is error 410. You should return 200 Success if the operation succeeds. When you return HttpResponse, the default status is 200.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to run BigQuery after Dataflow job completed successfully - google-cloud-platform

Related

After retrying, task marked success in Airflow but it should marked failed

AWS Error "Calling the invoke API action failed with this message: Rate Exceeded" when I use s3.get_paginator('list_objects_v2')

returning JSON response from AWS Glue Pythonshell job to the boto3 caller

AWS StepFunctions Task state gets cancelled when tearing down a Google Cloud cluster

Django HTTP Response object for GAE Cron

Categories

Resources