combination of using Google Bigtable and Cloud Pubsub occasionally hangs - flask

I've made a web API application using Gunicorn(Gevent) + Flask chain.
When It received a data, it read rows from 5 different tables in Bigtable with singleton pattern then do a cpu bound task.
After that, It publishes a message to Pubsub.
But the problem is sometimes It hangs forever right before the loop.
data = {}
for bigtable in five_bigtables:
rows = bigtable.read_rows(row_key)
print('reading rows start')
for row in rows:
data[row.row_key] = row.cells[column_family][column_qualifier][0]
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_id)
publisher.publish(topic_path, data)
So I can see the "reading rows start" on console and stop working.
And If i commented out a function that publish a message to Pubsub, It works totally fine.
The version of the packages
flask==2.0.3
gevent==21.12.0
google-cloud-bigtable==2.5.1
google-cloud-pubsub==2.9.0
grpcio-status==1.44.0
gunicorn==20.1.0
The gunicorn configs are these
worker_class = 'gevent'
preload_app = False
How I can reproduce the hang is that I used Apache ab testing
ab -n 10000 -c 200 "http://127.0.0.1:8080/test_url"
after some requests It shows.
Benchmarking 127.0.0.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 139 requests completed
Any comment or help would be appreciated.

Related

google Cloud function :Error: memory limit exceeded. Function invocation was interrupted, but it works

I created a new python google function that schedule a Query in BigQuery every 10 minutes, I test it and it works.
deployment works fine
testing give this error : Error: memory limit exceeded. logs not available ( but I can see that the Query did run as expected in BigQuery)
using http trigger in cloud scheduler, I got failure with this the error message status: 503, but again, I can see in BigQuery console, it is running as expected
edit : here the code for the function
from google.cloud import bigquery
def load(request):
client = bigquery.Client()
dataset_id = 'yyyyyyyy'
table_id='xxxxxxxxxxx'
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = False
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
sql = """
SELECT * FROM `xxxxxx.datastudio.today_view`;
"""
query_job = client.query(sql,location='asia-northeast1',job_config=job_config)
query_job.result()
print("Job finished.")
The BigQuery job is asynchronous. Your Cloud Function trigger it and wait up to completion. If the function fail between, it's not a problem, the 2 services aren't correlated.
If you do this by API, when you create a job (a query) you got immediately a JobID. Then you have to poll regularly this job ID to know its status. The client library do exactly the same!
Your out of memory issue come from the result which wait the completion and read the results. Set a page size or a max_result to limit the data returned.
But, you can also don't wait the end and exit immediately (skip the line query_job.result()). You will save Cloud Functions processing (useless wait) time, and thus money!

Cloud Tasks Conditional Execution

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).
Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.
Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()

How do I poll a web service from a GAE service in short intervals?

I'm developing a client app that relies on a GAE service. This service needs to get updates by polling a remote web service on a less than 1 minute interval so cron jobs are probably not the way to go here.
From the GAE service I need to poll the web service in intervals of a couple of seconds and then update the client app. So to break it down:
GAE service polls the remote web service in 5 sec intervals.
If a change is made, update the client app instantly.
Step 2 is solved already, but I'm struggling to find a good way on a polling of this sort. I have no control over the remote web service so I can't make any changes on that end.
I've looked at the Task queue API but the documentation specifically says that it is unsuitable for interactive applications where a user is waiting for the result
How would be the best way to solve this issue?
Use cron to schedule a bunch of taskqueue tasks with staggered etas
def cron_job(): # scheduled to run every 5 minutes
for i in xrange(0, 60*5, 5):
deferred.defer(poll_web_service, _countdown=i)
def poll_web_service():
# do stuff
Alternatively, with this level of frequency, you might as well have a dedicated instance on this. You can do this with manual-scaling microservice and you can have the request handler for /_ah/start/ never return, which will let it run forever (besides having periodic restarts). See this: https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed#instance_scaling
def on_change_detected(params):
queue = taskqueue.Queue('default')
task = taskqueue.Task(
url='/some-url-on-your-default-service/',
countdown=0,
target='default',
params={'params': params})
queue.add(task)
class Start(webapp2.RequestHandler):
def get(self):
while True:
time.sleep(5)
if change_detected: # YOUR LOGIC TO DETECT A CHANGE GOES HERE
on_change_detected()
_routes = [
RedirectRoute('/_ah/start', Start, name='start'),
]
for r in _routes:
app.router.add(r)

How to retrieve current workers count for job in GCP dataflow using API

Does anyone know if there is a possibility to get current workers count for active job that is running in GCP Dataflow?
I wasn't able to do it using provided by google API.
One thing that I was able to get is CurrentVcpuCount but it is not what I need.
Thanks in advance!
The current number of workers in a Dataflow job are displayed in the message logs, under autoscaling. For example, I did a quick job as example and I got the following message, when displaying the job logs in my Cloud Shell:
INFO:root:2019-01-28T16:42:33.173Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:02.166Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:05.385Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-01-28T16:43:05.433Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
Now, you can query these messages by using the projects.jobs.messages.list method, in the Data flow API, and setting the minimumImportance parameter to be JOB_MESSAGE_BASIC.
You will get a response similar to the following:
...
"autoscalingEvents": [
{...} //other events
{
"currentNumWorkers": "1",
"eventType": "CURRENT_NUM_WORKERS_CHANGED",
"description": {
"messageText": "(fcfef6769cff802b): Worker pool started.",
"messageKey": "POOL_STARTUP_COMPLETED"
},
"time": "2019-01-28T16:43:02.130129051Z",
"workerPool": "Regular"
},
To extend this you could create a python script to parse the response, and only get the parameter currentNumWorkers from the last element in the list autoscalingEvents, to know what is the last (hence the current) number of workers in the Job.
Note that if this parameter is not present, it means that the number of workers is zero.
Edit:
I did a quick python script that retrieves the current number of workers, from the message logs, using the API I mentioned above:
from google.oauth2 import service_account
import googleapiclient.discovery
credentials = service_account.Credentials.from_service_account_file(
filename='PATH-TO-SERVICE-ACCOUNT-KEY/key.json',
scopes=['https://www.googleapis.com/auth/cloud-platform'])
service = googleapiclient.discovery.build(
'dataflow', 'v1b3', credentials=credentials)
project_id="MY-PROJECT-ID"
job_id="DATAFLOW-JOB-ID"
messages=service.projects().jobs().messages().list(
projectId=project_id,
jobId=job_id
).execute()
try:
print("Current number of workers is "+messages['autoscalingEvents'][-1]['currentNumWorkers'])
except:
print("Current number of workers is 0")
A couple of notes:
The scopes are the permissions needed on the service account key you are referencing (in the from_service_account_file function), in order to do the call to the API. This line is needed to authenticate to the API. You can use any one of this list, to make it easy on my side, I just used a service account key with project/owner permissions.
If you want to read more about the Python API Client Libraries, check this documentation, and this samples.
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-5513132861824326",
enable_page_level_ads: true
});
</script>

Gmail API sends email but some are never being received

I recently tried my hands at the new Gmail API. And all seems to work fine except one thing. My issue is as follows:
I working on a receptionist project that may need to generate more than one email in less than a minute during busy hours. So just for testing purposes I run the following code which works fine:
if __name__ == '__main__':
service = setup() //Simply an helper function to do the basic credential check. Works fine!
print('service:'+str(service))
for counter in range(1, 10):
print('Sending message '+ str(counter))
message = create_message(<SENDER_EMAIL_ID>,<RECEIVER_EMAIL_ID>, "Email Number: "+ str(counter) , "Sample text")
response = send_message(service, 'me' , message)
print(response)
The setup() function is as follows:
credentials = get_credentials()
http = credentials.authorize(httplib2.Http())
service = discovery.build('gmail', 'v1', http=http)
Now, when I run the code say thrice consecutively in less than a minute, the code runs fine and I am able to see all the 27 emails in the sent folder of the SENDER_EMAIL_ID using a web browser. And thus Gmail API is sending all the messages through whenever a request is being made. However, only some of these emails are being received at the RECEIVER_EMAIL_ID and rest are just being dropped.
However, if I run the program with say 2-5 minutes delay then all the mails are being received.
I have no idea why this is.
Any help would be really appreciated. :)
To expound more on #ken-y-n's response in the comments section, GMail API has usage limits. Specifically for this product, Daily usage is about
1 Billion quota units / day
250 quota units / user / second
You may have encountered the rateLimitExceeded error during your tests.
Since you're sending emails thru a loop, it will cost you about 100 units when calling send (plus other costs depending on the methods you're calling). This is the reason why some emails seemed to be dropped. You can counter this by implementing exponential backoff on the messages that failed to send.
Another alternative instead of running it thru a loop, is to use Batch requests which groups your API calls together to reduce the number of HTTP connections your app making.