app_identity_service.GetAccessToken() required more quota than is available - python-2.7

I am using app engine and big query as the backend for my website. Whenever the user does some click, i log them into bigquery to do analytics later in the day. I get close to 75k clicks a day. It was working fine till last week. This is the code i use.
body = {"rows":[bodyFields]}
credentials = appengine.AppAssertionCredentials(scope=BIGQUERY_SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery = discovery.build('bigquery', 'v2', http=http)
response = bigquery.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=BIGQUERY_DATASETID,
tableId=BIGQUERY_TABLEID,
body=body).execute()
Now all of a sudden i am getting over quota exception. My application is a paid app engine instance. Below is the stack-trace of my exception
Attempting refresh to obtain initial access_token
The API call app_identity_service.GetAccessToken() required more quota than is available.
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~projectname/bigqueryapi.387952303347375306/filename.py", line 1611, in post
bigquery = discovery.build('bigquery', 'v2', http=http)
File "/base/data/home/apps/s~projectname/bigqueryapi.387952303347375306/oauth2client/util.py", line 129, in positional_wrapper
return wrapped(*args, **kwargs)
File "/base/data/home/apps/s~projectname/bigqueryapi.387952303347375306/apiclient/discovery.py", line 198, in build
resp, content = http.request(requested_url)
File "/base/data/home/apps/s~projectname/bigqueryapi.387952303347375306/oauth2client/util.py", line 129, in positional_wrapper
return wrapped(*args, **kwargs)
File "/base/data/home/apps/s~projectname/bigqueryapi.387952303347375306/oauth2client/client.py", line 516, in new_request
self._refresh(request_orig)
File "/base/data/home/apps/s~projectname/bigqueryapi.387952303347375306/oauth2client/appengine.py", line 194, in _refresh
scopes, service_account_id=self.service_account_id)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/app_identity/app_identity.py", line 589, in get_access_token
scopes, service_account_id=service_account_id)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/app_identity/app_identity.py", line 547, in get_access_token_uncached
return rpc.get_result()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 613, in get_result
return self.__get_result_hook(self)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/app_identity/app_identity.py", line 519, in get_access_token_result
rpc.check_success()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 579, in check_success
self.__rpc.CheckSuccess()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 134, in CheckSuccess
raise self.exception
OverQuotaError: The API call app_identity_service.GetAccessToken() required more quota than is available.
My traffic hasn't gone up by much also the number of time the handler is hit is almost the same as past 2 months data. So why am i getting this error.

In order to determine why you're hitting the quota error, you'll need to share more detail about your usage. The quota should reset every 24 hours. Do you know how long it takes for the error to appear and how much traffic you've successfully served to that point in time?
You mentioned that you "do analytics later in the day", which suggests that you might be using the TaskQueue API or Deferred Tasks. It's possible that those tasks are failing for other reasons and retrying, which could quickly eat up your quota. If you are using TaskQueues, you might try tuning the queue configuration and retry options.
Another way you might be able to conserve your quota would be to save the bigquery discovery service that you're building to something like the Memcache API, so that it can be reused for multiple requests to the BigQuery service.

Related

Error "No worker is available to serve request: model" when calling SageMaker endpoint during increaesd load

I have a custom container that takes a request, does some feature extraction and then passes on the enhanced request to a classifier endpoint. During feature extraction another endpoint is being called for generating text embeddings. I am using the HuggingFace estimator for my embedding model.
It has been working fine, but there was an increase in requests and looks like the embedding endpoint timed out somehow.
I am looking at adding automatic scaling to the endpoint, but I want to make sure I understand what is happening and that it properly addresses the issue. Unfortunately searching for this error message does not pull up much. The instance metrics is not showing the endpoint to be overloaded - cpu utilization was max ~30%. Would auto scaling address the no worker issue or is this something different? I was receiving a few hundred requests per minute at the time.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/opt/program/predictor.py", line 56, in transformation
result = preprocessor.transform(data)
File "/opt/program/preprocessor.py", line 189, in transform
response = embed_predictor.predict(data=json.dumps(payload))
File "/usr/local/lib/python3.7/site-packages/sagemaker/predictor.py", line 136, in predict
response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from primary with message "{
"code": 503,
"type": "ServiceUnavailableException",
"message": "No worker is available to serve request: model"
}
I would suggest confirming MemoryUtilization is not being overwhelmed and there is no specifc error in CloudWatch Logs as well.
If MemoryUtilization is overwhelmed, you can test configuring Auto Scaling in order to distribute the load of request to multiple instances. That being said, while I am not sure of the details of your custom container, I also recommend confirming the container itself can handle multiple concurrent requests (i.e have multiple workers available to serve requests).

Google Cloud Speech to Text API stopped working

I signed up for Google Cloud Speech to Text API for a project I am undertaking. This morning, I was able to convert speech to text successfully. In the evening, I resumed the work and all of a sudden, it stops working. It doesn't take any input and upon keyword interrupt I get the following message. I am not sure why it has suddenly decided not to work. I restarted my laptop thinking it would be some microphone issue. Then, I downloaded the code and re-ran, yet no positive outcome. Below is the message I get.
Traceback (most recent call last):
File "/Users/ajayshah/Downloads/s2t.py", line 174, in <module>
main()
File "/Users/ajayshah/Downloads/s2t.py", line 167, in main
responses = client.streaming_recognize(streaming_config, requests)
File "/Users/ajayshah/opt/miniconda3/lib/python3.9/site-packages/google/cloud/speech_v1/helpers.py", line 81, in streaming_recognize
return super(SpeechHelpers, self).streaming_recognize(
File "/Users/ajayshah/opt/miniconda3/lib/python3.9/site-packages/google/cloud/speech_v1/services/speech/client.py", line 605, in streaming_recognize
response = rpc(requests, retry=retry, timeout=timeout, metadata=metadata,)
File "/Users/ajayshah/opt/miniconda3/lib/python3.9/site-packages/google/api_core/gapic_v1/method.py", line 142, in __call__
return wrapped_func(*args, **kwargs)
File "/Users/ajayshah/opt/miniconda3/lib/python3.9/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
return retry_target(
File "/Users/ajayshah/opt/miniconda3/lib/python3.9/site-packages/google/api_core/retry.py", line 218, in retry_target
time.sleep(sleep)
KeyboardInterrupt

Glue Boto Client -- NoCredentialsError

I've been running my Glue Jobs on a schedule for a few months. Last night my Glue Job failed due to botocore.exceptions.NoCredentialsError: Unable to locate credentials after calling bucket.objects.filter(Prefix=productionDirectory):
I am under the impression this is a result of not having defined a credentials file, but AWS Glue has always pulled credentials without issue. I just re-ran my job and everything worked perfectly. For reference, I define my Glue Client via: glue = boto3.client('glue'). Has anyone ever experienced this before? Is this just an edge-case?
Full Logs:
Traceback (most recent call last):
File "/tmp/data-deployment", line 67, in <module>
for obj in bucket.objects.filter(Prefix=productionDirectory):
File "/home/spark/.local/lib/python3.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
for page in self.pages():
File "/home/spark/.local/lib/python3.7/site-packages/boto3/resources/collection.py", line 166, in pages
for page in pages:
File "/home/spark/.local/lib/python3.7/site-packages/botocore/paginate.py", line 255, in __iter__
response = self._make_request(current_kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/paginate.py", line 332, in _make_request
return self._method(**current_kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 613, in _make_api_call
operation_model, request_dict, request_context)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 632, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Edit/Update: This is a known bug. I've posted the mitigation strategy provided from AWS as an answer below.
Update: I reached out to AWS via Support and they responded. Apparently this is a known bug and issue. While they do not have a solution or ETA for solution, they do have a way to mitigate the issue. Information below:
Thank you for reporting your issue to us and product team is aware of this intermittent issue.
They are working on resolution however, I do not have an ETA.
To mitigate this issue, increase the timeout / attempts to meta service request in your code:
####START######
import os
####Increase meta service timeout and attempt########
os.environ['AWS_METADATA_SERVICE_NUM_ATTEMPTS'] ="5"
os.environ['AWS_METADATA_SERVICE_TIMEOUT'] ="30"
#####################END#################
I faced a similar issue with Glue, but not exactly the same.
We used external tables with SparkSQL and S3, and sometimes an Exception was raised out of nowhere, i.e. Table not found. The issue was never reproduced on testing and had least frequency. Since our jobs ran perfectly fine on retries, we enabled the retry mechanism to solve it.
It has something to do with the internal workings of Glue and its serverless environment.

Google Cloud Function pulling from Pub/Sub subscription throws exception - Deadline Exceeded

I have a Google Cloud Function in Python 3.7 reading from a Pub/Sub subscription in synchronous pull mode.
After running fine 1/hour for 24 hours, it threw this exception stack trace:
Traceback (most recent call last): File
"/env/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py",
line 57, in error_remapped_callable
return callable_(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/grpc/_channel.py", line 824,
in call
return _end_unary_response_blocking(state, call, False, None) File "/env/local/lib/python3.7/site-packages/grpc/_channel.py", line
726, in _end_unary_response_blocking
raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status =
StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded"
debug_error_string =
"{"created":"#1580454091.145703535","description":"Error received from
peer
ipv4:74.125.202.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline
Exceeded","grpc_status":4}"
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py",
line 346, in run_http_function
result = _function_handler.invoke_user_function(flask.request) File
"/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py",
line 217, in invoke_user_function
return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py",
line 210, in call_user_function
return self._user_function(request_or_event) File "/user_code/main.py", line 39, in iteration
response = sub.pull(sub_path, MAX_MESSAGES) File "/env/local/lib/python3.7/site-packages/google/cloud/pubsub_v1/_gapic.py",
line 40, in
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa File
"/env/local/lib/python3.7/site-packages/google/cloud/pubsub_v1/gapic/subscriber_client.py",
line 1005, in pull
request, retry=retry, timeout=timeout, metadata=metadata File "/env/local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py",
line 143, in call
return wrapped_func(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/google/api_core/retry.py",
line 286, in retry_wrapped_func
on_error=on_error, File "/env/local/lib/python3.7/site-packages/google/api_core/retry.py",
line 184, in retry_target
return target() File "/env/local/lib/python3.7/site-packages/google/api_core/timeout.py",
line 214, in func_with_timeout
return func(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py",
line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc) File "", line 3, in raise_from
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded
What is this about? Is it to be expected or a result of some configuration problem? If to be expected, how should it be handled?
The documentation ( view-source:https://googleapis.dev/python/pubsub/latest/subscriber/api/client.html ) on pull has nothing about this being a possible exception.
I ack the messages immediately after the pull completes. I only permit one function execution at a time. I have a 600 second acknowledgement deadline. A block of messages pulled at one time seem to be less than 100 in number. If this is about failing to ack a message, it seems like the error could be done much better.
This exception is raised by the client when there's no messages to read in the subscription. It is a known issue from the latest PubSub library versions >= 1.0.0. If necessary, you can downgrade to the version 0.45.0 where this issue was not present.
However, as a workaround you can catch the DeadlineExceeded exception and retry the operation again. Also, based on the comment of Hemang, here's a small monkeypatch that you can add to your running code, which might help to get the same behavior as in version 0.45.0.
from google.cloud.pubsub_v1.gapic import subscriber_client_config as sub_config
sub_config.config['interfaces']['google.pubsub.v1.Subscriber']['retry_params']['messaging']['initial_rpc_timeout_millis'] = 25000
Finally, keep in mind that when using synchronous pull, having many outstanding pull requests helps lower the delivery latency, which in turn might result in higher latency pull requests (and DeadlineExceeded errors). Although, if latency is crucial for the application, you could consider using StreamingPull

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data.
Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file paths for a list of URLs. Then I am trying to use a second MapReduce job to process this output by downloading the data from the commoncrawl S3 bucket. In the mapper I am using boto3 to download the gzip contents for a specific URL from the commoncrawl S3 bucket and then output some information about the the gzip contents (word counter information, content length, URLs linked to, etc.). The reducer then goes through this output to get the final word count, URL list, etc.
The output file from the first MapReduce job is only about 6mb in size (but will be larger once we scale to the full dataset). When I run the second MapReduce, this file is only split twice. Normally this is not a problem for such a small file, but the mapper code I described above (fetching S3 data, spitting out mapped output, etc.) takes a while to run for each URL. Since the file is only splitting twice, there are only 2 mappers being run. I need to increase the number of splits so that the mapping can be done faster.
I have tried setting "mapreduce.input.fileinputformat.split.maxsize" and "mapreduce.input.fileinputformat.split.minsize" for the MapReduce job, but it doesn't change the number of splits taking place.
Here is some of the code from the mapper:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
'Body'].read()
fileobj = io.BytesIO(gz_file)
with gzip.open(fileobj, 'rb') as file:
[do stuff]
I also manually split the input file up into multiple files with a maximum of 100 lines. This had the desired effect of giving me more mappers, but then I began encountering a ConnectionError from the s3client.get_object() call:
Traceback (most recent call last):
File "dmapper.py", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "dmapper.py", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
I am currently running this with only a handful of URLs, but I will need to do it with several thousand (each with many subdirectories) once I get it working.
I am not certain where to start with fixing this. I feel that it is highly likely there is better approach than what I am trying. The fact that the mapper seems to take so long for each URL seems like a big indication that I am approaching this wrong. I should also mention that the mapper and the reducer both run correctly if run directly as a pipe command:
"cat short_url_list.txt | python mapper.py | sort | python reducer.py" -> Produces desired output, but would take too long to run on the entire list of URLs.
Any guidance would be greatly appreciated.
The MapReduce API provides the NLineInputFormat. The property "mapreduce.input.lineinputformat.linespermap" allows to control how many lines (here WARC records) are passed to a mapper at maximum. It works with mrjob, cf. Ilya's WARC indexer.
Regarding the S3 connection error: it's better to run the job in the us-east-1 AWS region where the data is located.