I'm running Django 1.5 on GAE. I have a cron job that goes over several thousands of urls and grabs their "likes" count and saves it into DB. It can take easily more than 10 min to complete it. It works when I run it locally as a normal linux cron but fails with this error on GAE:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 266, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/django-1.5/django/core/handlers/wsgi.py", line 255, in __call__
response = self.get_response(request)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/django-1.5/django/core/handlers/base.py", line 175, in get_response
signals.got_request_exception.send(sender=self.__class__, request=request)
DeadlineExceededError
My setup:
app.yaml:
- url: /tasks/*
script: myproject.wsgi.application
login: admin
cron.yaml:
- description: update_facebook_resource
url: /tasks/update_facebook_resource
schedule: every day 04:05
timezone: Europe/Berlin
views.py
def update_facebook_resource(request):
resources = Resource.objects.filter(inactive=0).order_by('id')
url_start = "https://graph.facebook.com/fql?q=select++total_count+from+link_stat+where+url%3D"
url_end = "&access_token=..."
for item in resources:
url = item.link
url_final = url_start+ "%22" + url + "%22" + url_end
data = json.load(urllib2.urlopen(url_final))
likes = data["data"][0]["total_count"]
query = Resource.objects.get(id=item.id)
query.facebook_likes = likes
query.save(update_fields=['facebook_likes'])
return http.HttpResponse('ok')
what and how should I change so GAE lets me complete it? I've read this https://developers.google.com/appengine/articles/deadlineexceedederrors but it doesn't give me what I need really.
thanks
It's not a question of just getting GAE to let you complete the function. When developing for App Engine, you do need to think in a slightly different way, precisely because of things like the request deadline. In your case, you need to break the task up into chunks, and process each of those chunks individually.
You don't say if you're using django-nonrel with the GAE datastore, or if you're using Cloud SQL and therefore the standard Django API. If the former, you can use query cursors to keep track of your progress through the Resources. After each chunk, you can use deferred tasks to trigger the next chunk, passing it the cursor so it picks up where the last one left off.
Related
I know this has been questioned before, but no real solution was proposed and I was wondering if there any new ways nowadays.
Is there anyway to hook an event using any AWS service to check if a lambda has timed out? I mean it logs into the CloudWatch logs that it timed out so there must be a way.
Specifically in Python because its not so simple to keep checking if its reaching the 20 minute mark as you can with Javascript and other naturally concurrent languages.
Ideally I want to execute a lambda if the python lambda times out, with the same payload the original one received.
Here's an example from cloudformation-custom-resources/lambda/python ยท GitHub showing how an AWS Lambda function written in Python can realise that it is about to timeout.
(I've edited out the other stuff, here's the relevant bits):
import signal
def handler(event, context):
# Setup alarm for remaining runtime minus a second
signal.alarm((context.get_remaining_time_in_millis() / 1000) - 1)
# Do other stuff
...
def timeout_handler(_signal, _frame):
'''Handle SIGALRM'''
raise Exception('Time exceeded')
signal.signal(signal.SIGALRM, timeout_handler)
I want to update on #John Rotenstein answer which worked for me yet resulted in the following errors populating the cloudwatch logs:
START RequestId: ********* Version: $LATEST
Traceback (most recent call last):
File "/var/runtime/bootstrap", line 9, in <module>
main()
File "/var/runtime/bootstrap.py", line 350, in main
event_request = lambda_runtime_client.wait_next_invocation()
File "/var/runtime/lambda_runtime_client.py", line 57, in wait_next_invocation
response = self.runtime_connection.getresponse()
File "/var/lang/lib/python3.7/http/client.py", line 1369, in getresponse
response.begin()
File "/var/lang/lib/python3.7/http/client.py", line 310, in begin
version, status, reason = self._read_status()
File "/var/lang/lib/python3.7/http/client.py", line 271, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/var/lang/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/var/task/lambda_function.py", line 6, in timeout_handler
raise Exception('Time limit exceeded')
Exception: Time limit exceeded
END RequestId
So I just had to reset the signals alarm before returning each response:
import logging
import signal
def timeout_handler(_signal, _frame):
raise Exception('Time limit exceeded')
signal.signal(signal.SIGALRM, timeout_handler)
def lambda_handler(event, context):
try:
signal.alarm(int(context.get_remaining_time_in_millis() / 1000) - 1)
logging.info('Testing stuff')
# Do work
except Exception as e:
logging.error(f'Exception:\n{e}')
signal.alarm(0)# This line fixed the issue above!
return {'statusCode': 200, 'body': 'Complete'}
Two options I can think of, the first is quick and dirty, but also less ideal:
run it in a step function (check out step functions in AWS) which has the capability to retry on timeouts/errors
a better way would be to re-architect your code to be idempotent. In this example, the process that triggers the lambda checks a condition, and as long as this condition is true, trigger the lambda. That condition needs to remain true unless the lambda finished executing the logic successfully. This can be obtained by persisting the parameters sent to the lambda in a table in the DB, for example, and have an extra field called "processed" which will be modified to "true" only once the lambda finished running successfully for that event.
Using method #2 will make your code more resilient, easy to re-run on errors, and also easy to monitor: basically all you have to do is check how many such records do you have which are not processed, and what's their create/update timestamp on the DB.
If you care not only to identify the timeout, but to give your Lambdas an option of a "healthy" shutdown and pass the remaining payload to another execution automatically, you may have a look at the Siblings components of the sosw package.
Here is an example use-case where you call the sibling when the time is running out. You pass a pointer to where you have left the job to the Sibling. For example you may store the remaining payload in S3 and the cursor will show where you have stopped processing.
You will have to grant the Role of this Lambda permission to lambda:InvokeFunction on itself.
import logging
import time
from sosw import Processor as SoswProcessor
from sosw.app import LambdaGlobals, get_lambda_handler
from sosw.components.siblings import SiblingsManager
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class Processor(SoswProcessor):
DEFAULT_CONFIG = {
'init_clients': ['Siblings'], # Automatically initialize Siblings Manager
'shutdown_period': 10, # Some time to shutdown in a healthy manner.
}
siblings_client: SiblingsManager = None
def __call__(self, event):
cursor = event.get('cursor', 0)
while self.sufficient_execution_time_left:
self.process_data(cursor)
cursor += 1
if cursor == 20:
return f"Reached the end of data"
else:
# Spawning another sibling to continue the processing
payload = {'cursor': cursor}
self.siblings_client.spawn_sibling(global_vars.lambda_context, payload=payload, force=True)
self.stats['siblings_spawned'] += 1
def process_data(self, cursor):
""" Your custom logic respecting current cursor. """
logger.info(f"Processing data at cursor: {cursor}")
time.sleep(1)
#property
def sufficient_execution_time_left(self) -> bool:
""" Return if there is a sufficient execution time for processing ('shutdown period' is in seconds). """
return global_vars.lambda_context.get_remaining_time_in_millis() > self.config['shutdown_period'] * 1000
global_vars = LambdaGlobals()
lambda_handler = get_lambda_handler(Processor, global_vars)
I am working on a simple python scraping script, I am trying to get connections from LinkedIn using their API without a redirect_uri. I worked once with some APIs, that don't require the redirect url or just https://localhost. I got the consumer_key, consumer_secret, user_secret, consumer_secret. Here's the code i am using from https://github.com/ozgur/python-linkedin:
RETURN_URL = ''
url = 'https://api.linkedin.com/v1/people/~'
# Instantiate the developer authentication class
authentication = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET,
USER_TOKEN, USER_SECRET,
RETURN_URL, linkedin.PERMISSIONS.enums.values())
# Pass it in to the app...
application = linkedin.LinkedInApplication(authentication)
print application.get_profile() # works
print application.get_connections()
And here's the error I get:
Traceback (most recent call last):
File "getContacts.py", line 20, in <module>
print application.get_connections()
File "/home/imane/Projects/prjL/env/local/lib/python2.7/site-packages/linkedin/linkedin.py", line 219, in get_connections
raise_for_error(response)
File "/home/imane/Projects/prjL/env/local/lib/python2.7/site-packages/linkedin/utils.py", line 63, in raise_for_error
raise LinkedInError(message)
linkedin.exceptions.LinkedInError: 403 Client Error: Forbidden for url: https://api.linkedin.com/v1/people/~/connections: Unknown Error
This is my first question here, so excuse me if I didn't make it clear enough, and thank you for helping me out.
Here's what i tried with python_oauth2:
import oauth2 as oauth
import requests
url = 'https://api.linkedin.com/v1/people/~'
params = {}
token = oauth.Token(key=USER_TOKEN, secret=USER_SECRET)
consumer = oauth.Consumer(key=CONSUMER_KEY, secret=CONSUMER_SECRET)
# Set our token/key parameters
params['oauth_token'] = token.key
params['oauth_consumer_key'] = consumer.key
oauth_request = oauth.Request(method="GET", url=url, parameters=params)
oauth_request.sign_request(oauth.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()
response = requests.get(signed_url)
Connections API calls are a restricted endpoint as of March, 2015. It's possible you're using sample code/documentation that was written at a time when anyone could access those endpoints. You are receiving a 403 response because your application legitimately does not have the permission required to make that request.
My It is a cartridge/mezzanine app and is running fine with https set up properly. Its working fine until I get to the end of the checkout process, I get the following debug error in the browser:
Exception Type: AuthenticationError
Exception Value:
No API key provided. (HINT: set your API key using "stripe.api_key = "). You can generate API keys from the Stripe web interface. See https://stripe.com/api for details, or email support#stripe.com if you have any questions.
Exception Location: /home/jamesgilbert/lib/python2.7/stripe/api_requestor.py in request_raw, line 183
Traceback:
File "/home/johnsmith/webapps/cartridgeshop/lib/python2.7/Django-1.8.4-py2.7.egg/django/core/handlers/base.py" in get_response
132. response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/johnsmith/webapps/cartridgeshop/lib/python2.7/Django-1.8.4-py2.7.egg/django/views/decorators/cache.py" in _wrapped_view_func
57. response = view_func(request, *args, **kwargs)
File "/home/johnsmith/lib/python2.7/cartridge/shop/views.py" in checkout_steps
282. transaction_id = payment_handler(request, form, order)
File "/home/johnsmith/lib/python2.7/cartridge_stripe/init.py" in payment_handler
34. description=order)
File "/home/johnsmith/lib/python2.7/stripe/resource.py" in create
344. response, api_key = requestor.request('post', url, params, headers)
File "/home/johnsmith/lib/python2.7/stripe/api_requestor.py" in request
140. method.lower(), url, params, headers)
File "/home/johnsmith/lib/python2.7/stripe/api_requestor.py" in request_raw
183. 'No API key provided. (HINT: set your API key using '
I then got the following in the apache error logs:
/home/johnsmith/lib/python2.7/cartridge/shop/views.py:226:
UserWarning: The SHOP_CHECKOUT_FORM_CLASS setting is deprecated - please define your own urlpattern for the checkout_steps view, passing in your own form_class argument.
I have the correct stripe api keys in the settings and everything set up as it should be, I have looked in other places and coming to a dead end?
Many Thanks
You need to add the Stripe API key to your settings.py file (which you said you had already done). Something like the following line but with your own API key from the Stripe developer web site.
STRIPE_API_KEY="sk_test_XXXXXXXXXXXXXXXXXXXXXXXX"
You also need to reference the python interface file (which you must have been doing to get this error message).
SHOP_HANDLER_PAYMENT = "cartridge.shop.payment.stripe_api.process"
And, you need to install the strip-python module.
Seems like you did all this so I'm not sure if this will help but maybe it will trigger something to get you over this.
Cartridge-stripe doesn't seem to be maintained. I'd advise using the stripe payment handler built into cartridge, which will be documented in cartridge's next docs release (PR).
I'm trying to request a RestAPI resource multiple times. In order to save time, I try to use urllib3.HTTPSConnectionPool instead of urllib2. However, it keeps throwing me the following error:
Traceback (most recent call last):
File "LCRestapi.py", line 135, in <module>
listedLoansFast(version, key, showAll='false')
File "LCRestapi.py", line 55, in listedLoansFast
pool.urlopen('GET',url+resource,headers={'Authorization':key})
File "/Library/Python/2.7/site-packages/urllib3/connectionpool.py", line 515, in urlopen
raise HostChangedError(self, url, retries)
urllib3.exceptions.HostChangedError: HTTPSConnectionPool(host='https://api.lendingclub.com/api/investor/v1/loans/listing?showAll=false', port=None): Tried to open a foreign host with url: https://api.lendingclub.com/api/investor/v1/loans/listing?showAll=false
I'm using python-2.7.6
Here's my code:
manager = urllib3.PoolManager(1)
url = 'https://api.lendingclub.com/api/investor/v1/loans/listing?showAll=false'
pool = urllib3.HTTPSConnectionPool(url+resource, maxsize=1, headers={'Authorization':key})
r = pool.request('GET',url+resource)
print r.data
Thanks for the help!
The problem is that you're creating a PoolManager but never using it. Instead, you're also creating an HTTPSConnectionPool (which is bound to a specific host) and using that instead of the PoolManager. The PoolManager will automatically manage HTTPSConnectionPool objects on your behalf, so you don't need to worry about it.
This should work:
# Your example called this `manager`
http = urllib3.PoolManager()
url = 'https://api.lendingclub.com/api/investor/v1/loans/listing?showAll=false'
headers = {'Authorization': key}
# Your example did url+resource, but let's assume the url variable
# contains the combined absolute url.
r = http.request('GET', url, headers=headers)
print r.data
You can specify the size for the PoolManager if you'd prefer, but you shouldn't need to unless you're trying to do something unusual with limiting resources to a pool of threads.
I need to parse urls like the one below with Scrapy (ads from real estate agent)
http://ws.seloger.com/search.xml?idq=?&cp=72&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea
The response from the server is limited to 200 results whatever the Min/Max price you use in the url (See pxmin / pxman in url).
Therefore, i would like to use a function which generate urls for start_urls with the right price band so it doesn't go over 200 search results and so that urls cover a price range of say [0:1000000]
The function would do the following :
Take the first URL
Check number of results ("nbTrouvees" tag in the XML response)
adjust price band if results > 200 or add to start_urls list if < 200
The function increment the price band until it reach the price of 1,000,000.
Function return final start_urls list which will cover all properties for a given region.
This obviously means numerous requests to the server to find out the right price range plus all the request generated by Spider for the final scraping.
1) My first question therefore is : Is there a better way to tackle this in your point of view ?
2) My second question : I have tried to retrieve the content of one of these page with Scrapy, just to see how i could parse the "nbTrouvees" tag without using a spider but i'am stuck.
I tried using the TextResponse method but got nothing in return. I then tried the below but it fails as the method "body to unicode" doesn't exist for "Response" object.
>>>link = 'http://ws.seloger.com/search.xml? idq=1244,1290,1247&ci=830137&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea'
>>>xxs = XmlXPathSelector(Response(link))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site- packages/scrapy/selector/lxmlsel.py", line 31, in __init__
_root = LxmlDocument(response, self._parser)
File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site- packages/scrapy/selector/lxmldocument.py", line 27, in __new__
cache[parser] = _factory(response, parser)
File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site- packages/scrapy/selector/lxmldocument.py", line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'
Any idea? (fyi, It works with my spider)
Thank you
Gilles