AWS > Lambda > Canary Function - amazon-web-services

I am looking to use a Lambda function to monitor multiple websites, how would you alter the below codes (from BluePrint)?
Multiple env variable can be set up (e.g. site2, expected2), I just need help with making them work inside the function.
Thank you.
import os
from datetime import datetime
from urllib.request import Request, urlopen
SITE = os.environ['site'] # URL of the site to check, stored in the site environment variable
EXPECTED = os.environ['expected'] # String expected to be on the page, stored in the expected environment variable
def validate(res):
'''Return False to trigger the canary
Currently this simply checks whether the EXPECTED string is present.
However, you could modify this to perform any number of arbitrary
checks on the contents of SITE.
'''
return EXPECTED in res
def lambda_handler(event, context):
print('Checking {} at {}...'.format(SITE, event['time']))
try:
req = Request(SITE, headers={'User-Agent': 'AWS Lambda'})
if not validate(str(urlopen(req).read())):
raise Exception('Validation failed')
except:
print('Check failed!')
raise
else:
print('Check passed!')
return event['time']
finally:
print('Check complete at {}'.format(str(datetime.now())))

Several ways to go here, but rather than adding new envars for each site and expected value, you could just use two envvars to hold lists. Then in the lambda function, split the lists, zip them together into a dictionary, and iterate through the dict items using the key and value to call your modified validate() function.
Something like
import os
from datetime import datetime
from urllib.request import Request, urlopen
SITES = os.environ['sites'] # comma-separated URLs of the sites to check, stored in the site environment variable
EXPECTED_VALUES = os.environ['expected_values'] # comma-separated strings expected to be o site's page, stored in the expected environment variable
sites = SITES.split(',')
expected_values = EXPECTED_VALUES.split(',')
sites_to_check = dict(zip( sites,expected_values)
def validate(page_content, expected_value):
'''Return False to trigger the canary
Currently this simply checks whether the EXPECTED string is present.
However, you could modify this to perform any number of arbitrary
checks on the contents of SITE.
'''
return expected_value in page_content
def lambda_handler(event, context):
for site in sites_to_check.items():
url = site(0)
expected_val = site(1)
print('Checking {} at {}...'.format(url, event['time']))
try:
req = Request(url, headers={'User-Agent': 'AWS Lambda'})
if not validate(str(urlopen(req).read()), expected_val):
raise Exception('Validation failed')
except:
print('Check failed!')
raise
else:
print('Check passed!')
# keep on looping until we get through them all
finally:
print('Check complete for {} at {}'.format(url, str(datetime.now())))
# if we got here, we checked all the sites and they passed
print('Checked all sites and they all passed at {}'.format(str(datetime.now())))
I have not run or tested this code, but you get the idea - make it a loop!

Related

json serialisation of dates on flask restful

I have the following resource:
class Image(Resource):
def get(self, db_name, col_name, image_id):
col = mongo_client[db_name][col_name]
image = col.find_one({'_id':ObjectId(image_id)})
try:
image['_id'] = str(image['_id'])
except TypeError:
return {'image': 'notFound'}
return {'image':image}
linked to a certain endpoint.
However, image contains certain datetime objects inside. I could wrap this around with `json.dumps(..., default=str), but I see that there is a way of enforcing this on flask-restful. It's just not clear to me what exactly needs to be done.
In particular, I read:
It is possible to configure how the default Flask-RESTful JSON
representation will format JSON by providing a RESTFUL_JSON
attribute on the application configuration.
This setting is a dictionary with keys that
correspond to the keyword arguments of json.dumps().
class MyConfig(object):
RESTFUL_JSON = {'separators': (', ', ': '),
'indent': 2,
'cls': MyCustomEncoder}
But it's not clear to me where exactly this needs to be placed. Tried a few things and it didn't work.
EDIT:
I finally solved with this:
Right after
api = Api(app)
I added:
class CustomEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
#return int(obj.strftime('%s'))
return str(obj)
elif isinstance(obj, datetime.date):
#return int(obj.strftime('%s'))
return str(obj)
return json.JSONEncoder.default(self, obj)
def custom_json_output(data, code, headers=None):
dumped = json.dumps(data, cls=CustomEncoder)
resp = make_response(dumped, code)
resp.headers.extend(headers or {})
return resp
api = Api(app)
api.representations.update({
'application/json': custom_json_output
})
Just cleared this out, you just have to do the following:
app = Flask(__name__)
api = Api(app)
app.config['RESTFUL_JSON'] = {'cls':MyCustomEncoder}
This works both for plain Flask and Flask-RESTful.
NOTES:
1) Apparently the following part of the documentation is not that clear:
It is possible to configure how the default Flask-RESTful JSON
representation will format JSON by providing a RESTFUL_JSON attribute
on the application configuration. This setting is a dictionary with
keys that correspond to the keyword arguments of json.dumps().
class MyConfig(object):
RESTFUL_JSON = {'separators': (', ', ': '),
'indent': 2,
'cls': MyCustomEncoder}
2)Apart from the 'cls' argument you can actually overwrite any keyword argument of the json.dumps function.
Having created a Flask app, e.g. like so:
root_app = Flask(__name__)
place MyConfig in some module e.g. config.py and then configure root_app like:
root_app.config.from_object('config.MyConfig')

Beaker session in bottle

while using beaker session, i came across to use same session object along the whole application.
I came through this url: Bottle.py session with Beaker
But, still i am getting 'KeyError' when i am trying to access the save session value in one function by another function.
my rest.py file looks like:
import bottle
from bottle import route,default_app
from beaker.middleware import SessionMiddleware
app = bottle.default_app()
#bottle.hook('before_request')
def setup_request():
request.session = request.environ['beaker.session']
#app.route('/login')
def login():
request.session['uname'] = 'user'
#app.route('/logout')
def logout():
print request.session['uname']
# expecting to print user
session_opts = {
'session.type': 'file',
'session.data_dir': '/tmp/',
'session.cookie_expires': True,
}
app = SessionMiddleware(bottle.default_app(),session_opts)
I have mentioned the SessionMiddleware at the end as im getting errors with the help of this link https://groups.google.com/forum/#!topic/bottlepy/m0akSbWRpZg
But when i am accessing request.session in the logout function i am getting
'KeyError': Uname not found
can any one give clear example of how to adjust the code inorder to maintain same session in whole application.

How to scrape pages after login

I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy

Aggregate data from list of Requests in a crawler

Perhaps I am missing something simple, so I hope this is an easy question. I am using Scrapy to parse a directory listing and then pull down each appropriate web page (actually a text file) and parse it out using Python.
Each page has a set of data I am interested in, and I update a global dictionary each time I encounter such an item in each page.
What I would like to do is simply print out an aggregate summary when all Request calls are complete, however after the yield command nothing runs. I am assuming because yield is actually returning a generator and bailing out.
I'd like to avoid writing a file for each Request object if possible...I'd rather keep it self contained within this Python script.
Here is the code I am using:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
slain = {}
class GameSpider(BaseSpider):
name = "game"
allowed_domains = ["example.org"]
start_urls = [
"http://example.org/users/deaths/"
]
def parse(self, response):
links = response.xpath('//a/#href').extract()
for link in links:
if 'txt' in link:
l = self.start_urls[0] + link
yield Request(l, callback=self.parse_following, dont_filter=True)
# Ideally print out the aggregate after all Requests are satisfied
# print "-----"
# for k,v in slain.iteritems():
# print "Slain by %s: %d" % (k,v)
# print "-----"
def parse_following(self, response):
parsed_resp = response.body.rstrip().split('\n')
for line in parsed_resp:
if "Slain" in line:
broken = line.split()
slain_by = broken[3]
if (slain_by in slain):
slain[slain_by] += 1
else:
slain[slain_by] = 1
You have closed(reason) function, it is called when the spider finishes.
def closed(self, reason):
for k,v in self.slain.iteritems():
print "Slain by %s: %d" % (k,v)

Django: custom 404 handler that returns 404 status code

The project I'm working on has some data that needs to get passed to every view, so we have a wrapper around render_to_response called master_rtr. Ok.
Now, I need our 404 pages to run through this as well. Per the instructions, I created a custom 404 handler (cleverly called custom_404) that calls master_rtr. Everything looks good, but our tests are failing, because we're receiving back a 200 OK.
So, I'm trying to figure out how to return a 404 status code, instead. There seems to be an HttpResponseNotFound class that's kinda what I want, but I'm not quite sure how to construct all of that nonsense instead of using render_to_response. Or rather, I could probably figure it out, but it seems like their must be an easier way; is there?
The appropriate parts of the code:
def master_rtr(request, template, data = {}):
if request.user.is_authenticated():
# Since we're only grabbing the enrollments to get at the courses,
# doing select_related() will save us from having to hit database for
# every course the user is enrolled in
data['courses'] = \
[e.course for e in \
Enrollment.objects.select_related().filter(user=request.user) \
if e.view]
else:
if "anonCourses" in request.session:
data['courses'] = request.session['anonCourses']
else:
data['courses'] = []
data['THEME'] = settings.THEME
return render_to_response(template, data, context_instance=RequestContext(request))
def custom_404(request):
response = master_rtr(request, '404.html')
response.status_code = 404
return response
The easy way:
def custom_404(request):
response = master_rtr(...)
response.status_code = 404
return response
But I have to ask: why aren't you just using a context processor along with a RequestContext to pass the data to the views?
Just set status_code on the response.
Into your application's views.py add:
# Imports
from django.shortcuts import render
from django.http import HttpResponse
from django.template import Context, loader
##
# Handle 404 Errors
# #param request WSGIRequest list with all HTTP Request
def error404(request):
# 1. Load models for this view
#from idgsupply.models import My404Method
# 2. Generate Content for this view
template = loader.get_template('404.htm')
context = Context({
'message': 'All: %s' % request,
})
# 3. Return Template for this view + Data
return HttpResponse(content=template.render(context), content_type='text/html; charset=utf-8', status=404)
The secret is in the last line: status=404
Hope it helped!