Reading a site into a dataframe, after logging in via Requests - python-2.7

I'm trying to create a data frame from a html page that has my weekly football picks.
I have bypassed the login via "requests" and the following code:
import pandas as pd
import requests
with requests.session() as c:
url ='https://auth.cbssports.com/login?master_product=150&xurl=http%3A%2F%2Fwww.cbssports.com%2F'
c.get(url)
login_data ={'dummy::login_form':'1', 'form::login_form': 'login_form', 'xurl':'http://funkybunk.football.cbssports.com/office-pool/standings/live', 'master_product':'XXXX', 'vendor': 'cbssports', 'userid':'XXXX#gmail.com', 'password': 'XXXX', '_submit': 'Sign+In'}
c.post(url, data = login_data, headers ={'Referer':'https://auth.cbssports.com/login?master_product=150&xurl=http%3A%2F%2Fwww.cbssports.com%2F'})
page =c.get("http://funkybunk.football.cbssports.com/office-pool/standings/live/1")
However, I don't know how to get the info from "page" into a dataframe.
When I use "pd.read_html" it still reads the login page instead of the page in the parameter of pd.read_html.
week1 = pd.read_html("http://funkybunk.football.cbssports.com/office- pool/standings/live/1")
print(week1)
# trying to read it into a data frame; however it prints out the login page not the page I accessed via "page"
If I try:
week1 =pd.read_html(page)
#get an error saying I can't get response objects
If I run
print(page.content)
# I'm able to see all of my info
#such as my team picks etc
""" I"m able to see all this info and more: id":XXX"41","snsScore":0,"numGames":16,"picks":{"NFL_MIA#SEA":{"winner":"SEA"},"time":"1473309152","NFL_BUF#BAL":{"winner":"BAL"},"NFL_OAK#NO":{"winner":"OAK"},"NFL_NYG#DAL":{"winner":"NYG"},"NFL_SD#KC":{"winner":"SD"},"NFL_GB#JAC":{"winner":"GB"},"NFL_MIN#TEN":{"winner":"MIN"},"NFL_CAR#DEN":{"winner":"CAR"},"mnf":"43","NFL_NE#ARI":{"winner":"ARI"},"NFL_DET#IND":{"winner":"IND"},"NFL_CHI#HOU":{"winner":"HOU"},"NFL_TB#ATL":{"winner":"TB"},"NFL_LAR#SF":{"winner":"LAR"},"NFL_CLE#PHI":{"winner":"PHI"},"NFL_PIT#WAS":{"winner":"PIT"},"NFL_CIN#NYJ":{"winner":"CIN"}}}"""
#However when I run try to read it via a dataframe
#week1 = pd.read_html(page.contents)
#I only see:
#>>> week1
#[ 0 1 2
#0 NaN NaN NaN
#1 NaN Live Scoring Notice NaN
#2 NaN NaN NaN
#3 NaN NaN NaN]
So all the data isn't being read into the dataframe for some reason.
I don't know anything about html coding and was able to use 'request' by watching a video and some trial and error.
I've been at it a couple of days but I'm stumped.
Thanks.
I'm using python 2.7 and Windows 7.
Thanks.

Related

Display number of wrong attempts of user

In my django app I want to show that the user has tried logging in with wrong password three times.
I am using django-brutebuster
I can see a table in my postgres on migrate named BruteBuster_failedattempt which has the following columns :
id
username
IP
failures
timestamp
settings.py
Installed Apps:
...
BruteBuster
Middleware
...
BruteBuster.middleware.RequestMiddleware',
Block threshhold
BB_MAX_FAILURES = 3
BB_BLOCK_INTERVAL = 3
I want to display the count of failed attempts on my django template.
This is just a model, so you can
from BruteBuster.models import FailedAttempt
from django.db.models import Sum
total_failed = FailedAttempt.objects.filter(
username=request.user.username
).aggregate(
total_failures=Sum('failures')
)['total_failures']
There can however be some problems here:
BruteBuster here tracks the number of failures per username and per IP, so we need to sum these up;
if the user changes their username, then there are no records anymore for this FailedAttempt, so it might not work perfectly for such scenarios; and
if there are nu failed attempts, then it will return None, not 0, but you can add or 0 at the end to convert None to 0.

Having issues with Python xpath scraping

I'm back again with a question for the wonderful people here :)
Ive recently begun getting back into python (50% done at codcademy lol) and decided to make a quick script for web-scraping the spot price of gold in CAD. This will eventually be a part of a much bigger script... but Im VERY rusty and thought it would be a good project.
My issue:
I have been following the guide over at http://docs.python-guide.org/en/latest/scenarios/scrape/ to accomplish my goal, however my script always returns/prints
<Element html at 0xRANDOM>
with RANDOM being a (i assume) random hex number. This happens no matter what website I seem to use.
My Code:
#!/bin/python
#Scrape current gold spot price in CAD
from lxml import html
import requests
def scraped_price():
page = requests.get('http://goldprice.org/gold-price-canada.html')
tree = html.fromstring(page.content)
print "The full page is: ", tree #added for debug WHERE ERROR OCCURS
bid = tree.xpath("//span[#id='gpotickerLeftCAD_price']/text()")
print "Scraped content: ", bid
return bid
gold_scraper = scraped_price()
My research:
1) www.w3schools.com/xsl/xpath_syntax.asp
This is where I figured out to use '//span' to find all 'span' objects and then used the #id to narrow it down to the one I need.
2)Scraping web content using xpath won't work
This makes me think I simply have a bad tree.xpath setup. However I cannot seem to figure out where or why.
Any assistance would be greatly appreciated.
<Element html at 0xRANDOM>
What you see printed is the lxml.html's Element class string representation. If you want to see the actual HTML content, use tostring():
print(html.tostring(tree, pretty_print=True))
You are also getting Scraped content: [] printed which really means that there were no elements matching the locator. And, if you would see the previously printed out HTML, there is actually no element with id="gpotickerLeftCAD_price" in the downloaded source.
The prices on this particular site are retrieved dynamically with continuous JSONP GET requests issued periodically. You can either look into simulating these requests, or stay on a higher level automating a browser via selenium. Demo (using PhantomJS headless browser):
>>> import time
>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://goldprice.org/gold-price-canada.html")
>>> while True:
... print(driver.find_element_by_id("gpotickerLeftCAD_price").text)
... time.sleep(1)
...
1,595.28
1,595.28
1,595.28
1,595.28
1,595.28
1,595.19
...

Tweepy location on Twitter API filter always throws 406 error

I'm using the following code (from django management commands) to listen to the Twitter stream - I've used the same code on a seperate command to track keywords successfully - I've branched this out to use location, and (apparently rightly) wanted to test this out without disrupting my existing analysis that's running.
I've followed the docs and have made sure the box is in Long/Lat format (in fact, I'm using the example long/lat from the Twitter docs now). It looks broadly the same as the question here, and I tried using their version of the code from the answer - same error. If I switch back to using 'track=...', the same code works, so it's a problem with the location filter.
Adding a print debug inside streaming.py in tweepy so I can see what's happening, I print out the self.parameters self.url and self.headers from _run, and get:
{'track': 't,w,i,t,t,e,r', 'delimited': 'length', 'locations': '-121.7500,36.8000,-122.7500,37.8000'}
/1.1/statuses/filter.json?delimited=length and
{'Content-type': 'application/x-www-form-urlencoded'}
respectively - seems to me to be missing the search for location in some way shape or form. I don't believe I'm/I'm obviously not the only one using tweepy location search, so think it's more likely a problem in my use of it than a bug in tweepy (I'm on 2.3.0), but my implementation looks right afaict.
My stream handling code is here:
consumer_key = 'stuff'
consumer_secret = 'stuff'
access_token='stuff'
access_token_secret_var='stuff'
import tweepy
import json
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
#print type(decoded), decoded
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
try:
user, created = read_user(decoded)
print "DEBUG USER", user, created
if decoded['lang'] == 'en':
tweet, created = read_tweet(decoded, user)
print "DEBUG TWEET", tweet, created
else:
pass
except KeyError,e:
print "Error on Key", e
pass
except DataError, e:
print "DataError", e
pass
#print user, created
print ''
return True
def on_error(self, status):
print status
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret_var)
stream = tweepy.Stream(auth, l)
#locations must be long, lat
stream.filter(locations=[-121.75,36.8,-122.75,37.8], track='twitter')
The issue here was the order of the coordinates.
Correct format is:
SouthWest Corner(Long, Lat), NorthEast Corner(Long, Lat). I had them transposed. :(
The streaming API doesn't allow to filter by location AND keyword simultaneously.
you must refer to this answer i had the same problem earlier
https://stackoverflow.com/a/22889470/4432830

Tweepy rate limit / pagination issue.

I've put together a small twitter tool to pull relevant tweets, for later analysis in a latent semantic analysis. Ironically, that bit (the more complicated bit) works fine - it's pulling the tweets that's the problem. I'm using the code below to set it up.
This technically works, but no as expected - the .items(200) parameter I thought would pull 200 tweets per request, but it's being blocked into 15 tweet chunks (so the 200 items 'costs' me 13 requests) - I understand that this is the original/default RPP variable (now 'count' in the Twitter docs), but I've tried that in the Cursor setting (rpp=100, which is the maximum from the twitter documentation), and it makes no difference.
Tweepy/Cursor docs
The other nearest similar question isn't quite the same issue
Grateful for any thoughts! I'm sure it's a minor tweak to the settings, but I've tried various settings on page and rpp, to no avail.
auth = tweepy.OAuthHandler(apikey, apisecret)
auth.set_access_token(access_token, access_token_secret_var)
from tools import read_user, read_tweet
from auth import basic
api = tweepy.API(auth)
current_results = []
from tweepy import Cursor
for tweet in Cursor(api.search,
q=search_string,
result_type="recent",
include_entities=True,
lang="en").items(200):
current_user, created = read_user(tweet.author)
current_tweet, created = read_tweet(tweet, current_user)
current_results.append(tweet)
print current_results
I worked it out in the end, with a little assistance from colleagues. Afaict, the rpp and items() calls are coming after the actual API call. The 'count' option from the Twitter documentation which was formerly RPP as mentioned above, and is still noted as rpp in Tweepy 2.3.0, seems to be at issue here.
What I ended up doing was modifying the Tweepy Code - in api.py, I added 'count' in to the search bind section (around L643 in my install, ymmv).
""" search """
search = bind_api(
path = '/search/tweets.json',
payload_type = 'search_results',
allowed_param = ['q', 'count', 'lang', 'locale', 'since_id', 'geocode', 'max_id', 'since', 'until', 'result_type', **'count**', 'include_entities', 'from', 'to', 'source']
)
This allowed me to tweak the code above to:
for tweet in Cursor(api.search,
q=search_string,
count=100,
result_type="recent",
include_entities=True,
lang="en").items(200):
Which results in two calls, not fifteen; I've double checked this with
print api.rate_limit_status()["resources"]
after each call, and it's only deprecating my remaining searches by 2 each time.

Using Mechanize to login to Bing on Python 2.7.5

I am trying to use Python 2.7.5 and the mechanize library to create a program that logs me into my Microsoft account on bing.com. To start out I have created this program to print out the names of the forms on this webpage, so I can reference them in later code. My current code is this (sorry about the long URL):
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Firefox')]
br.open("https://login.live.com/ppsecure/post.srf?wa=wsignin1.0&rpsnv=11&ct=1375231095&rver=6.0.5286.0&wp=MBI&wreply=http:<%2F%2Fwww.bing.com%2FPassport.aspx%3Frequrl%3Dhttp%253a%252f%252fwww.bing.com%252f&lc=1033&id=264960&bk=1375231423")
print(br.title)
forms_printed = 0
for form in br.forms():
print form
forms_printed += 1
if forms_printed == 0:
print "No forms to print!"
Despite the fact that when I visit the webpage in Firefox I see the username and password form, when I run this code, the result always is "No forms to print!" Am I using mechanize wrong here, or is the website intentionally stopping me from finding those forms? Any tips and/or advice is greatly appreciated.
If you try to read the HTML that you are recieving you will see that webpage requires javascript.
Example:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Firefox')]
page = br.open("https://login.live.com/ppsecure/post.srf?wa=wsignin1.0&rpsnv=11&ct=1375231095&rver=6.0.5286.0&wp=MBI&wreply=http:<%2F%2Fwww.bing.com%2FPassport.aspx%3Frequrl%3Dhttp%253a%252f%252fwww.bing.com%252f&lc=1033&id=264960&bk=1375231423")
print page.read()
print(br.title)
forms_printed = 0
for form in br.forms():
print form
forms_printed += 1
if forms_printed == 0:
print "No forms to print!"
Output:
Microsoft account
JavaScript required to sign in
Microsoft account requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.
To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.
See related questions about that