Iterating tweets and saving them in a shapefile with tweepy and shapefile - if-statement

I am searching for tweets and want to save them in a shapefile. Iterating through tweets is going well and when I use print statements I can get exactly what I want. I am attempting to put these tweets in a point shapefile. For some reason it does not accept the iterating of the if statement. So how do I iterate through my tweets and save them one by one in my point shapefile with only the tweet.text and tweet.id attached?
I got inspired by looking at the following link: https://code.google.com/p/pyshp/
import tweepy
import shapefile
consumer_key = "..."
consumer_secret = "..."
access_token = "..."
access_token_secret = "..."
auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
tweetsaspoints = shapefile.Writer(shapefile.POINT)
page = 1
while True:
statuses = api.search(q="*",count=1000, geocode="52.015106,5.394287,150km")
if statuses:
for status in statuses:
print status.geo
tweetsaspoints._shapes.extend([status.geo['coordinates']])
tweetsaspoints.records.extend([("TEXT","Test")])
else:
# All done
break
page += 1 # next page
tweetsaspoints.save('shapefiles/test/point')
I do not understand the page part. I seem to iterate through the same tweets over and over again. Also, I am not succeeding in writing my coordinates and data to a point shapefile.

As per documentation, try to use:
for status in tweepy.Cursor(api.user_timeline).items():
# process status here
process_status(status)
Alternative:
page = 1
while True:
statuses = api.user_timeline(page=page)
if statuses:
for status in statuses:
# process status here
process_status(status)
else:
# All done
break
page += 1 # next page
reference: http://docs.tweepy.org/en/latest/cursor_tutorial.html

Related

How to get and save into file the full list of twitter account followers with Tweepy

I wrote this code to get the full list of twitter account followers using Tweepy:
# ... twitter connection and streaming
fulldf = pd.DataFrame()
line = {}
ids = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
line = [{'id': user.id,
'Name': user.name,
'Statuses Count':user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name':user.screen_name,
'Followers Count':user.followers_count,
'Location':user.location,
'Language':user.lang,
'Created at':user.created_at,
'Time zone':user.time_zone,
'Geo enable':user.geo_enabled,
'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
df = pd.DataFrame(line)
fulldf = fulldf.append(df)
del df
fulldf.to_csv('out.csv', sep=',', index=False)
print i ,len(ids)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except tweepy.TweepError as e2:
print "exception global block"
print e2.message[0]['code']
print e2.args[0][0]['code']
At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.
Any help with this will be appreciated.
Consider the following part of your code:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.
Let's start over a bit.
Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.
Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.
Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.
Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!
import tweepy
import pandas as pd
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
print i
full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
ids.extend(page)
results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
'Name': user.name,
'Statuses Count': user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name': user.screen_name,
'Followers Count': user.followers_count,
'Location': user.location,
'Language': user.lang,
'Created at': user.created_at,
'Time zone': user.time_zone,
'Geo enable': user.geo_enabled,
'Description': user.description}
for user in results]
df = pd.DataFrame(all_users)
df.to_csv('All followers.csv', index=False, encoding='utf-8')

Why does scrapy miss some links?

I am scraping the web-site "www.accell-group.com" using the "scrapy" library for Python. The site is scraped completely, in total 131 pages (text/html) and 2 documents (application/pdf) are identified. Scrapy did not throw any warnings or errors. My algorithm is supposed to scrape every single link. I use CrawlSpider.
However, when I look into the page "http://www.accell-group.com/nl/investor-relations/jaarverslagen/jaarverslagen-van-accell-group.htm", which is reported by "scrapy" as scraped/processed, I see that there are more pdf-documents, for example "http://www.accell-group.com/files/4/5/0/1/Jaarverslag2014.pdf". I cannot find any reasons for it not to be scraped. There is no dynamic/JavaScript content on this page. It is not forbidden in "http://www.airproducts.com/robots.txt".
Do you maybe have any idea why it can happen?
It is maybe because the "files" folder is not in "http://www.accell-group.com/sitemap.xml"?
Thanks in advance!
My code:
class PyscrappSpider(CrawlSpider):
"""This is the Pyscrapp spider"""
name = "PyscrappSpider"
def__init__(self, *a, **kw):
# Get the passed URL
originalURL = kw.get('originalURL')
logger.debug('Original url = {}'.format(originalURL))
# Add a protocol, if needed
startURL = 'http://{}/'.format(originalURL)
self.start_urls = [startURL]
self.in_redirect = {}
self.allowed_domains = [urlparse(i).hostname.strip() for i in self.start_urls]
self.pattern = r""
self.rules = (Rule(LinkExtractor(deny=[r"accessdenied"]), callback="parse_data", follow=True), )
# Get WARC writer
self.warcHandler = kw.get('warcHandler')
# Initialise the base constructor
super(PyscrappSpider, self).__init__(*a, **kw)
def parse_start_url(self, response):
if (response.request.meta.has_key("redirect_urls")):
original_url = response.request.meta["redirect_urls"][0]
if ((not self.in_redirect.has_key(original_url)) or (not self.in_redirect[original_url])):
self.in_redirect[original_url] = True
self.allowed_domains.append(original_url)
return self.parse_data(response)
def parse_data(self, response):
"""This function extracts data from the page."""
self.warcHandler.write_response(response)
pattern = self.pattern
# Check if we are interested in the current page
if (not response.request.headers.get('Referer')
or re.search(pattern, self.ensure_not_null(response.meta.get('link_text')), re.IGNORECASE)
or re.search(r"/(" + pattern + r")", self.ensure_not_null(response.url), re.IGNORECASE)):
logging.debug("This page gets processed = %(url)s", {'url': response.url})
sel = Selector(response)
item = PyscrappItem()
item['url'] = response.url
return item
else:
logging.warning("This page does NOT get processed = %(url)s", {'url': response.url})
return response.request
Remove or expand appropriately your "allowed_domains" variable and you should be fine. All the URLs the spider follows, by default, are restricted by allowed_domains.
EDIT: This case mentions particularly pdfs. PDFs are explicitly excluded as extensions as per the default value of deny_extensions (see here) which is IGNORED_EXTENSIONS (see here).
To allow your application to crawl PDFs all you have to do is to exclude them from IGNORED_EXTENSIONS by setting explicitly the value for deny_extensions:
from scrapy.linkextractors import IGNORED_EXTENSIONS
self.rules = (Rule(...
LinkExtractor(deny=[r"accessdenied"], deny_extensions=set(IGNORED_EXTENSIONS)-set(['pdf']))
..., callback="parse_data"...
So, I'm afraid, this is the answer to the question "Why does Scrapy miss some links?". As you will likely see it just opens the doors to further questions, like "how do I handle those PDFs" but I guess this is the subject of another question.

Scrapy webcrawler gets caught in infinite loop, despite initially working.

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

Tweepy location on Twitter API filter always throws 406 error

I'm using the following code (from django management commands) to listen to the Twitter stream - I've used the same code on a seperate command to track keywords successfully - I've branched this out to use location, and (apparently rightly) wanted to test this out without disrupting my existing analysis that's running.
I've followed the docs and have made sure the box is in Long/Lat format (in fact, I'm using the example long/lat from the Twitter docs now). It looks broadly the same as the question here, and I tried using their version of the code from the answer - same error. If I switch back to using 'track=...', the same code works, so it's a problem with the location filter.
Adding a print debug inside streaming.py in tweepy so I can see what's happening, I print out the self.parameters self.url and self.headers from _run, and get:
{'track': 't,w,i,t,t,e,r', 'delimited': 'length', 'locations': '-121.7500,36.8000,-122.7500,37.8000'}
/1.1/statuses/filter.json?delimited=length and
{'Content-type': 'application/x-www-form-urlencoded'}
respectively - seems to me to be missing the search for location in some way shape or form. I don't believe I'm/I'm obviously not the only one using tweepy location search, so think it's more likely a problem in my use of it than a bug in tweepy (I'm on 2.3.0), but my implementation looks right afaict.
My stream handling code is here:
consumer_key = 'stuff'
consumer_secret = 'stuff'
access_token='stuff'
access_token_secret_var='stuff'
import tweepy
import json
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
#print type(decoded), decoded
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
try:
user, created = read_user(decoded)
print "DEBUG USER", user, created
if decoded['lang'] == 'en':
tweet, created = read_tweet(decoded, user)
print "DEBUG TWEET", tweet, created
else:
pass
except KeyError,e:
print "Error on Key", e
pass
except DataError, e:
print "DataError", e
pass
#print user, created
print ''
return True
def on_error(self, status):
print status
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret_var)
stream = tweepy.Stream(auth, l)
#locations must be long, lat
stream.filter(locations=[-121.75,36.8,-122.75,37.8], track='twitter')
The issue here was the order of the coordinates.
Correct format is:
SouthWest Corner(Long, Lat), NorthEast Corner(Long, Lat). I had them transposed. :(
The streaming API doesn't allow to filter by location AND keyword simultaneously.
you must refer to this answer i had the same problem earlier
https://stackoverflow.com/a/22889470/4432830

django subprocess p.wait() doesn't return web

With a django button, I need to launch multiples music (with random selection).
In my models.py, I have two functions 'playmusic' and 'playmusicrandom' :
def playmusic(self, music):
if self.isStarted():
self.stop()
command = ("sudo /usr/bin/mplayer "+music.path)
p = subprocess.Popen(command+str(music.path), shell=True)
p.wait()
def playmusicrandom(request):
conn = sqlite3.connect(settings.DATABASES['default']['NAME'])
cur = conn.cursor()
cur.execute("SELECT id FROM webgui_music")
list_id = [row[0] for row in cur.fetchall()]
### Get three IDs randomly from the list ###
selected_ids = random.sample(list_id, 3)
for i in (selected_ids):
music = Music.objects.get(id=i)
player.playmusic(music)
With this code, three musics are played (one after the other), but the web page is just "Loading..." during execution...
Is there a way to display the refreshed web page to the user, during the loop ??
Thanks.
Your view is blocked from returning anything to the web server while it is waiting for playmusicrandom() to finish.
You need to arrange for playmusicrandom() to do its task after you're returned the HTTP status from the view.
This means that you likely need a thread (or similar solution).
Your view will have something like this:
import threading
t = threading.Thread(target=player_model.playmusicrandom,
args=request)
t.setDaemon(True)
t.start()
return HttpResponse()
This code snippet came from here, where you will find more detailed information about the issues you face and possible solutions.