Tweepy rate limit / pagination issue. - django

I've put together a small twitter tool to pull relevant tweets, for later analysis in a latent semantic analysis. Ironically, that bit (the more complicated bit) works fine - it's pulling the tweets that's the problem. I'm using the code below to set it up.
This technically works, but no as expected - the .items(200) parameter I thought would pull 200 tweets per request, but it's being blocked into 15 tweet chunks (so the 200 items 'costs' me 13 requests) - I understand that this is the original/default RPP variable (now 'count' in the Twitter docs), but I've tried that in the Cursor setting (rpp=100, which is the maximum from the twitter documentation), and it makes no difference.
Tweepy/Cursor docs
The other nearest similar question isn't quite the same issue
Grateful for any thoughts! I'm sure it's a minor tweak to the settings, but I've tried various settings on page and rpp, to no avail.
auth = tweepy.OAuthHandler(apikey, apisecret)
auth.set_access_token(access_token, access_token_secret_var)
from tools import read_user, read_tweet
from auth import basic
api = tweepy.API(auth)
current_results = []
from tweepy import Cursor
for tweet in Cursor(api.search,
q=search_string,
result_type="recent",
include_entities=True,
lang="en").items(200):
current_user, created = read_user(tweet.author)
current_tweet, created = read_tweet(tweet, current_user)
current_results.append(tweet)
print current_results

I worked it out in the end, with a little assistance from colleagues. Afaict, the rpp and items() calls are coming after the actual API call. The 'count' option from the Twitter documentation which was formerly RPP as mentioned above, and is still noted as rpp in Tweepy 2.3.0, seems to be at issue here.
What I ended up doing was modifying the Tweepy Code - in api.py, I added 'count' in to the search bind section (around L643 in my install, ymmv).
""" search """
search = bind_api(
path = '/search/tweets.json',
payload_type = 'search_results',
allowed_param = ['q', 'count', 'lang', 'locale', 'since_id', 'geocode', 'max_id', 'since', 'until', 'result_type', **'count**', 'include_entities', 'from', 'to', 'source']
)
This allowed me to tweak the code above to:
for tweet in Cursor(api.search,
q=search_string,
count=100,
result_type="recent",
include_entities=True,
lang="en").items(200):
Which results in two calls, not fifteen; I've double checked this with
print api.rate_limit_status()["resources"]
after each call, and it's only deprecating my remaining searches by 2 each time.

Related

Tweepy API search doesn't have keyword

I am working with Tweepy (python's REST API client) and I'm trying to find tweets by several keywords and without url included in tweet.
But search results are not up to our satisfaction. Looks like query has erros and was stopped. Additionally we had observed that results were returned one-by-one not (as previously) in bulk packs of 100.
Could you please tell me why this search does not work properly?
We wanted to get all tweets mentioning 'Amazon' without any URL links in the text.
We used search shown below. Search results were still containing tweets with URLs or without 'Amazon' keyword.
Could you please let us know what we are doing wrong?
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
searchQuery = 'Amazon OR AMAZON OR amazon filter:-links' # Keyword
new_tweets = api.search(q=searchQuery, count=100,
result_type = "recent",
max_id = sinceId,
lang = "en")
The minus sign should be put before "filter", not before "links", like this:
searchQuery = 'Amazon OR AMAZON OR amazon -filter:links'
Also, I doubt that the count = 100 option is a valid one, since it is not listed on the API documentation (which may not be very up-to-date, though). Try to replace that with rpp = 100 to get tweets in bulk packs.
I am not sure why some of the tweets you find do not contain the "Amazon" keyword, but a possibility is that "Amazon" is contained within the username of the poster. I do not know if you can filter that directly in the query, or even if you would want to filter it, since it would mean you would reject tweets from the official Amazon accounts. I would suggest that, for each tweet the query returns, you check it to make sure it does contain "Amazon".

How to invalidate cache_page in Django?

Here is the problem: I have blog app and I cache the post output view for 5 minutes.
#cache_page(60 * 5)
def article(request, slug):
...
However, I'd like to invalidate the cache whenever a new comment is added to the post.
I'm wondering how best to do so?
I've seen this related question, but it is outdated.
I would cache in a bit different way:
def article(request, slug):
cached_article = cache.get('article_%s' % slug)
if not cached_article:
cached_article = Article.objects.get(slug=slug)
cache.set('article_%s' % slug, cached_article, 60*5)
return render(request, 'article/detail.html', {'article':cached_article})
then saving the new comment to this article object:
# ...
# add the new comment to this article object, then
if cache.get('article_%s' % article.slug):
cache.delete('article_%s' % article.slug)
# ...
This was the first hit for me when searching for a solution, and the current answer wasn't terribly helpful, so after a lot of poking around Django's source, I have an answer for this one.
Yes you can know the key programmatically, but it takes a little work.
Django's page caching works by referencing the request object, specifically the request path and query string. This means that for every request to your page that has a different query string, you will have a different cache key. For most cases, this isn't likely to be a problem, since the page you want to cache/invalidate will be a known string like /blog/my-awesome-year, so to invalidate this, you just need to use Django's RequestFactory:
from django.core.cache import cache
from django.test import RequestFactory
from django.urls import reverse
from django.utils.cache import get_cache_key
cache.delete(get_cache_key(RequestFactory().get("/blog/my-awesome-year")))
If your URLs are a fixed list of values (ie. no differing query strings) then you can stop here. However if you've got lots of different query strings (say ?q=xyz for a search page or something), then your best bet is probably to create a separate cache for each view. Then you can just pass cache="cachename" to cache_page() and later clear that entire cache with:
from django.core.cache import caches
caches["my_cache_name"].clear()
Important note about this tactic
It only really works for unauthenticated pages. The minute your user is logged in, the cookie data is made part of the cache key creation process, and therefore re-creating that key programmatically becomes much harder. I suppose you could try pulling the cookie data out of your session store, but there could be thousands of keys in there, and you'd have to invalidate/pre-cache each and every one of them.

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

Scraping data off flipkart using scrapy

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
#staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[#class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
One problem is that I cannot find a way to control the crawling and scrapping.
Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
#staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[#class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience
Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:
response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.
For a more detail overview of scraping with Firebug, you can check out the official documentation.
Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs:
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/

YouTube API: youtube.search().list won't order by date

I'm trying to get a list of all recent Youtube uploads using Python and API v3.0. I'm using youtube.search().list, but the results were all uploaded at seemingly random times in the last few years, and are not ordered by date. I've attached a sample of my code.
def getSearchResults():
""" Perform an empty search """
search_results = youtube.search().list(
part="snippet",
order="date",
maxResults=50,
type="video",
q=""
).execute()
return search_results
def getVideoDetails(video):
""" Get details of search results """
video_id = video["id"]["videoId"]
video_title = video["snippet"]["title"]
video_date = video["snippet"]["publishedAt"]
return video_id, video_title, video_date
search_results = getSearchResults()
for result in search_results["items"]:
print getVideoDetails(result)
Any help is greatly appreciated!
EDIT: After retrying a few times, sometimes I will get the correct output (the most recently uploaded videos sorted by upload date) and sometimes I won't (resulting in videos from seemingly random times). I have no idea what the result depends on, but at certain times of the day it works, at others it doesn't, so the issue seems to be with the API and not my code.
EDIT 2: Further evidence that the API is at fault: ordering by date or title on the API Reference's Try It section for youtube.search().list is wack, too.
UPDATE: Google have acknowledged the issue and seem to have fixed it.