Yield multiple items using scrapy - python-2.7

I'm scraping data from the following URL:
http://www.indexmundi.com/commodities/?commodity=gasoline
There are two sections which contain price: Gulf Coast Gasoline Futures End of Day Settlement Price and Gasoline Daily Price
I want to scrape data from both sections as two different items. Here is the code which I've written:
if dailyPrice:
item['description'] = u''.join(dailyPrice.xpath(".//h1/text()").extract())
item['price'] = u''.join(dailyPrice.xpath(".//span/text()").extract())
item['unit'] = dailyPrice.xpath(".//div/p/text()").extract()[0].split(',')[-1]
regex = re.compile("Source:(.*)",re.IGNORECASE|re.UNICODE)
result = re.search(regex, u''.join(dailyPrice.xpath(".//div/p/text()").extract()))
if result:
item['source'] = result.group(1).strip()
yield item
if futurePrice:
item['description'] = u''.join(futurePrice.xpath(".//h1/text()").extract())
item['price'] = u''.join(futurePrice.xpath(".//span/text()").extract())
item['unit'] = u''.join(futurePrice.xpath(".//div[2]/table//tr[1]/td/text()").extract())
source = futurePrice.xpath(".//div[2]/table//tr[4]/td/a/text()").extract()
if source:
item['source'] = u' - '.join(source)
else:
item['source'] = ''
yield item
I want to know if this code will work fine or what should be correct way to do this?

It should work just fine. You can yield as many items from a parse callback as you need. Just some notes:
In the second case it's better to create a new item then reusing the old one. Because you never know what has happened to the old item reference. Maybe you are overwriting and losing the previous data.
You can create different item types for your two cases. And in the pipeline treat them differently.

Related

Filtering records with AND requests with django-taggits

I'm trying to find records which have both tags at the same time with the following code:
values = s.split(',')
query = Q()
for item in values:
query &= Q(tags__name__iexact=item.strip())
photos = Photo.objects.filter(query).distinct();
Which returns empty queryset. Django-taggit documentation provides with this example only
Food.objects.filter(tags__name__in=["delicious"])
Which returns all the record containing any tag, but not both at the same time
upd
This is how I did that in my code with "filter chaining".
photos = Photo.objects
for value in values:
photos = photos.filter(tags__name__icontains=value.strip())
Which actually means
photos.filter(...).filter(...).filter(...)...
It is probably superslow, but works for me.
I don't know if this is the best way of doing it, but it's definitely an easy one :-)
values = s.split(',')
sets = [];
# first get a list of sets of photos containing one tag at a time
for tag in values:
sets.append(set(Photo.objects.filter(tags__name__in=[tag])))
# now intersect all sets to get only photos with all tags
photos = set.intersection(*sets)

How expensive are `count` calls for Django querysets?

I have a list of "posts" I have to render. For each post, I must do three filter querysets, OR them together, and then count the number of objects. Is this reasonable? What factors might make this slow?
This is roughly my code:
def viewable_posts(request, post):
private_posts = post.replies.filter(permissions=Post.PRIVATE, author_profile=request.user.user_profile).order_by('-modified_date')
community_posts = post.replies.filter(permissions=Post.COMMUNITY, author_profile__in=request.user.user_profile.following.all()).order_by('-modified_date')
public_posts = post.replies.filter(permissions=Post.PUBLIC).order_by('-modified_date')
mixed_posts = private_posts | community_posts | public_posts
return mixed_posts
def viewable_posts_count(request, post):
return viewable_posts(request, post).count()
The biggest factor I can see is that you have filter actions on each post. If possible, you should query the results associated with each post in ONE query. As of the count, it's the most efficient way of getting the number of results from a query, so it's likely not a problem.
Try the following code:
def viewable_posts(request, post):
private_posts = post.replies.filter(permissions=Post.PRIVATE, author_profile=request.user.user_profile).values_list('id',flat=True)
community_posts = post.replies.filter(permissions=Post.COMMUNITY, author_profile__in=request.user.user_profile.following.values_list('id',flat=True)
public_posts = post.replies.filter(permissions=Post.PUBLIC).values_list('id',flat=True)
Lposts_id = private_posts
Lposts_id.extend(community_posts)
Lposts_id.extend(public_posts)
viewable_posts = post.filter(id__in=Lposts_id).order_by('-modified_date')
viewable_posts_count = post.filter(id__in=Lposts_id).count()
return viewable_posts,viewable_posts_count
It should improve the following things:
order_by once, instead of three times
The count method runs on a query with only the index field
django uses a faster filter with "values", both for the count and the filtering.
Depends on your database, the db own cache may pick the last queried posts for viewable_posts, and use it for viewable_posts_count
Indeed, if you can squeeze the first three filter queries into one, you will save time as well.

Django: ManyToMany filter matching on ALL items in a list

I have such a Book model:
class Book(models.Model):
authors = models.ManyToManyField(Author, ...)
...
In short:
I'd like to retrieve the books whose authors are strictly equal to a given set of authors. I'm not sure if there is a single query that does it, but any suggestions will be helpful.
In long:
Here is what I tried, (that failed to run getting an AttributeError)
# A sample set of authors
target_authors = set((author_1, author_2))
# To reduce the search space,
# first retrieve those books with just 2 authors.
candidate_books = Book.objects.annotate(c=Count('authors')).filter(c=len(target_authors))
final_books = QuerySet()
for author in target_authors:
temp_books = candidate_books.filter(authors__in=[author])
final_books = final_books and temp_books
... and here is what I got:
AttributeError: 'NoneType' object has no attribute '_meta'
In general, how should I query a model with the constraint that its ManyToMany field contains a set of given objects as in my case?
ps: I found some relevant SO questions but couldn't get a clear answer. Any good pointer will be helpful as well. Thanks.
Similar to #goliney's approach, I found a solution. However, I think the efficiency could be improved.
# A sample set of authors
target_authors = set((author_1, author_2))
# To reduce the search space, first retrieve those books with just 2 authors.
candidate_books = Book.objects.annotate(c=Count('authors')).filter(c=len(target_authors))
# In each iteration, we filter out those books which don't contain one of the
# required authors - the instance on the iteration.
for author in target_authors:
candidate_books = candidate_books.filter(authors=author)
final_books = candidate_books
You can use complex lookups with Q objects
from django.db.models import Q
...
target_authors = set((author_1, author_2))
q = Q()
for author in target_authors:
q &= Q(authors=author)
Books.objects.annotate(c=Count('authors')).filter(c=len(target_authors)).filter(q)
Q() & Q() is not equal to .filter().filter(). Their raw SQLs are different where by using Q with &, its SQL just add a condition like WHERE "book"."author" = "author_1" and "book"."author" = "author_2". it should return empty result.
The only solution is just by chaining filter to form a SQL with inner join on same table: ... ON ("author"."id" = "author_book"."author_id") INNER JOIN "author_book" T4 ON ("author"."id" = T4."author_id") WHERE ("author_book"."author_id" = "author_1" AND T4."author_id" = "author_1")
I came across the same problem and came to the same conclusion as iuysal,
untill i had to do a medium sized search (with 1000 records with 150 filters my request would time out).
In my particular case the search would result in no records since the chance that a single record will align with ALL 150 filters is very rare, you can get around the performance issues by verifying that there are records in the QuerySet before applying more filters to save time.
# In each iteration, we filter out those books which don't contain one of the
# required authors - the instance on the iteration.
for author in target_authors:
if candidate_books.count() > 0:
candidate_books = candidate_books.filter(authors=author)
For some reason Django applies filters to empty QuerySets.
But if optimization is to be applied correctly however, using a prepared QuerySet and correctly applied indexes are necessary.

django annotate count filter

I am trying to count daily records for some model, but I would like the count was made only for records with some fk field = xy so I get list with days where there was a new record created but some may return 0.
class SomeModel(models.Model):
place = models.ForeignKey(Place)
note = models.TextField()
time_added = models.DateTimeField()
Say There's a Place with name="NewYork"
data = SomeModel.objects.extra({'created': "date(time_added)"}).values('created').annotate(placed_in_ny_count=Count('id'))
This works, but shows all records.. all places.
Tried with filtering, but it does not return days, where there was no record with place.name="NewYork". That's not what I need.
It looks as though you want to know, for each day on which any object was added, how many of the objects created on that day have a place whose name is New York. (Let me know if I've misunderstood.) In SQL that needs an outer join:
SELECT m.id, date(m.time_added) AS created, count(p.id) AS count
FROM myapp_somemodel AS m
LEFT OUTER JOIN myapp_place AS p
ON m.place_id = p.id
AND p.name = 'New York'
GROUP BY created
So you can always express this in Django using a raw SQL query:
for o in SomeModel.objects.raw('SELECT ...'): # query as above
print 'On {0}, {1} objects were added in New York'.format(o.created, o.count)
Notes:
I haven't tried to work out if this is expressible in Django's query language; it may be, but as the developers say, the database API is "a shortcut but not necessarily an end-all-be-all.")
The m.id is superfluous in the SQL query, but Django requires that "the primary key ... must always be included in a raw query".
You probably don't want to write the literal 'New York' into your query, so pass a parameter instead: raw('SELECT ... AND p.name = %s ...', [placename]).

Improving Django performance with 350000+ regs and complex query

I have a model like this:
class Stock(models.Model):
product = models.ForeignKey(Product)
place = models.ForeignKey(Place)
date = models.DateField()
quantity = models.IntegerField()
I need to get the latest (by date) quantity for every product for every place,
with almost 500 products, 100 places and 350000 stock records on the database.
My current code is like this, it worked on testing but it takes so long with the real data that it's useless
stocks = Stock.objects.filter(product__in=self.products,
place__in=self.places, date__lt=date_at)
stock_values = {}
for prod in self.products:
for place in self.places:
key = u'%s%s' % (prod.id, place.id)
stock = stocks.filter(product=prod, place=place, date=date_at)
if len(stock) > 0:
stock_values[key] = stock[0].quantity
else:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
except IndexError:
stock_values[key] = 0
else:
stock_values[key] = stock.quantity
return stock_values
How would you make it faster?
Edit:
Rewrote the code as this:
stock_values = {}
for product in self.products:
for place in self.places:
try:
stock_value = Stock.objects.filter(product=product, place=place, date__lte=date_at)\
.order_by('-date').values('cant')[0]['cant']
except IndexError:
stock_value = 0
stock_values[u'%s%s' % (product.id, place.id)] = stock_value
return stock_values
It works better (from 256 secs to 64) but still need to improve it. Maybe some custom SQL, I don't know...
Arthur's right, the len(stock) isn't the most efficient way to do that. You could go further along the "easier to ask for forgiveness than permission" route with something like this inside the inner loop:
key = u'%s%s' % (prod.id, place.id)
try:
stock = stocks.filter(product=prod, place=place, date=date_at)[0]
quantity = stock.quantity
except IndexError:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
quantity = stock.quantity
except IndexError:
quantity = 0
stock_values[key] = quantity
I'm not sure how much that would improve it compared to just changing the length check, though I think this should at least restrict it to two queries with LIMIT 1 on them (see Limiting QuerySets).
Mind you, this is still performing a lot of database hits since you could run through that loop almost 50000 times. Optimize how you're looping and you're in a better position still.
maybe the trick is in that len() method!
follow docs from:
Note: Don't use len() on QuerySets if all you want to do is determine
the number of records in the set. It's much more efficient to handle a
count at the database level, using SQL's SELECT COUNT(*), and Django
provides a count() method for precisely this reason. See count()
below.
So try changing the len to count(), and see if it makes faster!