using two xpathselectors on the same page - python-2.7

I have a spider where the scraped items are 3: brand, model and price from the same page.
Brands and models are using the same sel.xpath, later extracted and differentiated by .re in loop. However, price item is using different xpath. How can I use or combine two XPathSelectors in the spider?
Examples:
for brand and model:
titles = sel.xpath('//table[#border="0"]//td[#class="compact"]')
for prices:
prices = sel.xpath('//table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
Tested and exported individually by xpath. My problem is the combining these 2 to construct the proper loop.
Any suggestions?
Thanks!

Provided you can differentiate all 3 kind of items (brand, model, price) later, you can try using XPath union (|) to bundle both XPath queries into one selector :
//table[#border="0"]//td[#class="compact"]
|
//table[#border="0"]//td[#class="cl-price-cont"]//span[4]
UPDATE :
Responding your comment, above meant to be single XPath string. I'm not using python, but I think it should be about like this :
sel.xpath('//table[#border="0"]//td[#class="compact"] | //table[#border="0"]//td[#class="cl-price-cont"]//span[4]')

I believe you are having trouble associating the price with the make/model because both xpaths give you a list of all numbers, correct? Instead, what you want to do is build an xpath that will get you each row of the table. Then, in your loop, you can do further xpath queries to pull out the make/model/price.
rows = sel.xpath('//table[#border="0"]/tr') # Get all the rows
for row in rows:
make_model = row.xpath('//td[#class="compact"]/text()').extract()
# set make and model here using your regex. something like:
(make,model) = re("^(.+?)\s(.+?)$", make_model).groups()
price = row.xpath('//td[#class="cl-price-cont"]//span[4]/text()').extract()
# do something with the make/model/price.
This way, you know that in each iteration of the loop, the make/model/price you're getting all go together.

Related

Filter multiple Django model fields with variable number of arguments

I'm implementing search functionality with an option of looking for a record by matching multiple tables and multiple fields in these tables.
Say I want to find a Customer by his/her first or last name, or by ID of placed Order which is stored in different model than Customer.
The easy scenario which I already implemented is that a user only types single word into search field, I then use Django Q to query Order model using direct field reference or related_query_name reference like:
result = Order.objects.filter(
Q(customer__first_name__icontains=user_input)
|Q(customer__last_name__icontains=user_input)
|Q(order_id__icontains=user_input)
).distinct()
Piece of a cake, no problems at all.
But what if user wants to narrow the search and types multiple words into search field.
Example: user has typed Bruce and got a whole lot of records back as a result of search.
Now he/she wants to be more specific and adds customer's last name to search.So the search becomes Bruce Wayne, after splitting this into separate parts I'm having Bruce and Wayne. Obviously I don't want to search Orders model because order_id is a single-word instance and it's sufficient to find customer at once so for this case I'm dropping it out of query at all.
Now I'm trying to match customer by both first AND last name, I also want to handle the scenario where the order of provided data is random, to properly handle Bruce Wayne and Wayne Bruce, meaning I still have customers full name but the position of first and last name aren't fixed.
And this is the question I'm looking answer for: how to build query that will search multiple fields of model not knowing which of search words belongs to which table.
I'm guessing the solution is trivial and there's for sure an elegant way to create such a dynamic query, but I can't think of a way how.
You can dynamically OR a variable number of Q objects together to achieve your desired search. The approach below makes it trivial to add or remove fields you want to include in the search.
from functools import reduce
from operator import or_
fields = (
'customer__first_name__icontains',
'customer__last_name__icontains',
'order_id__icontains'
)
parts = []
terms = ["Bruce", "Wayne"] # produce this from your search input field
for term in terms:
for field in fields:
parts.append(Q(**{field: term}))
query = reduce(or_, parts)
result = Order.objects.filter(query).distinct()
The use of reduce combines the Q objects by ORing them together. Credit to that part of the answer goes to this answer.
The solution I came up with is rather complex, but it works exactly the way I wanted to handle this problem:
search_keys = user_input.split()
if len(search_keys) > 1:
first_name_set = set()
last_name_set = set()
for key in search_keys:
first_name_set.add(Q(customer__first_name__icontains=key))
last_name_set.add(Q(customer__last_name__icontains=key))
query = reduce(and_, [reduce(or_, first_name_set), reduce(or_, last_name_set)])
else:
search_fields = [
Q(customer__first_name__icontains=user_input),
Q(customer__last_name__icontains=user_input),
Q(order_id__icontains=user_input),
]
query = reduce(or_, search_fields)
result = Order.objects.filter(query).distinct()

Django Trigram: create gin index and search suggested words in Django

I have model with title and description fields.
I want to create a GIN index for all the words in the title and description field
So I do it the following way using SQL:
STEP1: Create a table with all the words in title and description using simple config
CREATE TABLE words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'',COALESCE("articles_article"."title", '''')) || to_tsvector(''simple'',COALESCE("articles_article"."description", '''')) FROM "articles_article"');
STEP2: Create GIN index
CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
STEP3: SEARCH
SELECT word, similarity(word, 'sri') AS sml
FROM words
WHERE word % 'sri'
ORDER BY sml DESC, word;
Result:
word sml
sri 1
srila 0.5
srimad 0.428571
How to do this in DJANGO and also i have to keep updating the GIN index
Django docs suggest that you install the relevant btree_gin_extension and append the following to the model's Meta class:
from django.contrib.postgres.indexes import GinIndex
class MyModel(models.Model):
the_field = models.CharField(max_length=512)
class Meta:
indexes = [GinIndex(fields=['the_field'])]
A relevant answer can be found here.
Regarding the updating of the index, heroku suggests:
Finally, indexes will become fragmented and unoptimized after some
time, especially if the rows in the table are often updated or
deleted. In those cases it may be required to perform a REINDEX
leaving you with a balanced and optimized index. However be cautious
about reindexing big indexes as write locks are obtained on the parent
table. One strategy to achieve the same result on a live site is to
build an index concurrently on the same table and columns but with a
different name, and then dropping the original index and renaming the
new one. This procedure, while much longer, won’t require any long
running locks on the live tables.

Django create date list from existing objects

i come with this doubt abobt how to make a linked dates list based on existing objects, first of all i have a model with a DateTimeField which stores the date and hour that the object was added.
I have something like:
|pk|name|date
|1|name 1|2016-08-02 16:14:30.405305
|2|name 2|2016-08-02 16:15:30.405305
|3|name 3|2016-08-03 16:46:29.532976
|4|name 4|2016-08-04 16:46:29.532976
And i have some records with the same day but different hour, what i want is to make a list displaying only the unique days:
2016-08-02
2016-08-03
2016-08-04
And also because i'm using the CBV DayArchiveView i want to add a link to that elements to list them per day with a url pattern like this:
url(r'^archive/(?P<year>[0-9]{4})/(?P<month>[-\w]+)/(?P<day>[0-9]+)/$', ArticleDayArchiveView.as_view(), name="archive_day"),
The truth is that i don't have a clue of how to achieve that, can you help me with that?
Extracting unique dates
instances = YourModel.objects.all()
unique_dates = list(set(map(lambda x: x.date.strftime("%Y-%m-%d"), instances)))
About listing them, your url pattern looks ok. You need to define a view in order to retrieve them and wire up with that url.
UPDATE:
If you want to order them, just:
sorted_dates = sorted(unique_dates)

How do I use django's Q with django taggit?

I have a Result object that is tagged with "one" and "two". When I try to query for objects tagged "one" and "two", I get nothing back:
q = Result.objects.filter(Q(tags__name="one") & Q(tags__name="two"))
print len(q)
# prints zero, was expecting 1
Why does it not work with Q? How can I make it work?
The way django-taggit implements tagging is essentially through a ManytoMany relationship. In such cases there is a separate table in the database that holds these relations. It is usually called a "through" or intermediate model as it connects the two models. In the case of django-taggit this is called TaggedItem. So you have the Result model which is your model and you have two models Tag and TaggedItem provided by django-taggit.
When you make a query such as Result.objects.filter(Q(tags__name="one")) it translates to looking up rows in the Result table that have a corresponding row in the TaggedItem table that has a corresponding row in the Tag table that has the name="one".
Trying to match for two tag names would translate to looking up up rows in the Result table that have a corresponding row in the TaggedItem table that has a corresponding row in the Tag table that has both name="one" AND name="two". You obviously never have that as you only have one value in a row, it's either "one" or "two".
These details are hidden away from you in the django-taggit implementation, but this is what happens whenever you have a ManytoMany relationship between objects.
To resolve this you can:
Option 1
Query tag after tag evaluating the results each time, as it is suggested in the answers from others. This might be okay for two tags, but will not be good when you need to look for objects that have 10 tags set on them. Here would be one way to do this that would result in two queries and get you the result:
# get the IDs of the Result objects tagged with "one"
query_1 = Result.objects.filter(tags__name="one").values('id')
# use this in a second query to filter the ID and look for the second tag.
results = Result.objects.filter(pk__in=query_1, tags__name="two")
You could achieve this with a single query so you only have one trip from the app to the database, which would look like this:
# create django subquery - this is not evaluated, but used to construct the final query
subquery = Result.objects.filter(pk=OuterRef('pk'), tags__name="one").values('id')
# perform a combined query using a subquery against the database
results = Result.objects.filter(Exists(subquery), tags__name="two")
This would only make one trip to the database. (Note: filtering on sub-queries requires django 3.0).
But you are still limited to two tags. If you need to check for 10 tags or more, the above is not really workable...
Option 2
Query the relationship table instead directly and aggregate the results in a way that give you the object IDs.
# django-taggit uses Content Types so we need to pick up the content type from cache
result_content_type = ContentType.objects.get_for_model(Result)
tag_names = ["one", "two"]
tagged_results = (
TaggedItem.objects.filter(tag__name__in=tag_names, content_type=result_content_type)
.values('object_id')
.annotate(occurence=Count('object_id'))
.filter(occurence=len(tag_names))
.values_list('object_id', flat=True)
)
TaggedItem is the hidden table in the django-taggit implementation that contains the relationships. The above will query that table and aggregate all the rows that refer either to the "one" or "two" tags, group the results by the ID of the objects and then pick those where the object ID had the number of tags you are looking for.
This is a single query and at the end gets you the IDs of all the objects that have been tagged with both tags. It is also the exact same query regardless if you need 2 tags or 200.
Please review this and let me know if anything needs clarification.
first of all, this three are same:
Result.objects.filter(tags__name="one", tags__name="two")
Result.objects.filter(Q(tags__name="one") & Q(tags__name="two"))
Result.objects.filter(tags__name_in=["one"]).filter(tags__name_in=["two"])
i think the name field is CharField and no record could be equal to "one" and "two" at same time.
in python code the query looks like this(always false, and why you are geting no result):
from random import choice
name = choice(["abtin", "shino"])
if name == "abtin" and name == "shino":
we use Q object for implement OR or complex queries
Into the example that works you do an end on two python objects (query sets). That gets applied to any record not necessarily to the same record that has one AND two as tag.
ps: Why do you use the in filter ?
q = Result.objects.filter(tags_name_in=["one"]).filter(tags_name_in=["two"])
add .distinct() to remove duplicates if expecting more than one unique object

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])