Combine and flatten many key/value tuples into a single tuple in pig - tuples

I am using Pig 0.8.1. I am somewhat new to Pig but I know there must be a reasonable and re-usable solution for how I want to work with my tuples. I have the following format (similar to triples):
Schema: (uuid, key, val)
Data:
(id1, 'name', 'Corey')
(id1, 'location', 'USA')
(id1, 'carsOwned', 5)
(id2, 'name', 'Paul')
(id2, 'location', 'CANADA')
(id2, 'carsOwned', 10)
The reason I'm representing this data in triples is because it's possible to have multi-valued keys, so pushing the data into a map is out of the question.
What I need to be able to do is find the ids, names and locations of the people with the top 10 cars owned. I'd like it if my output format could be this when sorted in descending order:
Schema: (uuid, name, location, carsOwned)
Data:
(id2, 'Paul', 'CANADA', 10)
(id1, 'Corey', 'USA', 5)
I have tried filtering my input into 3 different aliases (one where key == 'name', one where key == 'location' and one where key == 'carsOwned') so that I can use JOIN and bring them back into one tuple, but it appears that Pig ends up loading from the inputFormat 3 times instead of one. Maybe I'm doing that wrong?
I also tried grouping by id but then I can't seem to find a reasonable way to work with the bag of the actual triple key/values since they all have the exact same schema.
What I really want is to group by the id field and then flatten each of the keys but rename the alias to the actual name of the key.
Any ideas? Thanks in advance!

This solution is a bit sloppy, because your data is not organized in a way that Pig is really set up for -- i.e., conceptually each id show be a row key, with the fields named by what you have in the second column. But this can still be done, as long as your data is all reasonable. If you erroneously wind up with multiple rows with the same id and field name, but different values, this will get complicated.
Use a nested foreach to pick out the values for the three fields you're interested in.
keyedByID =
/* Gather the rows by ID, then process each one in turn */
FOREACH (GROUP Data BY id) {
/* Pull out the fields you want. If you have duplicate rows,
you'll need to use a LIMIT statement to ensure just a single record */
name = FILTER Data BY field == 'name';
location = FILTER Data BY field == 'location';
carsOwned = FILTER Data BY field == 'carsOwned';
GENERATE
/* Output each field you want. You'll need to use FLATTEN since
the things created above in the nested foreach are bags. */
group AS id,
FLATTEN(name) AS name,
FLATTEN(locatioN) AS location,
FLATTEN(carsOwned) AS carsOwned;
};
Now you've got a relation that puts all the information for an ID on a single row, and you can do with it whatever you want. For example, you said wanted to pull out the top 10 car owners:
ordered = ORDER keyedByID BY carsOwned DESC;
top10 = LIMIT ordered 10;

Related

Django inner join with search terms for tables across many-to-many relationships

I have a dynamic hierarchical advanced search interface I created. Basically, it allows you to search terms, from about 6 or 7 tables that are all linked together, that can be and'ed or or'ed together in any combination. The search entries in the form are all compiled into a complex Q expression.
I discovered a problem today. If I provide a search term for a field in a many-to-many related sub-table, the output table can include results from that table that don't match the term.
My problem can by reproduced in the shell with a simple query:
qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
sqs = qs[0].msrun.sample.animal.studies.all()
sqs.count()
#result: 2
Specifically:
In [3]: qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
In [12]: ss = s[0].msrun.sample.animal.studies.all()
In [13]: ss[0].__dict__
Out[13]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bfbf940>,
'id': 3,
'name': 'Small OBOB',
'description': ''}
In [14]: ss[1].__dict__
Out[14]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bea81f0>,
'id': 1,
'name': 'obob_fasted',
'description': ''}
The ids in sqs queryset include 1 & 3 even though I only searched on 3. I don't get back literally all studies, so it is filtering some un-matching study records. I understand why I see that, but I don't know how to execute a query that treats it like a join I could perform in SQL where I can restrict the results to only include records that match the query, instead of getting back only records in the root model and gathering everything left-joined to those root model records.
Is there a way to do such an inner join (as the result of a single complex Q expression in a filter) on the entire set of linked tables so that I only get back records that match the M:M field search term?
UPDATE:
By looking at the SQL:
In [3]: str(s.query)
Out[3]: 'SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC'
...I can see that the query is as specific as I would like it to be, but in the template, how do I specify that I want what I would have seen in the SQL result, had I supplied all of the specific related table fields I wanted to see in the output? E.g.:
SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id", "DataRepo_animal_studies"."study_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC
All the Django ORM gives you back from a query (whose filters use fields from many-to-many [henceforth "M:M"] related tables), is a set of records from the "root" table from which the query started. It uses the join logic that you would use in an SQL query, obtaining records using the foreign keys, so you are guaranteed to get back root table records that DO link to a M:M related table record that matches your search term, but when you send the root table records(/queryset) to be rendered in a template and insert a nested loop to access records from the M:M related table, it always gets everything linked to that root table record - whether they match your search term or not, so if you get back a root table record that links to multiple records in an M:M related table, at least 1 record is guaranteed to match your search term, but other records may not.
In order to get the inner join (i.e. only including combined records that match search terms in the table(s) queried), you simply have to roll-your-own, because Django doesn't support it. I accomplished this in the following way:
When rendering search results, wherever you want to include records from a M:M related table, you have to create a nested loop. At the inner-most loop, I essentially re-enforce that the record matches all of the search terms. I.e. I implement my own filter using conditionals on each combination of records (from the root table and from each of the M:M related table). If any record combination does not match, I skip it.
Regarding the particulars as to how I did that, I won't get too into the weeds, but at each nested loop, I maintain a dict of its key path (e.g. msrun__sample__animal__studies) as the key and the current record as the value. I then use a custom simple_tag to re-run the search terms against the current record from each table (by checking the key path of the search term against those available in the dict).
Alternatively, you could do this in the view by compiling all of the matching records and sending the combined large table to the template - but I opted not to do this because of the hurdles surrounding cached_properties and since I already pass the search term form data to re-display the executed search form, the structure of the search was already available.
Watch out #1 - Even without "refiltering" the combined records in the template, note that the record count can be inaccurate when dealing with a combined/composite/joined table. To ensure that the number of records I report in my header above the table was always correct/accurate (i.e. represented the number of rows in my html table), I kept track of the filtered count and used javascript under the table to update the record count reported at the top.
Watch out #2 - There is one other thing to keep in mind when querying using terms from M:M related tables. For every M:M related table record that matches the search term, you will get back a duplicate record from the root table, and there's no way to tell the difference between them (i.e. which M:M record/field value matched in each case. For example, if you matched 2 records from an M:M related table, you would get back 2 identical root table records, and when you pass that data to the template, you would end up with 4 rows in your results, each with the same data from the root table record. To avoid this, all you have to do is append the .distinct() filter to you results queryset.

How can you filter a Django query's joined tables then iterate the joined tables in one query?

I have table Parent, and a table Child with a foreign key to table Parent.
I want to run a query for all Parents with a child called Eric, and report Eric's age.
I run:
parents = Parents.objects.filter(child__name='Eric')
I then iterate over the queryset:
for parent in parents:
print(f'Parent name {parent.name} child Eric age {parent.child.age}')
Clearly this doesn't work - I need to access child through the foreign key object manager, so I try:
for parent in parents:
print(f'Parent name {parent.name}')
for child in parent.child_set.all():
print(f'Child Eric age {parent.child.age}')
Django returns all children's ages, not just children named Eric.
I can repeat the filter conditions:
parents = Parents.objects.filter(child__name='Eric')
for parent in parents:
print(f'Parent name {parent.name}')
for child in parent.child_set.filter(name='Eric'):
print(f'Child Eric age {child.age}')
But this means duplicate code (so risks future inconsistency when another dev makes a change to one not the other), and runs a second query on the database.
Is there a way of getting the matching records and iterating over them? Been Djangoing for years and can't believe I can't do this!
PS. I know that I can do Child.objects.filter(name='Eric').select_related('parent'). But what I would really like to do involves a second child table. So add to the above example a table Address with a foreign key to Parent. I want to get parents with children named Eric and addresses in Timbuktu and iterate over the all Timbuktu addresses and all little Erics. This is why I don't want to use Child's object manager.
This is the best I could come up with - three queries, repeating each filter.
children = Children.objects.filter(name='Eric')
addresses = Address.objects.filter(town='Timbuktu')
parents=(
Parent.objects
.filter(child__name='Eric', address__town='Timbuktu')
.prefetch_related(Prefetch('child_set', children))
.prefetch_related(Prefetch('address_set', addresses))
)
The .values function gives you direct access to the recordset returned (thank you #Iain Shelvington):
parents_queryset_dicts = Parent.objects
.filter(child__name='Eric', address__town='Timbuktu')
.values('id', 'name', 'child__id', 'address__id', 'child__age', 'address__whatever')
.order_by('id', 'child__id', 'address__id')
Note though that this retrieves a Cartesian product of children and addresses, so our gain in reduced query count is slightly offset by double-sized result sets and de-duplication below. So I am starting to think two queries using Child.objects and Address.objects is superior - slightly slower but simpler code.
In my actual use case I have multiple, multi-table chains of foreign key joins, so am splitting the query to prevent the Cartesian join, but still making use of the .values() approach to get filtered, nested tables.
If you then want a hierarchical structure, eg, for sending as JSON to the client, to produce:
parents = {
parent_id: {
'name': name,
'children': {
child_id: {
'age': child_age
},
'addresses': {
address_id: {
'whatever': address_whatever
},
},
},
}
Run something like:
prev_parent_id = prev_child_id = prev_address_id = None
parents = {}
for parent in parents_queryset_dicts:
if parent['id'] != prev_parent_id:
parents[parent['id']] = {'name': parent['name'], children: {}, addresses: {}}
prev_parent_id = parent['id']
if parent['child__id'] != prev_child_id:
parents[parent['id']]['children'][parent['child__id']] = {'age': parent['child__age']}
prev_child_id = parent['child__id']
if parent['address__id'] != prev_address_id:
parents[parent['id']]['addresses'][parent['address__id']] = {'whatever': parent['address__whatever']}
prev_address_id = parent['address__id']
This is dense code, and you no longer get access to any fields not explicitly extracted and copied in, including any nested ~_set querysets, and de-duplication of the Cartesian product is not obvious to later developers. You can grab the queryset, keep it, then extract the .values, so you have both from the same, single, database query. But often the three query repeated filters is a bit cleaner, if a couple database queries less efficient:
children = Children.objects.filter(name='Eric')
addresses = Address.objects.filter(town='Timbuktu')
parents_queryset = (
Parent.objects
.filter(child__name='Eric', address__town='Timbuktu')
.prefetch_related(Prefetch('child_set', children))
.prefetch_related(Prefetch('address_set', addresses))
)
parents = {}
for parent in parents_queryset:
parents[parent.id] = {'name': parent['name'], children: {}, addresses: {}}
for child in parent.child_set: # this is implicitly filtered
parents[parent.id]['children'][child.id] = {'age': child.age}
for address in parent.address_set: # also implicitly filtered
parents[parent.id]['addresses'][address.id] = {'whatever': address.whatever}
One last approach, which someone briefly posted then deleted - I'd love to know why - is using annotate and F() objects. I have not experimented with this, the SQL generated looks fine though and it seems to run a single query and not require repeating filters:
from django.db.models import F
parents = (
Parent.objects.filter(child__name='Eric')
.annotate(child_age=F('child__age'))
)
Pros and cons seem identical to .values() above, although .values() seems slightly more basic Django (so easier to read) and you don't have to duplicate field names (eg, with the obfuscation above of child_age=child__age). Advantages might be convenience of . accessors instead of ['field'], you keep hold of the lazy nested recordsets, etc. - although if you're counting the queries you probably want things to fall over if you issue an accidental query per row.

django setting filter field with a variable

I show a model of sales that can be aggregated by different fields through a form. Products, clients, categories, etc.
view_by_choice = filter_opts.cleaned_data["view_by_choice"]
sales = sales.values(view_by_choice).annotate(........).order_by(......)
In the same form I have a string input where the user can filter the results. By "product code" for example.
input_code = filter_opts.cleaned_data["filter_code"]
sales = sales.filter(prod_code__icontains=input_code)
What I want to do is filter the queryset "sales" by the input_code, defining the field dynamically from the view_by_choice variable.
Something like:
sales = sales.filter(VARIABLE__icontains=input_code)
Is it possible to do this? Thanks in advance.
You can make use of dictionary unpacking [PEP-448] here:
sales = sales.filter(
**{'{}__icontains'.format(view_by_choice): input_code}
)
Given that view_by_choice for example contains 'foo', we thus first make a dictionary { 'foo__icontains': input_code }, and then we unpack that as named parameter with the two consecutive asterisks (**).
That being said, I strongly advice you to do some validation on the view_by_choice: ensure that the number of valid options is limited. Otherwise a user might inject malicious field names, lookups, etc. to exploit data from your database that should remain hidden.
For example if you model has a ForeignKey named owner to the User model, he/she could use owner__email, and thus start trying to find out what emails are in the database by generating a large number of queries and each time looking what values that query returned.

Ordered list in Django

Can anyone help, I want to return an ordered list based on forloop in Django using a field in the model that contains both integer and string in the format MM/1234. The loop should return the values with the least interger(1234) in ascending order in the html template.
Ideally you want to change the model to have two fields, one integer and one string, so you can code a queryset with ordering based on the integer one. You can then define a property of the model to return the self.MM+"/"+str( self.nn) composite value if you often need to use that. But if it's somebody else's database schema, this may not be an option.
In which case you'll have to convert your queryset into a list (which reads all the data rows at once) and then sort the list in Python rather than in the database. You can run out of memory or bring your server to its knees if the list contains millions of objects. count=qs.count() is a DB operation that won't.
qs = Foo.objects.filter( your_selection_criteria)
# you might want to count this before the next step, and chicken out if too many
# also this simple key function will crash if there's ever no "/" in that_field
all_obj = sorted( list( qs),
key = lambda obj: obj.that_field.split('/')[1] )

Django: Sort and filter rows by specific many to one value

In the provided schema I would like to sort Records by a specific Attribute of the record. I'd like to do this in native Django.
Example:
Query all Records (regardless of Attribute.color), but sort by Attribute.value where Attribute.color is 'red'. Obviously Records missing a 'red' Attribute can't be sorted, so they could be just interpreted as NULL or sent to the end.
Each Record is guaranteed to have one or zero of an Attribute of a particular color (enforced by unique_together). Given this is a one to many relationship, a Record can have Attributes of more than` one color.
class Record(Model):
pass
class Attribute(Model):
color = CharField() # **See note below
value = IntegerField()
record = ForeignKey(Record)
class Meta:
unique_together = (('color', 'record'),)
I will also need to filter Records by Attribute.value and Attribute.color as well.
I'm open to changing the schema, but the schema above seems to be the simplest to represent what I need to model.
How can I:
Query all Records where it has an Attribute.color of 'red' and, say, an Attribute.value of 10
Query all Records and sort by the Attribute.value of the associated Attribute where Attribute.color is 'red'.
** I've simplified it above -- in reality the color field would be a ForeignKey to an AttributeDefinition, but I think that's not important right now.
I think something like this would work:
record_ids = Attribute.objects.filter(color='red', value=10).values_list('record', flat=True)
and
record_ids = Attribute.objects.filter(color='red').order_by('value').values_list('record', flat=True)
That will give you IDs of records. Then, you can do this:
records = Record.objects.filter(id__in=record_ids)
Hope this helps!