I have the following data:
publisher title
-------------------------- -----------------------------------
New Age Books Life Without Fear
New Age Books Life Without Fear
New Age Books Sushi, Anyone?
Binnet & Hardley Life Without Fear
Binnet & Hardley The Gourmet Microwave
Binnet & Hardley Silicon Valley
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Here is what I want to do: I want to count how many books of the same titles are published by each author.
I want to get the following result:
{publisher: New Age Books, title: Life Without Fear, count: 2},
{publisher: New Age Books, title: Sushi Anyone?, count: 1},
{publisher: Binnet & Hardley, title: The Gourmet Microwave, count: 1},
{publisher: Binnet & Hardley, title: Silicon Valley, count: 1},
{publisher: Binnet & Hardley, title: Life Without Fear, count: 1},
{publisher: Algodata Infosystems, title: But Is It User Friendly?, count: 3}
My solution goes something along the lines of:
query_set.values('publisher', 'title').annotate(count=Count('title'))
But it is not producing the desired result.
There is a peculiarity in Django that will not perform a GROUP BY on the values without an .order_by() clause. You can thus add an .order_by() clause and process this with:
query_set.values('publisher', 'title').annotate(
count=Count('pk')
).order_by('publisher', 'title')
By ordering the items "fold" into a group and we thus count the number of primary keys for each group.
Related
I have the following data:
publisher title
-------------------------- -----------------------------------
New Age Books Life Without Fear
New Age Books Life Without Fear
New Age Books Sushi, Anyone?
Binnet & Hardley Life Without Fear
Binnet & Hardley The Gourmet Microwave
Binnet & Hardley Silicon Valley
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Algodata Infosystems But Is It User Friendly?
Here is what I want to do: I want to count the number of books published by each author in a single object.
I want to get the following result:
{publisher: New Age Books, titles: {Life Without Fear: 2, Sushi Anyone?: 1}},
{publisher: Binnet & Hardley, titles: {The Gourmet Microwave: 1, Silicon Valley: 1, Life Without Fear: 1}},
{publisher: Algodata Infosystems, titles: {But Is It User Friendly?: 3}}
My solution goes something along the lines of:
query_set.values('publisher', 'title').annotate(count=Count('title'))
But it is not producing the desired result.
You can post-process the results of the query with the groupby(…) function [Python-doc] of the itertools package [Python-doc]:
from django.db.models import Count
from itertools import groupby
from operator import itemgetter
qs = query_set.values('publisher', 'title').annotate(
count=Count('pk')
).order_by('publisher', 'title')
result = [
{
'publisher': p,
'titles': {r['title']: r['count'] for r in rs }
}
for p, rs in groupby(qs, itemgetter('publisher'))
]
I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social...
My model looks as follows:
type StudentMarks
#model
#key(
name: "filterbyClassAndMarks"
fields: [
"gender"
"classCode"
"mathsMarks"
"socialMarks"
"englishMarks"
"scienceMarks"
]
queryField: "filterbyClassAndMarks"
)
#auth(
rules: [
{ allow: private, provider: iam, operations: [read] }
{ allow: public, provider: iam, operations: [read] }
]
) {
id: ID!
name: String
gender: String
classCode: String
mathsMarks: String
socialMarks: String
englishMarks: String
scienceMarks: String
}
The GSI created with gender as the partition key(hashkey) and compound sort/range key with classCode, mathMarks, socialMarks, scienceMarks, englishMarks fields is showing item count as 0.
I am trying to query using the following graphql:
const listOfStudents = await API.graphql({
query: queries.filterbyClassAndMarks,
variables: {
gender: "m",
classCodeMathsMarksSocialMarksEnglishMarksScienceMarks: {
between: [
{
classCode: "05",
mathsMarks: "06",
scienceMarks: "07",
englishMarks: "04",
socialMarks: "05",
},
{
classCode: "11",
mathsMarks: "90",
scienceMarks: "91",
englishMarks: "95",
socialMarks: "92",
},
],
},
},
authMode: "AWS_IAM",
});
This table with 11 records should return 4 records as shown in green
The cloud watch logs are as below:
{
"logType": "RequestMapping",
"path": [
"filterbyClassAndMarks"
],
"fieldName": "filterbyClassAndMarks",
"resolverArn": "arn:aws:appsync:ap-south-1:488377001042:apis/4tbr6xolzjfctl5tdbiur7jqnu/types/Query/resolvers/filterbyClassAndMarks",
"requestId": "5de45e66-1a5d-44dd-80cc-1a5b93a3c3aa",
"context": {
"arguments": {
"gender": "m",
"classCodeMathsMarksSocialMarksEnglishMarksScienceMarks": {
"between": [
{
"classCode": "05",
"mathsMarks": "06",
"socialMarks": "05",
"englishMarks": "04",
"scienceMarks": "07"
},
{
"classCode": "11",
"mathsMarks": "90",
"socialMarks": "92",
"englishMarks": "95",
"scienceMarks": "91"
}
]
},
},
"stash": {},
"outErrors": []
},
"fieldInError": false,
"errors": [],
"parentType": "Query",
"graphQLAPIId": "4tbr6xolzjfctl5tdbiur7jqnu",
"transformedTemplate": "{\"version\":\"2018-05-29\",\"operation\":\"Query
\",\"limit\":100,\"query\":{\"expression\":\"#gender = :gender AND #sortKey
BETWEEN :sortKey0 AND :sortKey1\",\"expressionNames\":{\"#gender\":\"gender
\",\"#sortKey\":\"classCode#mathsMarks#socialMarks#englishMarks#scienceMarks
\"},\"expressionValues\":{\":gender\":{\"S\":\"m\"},\":sortKey0\":{\"S
\":\"05#06#05#04#07\"},\":sortKey1\":{\"S\":\"11#90#92#95#91\"}}},\"index
\":\"filterbyClassAndMarks\",\"scanIndexForward\":true}"
}
It is showing the scanned count as 0 in the result:
"result": {
"items": [],
"scannedCount": 0
},
And the final response is success with empty result as shown below:
Object { data: {…} }
data: Object { filterbyClassAndMarks: {…} }
filterbyClassAndMarks: Object { items: [], nextToken: null }
items: Array []
length: 0
<prototype>: Array []
nextToken: null
Array []
Key conditions on compound sort keys is possible as per aws documentation... I do not understand what I am missing exactly. Probably the GSI index is not properly configured/written.
Since most of the time the result is less than 1% of records in the table and is read intensive, scanning/reading all the records and filtering them is a very naive solution. Need better solution either with indices or otherwise.
SECOND EDIT:
Example of expected behavior of applying hash key and compound sort key before filtering.. The example is given to highlight expected behavior only and does not indicate approximate percentage of records not read due to hash key or compound sort key.
DynamoDB does not excel in supporting ad-hoc queries among an arbitrary list of attributes. If you want to fetch the same item using an undefined (or large) number of attributes(e.g. fetch student by mathMarks, socialMarks, scienceMarks, name, gender, etc) you will be better off using something other than DynamoDB.
EDIT:
You've updated your question with information that fundamentally changes the access pattern. You initially said
I want to fetch records of all male or female students from dynamodb who have secured more than x1 marks in maths, more than x2 marks in science, less than y1 marks in social and less than y2 marks in English...
and later changed this access pattern to say (emphasis mine)
I want to fetch records of all male or female students from class 5 to class 11 who have secured between x1 and x2 marks in maths, between y1 and y2 marks in science, between z1 and z2 marks in english and between w1 and w2 marks in social
Partitioning your data on gender and class range might allow you to implement this access pattern. You also removed ... other variables from your example schema, suggesting you might have a fixed list of marks include math, science, social, and English. This is less important than gender and class, but hear me out :)
You didn't mention it in your original question, but your data suggests you have a fetch Student by ID access pattern. So I started by defining students by ID:
The partition key is STUDENT#student_id and the sort key is A. I sometimes use "METADATA" or "META" as the sort key, but "A" is nice and short.
To support your primary access pattern, I created a global secondary index named GSI1, with PK and SK attributes of GSI1PK and GSI1SK. I assigned GSI1PK STUDENTS#gender and GSISK is the class attribute.
This partitions your data by gender and class. To narrow your results even further, you'd need to use filters on the various marks. For example, if you wanted to fetch all make students between class 5 and 9 with specific marks, you could do the following (in DynamoDB pseudocode):
QUERY from GSI1 where PK=STUDENTS#M SK BETWEEN 05 and 09
FILTER englishMark BETWEEN 009 and 050 AND
mathsMark BETWEEN 050 and 075 AND
scienceMark BETWEEN 045 and 065 AND
socialMark BETWEEN 020 and 035 AND
Filtering in DynamoDB doesn't work like most people think. When filtering, DynamoDB:
Read items from the table
Apply filter to remove items that don't match
Return items
This can lead to awful performance (and cost) if you have a large table and are filtering for a very small set of data. For example, if you're executing a scan operation on terabytes of data and applying filters to identify a single item, you're going to have a bad time.
However, there are circumstances where filtering is a fine solution. One example is when you can use the Partition Key and Sort Key to narrow the set of data you're working with before applying a filter. In the above example, I'm dividing your data by gender (perhaps dividing your search space in half) before further narrowing the items by class. The amount of data you're left with might be small enough to filter out the remaining items effectively.
However, I'll point out that gender has rather low cardinality. Perhaps there are more specifics about your access pattern that could help with that. For example, maybe you could group students by primary/secondary school and create a PK of STUDENTS#F#1-5. Maybe you could group them by the school district, or zip code? No detail about access patterns is too specific when working with NoSQL databases!
The point of this exercise is to illustrate two points:
Filtering in DynamoDB is best achieved by selecting a good primary key.
Using the DynamoDB filtering mechanism is best used on smaller subsets of your data. Don't be afraid to use it in the right places!
In my application I have models Visits & Post &
class Visit < ActiveRecord::Base
belongs_to :post, :counter_cache => true
class Post < ActiveRecord::Base
has_many :visits
When a visitor visits a post, I am adding it to my visits table with post_id and price (price is decimal).
In my dashboard, I want to show which posts they have viewed (grouped) and how much they have earned.
For instance:
1) post 1, viewed 54 times and earned $1.6, 2) post 2, viewed 39 times and earned $1.1, etc
I have tried with:
- a = Visit.group(:post_id).where(user: current_user).sum(:price)
- a.each do |n|
%p
= n
This gives me, each post_id, but price is just * BigDecimal:7fb2625f9238,'0.15E1* & I can't find post title by doing n.post.title & n.post.title gives me error: undefined method 'post'
This is result I get:
[44, #<BigDecimal:7fb2625f9238,'0.15E1',18(36)>]
[45, #<BigDecimal:7fb2625f8dd8,'0.13E1',18(36)>]
[46, #<BigDecimal:7fb2625f8928,'0.3E-1',9(36)>]
I have also tried with:
- Visit.select([:post_id, :price]).where(user: current_user).each do |e|
%p
= e.post.title
= e.cpc_bid
This option gives me all the posts and prices individually and not combined.
Results are like:
Post title 1, 0.15
Post title 1, 0.01
Post title 2, 0.1
Post title 1, 0.15
Post title 2, 0.1
Post title 2, 0.1
Post title 2, 0.1
Post title 1, 0.15
I also tried with:
- Visit.select([:post_id, :price]).group(:post_id).where(user: current_user).each do |e|
%p
= e.post.title
= e.price
This option gives me only one of the visits on the post with its price.
Results are:
Post title 2, 0.1
Post title 1, 0.15
My last try was:
- Visit.joins(:post).group(:post_id).select('sum(price) as earnings', :post_id, :title, 'count(visits.id) as total_views').where(user: current_user).each do |e|
%p
= e.title
= e.price
This gives me this error:
PG::GroupingError: ERROR: column "posts.title" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...ECT sum(price) as earnings, "visits"."post_id", "title", c...
How can I combine them together with sum of price on all post, with its post title.
You need to join tables and group
Post.joins(:visits).group(:id)
.where(visits: { user_id: current_user.id})
.select("*, sum(price) as total_price, count(visits.id) as total_views")
It adds to post instance accessors total_price and total_views.
Answer from #MikDiet works perfectly, only issue I had was that I was getting error from PostgreSql:
PG::GroupingError: ERROR: column "posts.title" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...ECT sum(price) as earnings, "visits"."post_id", "title", c...
and I changed it to:
Visit.joins(:post).group([:post_id, :title]).select('sum(cpc_bid) as earnings', :post_id, :title, 'count(visits.id) as total_views').where(influencer: current_user)
And it worked.
NB: I couldn't find the answer without #MikDiet's help and all credits goes to him
I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.
A very similar post was made about this issue here. In cloudant, I have a document structure storing when users access an application, that looks like the following:
{"username":"one","timestamp":"2015-10-07T15:04:46Z"}---| same day
{"username":"one","timestamp":"2015-10-07T19:22:00Z"}---^
{"username":"one","timestamp":"2015-10-25T04:22:00Z"}
{"username":"two","timestamp":"2015-10-07T19:22:00Z"}
What I want to know is to count the # of unique users for a given time period. Ex:
2015-10-07 = {"count": 2} two different users accessed on 2015-10-07
2015-10-25 = {"count": 1} one different user accessed on 2015-10-25
2015 = {"count" 2} two different users accessed in 2015
This all just becomes tricky because for example on 2015-10-07, username: one has two records of when they accessed, but it should only return a count of 1 to the total of unique users.
I've tried:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
emit([time.getUTCFullYear(),time.getUTCMonth(),time.getUTCDay(),doc.username], 1);
}
This suffers from several issues, which are highlighted by Jesus Alva who commented in the post I linked to above.
Thanks!
There's probably a better way of doing this, but off the top of my head ...
You could try emitting an index for each level of granularity:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
var year = time.getUTCFullYear();
var month = time.getUTCMonth()+1;
var day = time.getUTCDate();
// day granularity
emit([year,month,day,doc.username], null);
// year granularity
emit([year,doc.username], null);
}
// reduce function - `_count`
Day query (2015-10-07):
inclusive_end=true&
start_key=[2015, 10, 7, "\u0000"]&
end_key=[2015, 10, 7, "\uefff"]&
reduce=true&
group=true
Day query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,10,7,"one"],"value":2},
{"key":[2015,10,7,"two"],"value":1}
]}
Year query:
inclusive_end=true&
start_key=[2015, "\u0000"]&
end_key=[2015, "\uefff"]&
reduce=true&
group=true
Query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,"one"],"value":3},
{"key":[2015,"two"],"value":1}
]}