Django ORM: calculation only inside databse-query possible? - django

I have rather simple dataset that's containing the following data:
id | aqi | date | state_name
1 | 17 | 2020-01-01 | California
2 | 54 | 2020-01-02 | California
3 | 37 | 2020-01-03 | California
4 | 29 | 2020-01-04 | California
What I'm trying to achieve is the average aqi (air-quality-index) from april 2022 minus the average aqi from april 2021, without using multiple queries. Is this even possible or should I use two queries and compare them manually?
From my understanding, I should use the Q-Expression to filter the correct dates, correct?
AirQuality.objects.filter(Q(date__range=['2021-04-01', '2021-04-30']) & Q('2022-04-01', '2022-04-30'))
The best solution I came across myself so far is:
qs_apr20 = (
AirQuality.objects
.aggregate(apr20=Avg('aqi', filter=Q(date__range=(datetime.date(2020, 4, 1), datetime.date(2020, 4, 30)))))['apr20']
)
qs_apr21 = (
AirQuality.objects
.aggregate(apr21=Avg('aqi', filter=Q(date__range=(datetime.date(2021, 4, 1), datetime.date(2021, 4, 30)))))['apr21']
)
result = round(qs_apr21 - qs_apr20, 2)
Thanks for your help and have a great day!

Inspiring from the documentation, the following should work:
>>> import datetime
>>> from django.db.models import Q, Avg
>>> from django.db.models import F
>>> apr21 = Avg('aqi', filter=Q(date__range=(datetime.date(2021, 4, 1), datetime.date(2021, 4, 30)))
>>> apr22 = Avg('aqi', filter=Q(date__range=(datetime.date(2022, 4, 1), datetime.date(2022, 4, 30)))
>>> aqi_calc = Airquality.objects.annotate(apr21=apr21)
.annotate(apr22=apr22)
.annotate(diff=F('apr22') - F('apr21'))
It should do everything in 1 query, if I'm not mistaken.

Related

Join 2 Dataframes with Regex in where clause pyspark

We have two dataframes
df = spark.createDataFrame([
(1, 'Nick', 'Miller'),
(2, 'Jessica', 'Day'),
(3, 'Winston', 'Schmidt'),
], ['id', 'First_name', 'Last_name'])
df1 = spark.createDataFrame([ (1, '^[a-lA-L]', 'type1'), (3, '^[m-zM-Z]', 'type2')], ['id', 'regex_match', 'vaule']
Need to join these two dataframe, where df1.regex_match matches with df.Last_name
Needed output be like following: Any suggestion please:
join df to df1 using left join
You can join using an rlike condition:
import pyspark.sql.functions as F
result = df.alias('df').join(
df1.drop('id').alias('df1'),
F.expr('df.Last_name rlike df1.regex_match'),
'left'
).drop('regex_match')
result.show()
+---+----------+---------+-----+
| id|First_name|Last_name|vaule|
+---+----------+---------+-----+
| 1| Nick| Miller|type2|
| 2| Jessica| Day|type1|
| 3| Winston| Schmidt|type2|
+---+----------+---------+-----+

match index from pyspark dataframe in pandas

I have following pyspark dataframe (testDF=ldamodel.describeTopics().select("termIndices").toPandas())
topic| termIndices| termWeights|
+-----+---------------+--------------------+
| 0| [6, 118, 5]|[0.01205522104545...|
| 1| [0, 55, 100]|[0.00125521761966...|
and i have following word list
['one',
'peopl',
'govern',
'think',
'econom',
'rate',
'tax',
'polici',
'year',
'like',
........]
I am trying to match vocablist to termIndices to termWeights.
So far I have following:
for i in testDF.items():
for j in i[1]:
for m in j:
t=vocablist[m],m
print(t)
which results into:
('tax', 6)
('insur', 118)
('rate', 5)
('peopl', 1)
('health', 84)
('incom', 38)
('think', 3)
('one', 0)
('social', 162)
.......
But I wanted something like
('tax', 6, 0.012055221045453202)
('insur', 118, 0.001255217619666775)
('rate', 5, 0.0032220995010401187)
('peopl', 1,0.008342115226031033)
('health', 84,0.0008332053105123403)
('incom', 38, ......)
Any help will be appreciated.
I would recommend you spread those lists in the columns termIndices and termWeights downward. Once you've done that, then you can actually map indices to their term names while having the term weights aligned with each term index. The following is an illustration:
df = pd.DataFrame(data={'topic': [0, 1],
'termIndices': [[6, 118, 5],
[0, 55, 100]],
'termWeights': [[0.012055221045453202, 0.012055221045453202, 0.012055221045453202],
[0.00125521761966, 0.00125521761966, 0.00125521761966]]})
dff = df.apply(lambda s: s.apply(pd.Series).stack().reset_index(drop=True, level=1))
vocablist = ['one', 'peopl', 'govern', 'think', 'econom', 'rate', 'tax', 'polici', 'year', 'like'] * 50
dff['termNames'] = dff.termIndices.map(vocablist.__getitem__)
dff[['termNames', 'termIndices', 'termWeights']].values.tolist()
I hope this helps.

Can't query the sum of values using aggregators

I want to sum the values of all existing rows grouping by another field.
Here's my model structure:
class Answer(models.Model):
person = models.ForeignKey(Person)
points = models.PositiveIntegerField(default=100)
correct = models.BooleanField(default=False)
class Person(models.Model):
# irrelevant model fields
Sample dataset:
Person | Points
------ | ------
4 | 90
3 | 50
3 | 100
2 | 100
2 | 90
Here's my query:
Answer.objects.values('person').filter(correct=True).annotate(points_person=Sum('points'))
And the result (you can see that all the person values are separated):
[{'person': 4, 'points_person': 90}, {'person': 3, 'points_person': 50}, {'person': 3, 'points_person': 100}, {'person': 2, 'points_person': 100}, {'person': 2, 'points_person': 90}]
But what I want (sum of points by each person):
[{'person': 4, 'points_person': 90}, {'person': 3, 'points_person': 150}, {'person': 2, 'points_person': 190}]
Is there any way to achieve this using only queryset filtering?
Thanks!
Turns out I had to do the inverse filtering, by the Person's and not the Answers, like so:
Person.objects.filter(answer__correct=True).annotate(points=Sum('answer__points'))
Now I get the total summed points for each person correctly.

find the index of maximum element of 2nd column of a 2 dimensional list

I am a newbie in python,
I need to find the index of maximum element of 2nd column of a 2 dimensional list as follows
3 | 23
7 | 12
5 | 42
1 | 25
so the output should be the index of 42 i.e. 2
I tried converting the list to numpy array and then using argmax, but in vain
Thanks!!
>>> import numpy as np
>>> a = np.asarray([[3, 7, 5, 1], [23, 12, 42, 25]])
>>> a
array([[ 3, 7, 5, 1],
[23, 12, 42, 25]])
>>> a[1]
array([23, 12, 42, 25])
>>> np.argmax(a[1])
2

Pandas: What is the best way to 'crop' as large dataframe to only the previous 1000 days?

I have a dataframe where the index is made up of datetimes. I also have an anchor date and I know that I only want the second dataframe to contain the 1000 days previous to the anchor date. What is the best way to do this?
Don't know if it's the best way, but it should work
Create example DataFrame:
>>> dates = [pd.datetime(2012, 5, 4), pd.datetime(2012, 5, 5), pd.datetime(2012, 5, 6), pd.datetime(2012, 5, 1), pd.datetime(2012, 5, 2), pd.datetime(2012, 5, 3)]
>>> values = [1, 2, 3, 4, 5, 6]
>>> df = pd.DataFrame(values, dates)
>>> df
>>> df
0
2012-05-04 1
2012-05-05 2
2012-05-06 3
2012-05-01 4
2012-05-02 5
2012-05-03 6
Suppose we want 2 days back from 2012-05-04:
>>> date_end = pd.datetime(2012, 5, 4)
>>> date_start = date_end - pd.DateOffset(days=2)
>>> date_start, date_end
(datetime.datetime(2012, 5, 2, 0, 0), datetime.datetime(2012, 5, 4, 0, 0))
Now let's try to get rows by label indexing:
>>> df.loc[date_start:date_end]
Empty DataFrame
Columns: [0]
Index: []
That's because our index is not sorted, so let's fix it:
>>> df.sort_index(inplace=True)
>>> df.loc[date_start:date_end]
0
2012-05-02 5
2012-05-03 6
2012-05-04 1
It's also possible to get rows by datetime indexing:
>>> df[date_start:date_end]
0
2012-05-02 5
2012-05-03 6
2012-05-04 1
Keep in mind that I'm still not an expert in Pandas, but I like it for Data Analysis very much.
Hope it helps.