Im trying to remove records from a table that have a duplicate value by their oldest timestamp(s), grouping by ID, so the results would be unique values per ID with the newest unique values per ID/timestamp kept, hopefully the below samples will make sense.
sample data:
id value timestamp
10 10 9/4/20 17:00
11 17 9/4/20 17:00
21 50 9/4/20 17:00
10 10 9/4/20 16:00
10 10 9/4/20 15:00
10 11 9/4/20 14:00
11 41 9/4/20 16:00
11 41 9/4/20 15:00
21 50 9/4/20 16:00
so id like to remove any values that have a dupliate value with the same id, keeping the newest timestamps, so the above data would become:
id value timestamp
10 10 9/4/20 17:00
11 17 9/4/20 17:00
21 50 9/4/20 17:00
10 11 9/4/20 14:00
11 41 9/4/20 16:00
EDIT:
query is just
SampleData.objects.all()
One approach could be using Subquery expressions as documented here.
Suppose your SampleData model looks like this:
class SampleData(models.Model):
id2 = models.IntegerField()
value = models.IntegerField()
timestamp = models.DateTimeField()
(I replaced id by id2 to avoid conflicts with the model id).
Then you could delete your duplicates like this:
newest = SampleData.objects.filter(id2=OuterRef('id2'), value=OuterRef('value')).order_by('-timestamp')
SampleData.objects.annotate(newest_id=Subquery(newest.values('pk')[:1])).exclude(pk=F('newest_id')).delete()
Edit:
It seems as if MySQL has some issues handling deletions and subqueries, as documented in this SO post.
In this case a 2 step approach should help: First getting the ids of the objects to delete and then deleting them:
newest = SampleData.objects.filter(id2=OuterRef('id2'), value=OuterRef('value')).order_by('-timestamp')
ids2delete = list(SampleData.objects.annotate(newest_id=Subquery(newest.values('pk')[:1])).exclude(pk=F('newest_id')).values_list('pk', flat=True))
SampleData.objects.filter(pk__in=ids2delete).delete()
Related
Example
Record Table
id value created_datetime
1 10 2022-01-18 10:00:00
2 11 2022-01-18 10:15:00
3 8 2022-01-18 15:15:00
4 25 2022-01-19 09:00:00
5 16 2022-01-19 12:00:00
6 9 2022-01-20 11:00:00
I want to filter this table 'Record Table' as getting each date latest value.For Example there are three dates 2022-01-18,2022-01-19,2022-01-20 in which latest value of these dates are as follows
Latest value of each dates are (Result that iam looking to get)
id value created_datetime
3 8 2022-01-18 15:15:00
5 16 2022-01-19 12:00:00
6 9 2022-01-20 11:00:00
So how to filter to recieve results as the above mentioned table
It can be done in two steps:
First get the latest datetime for each day and then filter the records by that.
max_daily_date_times = Record.objects.extra(select={'day': 'date( created_datetime )'}).values('day') \
.annotate(latest_datetime=Max('created_datetime'))
records = Record.objects.filter(
created_datetime__in=[entry["latest_datetime"] for entry in max_daily_date_times]).values("id", "value",
"created_datetime")
I am trying to replicate the data that is used in "When your fans are online" section" of a business page's insights dashboard. I am using the following parameters in the /insights/page_fans_online api call which returns the data I am after:
parameters={'period':'day','since':'2018-10-20T07:00:00','until':'2018-10-21T07:00:00','access_token':page_token['access_token'][0]}
The data returned can be seen below, where:
end_time = end_time (based on the since & until dates in the parameters)
name = metric
apiHours = hour of day returned
localDate = localized date (applied manually)
localHours = - 6 hour offset to localize to Auckland/New Zealand (applied
manually to replicate what is seen on the insights dashboard.
fansOnline = number of unique page fans online during that hour
Data:
end_time name apiHours localDate localHours fansOnline
2018-10-21T07:00:00+0000 page_fans_online 0 2018-10-19 18 21
1 2018-10-19 19 29
2 2018-10-19 20 20
3 2018-10-19 21 18
4 2018-10-19 22 20
5 2018-10-19 23 15
6 2018-10-19 0 4
7 2018-10-19 1 6
8 2018-10-19 2 5
9 2018-10-19 3 8
10 2018-10-19 4 17
11 2018-10-19 5 19
12 2018-10-19 6 26
13 2018-10-19 7 24
14 2018-10-19 8 20
15 2018-10-19 9 22
16 2018-10-19 10 19
17 2018-10-19 11 22
18 2018-10-19 12 18
19 2018-10-19 13 18
20 2018-10-19 14 18
21 2018-10-19 15 18
22 2018-10-19 16 21
23 2018-10-19 17 28
It took a while to work out that the data returned when pulling page_fans_online using the parameters specified above is for Wednesday October 19th, for a New Zealand business page.
If we look at the last row in the data above:
end_time = 2018-10-21
apiHours = 23
localDate = 2018-10-19
localHours = 17
fansOnline = 28
It is saying on 2018-10-21 # 11 pm there were 28 unique fans online. This translates to , on 2018-10-19 # 5 pm there were 28 unique fans online when the dates and times are manually localized, (I worked the offset out by checking the "When your fans online" graphs on the page insights).
There is a -54 hour offset between 2018-10-21 11:00 pm and 2018-10-19 5:00 pm, and my question is, what is the logic used behind the end_time and hour of day returned by the page_fans_online insights metric and is there any info regarding how this should be localized depending on what country the business is located?
There is only a simple description of what page_fans_online is in the page/insights docs and says the hours are in PST/PDT but that does not help with localizing the date and hour of day:
https://developers.facebook.com/docs/graph-api/reference/v3.1/insights
I am currently trying to create a report that shows how customers behave over time, but instead of doing this by date, I am doing it by customer age (number of months since they first became a customer). So using a date field isn't really an option, considering one customer may have started in Dec 2016 and another starts in Jun 2017.
What I'm trying to find is the month-over-month change in units purchased. If I was using a date field, I know that I could use
[Previous Month Total] = CALCULATE(SUM([Total Units]), PREVIOUSMONTH([FiscalDate]))
I also thought about using EARLIER() to find out but I don't think it would work in this case, as it requires row context that I'm not sure I could create. Below is a simplified version of the table that I'll be using.
ID Date Age Units
219 6/1/2017 0 10
219 7/1/2017 1 5
219 8/1/2017 2 4
219 9/1/2017 3 12
342 12/1/2016 0 500
342 1/1/2017 1 280
342 2/1/2017 2 325
342 3/1/2017 3 200
342 4/1/2017 4 250
342 5/1/2017 5 255
How about something like this?
PrevTotal =
VAR CurrAge = SELECTEDVALUE(Table3[Age])
RETURN CALCULATE(SUM(Table3[Units]), ALL(Table3[Date]), Table3[Age] = CurrAge - 1)
The CurrAge variable gives the Age evaluated in the current filter context. You then plug that into a filter in the CALCULATE line.
I have a csv file like below.
Beat,Hour,Month,Primary Type,COUNTER
111,10AM,Apr,ASSAULT,12
111,10AM,Apr,BATTERY,5
111,10AM,Apr,BURGLARY,1
111,10AM,Apr,CRIMINAL DAMAGE,4
111,10AM,Aug,MOTOR VEHICLE THEFT,2
111,10AM,Aug,NARCOTICS,1
111,10AM,Aug,OTHER OFFENSE,18
111,10AM,Aug,THEFT,38
Now I want to find the % of each Primary Type grouped by the first three columns. For eg, For Beat = 111, Hour=10AM, Month=Apr, %Assault=12/(12+5+1+4) * 100. Can anyone give a clue on how to do this using pandas?
You can using transform sum
df['New']=df.COUNTER/df.groupby(['Beat','Hour','Month']).COUNTER.transform('sum')*100
df
Out[575]:
Beat Hour Month Primary Type COUNTER New
0 111 10AM Apr ASSAULT 12 54.545455
1 111 10AM Apr BATTERY 5 22.727273
2 111 10AM Apr BURGLARY 1 4.545455
3 111 10AM Apr CRIMINAL DAMAGE 4 18.181818
4 111 10AM Aug MOTOR VEHICLE THEFT 2 3.389831
5 111 10AM Aug NARCOTICS 1 1.694915
6 111 10AM Aug OTHER OFFENSE 18 30.508475
7 111 10AM Aug THEFT 38 64.406780
The pandas DF has datetime index with price and volume at that price.
Last Volume
Date_Time
20160907 070000 1.1249 17
20160907 070001 1.1248 12
20160907 070001 1.1249 15
20160907 070002 1.1248 13
20160907 070002 1.1249 20
I want to create a column that keeps a running total(sum) of volume through the sequence if the price repeats. I am trying to create a column that would look like this.
Last Volume VolumeCount
1.1249 17 17
1.1248 12 12
1.1249 15 32
1.1248 13 25
1.1249 20 52
I have been working on different functions and loops and I can't seem to create a column that that isn't a total sum of the group. I would really appreciate any help or suggestions. Thank you.
Try:
DF['VolumeCount'] = DF.groupby('Last')['Volume'].cumsum()
I hope this helps.
You want to accumulated volume on contiguous sets of same Last
consider the df
Last Volume
Date_Time
20160907-70000 1.1249 17
20160907-70001 1.1248 12
20160907-70001 1.1248 15
20160907-70002 1.1248 13
20160907-70002 1.1249 20
Then
df.Volume.groupby((df.Last != df.Last.shift()).cumsum()).cumsum()
Date_Time
20160907-70000 17
20160907-70001 12
20160907-70001 27
20160907-70002 40
20160907-70002 20
Name: Volume, dtype: int64