Pandas Aggregate/Group by based on most recent date

Pandas Aggregate/Group by based on most recent date - python-2.7

I have a DataFrame as follows, where Id is a string and Date is a datetime:
Id Date
1 3-1-2012
1 4-8-2013
2 1-17-2013
2 5-4-2013
2 10-30-2012
3 1-3-2013
I'd like to consolidate the table to just show one row for each Id which has the most recent date.
Any thoughts on how to do this?

You can groupby the Id field:
In [11]: df
Out[11]:
Id Date
0 1 2012-03-01 00:00:00
1 1 2013-04-08 00:00:00
2 2 2013-01-17 00:00:00
3 2 2013-05-04 00:00:00
4 2 2012-10-30 00:00:00
5 3 2013-01-03 00:00:00
In [12]: g = df.groupby('Id')
If you are not certain about the ordering, you could do something along the lines:
In [13]: g.agg(lambda x: x.iloc[x.Date.argmax()])
Out[13]:
Date
Id
1 2013-04-08 00:00:00
2 2013-05-04 00:00:00
3 2013-01-03 00:00:00
which for each group grabs the row with largest (latest) date (the argmax part).
If you knew they were in order you could take the last (or first) entry:
In [14]: g.last()
Out[14]:
Date
Id
1 2013-04-08 00:00:00
2 2012-10-30 00:00:00
3 2013-01-03 00:00:00
(Note: they're not in order, so this doesn't work in this case!)

In the Hayden response, I think that using x.loc in place of x.iloc is better, as the index of the df dataframe could be sparse (and in this case the iloc will not work).
(I haven't enought points on stackoverflow to post it in comments of the response).

Related

How to filter in django to get each date latest record

Example
Record Table
id value created_datetime
1 10 2022-01-18 10:00:00
2 11 2022-01-18 10:15:00
3 8 2022-01-18 15:15:00
4 25 2022-01-19 09:00:00
5 16 2022-01-19 12:00:00
6 9 2022-01-20 11:00:00
I want to filter this table 'Record Table' as getting each date latest value.For Example there are three dates 2022-01-18,2022-01-19,2022-01-20 in which latest value of these dates are as follows
Latest value of each dates are (Result that iam looking to get)
id value created_datetime
3 8 2022-01-18 15:15:00
5 16 2022-01-19 12:00:00
6 9 2022-01-20 11:00:00
So how to filter to recieve results as the above mentioned table

It can be done in two steps:
First get the latest datetime for each day and then filter the records by that.
max_daily_date_times = Record.objects.extra(select={'day': 'date( created_datetime )'}).values('day') \
.annotate(latest_datetime=Max('created_datetime'))
records = Record.objects.filter(
created_datetime__in=[entry["latest_datetime"] for entry in max_daily_date_times]).values("id", "value",
"created_datetime")

How Django bulks create check exist already in bulks objs and instance?

I have a lot of data, that data is pretty dirty, example:
A table ORM :
id = models.CharField(default='', max_length=50)
time = models.DateTimeField(default=timezone.now)
number = models.CharField(default='', max_length=20)
value = models.CharField(default='', max_length=20)
unique_together = ['id', 'time', 'number']
A table DATA :
id time number value
1 2018-07-16 00:00:00 1 64
1 2018-07-16 00:00:00 2 -99
1 2018-07-16 00:00:00 3 655
1 2018-07-16 00:00:00 4 3
2 2018-07-16 00:00:00 0 12
Import Datas (sample) :
id time number value
1 2018-07-16 00:00:00 1 64
3 2018-07-16 00:00:00 0 -99
3 2018-07-16 00:00:00 0 11
4 2018-07-16 00:00:00 0 -99
4 2018-07-16 00:00:00 1 -99
So, When I Do
for loop....
objs = []
objs.append(A(**kwargs))
A.objects.bulk_create(objs, batch_size=50000)
It will raise two kind duplicate.
A Table already exist " 1 2018-07-16 00:00:00 1"
Import Datas already exist 3 2018-07-16 00:00:00 0 for two times in objs, so when I bulks create it will raise duplicate, then it will roll back all commit !!!
the "1", I can use get or create to solve it
but "2", I can't check now I append data exist in the objs or not
I tried to use this to check exist or not, but when data row over 1000000,
the complexity will be terrible.
def search(id, time, number, objs):
for obj in objs:
if obj['id'] == id and obj['time'] == time and obj['number'] == number:
return True
return False
Is there have any better way? thanks.

You can add a tuple with id, time and number to a set:
objs = []
duplicate_check = set()
for loop....
data = kwargs['id'], kwargs['time'], kwargs['number']
if not data in duplicate_check:
objs.append(A(**kwargs))
duplicate_check.add(data)
A.objects.bulk_create(objs, batch_size=50000)
The set operations have a complexity of O(1).

Date periods based on first occurence

I have a pandas data frame of orders:
OrderID OrderDate Value CustomerID
1 2017-11-01 12.56 23
2 2017-11-06 1.56 23
3 2017-11-08 2.67 23
4 2017-11-12 5.67 99
5 2017-11-13 7.88 23
6 2017-11-19 3.78 99
Let's look at customer with ID 23.
His first order in the history was 2017-11-01. This date is a start date for his first week. It means that all his orders between 2017-11-01 and 2017-11-07 are assigned to his week number 1 (It IS NOT a calendar week like Monday to Sunday).
For customer with ID 99 first week starts 2017-11-12 of course as it is a date of his first order (OrderId 6).
I need to assign every order of the table to the respective index of the common table Periods. Periods[0] will contain orders from customer's weeks number 1, Periods[1] from customer's weeks number 2 etc.
OrderId 1 nad OrderId 6 will be in the same index of Periods table as both orders were created in first week of their customers.
Period table containig orders IDs has to look like this:
Periods=[[1,2,4],[3,5,6]]

Is this what you want ?
df['New']=df.groupby('CustomerID').OrderDate.apply(lambda x : (x-x.iloc[0]).dt.days//7)
df.groupby('New').OrderID.apply(list)
Out[1079]:
New
0 [1, 2, 4]
1 [3, 5, 6]
Name: OrderID, dtype: object
To get your period table
df.groupby('New').OrderID.apply(list).tolist()
Out[1080]: [[1, 2, 4], [3, 5, 6]]
More info
df
Out[1081]:
OrderID OrderDate Value CustomerID New
0 1 2017-11-01 12.56 23 0
1 2 2017-11-06 1.56 23 0
2 3 2017-11-08 2.67 23 1
3 4 2017-11-12 5.67 99 0
4 5 2017-11-13 7.88 23 1
5 6 2017-11-19 3.78 99 1

why does pd.to_datetime fail to convert?

I have an object column with values which are dates. I manually placed 2016-08-31 instead of NaN after reading from csv.
close_date
0 1948-06-01 00:00:00
1 2016-08-31 00:00:00
2 2016-08-31 00:00:00
3 1947-07-01 00:00:00
4 1967-05-31 00:00:00
Running df['close_date'] = pd.to_datetime(df['close_date']) results in
TypeError: invalid string coercion to datetime
Adding coerce=Trueargument results in:
TypeError: to_datetime() got an unexpected keyword argument 'coerce'
Furthermore, even though I call the column 'close_date', all the columns in the dataframe, some int64, float64, and datetime64[ns], change to dtype object.
What am I doing wrong?

You need errors='coerce' parameter what convert some not parseable values to NaT:
df['close_date'] = pd.to_datetime(df['close_date'], errors='coerce')
print (df)
close_date
0 1948-06-01
1 2016-08-31
2 2016-08-31
3 1947-07-01
4 1967-05-31
print (df['close_date'].dtypes)
datetime64[ns]
But if there are some mixed values - numeric with datetimes convert to str first:
df['close_date'] = pd.to_datetime(df['close_date'].astype(str), errors='coerce')

Resample pandas dataframe and count instances

If I have a dataframe such as:
index = pd.date_range(start='2014 01 01 00:00', end='2014 01 05 00:00', freq='12H')
df = pd.DataFrame(pd.np.random.randn(9),index=index,columns=['A'])
df
Out[5]:
A
2014-01-01 00:00:00 2.120577
2014-01-01 12:00:00 0.968724
2014-01-02 00:00:00 1.232688
2014-01-02 12:00:00 0.328104
2014-01-03 00:00:00 -0.836761
2014-01-03 12:00:00 -0.061087
2014-01-04 00:00:00 -1.239613
2014-01-04 12:00:00 0.513896
2014-01-05 00:00:00 0.089544
And I want to resample to daily frequency, it is quite easy:
df.resample(rule='1D',how='mean')
Out[6]:
A
2014-01-01 1.544650
2014-01-02 0.780396
2014-01-03 -0.448924
2014-01-04 -0.362858
2014-01-05 0.089544
However, I need to track how many instances are going into each day. Is there a good pythonic way of using resample to both perform the specified "how" operation AND track number of data points going into each mean value, e.g. yielding
Out[6]:
A Instances
2014-01-01 1.544650 2
2014-01-02 0.780396 2
2014-01-03 -0.448924 2
2014-01-04 -0.362858 2
2014-01-05 0.089544 2

Conveniently, how accepts a list:
df1 = df.resample(rule='1D', how=['mean', 'count'])
This will return a DataFrame with a MultiIndex column: one level for 'A' and another level for 'mean' and 'count'. To get a simple DataFrame like the desired output in your question, you can drop the extra level like df1.columns = df1.columns.droplevel(0) or, better, you can do your resampling on df['A'] instead of df.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pandas Aggregate/Group by based on most recent date - python-2.7

I have a DataFrame as follows, where Id is a string and Date is a datetime: Id Date 1 3-1-2012 1 4-8-2013 2 1-17-2013 2 5-4-2013 2 10-30-2012 3 1-3-2013 I'd like to consolidate the table to just show one row for each Id which has the most recent date. Any thoughts on how to do this?

In the Hayden response, I think that using x.loc in place of x.iloc is better, as the index of the df dataframe could be sparse (and in this case the iloc will not work). (I haven't enought points on stackoverflow to post it in comments of the response).

Related

How to filter in django to get each date latest record

How Django bulks create check exist already in bulks objs and instance?

Date periods based on first occurence

why does pd.to_datetime fail to convert?

Resample pandas dataframe and count instances

Categories

Resources