I have a lot of data, that data is pretty dirty, example:
A table ORM :
id = models.CharField(default='', max_length=50)
time = models.DateTimeField(default=timezone.now)
number = models.CharField(default='', max_length=20)
value = models.CharField(default='', max_length=20)
unique_together = ['id', 'time', 'number']
A table DATA :
id time number value
1 2018-07-16 00:00:00 1 64
1 2018-07-16 00:00:00 2 -99
1 2018-07-16 00:00:00 3 655
1 2018-07-16 00:00:00 4 3
2 2018-07-16 00:00:00 0 12
Import Datas (sample) :
id time number value
1 2018-07-16 00:00:00 1 64
3 2018-07-16 00:00:00 0 -99
3 2018-07-16 00:00:00 0 11
4 2018-07-16 00:00:00 0 -99
4 2018-07-16 00:00:00 1 -99
So, When I Do
for loop....
objs = []
objs.append(A(**kwargs))
A.objects.bulk_create(objs, batch_size=50000)
It will raise two kind duplicate.
A Table already exist " 1 2018-07-16 00:00:00 1"
Import Datas already exist 3 2018-07-16 00:00:00 0 for two times in objs, so when I bulks create it will raise duplicate, then it will roll back all commit !!!
the "1", I can use get or create to solve it
but "2", I can't check now I append data exist in the objs or not
I tried to use this to check exist or not, but when data row over 1000000,
the complexity will be terrible.
def search(id, time, number, objs):
for obj in objs:
if obj['id'] == id and obj['time'] == time and obj['number'] == number:
return True
return False
Is there have any better way? thanks.
You can add a tuple with id, time and number to a set:
objs = []
duplicate_check = set()
for loop....
data = kwargs['id'], kwargs['time'], kwargs['number']
if not data in duplicate_check:
objs.append(A(**kwargs))
duplicate_check.add(data)
A.objects.bulk_create(objs, batch_size=50000)
The set operations have a complexity of O(1).
Related
Would you, please, help me to apply different calculations for 2 rows in power BI:
that is, to transform this table:
client_ids products purchased month
1 0 0 jan
2 1A 1 jan
2 1B 1 jan
3 0 0 jan
4 0 0 jan
5 0 0 feb
into this:
purchased jan feb
1 1
0 3 1
That is, to perform calculations:
-on purchased = 0 - count over month, client
-on purchased = 1 - count distinct over month, client
Thank you.
I used the method:
-create the reference to the main query in the query editor
-drop the column with products
-drop duplicates
But this makes downloading the report slower.
To return the expected output, you can use two steps to obtain the result from the data:
Assuming this is your table with date:
First, calculate the month different compared with today to find recently month (you can try other method depend on your data nature):
Mon Diff = (YEAR(NOW()) - YEAR(Sheet1[date])) + (MONTH(NOW()) - MONTH(Sheet1[date]))
Second, rank the recent month as current:
rank =
var ranking = RANKX(Sheet1,Sheet1[Mon Diff],,,Dense)
return
SWITCH(ranking,1,"prior",2,"current")
Third, generate distinct values from purchase column
Table = DISTINCT(Sheet1[purchased])
Fourth, calculate the frequencies of 0 & 1 in Prior Month, the same for Feb
Jan = CALCULATE(COUNT(Sheet1[rank]),Sheet1[rank]="prior",
Sheet1[purchased]=EARLIER('Table'[purchased]))
Feb = CALCULATE(COUNT(Sheet1[rank]),Sheet1[rank]="current",
Sheet1[purchased]=EARLIER('Table'[purchased]))
The New table for the infor (In Jan, purchase 2 has 2 occurrence instead of 1):
Let's assume I have access to this data in QuickSight :
Id Amount Date
1 10 15-01-2019
2 0 16-01-2020
3 100 21-12-2019
4 34 15-01-2020
5 5 20-02-2020
1 50 13-09-2020
4 0 01-01-2020
I would like to create a boolean calculated field, named "Amount_in_2020", whose value is True when the Id have a total strictly positive Amount in 2020, else False.
With python I would have done the following :
# Sample data
df = pd.DataFrame({'Id' : [1,2,3,4,5,1,4],
'Amount' : [10,0,100,34,5,50,0],
'Date' : ['15-01-2019','16-01-2020','21-12-2019','15-01-2020','20-02-2020','13-09-2020','01-01-2020']})
df['Date']=df['Date'].astype('datetime64')
# Group by to get total amount and filter dates
df_gb = pd.DataFrame(df[(df["Date"]>="01-01-2020") & (df["Date"]<="31-12-2020")].groupby(by=["Id"]).sum()["Amount"])
# Creation of the wanted column
df["Amount_in_2020"]=np.where(df["Id"].isin(list(df_gb[df_gb["Amount"]>0].index)),True,False)
But I can't find a way to create such a calculated field in Quicksight. Could you please help me ?
Expected output :
Id Amount Date Amount_in_2020
1 10 2019-01-15 True
2 0 2020-01-16 False
3 100 2019-12-21 False
4 34 2020-01-15 True
5 5 2020-02-20 True
1 50 2020-09-13 True
4 0 2020-01-01 True
Finally found :
ifelse(sumOver(max(ifelse(extract("YYYY",{Date})=2020,{Amount},0)), [{Id}])>0,1,0)
I have an object column with values which are dates. I manually placed 2016-08-31 instead of NaN after reading from csv.
close_date
0 1948-06-01 00:00:00
1 2016-08-31 00:00:00
2 2016-08-31 00:00:00
3 1947-07-01 00:00:00
4 1967-05-31 00:00:00
Running df['close_date'] = pd.to_datetime(df['close_date']) results in
TypeError: invalid string coercion to datetime
Adding coerce=Trueargument results in:
TypeError: to_datetime() got an unexpected keyword argument 'coerce'
Furthermore, even though I call the column 'close_date', all the columns in the dataframe, some int64, float64, and datetime64[ns], change to dtype object.
What am I doing wrong?
You need errors='coerce' parameter what convert some not parseable values to NaT:
df['close_date'] = pd.to_datetime(df['close_date'], errors='coerce')
print (df)
close_date
0 1948-06-01
1 2016-08-31
2 2016-08-31
3 1947-07-01
4 1967-05-31
print (df['close_date'].dtypes)
datetime64[ns]
But if there are some mixed values - numeric with datetimes convert to str first:
df['close_date'] = pd.to_datetime(df['close_date'].astype(str), errors='coerce')
I am facing a problem of huge memory leak on a server, serving a Django (1.8) app with Apache or Ngnix (The issue happens on both).
When I go on certain pages (let's say on the specific request below) the RAM of the server goes up to 16 G in few seconds (with only one request) and the server freeze.
def records(request):
"""Return list 14 last records page. """
values = []
time = timezone.now() - timedelta(days=14)
record =Records.objetcs.filter(time__gte=time)
return render(request,
'record_app/records_newests.html',
{
'active_nav_tab': ["active", "", "", ""]
' record': record,
})
When I git checkout to older version, back when there was no such problem, the problem survives and i have the same issue.
I Did a memory check with Gumpy for the faulty request here is the result:
>>> hp.heap()
Partition of a set of 7042 objects. Total size = 8588675016 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1107 16 8587374512 100 8587374512 100 unicode
1 1014 14 258256 0 8587632768 100 django.utils.safestring.SafeText
2 45 1 150840 0 8587783608 100 dict of 0x390f0c0
3 281 4 78680 0 8587862288 100 dict of django.db.models.base.ModelState
4 326 5 75824 0 8587938112 100 list
5 47 1 49256 0 8587987368 100 dict of 0x38caad0
6 47 1 49256 0 8588036624 100 dict of 0x39ae590
7 46 1 48208 0 8588084832 100 dict of 0x3858ab0
8 46 1 48208 0 8588133040 100 dict of 0x38b8450
9 46 1 48208 0 8588181248 100 dict of 0x3973fe0
<164 more rows. Type e.g. '_.more' to view.>
After a day of search I found my answer.
While investigating I checked statistics on my DB and saw that some table was 800Mo big but had only 900 rows. This table contains a Textfield without max len. Somehow one text field got a huge amount of data inserted into and this line was slowing everything down on every pages using this model.
I have a DataFrame as follows, where Id is a string and Date is a datetime:
Id Date
1 3-1-2012
1 4-8-2013
2 1-17-2013
2 5-4-2013
2 10-30-2012
3 1-3-2013
I'd like to consolidate the table to just show one row for each Id which has the most recent date.
Any thoughts on how to do this?
You can groupby the Id field:
In [11]: df
Out[11]:
Id Date
0 1 2012-03-01 00:00:00
1 1 2013-04-08 00:00:00
2 2 2013-01-17 00:00:00
3 2 2013-05-04 00:00:00
4 2 2012-10-30 00:00:00
5 3 2013-01-03 00:00:00
In [12]: g = df.groupby('Id')
If you are not certain about the ordering, you could do something along the lines:
In [13]: g.agg(lambda x: x.iloc[x.Date.argmax()])
Out[13]:
Date
Id
1 2013-04-08 00:00:00
2 2013-05-04 00:00:00
3 2013-01-03 00:00:00
which for each group grabs the row with largest (latest) date (the argmax part).
If you knew they were in order you could take the last (or first) entry:
In [14]: g.last()
Out[14]:
Date
Id
1 2013-04-08 00:00:00
2 2012-10-30 00:00:00
3 2013-01-03 00:00:00
(Note: they're not in order, so this doesn't work in this case!)
In the Hayden response, I think that using x.loc in place of x.iloc is better, as the index of the df dataframe could be sparse (and in this case the iloc will not work).
(I haven't enought points on stackoverflow to post it in comments of the response).