I have a model called MyModel which has some dummy data as follows:
item date value
------------------------------
ab 8/10/12 1
ab 7/10/12 2
ab 6/10/12 3
abc 8/10/12 4
abc 7/10/12 5
abc 7/10/12 6
ab 7/10/12 7
ab 7/10/12 8
ab 7/10/12 9
ab 7/10/12 10
abc 7/10/12 11
abc 7/10/12 12
abc 7/10/12 13
I would like to query this Model in such a way that i get an output the gives me the ranges of the items that are in serial, something like the following:
[{'item': 'ab', 'values': '1-3'},
{'item': 'abc', 'values': '4-6'},
{'item': 'ab', 'values': '7-10'},
{'item': 'abc', 'values': '11-13'}]
How would i be able to do this using the django ORM?
Pretty certain you can't do that with the ORM...you'll need to write your own python code to do that.
counts = []
for model in MyModel.objects.all().order_by('value'):
if not counts or last_item != model.item:
counts.append({'item': model.item, 'values': [ model.value ]})
last_item = model.item
elsif model.item == last_item:
counts[-1]['values'].append(model.value)
for count in counts:
count['values'] = '%s-%s' % (count['values'][0], count['values'][-1])
Edit:
counts = []
for model in MyModel.objects.all().order_by('value'):
if not counts or last_item != model.item:
counts.append({'item': model.item, 'first': model.value, 'last':model.value})
last_item = model.item
elsif model.item == last_item:
counts[-1][last] = model.value
Related
df = im.pd.DataFrame([
['abc bank', 'Delhi', 'person a'],
['abc bank', 'Delhi', 'person b'],
['abc bank', 'Bombay', 'person c'],
['abc bank', 'Bombay', 'person d'],
['abc bank', 'Surat', 'person c'],
['abc bank', 'Surat', 'person d'],
['cde bank', 'Delhi', 'person z'],
['cde bank', 'Delhi', 'person y'],
['cde bank', 'Bombay', 'person x']
],
columns = ['corporation', 'city', 'managers'])
print('DataFrame with default index\n', df)
Then when we do this:
df = df.set_index(['corporation', 'city'])
print('\nDataFrame with MultiIndex\n',df)
The output we get is:
DataFrame with MultiIndex
managers
corporation city
abc bank Delhi person a
Delhi person b
Bombay person c
Bombay person d
Surat person c
Surat person d
cde bank Delhi person z
Delhi person y
Bombay person x
What I want is:
managers
corporation city
abc bank Delhi person a
person b
Bombay person c
person d
Surat person c
person d
cde bank Delhi person z
person y
Bombay person x
So the set_index is grouping the first column of 'corporation' but not the second column of 'city' how can I get this?
Context: I'm doing this to covert pandas df to html with rowspans for first 2 columns
You can start by sorting your two columns (set to be the multiindex) then use pandas.DataFrame.loc with pandas.DataFrame.duplicated to replace duplicated entries with empty strings :
df = df.sort_values(['rollno', 'name']).reset_index(drop=True)
df.loc[df.duplicated(["rollno", "name"]), 'name'] = ''
df.loc[df.duplicated('name'), 'name'] = ''
out= df.set_index(['rollno', 'name'])
# Output :
print(out)
physics botony
rollno name
21 Amol 72 67
78 69
Kiku 74 56
22 Ajit 54 76
Here's my code:
import pandas as pd
import numpy as np
input = {'name': ['Andy', 'Alex', 'Amy', "Olivia" ],
'rating': ['A', 'A', 'B', "B" ],
'score': [100, 60, 70, 95]}
df = pd.DataFrame(input)
df['valid1']=np.where((df['score']==100) & (df['rating']=='A'),'true','false')
The code above works fine to set a new column 'valid1' data as 'true' for score is 100 and 'rating' is A.
If the condition comes from a dict variable as
c = {'score':'100', 'rating':'A'}
How can I use the condition defined in c to get the same result 'valid' column value? I tried the following code
for key,value in c.iteritems():
df['valid2']=np.where((df[key]==value),'true','false')
got an error:
TypeError: Invalid type comparison
I'd define c as a pd.Series so that when you compare it to a dataframe, it automatically compares agains each row while matching columns with series indices. Note that I made sure 100 was an integer and not a string.
c = pd.Series({'score':100, 'rating':'A'})
i = df.columns.intersection(c.index)
df.assign(valid1=df[i].eq(c).all(1))
name rating score valid1
0 Andy A 100 True
1 Alex A 60 False
2 Amy B 70 False
3 Olivia B 95 False
You can use the same series and still use numpy to speed things up
c = pd.Series({'score':100, 'rating':'A'})
i = df.columns.intersection(c.index)
v = np.column_stack(df[c].values for c in i)
df.assign(valid1=(v == c.loc[i].values).all(1))
name rating score valid1
0 Andy A 100 True
1 Alex A 60 False
2 Amy B 70 False
3 Olivia B 95 False
I would like to count the top subject in a Column. Some fields have commas or dot I would like to create a new row with them.
import pandas as pd
from pandas import DataFrame, Series
sbj = DataFrame(["Africa, Business", "Oceania",
"Business.Biology.Pharmacology.Therapeutics",
"French Litterature, Philosophy, Arts", "Biology,Business", ""
])
sbj
I would like to split into a new any field that has a '.' or '.'
sbj_top = sbj[0].apply(lambda x: pd.value_counts(x.split(",")) if not pd.isnull(x) else pd.value_counts('---'.split(","))).sum(axis = 0)
sbj_top
I'm getting an error (AttributeError) here while try to re-split('.') it
sbj_top = sbj_top.apply(lambda x: pd.value_counts(x.split(".")) if not pd.isnull(x) else pd.value_counts('---'.split(","))).sum(axis = 0)
sbj_top
My desired output
sbj_top.sort(ascending=False)
plt.title("Distribution of the top 10 subjects")
plt.ylabel("Frequency")
sbj_top.head(10).plot(kind='bar', color="#348ABD")
You can use Counter together with chain from itertools. Note that I first replace periods with commas before parsing.
from collections import Counter
import itertools
from string import whitespace
trimmed_list = [i.replace('.', ',').split(',') for i in sbj[0].tolist() if i != ""]
item_list = [item.strip(whitespace) for item in itertools.chain(*trimmed_list)]
item_count = Counter(item_list)
>>> item_count.most_common()
[('Business', 3),
('Biology', 2),
('Oceania', 1),
('Pharmacology', 1),
('Philosophy', 1),
('Africa', 1),
('French Litterature', 1),
('Therapeutics', 1),
('Arts', 1)]
if you need the output in the form of a DataFrame:
df = pd.DataFrame(item_list, columns=['subject'])
>>> df
subject
0 Africa
1 Business
2 Oceania
3 Business
4 Biology
5 Pharmacology
6 Therapeutics
7 French Litterature
8 Philosophy
9 Arts
10 Biology
11 Business
I am trying to clean up the data. For the first name variable, I would like to 1) assign missing value (NaN) to those entries that have one character only, 2) assign missing value if it contains only two characters AND one of the characters is a symbol (ie: ".", or "?"), and 3) convert "wm" to string "william"
I tried the following and other codes, but none seems to work:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
def CleanUp():
data = pd.read_csv("C:\sample.csv")
frame2 = DataFrame(data)
frame2.columns = ["First Name", "Ethnicity"]
# Convert weird values to missing value
for Name in frame2["First_Name"]:
if len(Name) == 1:
Name == np.nan
if (len(Name) == 2) and (Name.str.contain(".|?|:", na=False)):
Name == np.nan
if Name == "wm":
Name == "william"
print frame2["First_Name"]
You're looking for df.replace
make up some data:
np.random.seed(3)
n=6
df = pd.DataFrame({'Name' : np.random.choice(['wm','bob','harry','chickens'], size=n),
'timeStamp' : np.random.randint(1000, size=n)})
print df
Name timeStamp
0 harry 256
1 wm 789
2 bob 659
3 chickens 714
4 wm 875
5 wm 681
run the replace:
df.Name = df.Name.replace('wm','william')
print df
Name timeStamp
0 harry 256
1 william 789
2 bob 659
3 chickens 714
4 william 875
5 william 681
I have a pandas DataFrame with duplicate values for a set of columns. For example:
df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})
In [2]: df
Out[2]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).
I am looking for the following output:
Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]
Dup_Index the original index of the duplicate row.
In [3]: df
Out[3]:
Column1 Column2 Column3 Column4 Is_Duplicate Dup_Index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
There is a DataFrame method duplicated for the first column:
In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0 False
1 False
2 True
In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])
To do the second you could try something like this:
In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])
In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])
In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])
In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
In [17]: df
Out[17]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Let's say your dataframe is stored in df.
You can use groupby to get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:
df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()
you can then merge this new dataframe with the original one by using the merge function:
df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))
You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:
df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']