Merge rows based on index - python-2.7

I have a pandas dataframe like this,
Timestamp Meter1 Meter2
0 234 NaN
1 235 NaN
2 236 NaN
0 NaN 100
1 NaN 101
2 NaN 102
and I'm having trouble merging the rows based on the index Timestamp to something like this,
Timestamp Meter1 Meter2
0 234 100
1 235 101
2 236 102

Option 0
df.max(level=0)
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Option 1
df.sum(level=0)
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Option 2
Disturbing Answer
df.stack().unstack()
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
As brought up by #jezrael and linked to issue here
However, As I've understood groupby.first and groupby.last is that it will return the first (or last) valid value in the group per column. In other words, it is my belief that this is working as intended.
Option 3
df.groupby(level=0).first()
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Option 4
df.groupby(level=0).last()
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
​

Use groupby:
df.groupby(level=0).max()
OR
df.groupby('Timestamp').max()
Output
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0

Use groupby and aggregate sum:
df = df.groupby(level=0).sum()
print (df)
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
And if only ints:
df = df.groupby(level=0).sum().astype(int)
print (df)
Meter1 Meter2
Timestamp
0 234 100
1 235 101
2 236 102
But maybe problem was you forget axis=1 in concat:
print (df1)
Meter1
Timestamp
0 234
1 235
2 236
print (df2)
Meter2
Timestamp
0 100
1 101
2 102
print (pd.concat([df1, df2]))
Meter1 Meter2
Timestamp
0 234.0 NaN
1 235.0 NaN
2 236.0 NaN
0 NaN 100.0
1 NaN 101.0
2 NaN 102.0
print (pd.concat([df1, df2], axis=1))
Meter1 Meter2
Timestamp
0 234 100
1 235 101
2 236 102

Related

Adding lists in a python elif statements

I have two data files (datafile1 and datafile2) and I want to add some information from datafile2 to datafile1, but only if it meets certain requirements, and then write all of the information to new file.
Here is an example of datafile1 (I changed the tabs so it's easier to see):
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt
Here is an example of datafile2:
#GInumber OTU Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
1366104624 OTU49 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta Hymenoptera Braconidae Leiophron NA
342734543 OTU171 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta Lepidoptera Limacodidae Euphobetron Euphobetron cupreitincta
290756623 OTU803 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae Apocheima Apocheima pilosaria
296792336 OTU2519 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
What I would like to do is for every line of datafile1, find the line in datafile2 with the same "OTU", and from datafile 2 always add GInumber, Accssn, Ident, Len, M, Gap, Qs, Qe, Ss, Se, evalue, bit, phylum, and class. If Ident falls between certain numbers, then I would like to also add order, family, genus, and species, according to these criteria:
Case #1: Ident > 98.0, add order, family, genus, and species
Case #2: Ident between 96.5 and 98.0, add order, family, "NA", "NA"
Case #3: Ident between 95.0 and 96.5, add order, "NA", "NA", "NA"
Case #4: Ident < 95.0 add "NA", "NA", "NA", "NA"
The desired output would be this:
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq GInumber Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat 1366104624 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta NA NA NA NA
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt 342734543 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta Lepidoptera NA NA NA
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt 290756623 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae NA NA
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt 296792336 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
I wrote this script:
import csv
#Files
besthit_taxonomy_unique_file = "datafile2.txt"
OTUtablefile = "datafile1.txt"
outputfile = "outputfile.txt"
#Settings
OrderLevel = float(95.0)
FamilyLevel = float(96.5)
SpeciesLevel = float(98.0)
#Importing the OTU table, which is tab delimited
OTUtable = list(csv.reader(open(OTUtablefile, 'rU'), delimiter='\t'))
headerOTUs = OTUtable.pop(0)
#Importing the best hit taxonomy table, which is tab delimited
taxonomytable = list(csv.reader(open(besthit_taxonomy_unique_file, 'rU'), delimiter='\t'))
headertax = taxonomytable.pop(0)
headertax.pop(1)
#Getting the header info
totalheader = headerOTUs + headertax
#Merging and assigning the taxonomy at the appropriate level
outputtable = []
NAs = 4 * ["NA"] #This is a list of NAs so that I can add the appropriate number, depending on the Identity
for item in OTUtable:
OTU = item #Just to prevent issues with the list of lists
OTUIDtable = OTU[0]
print OTUIDtable
for thing in taxonomytable:
row = thing #Just to prevent issues with the list of lists
OTUIDtax = row[1]
if OTUIDtable == OTUIDtax:
OTU.append(row[0])
OTU += row[2:15]
PercentID = float(row[3])
if PercentID >= SpeciesLevel:
OTU += row[15:]
elif FamilyLevel <= PercentID < SpeciesLevel:
OTU += row[15:17]
OTU += NAs[:2]
elif OrderLevel <= PercentID < FamilyLevel:
print row[15]
OTU += row[15]
OTU += NAs[:3]
else:
OTU += NAs
outputtable.append(OTU)
#Writing the output file
f1 = open(outputfile, 'w')
for item in totalheader[0:-1]:
f1.write(str(item) + '\t')
f1.write(str(totalheader[-1]) + '\n')
for row in outputtable:
currentrow = row
for item in currentrow[0:-1]:
f1.write(str(item) + '\t')
f1.write(str(currentrow[-1]) + '\n')
For the most part the output is correct, except in case #3 (Ident between 95 and 96.5), when the script outputs the entry for order with every letter having a tab between it.
Here is an example of the output:
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq GInumber Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat 1366104624 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta NA NA NA NA
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt 342734543 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta L e p i d o p t e r a NA NA NA
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt 290756623 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae NA NA
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt 296792336 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
I just can't figure out what's going wrong. The rest of the time the order seems to contain the correct info, but for this one case it seems as if the information in order is stored as a list of lists. However, the output to the screen is this:
OTU171
Lepidoptera
This doesn't seem to indicate a list of lists...
I would be happy for any insights. I also appreciate if anyone has ideas for making my code more pythonic.
Andreanna

Pandas Nested DataFrame assignment

I have the following DataFrame:
prefix operator_name country_name mno_subscribers
0 267.0 Airtel Botswana 490
1 373.0 Orange Moldova 207
2 248.0 Airtel Seychelles 490
3 91.0 Reliance Bostwana 92
4 233.0 Vodafone Bostwana 516
I am trying to acheive this:
prefix operator_name country_name mno_subscribers operator_proba
0 267.0 Airtel Botswana 490 0.045
1 373.0 Orange Moldova 207 0.004
2 248.0 Airtel Seychelles 490 0.135
3 91.0 Reliance India 92 0.945
4 233.0 Vodafone Ghana 516 0.002
With this:
countries = df["country_name"].unique()
df["operator_proba"] = 0
for country in countries:
country_name = df[df["country_name"] == country]
for operator in country:
mno_sum = country_name["mno_subscribers"].sum()
df["operator_proba"]["country_name"] = country_name["mno_subscribers"] / mno_sum
Where am I going wrong in assigning the operator_proba to the original DataFrame?
This line
df["operator_proba"]["country_name"] = country_name["mno_subscribers"] / mno_sum
can't really work, since df["operator_proba"] is a column (or Series); you can't use ["country_name"] indexing on that.
That is probably why things don't work for you.
It's not entirely clear what you want to achieve, but I guess this may work:
df['operator_proba'] = df.groupby('country_name')['mno_subscribers'].apply(lambda x : x/x.sum())
This saves you a double loop, and is more Pandas-style (there are probably even nicer ways to compute this). The result is:
prefix operator_name country_name mno_subscribers operator_proba
0 267.0 Airtel Botswana 490 1.000000
1 373.0 Orange Moldova 207 1.000000
2 248.0 Airtel Seychelles 490 1.000000
3 91.0 Reliance Bostwana 92 0.151316
4 233.0 Vodafone Bostwana 516 0.848684
with the limited data set (and Botswana/Bostwana difference), most "probabilities" are 1.

ValueError: The truth value of a Series is ambiguous

>>> df.head()
№ Summer Gold Silver Bronze Total № Winter \
Afghanistan (AFG) 13 0 0 2 2 0
Algeria (ALG) 12 5 2 8 15 3
Argentina (ARG) 23 18 24 28 70 18
Armenia (ARM) 5 1 2 9 12 6
Australasia (ANZ) [ANZ] 2 3 4 5 12 0
Gold.1 Silver.1 Bronze.1 Total.1 № Games Gold.2 \
Afghanistan (AFG) 0 0 0 0 13 0
Algeria (ALG) 0 0 0 0 15 5
Argentina (ARG) 0 0 0 0 41 18
Armenia (ARM) 0 0 0 0 11 1
Australasia (ANZ) [ANZ] 0 0 0 0 2 3
Silver.2 Bronze.2 Combined total
Afghanistan (AFG) 0 2 2
Algeria (ALG) 2 8 15
Argentina (ARG) 24 28 70
Armenia (ARM) 2 9 12
Australasia (ANZ) [ANZ] 4 5 12
Not sure why do I see this error:
>>> df['Gold'] > 0 | df['Gold.1'] > 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ankuragarwal/data_insight/env/lib/python2.7/site-packages/pandas/core/generic.py", line 917, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Whats ambiguous here ?
But this works:
>>> (df['Gold'] > 0) | (df['Gold.1'] > 0)
Assuming we have the following DF:
In [35]: df
Out[35]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
4 1 4 6
The following command:
df.a > 5 | df.b > 5
because | has higher precedence (compared to >) as it's specified in the Operator precedence table) it will be translated to:
df.a > (5 | df.b) > 5
which will be translated to:
df.a > (5 | df.b) and (5 | df.b) > 5
step by step:
In [36]: x = (5 | df.b)
In [37]: x
Out[37]:
0 5
1 7
2 13
3 7
4 5
Name: c, dtype: int32
In [38]: df.a > x
Out[38]:
0 True
1 False
2 False
3 False
4 False
dtype: bool
In [39]: x > 5
Out[39]:
0 False
1 True
2 True
3 True
4 False
Name: b, dtype: bool
but the last operation won't work:
In [40]: (df.a > x) and (x > 5)
---------------------------------------------------------------------------
...
skipped
...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The error message above might lead inexperienced users to do something like this:
In [12]: (df.a > 5).all() | (df.b > 5).all()
Out[12]: False
In [13]: df[(df.a > 5).all() | (df.b > 5).all()]
...
skipped
...
KeyError: False
But in this case you just need to set your precedence explicitly in order to get expected result:
In [10]: (df.a > 5) | (df.b > 5)
Out[10]:
0 True
1 True
2 True
3 True
4 False
dtype: bool
In [11]: df[(df.a > 5) | (df.b > 5)]
Out[11]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
This is the real reason for the error:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html
pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not. It is not clear what the result of
>>> if pd.Series([False, True, False]):
...
should be. Should it be True because it’s not zero-length? False because there are False values? It is unclear, so instead, pandas raises a ValueError:
>>> if pd.Series([False, True, False]):
print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
If you see that, you need to explicitly choose what you want to do with it (e.g., use any(), all() or empty). or, you might want to compare if the pandas object is None

How to get cut needed value in to different col in python dataframe?

I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00

Counting specific values in rows in CSV

I have a csv file:
A D1 B D2 C D3 E
1 action 0.5 action 0.35 null a
2 0 0.75 0 0.45 action b
3 action 1 action 0.85 action c
I want to count the number of 'action' keyword in each row and make a new column giving the output. So the output file would be something like this.
A D1 B D2 C D3 E TotalAction
1 action 0.5 action 0.35 null a 2
2 0 0.75 0 0.45 action b 1
3 action 1 action 0.85 action c 3
What is the best way to go forward using Pandas? thanks
You could use apply across the rows with str.contains for that keyword:
In [21]: df.apply(lambda x: x.str.contains('action').sum(), axis=1)
Out[21]:
0 2
1 1
2 3
dtype: int64
df['TotalAction'] = df.apply(lambda x: x.str.contains('action').sum(), axis=1)
In [23]: df
Out[23]:
A D1 B D2 C D3 E TotalAction
0 1 action 0.50 action 0.35 null a 2
1 2 0 0.75 0 0.45 action b 1
2 3 action 1.00 action 0.85 action c 3
EDIT
Although you could do it easier and faster with isin and then sum across the rows:
In [45]: df.isin(['action']).sum(axis=1)
Out[45]:
0 2
1 1
2 3
dtype: int64
Note: You need to wrap your string keyword into list.
you can use select_dtypes (for selecting only string columns) in conjunction with .sum(axis=1):
In [95]: df['TotalAction'] = (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
In [96]: df
Out[96]:
A D1 B D2 C D3 E TotalAction
0 1 action 0.50 action 0.35 null a 2
1 2 0 0.75 0 0.45 action b 1
2 3 action 1.00 action 0.85 action c 3
Timing against 30K rows DF:
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [6]: df.shape
Out[6]: (30000, 7)
In [4]: %timeit df.apply(lambda x: x.str.contains('action').sum(), axis=1)
1 loop, best of 3: 7.89 s per loop
In [5]: %timeit (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
100 loops, best of 3: 7.08 ms per loop
In [7]: %timeit df.isin(['action']).sum(axis=1)
10 loops, best of 3: 22.8 ms per loop
Conclusion: apply(...) is 1114 times slower compared to select_dtypes() method
Explanation:
In [92]: df.select_dtypes(include=[object])
Out[92]:
D1 D2 D3 E
0 action action null a
1 0 0 action b
2 action action action c
In [93]: df.select_dtypes(include=[object]) == 'action'
Out[93]:
D1 D2 D3 E
0 True True False False
1 False False True False
2 True True True False
In [94]: (df.select_dtypes(include=[object]) == 'action').sum(axis=1)
Out[94]:
0 2
1 1
2 3
dtype: int64