I have a Pandas dataframe with decimal values like below.
'+' and '-' signs can either be leading or trailing.
df = pd.DataFrame({'amt': ['11.11', '+22.22', '33.33+', '-44.44', '55.55-', '66.66', '77.77', '8a8', '99', '97-9']})
df['amt']
0 11.11
1 +22.22
2 33.33+
3 -44.44
4 55.55-
5 66.66
6 77.77
7 8a8
8 99
9 97-9
Name: amt, dtype: object
My requirement is to:
Remove leading and trailing '+'
Move trailing '-' to leading '-'
This is what I have done so far:
abs_ser = pd.to_numeric(df['amt'].str.strip().str.strip('+|-'), errors='coerce')
abs_ser
0 11.11
1 22.22
2 33.33
3 44.44
4 55.55
5 66.66
6 77.77
7 NaN
8 99.00
9 NaN
Name: amt, dtype: float64
df['clean_amt'] = np.where(df['amt'].str.match(r'(^-|-$)'), abs_ser * -1, abs_ser)
df[['amt', 'clean_amt']]
amt clean_amt
0 11.11 11.11
1 +22.22 22.22
2 33.33+ 33.33
3 -44.44 -44.44
4 55.55- 55.55
5 66.66 66.66
6 77.77 77.77
7 8a8 NaN
8 99 99.00
9 97-9 NaN
The regular expression is not matching the trailing '-'.
Can someone help with correcting the regular expression?
I have tried the following and it gives me the desired result. However, I prefer the regex if it can do it in one pass of 'amt' column.
df['clean_amt'] = np.where((df['amt'].str.startswith('-') | df['amt'].str.endswith('-')), abs_ser * -1, abs_ser)
You can use
abs_ser = pd.to_numeric(df['amt'].str.strip().str.replace(r'^\+|\+$|^(.+)(-)$', r'\2\1'), errors='coerce')
See the regex demo.
Details
^\+ - finds a + at the start
\+$ - finds a + at the end
^(.+)(-)$ - captures any one or more chars at the start of the string (capturing the text into Group 1) and then captures a - at the end of the string into Group 2.
The replacement is the concatenated Group 2 and 1 values.
You could do the following:
# strip +
df['amt'] = df['amt'].str.strip('+')
# move -
mask = df['amt'].str.contains('-$')
df.loc[mask, 'amt'] = '-' + df.loc[mask, 'amt'].str.rstrip('-')
# transform to numeric
res = pd.to_numeric(df['amt'], errors='coerce')
print(res)
Output
0 11.11
1 22.22
2 33.33
3 -44.44
4 -55.55
5 66.66
6 77.77
7 NaN
8 99.00
9 NaN
Name: amt, dtype: float64
Related
I have an Input file:
ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300
I'm trying to find maximum count of consecutive days where FEES is 0 for a particular ROLL_NO. If FEES is not equal to zero for consecutive days, max count will be zero for that particular ROLL_NO.
Expected Output:
ID,ROLL_NO,MAX_CNT -- First occurrence of ID for a particular ROLL_NO should come as ID in output
1,12345,3
3,987654,0
This is what I've come up with so far,
import pandas as pd
df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df
But I don't believe this is the right way to approach this.
Could someone help out a python newbie out here?
You can use:
#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 4
10 5
11 5
dtype: int32
#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
print (df)
ROLL_NO MAX_CNT
0 12345 3
1 987654 0
Explanation:
First compare FEES column with 0, eq is same as == and multiple mask by column ROLL_NO:
mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0 0
1 0
2 0
3 12345
4 12345
5 0
6 12345
7 12345
8 12345
9 0
10 0
dtype: int64
Get consecutive groups by compare shifted Series r and cumsum:
a = r.ne(r.shift()).cumsum()
print (a)
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 5
10 5
dtype: int32
Filter only 0 in FEES and groupby with size, also filter a for same indexes:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO
12345 2 2
4 3
dtype: int64
Get max values per first level of MultiIndex:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345 3
dtype: int64
Last add missing ROLL_NO without 0 by reindex:
print (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0))
ROLL_NO
12345 3
987654 0
dtype: int64
and for columns from index use reset_index.
EDIT:
For first ID use drop_duplicates with insert and map:
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID']
mask = df['FEES'].eq(0)
df1 = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
df1.insert(0, 'ID', df1['ROLL_NO'].map(s))
print (df1)
ID ROLL_NO MAX_CNT
0 1 12345 3
1 3 987654 0
I have to compare a columns with all other columns in the dataframe. The column that i have to compare with others is located in position 4 so i write df.iloc[x,4] to take column values. Then i have to consider these values, multiply them with the values in the next column (for example df.iloc[x,5]), create a new column in the dataframe and save results. Then i have to repeat this procedure to the end the existing column (the original dataframe has 43 column, so the end it is the df.iloc[x,43] )
How can i do this in python?
If it is possibile can you do some examples? I try to put my code in the post but i 'm not good with my new phone.
I think you can use eq - compare filtered DataFrame with column E in position 4:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,8,9],
'G':[1,3,5],
'H':[5,3,6],
'I':[7,4,3]})
print (df)
A B C D E F G H I
0 1 4 7 1 5 7 1 5 7
1 2 5 8 3 3 8 3 3 4
2 3 6 9 5 6 9 5 6 3
print (df.iloc[:,5:].eq(df.iloc[:,4], axis=0))
F G H I
0 False False True False
1 False True True False
2 False False True False
If need multiple by column in position 4 use mul:
print (df.iloc[:,5:].mul(df.iloc[:,4], axis=0))
F G H I
0 35 5 25 35
1 24 9 9 12
2 54 30 36 18
Or if need multiple by shifted columns:
print (df.iloc[:,4:].mul(df.iloc[:,5:], axis=0, fill_value=1))
E F G H I
0 5.0 49 1 25 49
1 3.0 64 9 9 16
2 6.0 81 25 36 9
>>> df.head()
β Summer Gold Silver Bronze Total β Winter \
Afghanistan (AFG) 13 0 0 2 2 0
Algeria (ALG) 12 5 2 8 15 3
Argentina (ARG) 23 18 24 28 70 18
Armenia (ARM) 5 1 2 9 12 6
Australasia (ANZ) [ANZ] 2 3 4 5 12 0
Gold.1 Silver.1 Bronze.1 Total.1 β Games Gold.2 \
Afghanistan (AFG) 0 0 0 0 13 0
Algeria (ALG) 0 0 0 0 15 5
Argentina (ARG) 0 0 0 0 41 18
Armenia (ARM) 0 0 0 0 11 1
Australasia (ANZ) [ANZ] 0 0 0 0 2 3
Silver.2 Bronze.2 Combined total
Afghanistan (AFG) 0 2 2
Algeria (ALG) 2 8 15
Argentina (ARG) 24 28 70
Armenia (ARM) 2 9 12
Australasia (ANZ) [ANZ] 4 5 12
Not sure why do I see this error:
>>> df['Gold'] > 0 | df['Gold.1'] > 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ankuragarwal/data_insight/env/lib/python2.7/site-packages/pandas/core/generic.py", line 917, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Whats ambiguous here ?
But this works:
>>> (df['Gold'] > 0) | (df['Gold.1'] > 0)
Assuming we have the following DF:
In [35]: df
Out[35]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
4 1 4 6
The following command:
df.a > 5 | df.b > 5
because | has higher precedence (compared to >) as it's specified in the Operator precedence table) it will be translated to:
df.a > (5 | df.b) > 5
which will be translated to:
df.a > (5 | df.b) and (5 | df.b) > 5
step by step:
In [36]: x = (5 | df.b)
In [37]: x
Out[37]:
0 5
1 7
2 13
3 7
4 5
Name: c, dtype: int32
In [38]: df.a > x
Out[38]:
0 True
1 False
2 False
3 False
4 False
dtype: bool
In [39]: x > 5
Out[39]:
0 False
1 True
2 True
3 True
4 False
Name: b, dtype: bool
but the last operation won't work:
In [40]: (df.a > x) and (x > 5)
---------------------------------------------------------------------------
...
skipped
...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The error message above might lead inexperienced users to do something like this:
In [12]: (df.a > 5).all() | (df.b > 5).all()
Out[12]: False
In [13]: df[(df.a > 5).all() | (df.b > 5).all()]
...
skipped
...
KeyError: False
But in this case you just need to set your precedence explicitly in order to get expected result:
In [10]: (df.a > 5) | (df.b > 5)
Out[10]:
0 True
1 True
2 True
3 True
4 False
dtype: bool
In [11]: df[(df.a > 5) | (df.b > 5)]
Out[11]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
This is the real reason for the error:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html
pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not. It is not clear what the result of
>>> if pd.Series([False, True, False]):
...
should be. Should it be True because itβs not zero-length? False because there are False values? It is unclear, so instead, pandas raises a ValueError:
>>> if pd.Series([False, True, False]):
print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
If you see that, you need to explicitly choose what you want to do with it (e.g., use any(), all() or empty). or, you might want to compare if the pandas object is None
I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n
I have a data.frame and I want to split one of its columns to two based on a regular expression. More specifically the strings have a suffix in parentheses that needs to be extracted to a column of its own.
So e.g. I want to get from here:
dfInit <- data.frame(VAR = paste0(c(1:10),"(",c("A","B"),")"))
to here:
dfFinal <- data.frame(VAR1 = c(1:10), VAR2 = c("A","B"))
1) gsubfn::read.pattern read.pattern in the gsubfn package can do that. The matches to the parenthesized portions of the regular rexpression are regarded as the fields:
library(gsubfn)
read.pattern(text = as.character(dfInit$VAR), pattern = "(.*)[(](.*)[)]$")
giving:
V1 V2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
2) sub Another way is to use sub:
data.frame(V1=sub("\\(.*", "", dfInit$VAR), V2=sub(".*\\((.)\\)$", "\\1", dfInit$VAR))
giving the same result.
3) read.table This solution does not use a regular expression:
read.table(text = as.character(dfInit$VAR), sep = "(", comment = ")")
giving the same result.
You could also use extract from tidyr
library(tidyr)
extract(dfInit, VAR, c("VAR1", "VAR2"), "(\\d+).([[:alpha:]]+).", convert=TRUE) # edited and added `convert=TRUE` as per #aosmith's comments.
# VAR1 VAR2
#1 1 A
#2 2 B
#3 3 A
#4 4 B
#5 5 A
#6 6 B
#7 7 A
#8 8 B
#9 9 A
#10 10 B
See Split column at delimiter in data frame
dfFinal <- within(dfInit, VAR<-data.frame(do.call('rbind', strsplit(as.character(VAR), '[[:punct:]]'))))
> dfFinal
VAR.X1 VAR.X2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
An approach with regmatches and gregexpr:
as.data.frame(do.call(rbind, regmatches(dfInit$VAR, gregexpr("\\w+", dfInit$VAR))))
You can also use cSplit from splitstackshape.
library(splitstackshape)
cSplit(dfInit, "VAR", "[()]", fixed=FALSE)
# VAR_1 VAR_2
# 1: 1 A
# 2: 2 B
# 3: 3 A
# 4: 4 B
# 5: 5 A
# 6: 6 B
# 7: 7 A
# 8: 8 B
# 9: 9 A
#10: 10 B