Pandas: select rows from columns using Regex - regex

I want to extract rows from column feccandid that have a H or S as the first value:
cid amount date catcode feccandid
0 N00031317 1000 2010 B2000 H0FL19080
1 N00027464 5000 2009 B1000 H6IA01098
2 N00024875 1000 2009 A5200 S2IL08088
3 N00030957 2000 2010 J2200 S0TN04195
4 N00026591 1000 2009 F3300 S4KY06072
5 N00031317 1000 2010 B2000 P0FL19080
6 N00027464 5000 2009 B1000 P6IA01098
7 N00024875 1000 2009 A5200 S2IL08088
8 N00030957 2000 2010 J2200 H0TN04195
9 N00026591 1000 2009 F3300 H4KY06072
I am using this code:
campaign_contributions.loc[campaign_contributions['feccandid'].astype(str).str.extractall(r'^(?:S|H)')]
Returns error:
ValueError: pattern contains no capture groups
Does anyone with experience using Regex know what I am doing wrong?

Why not just use str.match instead of extract and negate?
ie df[df['col'].str.match(r'^(S|H)')]
(I came here looking for the same answer, but the use of extract seemed odd, so I found the docs for str.ops.
W

For something this simple, you can bypass the regex:
relevant = campaign_contributions.feccandid.str.startswith('H') | \
campaign_contributions.feccandid.str.startswith('S')
campaign_contributions[relevant]
However, if you want to use a regex, you can change this to
relevant = ~campaign_contributions['feccandid'].str.extract(r'^(S|H)').isnull()
Note that the astype is redundant, and that extract is enough.

Related

How to isolate specific words and years in group 1 no matter the combination

I got some values coming in with various look such as
autumn ux-s 2021
2021 pes-3 autumn P-S
pes-3 autumn 2021 32
autumn usd- fosd 2021 2
I really want to isolate "autumn" and "2021" - both in group 1
Of course the "autumn" could also be "spring", "summer", "winter" while the year of course should match the year.
It doesnt matter if i get "2021 autumn" and "autumn 2021" as long as i can isolate it within the same group 1
How could i achieve this? I simply cant see how i can keep it within one single group ?
I can isolate the location here, but of course still match the whole thing
((?:(?:autumn|spring)(?:\s*[a-zA-Z]*\s*)\d{4})|(?:\d{4}(?:\s*[a-zA-Z]*\s*)(?:autumn|spring)))
Can i somehow substract only partials from here and combine them into a single group result?
I did not find a regex to capture what you want in 1 group but maybe this one-liner solution helps you?
import re
text = ["autumn ux-s 2021", "2021 pes-3 autumn P-S", "pes-3 autumn 2021 32" ,"autumn usd- fosd 2021 2"]
pattern = r"(autumn|summer|winter|spring).*(\d{4})|(\d{4}).*(autumn|summer|winter|spring)"
print([' '.join(filter(None, re.search(pattern, txt, re.IGNORECASE).groups())) for txt in text])
output: ['autumn 2021', '2021 autumn', 'autumn 2021', 'autumn 2021']

Generate a variable based on the most recent I/observation

My data is currently organized in Stata as follows:
input str2 Country gdp_2015 gdp_2016 gdp_2017 imports_2016 imports_2017 exports_2016
"A" 11 12 13 5 6 8 5
"B" 11 . . 5 6 10 5
"C" 12 13 . 5 6 8 5
end
gen net_imports = (imports_2017-foodexport_2017)
gen net_imports_toGDP = (net_imports/gdpcurrent_2017)
The code works well but only created a variable if a country has 2017 data, but I would like to essentially create an import to GDP ratio, based on the most recent observation available for GDP.
You could simply replace the missing data as follows:
replace gdp_2016 = gdp_2015 if mi(gdp_2016)
replace gdp_2017 = gdp_2016 if mi(gdp_2017)
However, a more general approach would begin by reshaping your data from wide to long:
reshape long gdp_ imports_ exports_, i(Country)
See help reshape for more detail on the command. The gdp_ etc. are the stubs that will be the new variable names, and i(Country) sets the identifier.
Then you can fill forward within each observation using time-series variables:
encode Country, generate(Country_num
xtset Country_num _j
replace gdp_=l.gdp_ if mi(gdp_) & !mi(l.gdp_)

PowerBI running Total formula

I have a dataset OvertimeHours with EMPLID, checkdate and NumberOfHours (and other fields). I need a running total NumberOfHours for each employee by checkdate. I tried using the Quick Measure option but that only allows for a single column and I have two. I do not want the measure to recalculate when filters are applied. Ultimately what I am trying to do is identify the records for the first 6 hours of overtime worked on each check so that they can get a category of OCB and all overtime over the first 6 hours is OTP and it does not have to be exact (as demonstrated in the output below). I have only been working with Power BI for about a month and this is a pretty complex (for me) formula to figure out...
EMPLID CheckDate WkDate NumberOfHours RunningTotal Category
124 1/1/19 12/20/18 5 5 OCB
124 1/1/19 12/21/18 9 14 OTP
125 1/1/19 12/20/18 3 3 OCB
125 1/1/19 12/20/18 2 5 OCB
125 1/1/19 12/22/18 2 7 OTP
124 1/15/19 1/8/19 3 3 OCB
*Edited to add the WkDate.
Edit:
I have tweaked my query so that I have the running total and a sequential counter now:
Using the first 12 records, I am looking to get the following results:
I can either do it in a query if that is the easiest way or if there is a way to use DAX in PowerBI with this dataset now that I have the sequential piece, I can do that too.
I got it in the query:
select r.CheckDate,
r.EMPLID,
case
when PayrollRunningOTHours <= 6
then PayrollRunningOTHours
else 6
end as OCBHours,
case
when PayRollRunningOTHours > 6
then PayRollRunningOTHours - 6
end as OTPHours
from #rollingtotal r
inner
join lastone l
on r.CheckDate = l.CheckDate
and r.EMPLID = l.EMPLID
and r.OTCounter = l.lastRec
order by r.emplid,
r.CheckDate,
r.OTCounter

regex grouping and decimal plus comma and Time

I need a regex for these String(s):
17 h 13 min - 43,1 km - Ankunft ca. 11:48
17 h 13 min - 1.443,1 km - Ankunft ca. 11:48
13 min - 431 m - Ankunft ca. 11:48
17 h - 3 km - Ankunft ca. 09:04
It has to be grouped in 3 Groups (routerarrivalin, routedistance, routeeta).
routearrival: 17 h 13 m
routedistance: 43,1 km
routeeta: 11:48
I tried alot but I am constantly failing. What I have right now is this regex:
(?<routearrivalin>\d+?\s?h?\s?\d+?\s*m).+(?<routedistance>\d\d,\d\s[km|m]*).+(?<routeeta>\d\d:\d\d)
I cant filter the distance to match every possibility I have.
If someone is curious what I'm doing here: the string is the navigationinformation you get on a android phone from google maps. I want to split the string into mentioned variables to handle them with tasker and forward those to my display in the speedometer.
So it took a little tweaking, but I have included some extra groups to handle the optional and changing parts and I think this is what you are looking for:
(?<routearrivalin>([\d\s]+h)?([\d\s]*m)?)(in)?[\s-]+(?<routedistance>([\d.,]+\s+(km|m))).+(?<routeeta>\d\d:\d\d)
You can see it here at Regex101 working with the example you have given

Creating a flag using indexes

I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+