modification of alter text in pandas column based on names - regex

Background
I have the following df which is a modification of Alter text in pandas column based on names
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Doe works ',
'So is Mary Doe, works too',
'Jane Ann, Doe doesnt',
'Jone, Dow doesnt either'],
'P_ID': [1,2,3,4],
'P_Name' : ['Doe, Jon J', 'Doe, Mary', 'Doe, Jane Ann', 'Dow, Jone' ]
})
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
And the following block of code works to block names like Jon J Doe but it doesnt work when a name like Jane Ann Doe has a character in between e.g. Jane Ann, Doe or Jone! Dow
df['NewText'] = df['Text'].replace(df['P_Name'].str.split(', *').apply(lambda l: ' '.join(l[::-1])),'**BLOCK**',regex=True)
Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt Jane Ann, Doe doesnt
3 4 Dow, Jone Jone,Dow doesnt either Jone, Dow doesnt either
Goal
1) Tweak the code above to take into account for , (or any other characters that may be in between the names)
(I know I can strip commas, but I need to leave them in)
Desired Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt **BLOCK** doesnt
3 4 Dow, Jone Jone,Dow doesnt either **BLOCK** doesnt either
Question
How do I tweak my code to get my desired output?

I don't know if there are multiple such cases, but in case you have limited
Sample DataSet:
>>> df
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
You can create dict combination and apply that to the dataFrame to get the result.
>>> replace_values = {'Jon J Doe': '**BLOCK**', 'Mary Doe': '**BLOCK**', 'Jane Ann, Doe': '**BLOCK**', 'Jone, Dow': '**BLOCK**'}
Resulted dataFrame:
>>> df = df.replace(replace_values, regex=True)
>>> df
P_ID P_Name Text
0 1 Doe, Jon J **BLOCK** works
1 2 Doe, Mary So is **BLOCK**, works too
2 3 Doe, Jane Ann **BLOCK** doesnt
3 4 Dow, Jone **BLOCK** doesnt either

try this:
df['NewText'] = df['Text'].replace( r'('+ df['P_Name'].str.split('\W+').str.join('|')+'|\W+){3,}', ' **BLOCK** ', regex=True)

Related

Regex: I want to separate addresses that include some exceptions

Rules I'm trying to apply:
Group 1 must always contain text, and if the string starts with "the" then also include it.
Group 2 is optional and can be (street or road).
Group 3 is optional and can contain (east or west).
I've got most of the way with the following (I think):
(.+?)\b\s?((?i)ROAD|STREET)*.?((?<= +)(?i)WEST|EAST)?$
but with a couple of exceptions:
"the street" is separated but needs to all be in Group 1 as it starts with a "the".
"STREET" is in Group 2 but needs to be in Group 1 as Group 1 always needs a value
Text
Match
Position
Length
Group 1
Group 2
Group 3
smith
smith
0
5
smith
the street
the street
5
11
the
street
STREET
STREET
16
7
STREET
the street west
the street west
23
16
the
street
west
smith street
smith street
39
13
smith
street
smith road
smith road
52
11
smith
road
smith strreet east
smith strreet east
63
19
smith strreet
east
SMITH
SMITH
82
6
SMITH
SMITH Street
SMITH Street
88
13
SMITH
Street
SMITH Street
SMITH Street
101
14
SMITH
Street
Smith Street West abc
Smith Street West abc
115
22
Smith Street West abc
Smith Street East
Smith Street East
137
18
Smith
Street
East
Smith SttReet East
Smith SttReet East
155
19
Smith SttReet
East
Smith Street West
Smith Street West
174
18
Smith
Street
West
((.+?)\b\s?((?i)ROAD|STREET)*).?((?<=)(?i)WEST|EAST)?$
I did two changes:
I removed white space and plus symbol +. It fixes the blank group 1
I have grouped the the and street. Both appears in group 1 if they exist

Removing duplicate substring in SAS

In the following sample data I am trying to remove the duplicate substrings in my string using the code below:
data z;
input pvd_name_orig $50.;
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;
DATA want (keep=pvd_name_orig pvd_name_temp);
SET pvd_pmd_md;
pvd_name_temp=scan(pvd_name_orig, 1, ' ');
do i=2 to countw(pvd_name_orig,' ');
word=scan(pvd_name_orig, i, ' ');
found=find(pvd_name_temp, word, 'it');
if found=0 then pvd_name_temp=catx(' ', pvd_name_temp, word);
end;
run;
However, the above code works well with all names except the names that have middle name initial letter that is repeated in the string. In that case it deletes the middle name initial considering it as a repetition. Is there a way I can avoid deleting single letter words in my string?
I have tried to manually add a period after middle name initial and then the code does not delete it in the new variable. However, I am unable to add a period after middle name initial using a SAS code.
I have used the following code to add a period at the second position of the second word (randomly) but it only adds a period at the second character in the string.
data want;
set z;
if length(compress(scan(pvd_name_orig,2,' '),'.'))=1 then substr(pvd_name_orig,2,1)='.';
run;
My final desired output is
Obs pvd_name_orig pvd_name_temp
1 MD SMITH, JOHN MD MD SMITH, JOHN
2 SMITH, JOHN W SMITH, JOHN W
3 MD T. SMITH, JOHN W. MD T. SMITH, JOHN W.
4 SMITH, JOHN WILLIAM SMITH, JOHN WILLIAM
5 JOHN N MD SMITH MD JOHN N MD SMITH
6 MD JOHN W. SMITH MD MD JOHN W. SMITH
7 MD SMITH, MD JOHN MD SMITH, JOHN
Any suggestions??
Regex could be used to resolve the issue. here \b is word boundary, \S+ is string without space, (\b\S+\b) is a word without space; (.*) is anything, first (\1) is repeat word of (\b\S+\b) which needed to be delete; \1\2 means to keep (\b\S+\b)(.*).
data z;
input name $50.;
new_name=prxchange('s/(\b\S+\b)(.*)(\1)/\1\2/',-1,strip(name));
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;

Pandas - Strip col1 values from column2 values if exists match (regex match with dynamic value)

I have a task to update values in column_1 IF it has a full match to value from column_2.
Like so
name city
Danny London London
Tim Detroit Detroit
Keith New Orleans The city of New Orleans
Mary Jane London
=>
name city
Danny London <- updated
Tim Detroit <- updated
Keith New Orleans The city of New Orleans
Mary Jane London
So far I've tried this
condlidt = [df.apply(lambda x: x.name_cleaned.endwith(f"{x.city}"), axis=1)]
choicelist = [df.name_cleaned.str.replace(rf'{df.city}$', '', regex=True]
fd['name_cleaned'] = np.select(condlist, choicelist, default=df.name_cleaned)
But it returns the same df. I've checked and condlist works as expected - returns True/False for values, the problem is in choicelist - not sure how to pass regex with dynamic value. Would really appreciate any help.
Instead test by endswith you can add value $ for end of string and also \s+ for match spaces and repalce these values to empty string in re.sub:
import re
df['name'] = df.apply(lambda x: re.sub(rf"\s+{x.city}$",'',x['name']), axis=1)
print (df)
name city
0 Danny London
1 Tim Detroit
2 Keith New Orleans The city of New Orleans
3 Mary Jane London

modification of alter number string pandas

Background
I have the following sample df which is an alternation of Alter number string in pandas column
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Smith Record #: 0000004 is this ',
'Record #: 0000003 Mary Lisa Hider found here',
'Jane A Doe is also here Record #: 0000002',
'Record #: 0000001'],
'P_ID': [1,2,3,4],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df
Text N_ID P_ID
0 Jon J Smith Record #: 0000004 is this A1 1
1 Record #: 0000003 Mary Lisa Hider fou... A2 2
2 Jane A Doe is also here Record #: 000... A3 3
3 Record #: 0000001 A4 4
Goal
1) replace number after Record #: with **BLOCK**
Jon J Smith Record #: 0000004 is this
Jon J Smith Record #: **BLOCK** is this
2) create new column
Desired Output
Text N_ID P_ID New_Text
0 Jon J Smith Record #: **BLOCK** is this
1 Record #: **BLOCK** Mary Lisa Hider fou...
2 Jane A Doe is also here Record #: **BLOCK**
3 Record #: **BLOCK**
Tried
I have tried the following but this is not quite right
df['New_Text']= df['Text'].replace(r'(?i)record\s+#: \d+', r"Date of Birth: **BLOCK**", regex=True)
Question
How do I alter my code to get my desired output?
You are matching a single space after the : which you could turn into \s+ (or repeat a space + if it can only be spaces) and use a capturing group for the first part.
(?i)(medical\s+record\s+#:\s+)\d+
Regex demo
In the replacement use
\1**BLOCK**
The final piece of code will look like this
df['New_Text']= df['Text'].replace(r'(?i)(medical\s+record\s+#:\s+)\d+', r"\1**BLOCK**", regex=True)

Replace Value & Shift Data Frame If Certain Condition Met

I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.
90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.
Here is an example:
Num Last-Name First-Name Cat. DOB Location
11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA
df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
"2 Pearl, Sam R 1986-11-04 UK",
"5 Livingston, Steph LL 1983-12-12 USA",
"7 Thornton, Mark LR 1982-03-26 USA",
"10 Silver, John RED LL 1983-09-14 USA")
You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.
I started to run code that used stringr to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a dplyr mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:
Num Last-Name First-Name Cat. DOB Location Color
11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED
I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.
here's a simpler way:
input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)
the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.
IDEone demo
We could use gsub to remove the extra substrings
v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
"\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE,
col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <- trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
"", trimws(df1)))
d1
# Num LastName FirstName Cat DOB Location Color
#1 11 Jackson Adam L 1982-06-15 USA
#2 2 Pearl Sam R 1986-11-04 UK
#3 5 Livingston Steph LL 1983-12-12 USA
#4 7 Thornton Mark LR 1982-03-26 USA
#5 10 Silver John LL 1983-09-14 USA RED
Using strsplit instead of regex:
# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)
# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
# if number of items in row is 6, insert NA, else rearrange
if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))
# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")
df2
# Num Last-Name First-Name Cat. DOB Location Color
# 1 11 Jackson Adam L 1982-06-15 USA <NA>
# 2 2 Pearl Sam R 1986-11-04 UK <NA>
# 3 5 Livingston Steph LL 1983-12-12 USA <NA>
# 4 7 Thornton Mark LR 1982-03-26 USA <NA>
# 5 10 Silver John LL 1983-09-14 USA RED