modification of alter number string pandas - regex

Background
I have the following sample df which is an alternation of Alter number string in pandas column
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Smith Record #: 0000004 is this ',
'Record #: 0000003 Mary Lisa Hider found here',
'Jane A Doe is also here Record #: 0000002',
'Record #: 0000001'],
'P_ID': [1,2,3,4],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df
Text N_ID P_ID
0 Jon J Smith Record #: 0000004 is this A1 1
1 Record #: 0000003 Mary Lisa Hider fou... A2 2
2 Jane A Doe is also here Record #: 000... A3 3
3 Record #: 0000001 A4 4
Goal
1) replace number after Record #: with **BLOCK**
Jon J Smith Record #: 0000004 is this
Jon J Smith Record #: **BLOCK** is this
2) create new column
Desired Output
Text N_ID P_ID New_Text
0 Jon J Smith Record #: **BLOCK** is this
1 Record #: **BLOCK** Mary Lisa Hider fou...
2 Jane A Doe is also here Record #: **BLOCK**
3 Record #: **BLOCK**
Tried
I have tried the following but this is not quite right
df['New_Text']= df['Text'].replace(r'(?i)record\s+#: \d+', r"Date of Birth: **BLOCK**", regex=True)
Question
How do I alter my code to get my desired output?

You are matching a single space after the : which you could turn into \s+ (or repeat a space + if it can only be spaces) and use a capturing group for the first part.
(?i)(medical\s+record\s+#:\s+)\d+
Regex demo
In the replacement use
\1**BLOCK**
The final piece of code will look like this
df['New_Text']= df['Text'].replace(r'(?i)(medical\s+record\s+#:\s+)\d+', r"\1**BLOCK**", regex=True)

Related

Pandas - Strip col1 values from column2 values if exists match (regex match with dynamic value)

I have a task to update values in column_1 IF it has a full match to value from column_2.
Like so
name city
Danny London London
Tim Detroit Detroit
Keith New Orleans The city of New Orleans
Mary Jane London
=>
name city
Danny London <- updated
Tim Detroit <- updated
Keith New Orleans The city of New Orleans
Mary Jane London
So far I've tried this
condlidt = [df.apply(lambda x: x.name_cleaned.endwith(f"{x.city}"), axis=1)]
choicelist = [df.name_cleaned.str.replace(rf'{df.city}$', '', regex=True]
fd['name_cleaned'] = np.select(condlist, choicelist, default=df.name_cleaned)
But it returns the same df. I've checked and condlist works as expected - returns True/False for values, the problem is in choicelist - not sure how to pass regex with dynamic value. Would really appreciate any help.
Instead test by endswith you can add value $ for end of string and also \s+ for match spaces and repalce these values to empty string in re.sub:
import re
df['name'] = df.apply(lambda x: re.sub(rf"\s+{x.city}$",'',x['name']), axis=1)
print (df)
name city
0 Danny London
1 Tim Detroit
2 Keith New Orleans The city of New Orleans
3 Mary Jane London

modification of alter text in pandas column based on names

Background
I have the following df which is a modification of Alter text in pandas column based on names
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Doe works ',
'So is Mary Doe, works too',
'Jane Ann, Doe doesnt',
'Jone, Dow doesnt either'],
'P_ID': [1,2,3,4],
'P_Name' : ['Doe, Jon J', 'Doe, Mary', 'Doe, Jane Ann', 'Dow, Jone' ]
})
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
And the following block of code works to block names like Jon J Doe but it doesnt work when a name like Jane Ann Doe has a character in between e.g. Jane Ann, Doe or Jone! Dow
df['NewText'] = df['Text'].replace(df['P_Name'].str.split(', *').apply(lambda l: ' '.join(l[::-1])),'**BLOCK**',regex=True)
Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt Jane Ann, Doe doesnt
3 4 Dow, Jone Jone,Dow doesnt either Jone, Dow doesnt either
Goal
1) Tweak the code above to take into account for , (or any other characters that may be in between the names)
(I know I can strip commas, but I need to leave them in)
Desired Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt **BLOCK** doesnt
3 4 Dow, Jone Jone,Dow doesnt either **BLOCK** doesnt either
Question
How do I tweak my code to get my desired output?
I don't know if there are multiple such cases, but in case you have limited
Sample DataSet:
>>> df
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
You can create dict combination and apply that to the dataFrame to get the result.
>>> replace_values = {'Jon J Doe': '**BLOCK**', 'Mary Doe': '**BLOCK**', 'Jane Ann, Doe': '**BLOCK**', 'Jone, Dow': '**BLOCK**'}
Resulted dataFrame:
>>> df = df.replace(replace_values, regex=True)
>>> df
P_ID P_Name Text
0 1 Doe, Jon J **BLOCK** works
1 2 Doe, Mary So is **BLOCK**, works too
2 3 Doe, Jane Ann **BLOCK** doesnt
3 4 Dow, Jone **BLOCK** doesnt either
try this:
df['NewText'] = df['Text'].replace( r'('+ df['P_Name'].str.split('\W+').str.join('|')+'|\W+){3,}', ' **BLOCK** ', regex=True)

blocking seven digit numbers in string pandas

Background
I have the following sample df
import pandas as pd
df = pd.DataFrame({'Text':['This person num is 111-888-8888 and other',
'dont block 23 here',
'two numbers: 001-002-1234 and some other 123-123-1234 here',
'block this 666-666-6666',
'1-510-999-9999 is one more'],
'P_ID': [1,2,3,4,5],
'N_ID' : ['A1', 'A2', 'A3','A4', 'A5']})
N_ID P_ID Text
0 A1 1 This person num is 111-888-8888 and other
1 A2 2 dont block 23 here
2 A3 3 two numbers: 001-002-1234 and some other 123-1...
3 A4 4 block this 666-666-6666
4 A5 5 1-510-999-9999 is one more
Goal
1) Block all seven digit numbers e.g. 111-888-8888 becomes **Block**
2) Avoid blocking non-seven digit numbers e.g. 23
3) Create new column
Tried
I have tried the following
df['New_Text'] = df['Text'].str.replace(r'\d+','**Block**')
But it blocks all numbers
Also Tried
I have also tried changing the \d+ with many other version e.g. /^\d{7}$/ taken from Regexp exactly seven digits and e.g ^[0-9]{7} taken from
Regex to match "<seven digits> - <filename>" with only one set of seven digits and e.g \b[0-9]{7}(?![0-9]) taken from
REGEX To get seven numbers in a row? but they all don't work.
Desired Output
N_ID P_ID Text New_Text
0 This person num is **Block** and other
1 dont block 23 here
2 two numbers: **Block** and some other **Block**
3 block this **Block**
4 1-**Block** is one more
Question
How do I tweak my code to achieve my desired output?
You can try this regex expression. ((?:[\d]-?){7,})
Regex Demo
Final block of code is this
df['New_Text'] = df['Text'].str.replace(r'((?:[\d]-?){7,})','**Block**')

Replace Value & Shift Data Frame If Certain Condition Met

I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.
90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.
Here is an example:
Num Last-Name First-Name Cat. DOB Location
11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA
df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
"2 Pearl, Sam R 1986-11-04 UK",
"5 Livingston, Steph LL 1983-12-12 USA",
"7 Thornton, Mark LR 1982-03-26 USA",
"10 Silver, John RED LL 1983-09-14 USA")
You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.
I started to run code that used stringr to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a dplyr mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:
Num Last-Name First-Name Cat. DOB Location Color
11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED
I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.
here's a simpler way:
input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)
the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.
IDEone demo
We could use gsub to remove the extra substrings
v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
"\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE,
col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <- trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
"", trimws(df1)))
d1
# Num LastName FirstName Cat DOB Location Color
#1 11 Jackson Adam L 1982-06-15 USA
#2 2 Pearl Sam R 1986-11-04 UK
#3 5 Livingston Steph LL 1983-12-12 USA
#4 7 Thornton Mark LR 1982-03-26 USA
#5 10 Silver John LL 1983-09-14 USA RED
Using strsplit instead of regex:
# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)
# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
# if number of items in row is 6, insert NA, else rearrange
if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))
# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")
df2
# Num Last-Name First-Name Cat. DOB Location Color
# 1 11 Jackson Adam L 1982-06-15 USA <NA>
# 2 2 Pearl Sam R 1986-11-04 UK <NA>
# 3 5 Livingston Steph LL 1983-12-12 USA <NA>
# 4 7 Thornton Mark LR 1982-03-26 USA <NA>
# 5 10 Silver John LL 1983-09-14 USA RED

How to replace specific characters of a string with tab in R

Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.
You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith
Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")