Background
I have the following sample df
import pandas as pd
df = pd.DataFrame({'Text':['This person num is 111-888-8888 and other',
'dont block 23 here',
'two numbers: 001-002-1234 and some other 123-123-1234 here',
'block this 666-666-6666',
'1-510-999-9999 is one more'],
'P_ID': [1,2,3,4,5],
'N_ID' : ['A1', 'A2', 'A3','A4', 'A5']})
N_ID P_ID Text
0 A1 1 This person num is 111-888-8888 and other
1 A2 2 dont block 23 here
2 A3 3 two numbers: 001-002-1234 and some other 123-1...
3 A4 4 block this 666-666-6666
4 A5 5 1-510-999-9999 is one more
Goal
1) Block all seven digit numbers e.g. 111-888-8888 becomes **Block**
2) Avoid blocking non-seven digit numbers e.g. 23
3) Create new column
Tried
I have tried the following
df['New_Text'] = df['Text'].str.replace(r'\d+','**Block**')
But it blocks all numbers
Also Tried
I have also tried changing the \d+ with many other version e.g. /^\d{7}$/ taken from Regexp exactly seven digits and e.g ^[0-9]{7} taken from
Regex to match "<seven digits> - <filename>" with only one set of seven digits and e.g \b[0-9]{7}(?![0-9]) taken from
REGEX To get seven numbers in a row? but they all don't work.
Desired Output
N_ID P_ID Text New_Text
0 This person num is **Block** and other
1 dont block 23 here
2 two numbers: **Block** and some other **Block**
3 block this **Block**
4 1-**Block** is one more
Question
How do I tweak my code to achieve my desired output?
You can try this regex expression. ((?:[\d]-?){7,})
Regex Demo
Final block of code is this
df['New_Text'] = df['Text'].str.replace(r'((?:[\d]-?){7,})','**Block**')
Related
import pandas as pd
df= pd.DataFrame({'Data':['123456A122 119999 This 1234522261 1A1619 BL171111 A-1-24',
'134456 dont 12-23-34-45-5-6 Z112 NOT 01-22-2001',
'mix: 1A25629Q88 or A13B ok'],
'IDs': ['A11','B22','C33'],
})
I have the following df as seen above. I am using the following to get only consequtive digits
reg = r'((?:[\d]-?){6,})'
df['new'] = df['Data'].str.findall(reg)
Data IDs new
0 [123456,119999, 1234522261, 171111]
1 [134456, 12-23-34-45-5-6, 01-22-2001]
2 []
This picks up many things I dont want like 171111 from BL171111 and 123456 from 123456A122 etc
I would like the following output which only picks up 6 consequtive digits
Data IDs new
0 [119999]
1 [134456]
2 []
How do I change my regex to so?
reg = r'((?:[\d]-?){6,})'
Change your regex to use word boundaries (\b) and limit the number of digits to exactly 6, like this:
reg = r'(\b\d{6}\b)'
This looks for a word boundary, 6 numbers, and another word boundary.
Here's a demo.
Background
I have the following sample df which is an alternation of Alter number string in pandas column
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Smith Record #: 0000004 is this ',
'Record #: 0000003 Mary Lisa Hider found here',
'Jane A Doe is also here Record #: 0000002',
'Record #: 0000001'],
'P_ID': [1,2,3,4],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df
Text N_ID P_ID
0 Jon J Smith Record #: 0000004 is this A1 1
1 Record #: 0000003 Mary Lisa Hider fou... A2 2
2 Jane A Doe is also here Record #: 000... A3 3
3 Record #: 0000001 A4 4
Goal
1) replace number after Record #: with **BLOCK**
Jon J Smith Record #: 0000004 is this
Jon J Smith Record #: **BLOCK** is this
2) create new column
Desired Output
Text N_ID P_ID New_Text
0 Jon J Smith Record #: **BLOCK** is this
1 Record #: **BLOCK** Mary Lisa Hider fou...
2 Jane A Doe is also here Record #: **BLOCK**
3 Record #: **BLOCK**
Tried
I have tried the following but this is not quite right
df['New_Text']= df['Text'].replace(r'(?i)record\s+#: \d+', r"Date of Birth: **BLOCK**", regex=True)
Question
How do I alter my code to get my desired output?
You are matching a single space after the : which you could turn into \s+ (or repeat a space + if it can only be spaces) and use a capturing group for the first part.
(?i)(medical\s+record\s+#:\s+)\d+
Regex demo
In the replacement use
\1**BLOCK**
The final piece of code will look like this
df['New_Text']= df['Text'].replace(r'(?i)(medical\s+record\s+#:\s+)\d+', r"\1**BLOCK**", regex=True)
Working with a text dataset, I have an extraction that gives me irregular results in a dataframe. I am not very good with regular expressions and have never done a filter trying one so help would be appreciated.
I am trying to filter column a for rows 4 & 6. The pattern is 4 numbers, a letter, a space, / space, 2 numbers, space, /, space, 5 numbers, space, /, then whatever follows.
The dataframe looks like this:
a b c d
0 1234B:Program Name / Title Chapter Page Number ID Code
1 1234B:Program Name / Title Chapter Page Number ID Code
2 1234B:Program Name / Title Chapter Page Number ID Code
3 1234B / 01 / 2 (blank) (blank) ID Code
4 1234B / 01 / 23456 / Title Chapter Page Number ID Code <---- Filter for this
5 1234B / 01 / 2 (blank) (blank) ID Code
6 1234B / 01 / 23456 / Title Chapter Page Number ID Code <---- Filter for this
I've tried the following code:
# Filter by pattern
import pandas as pd
import numpy as np
import re
pattern = re.compile("[0-9][0-9][0-9][0-9][B][\s][/][\s][0-9][0-9][\s][/][\s][0-9][0-9][0-9][0-9][0-9][\s]+[/]")
df = df[df['a'].apply(pattern)]
Result is a TypeError: '_sre.SRE_Pattern' object is not callable. It looks like I'm applying it wrong. Also my regular expression does not have a wildcard to account for the rest of the data in column a. What is a pythonic way to filter column A to look at the first 20 characters in column A and do a pattern match on it?
You can use the following, based on your rules given:
df = df[df['a'].str.match(r'\d{4}[a-zA-z]\s\/\s\d{2}\s\/\s\d{5}\s\/.*')]
this gives:
a b c d
4 1234B / 01 / 23456 / Title Chapter Page Number ID Code
6 1234B / 01 / 23456 / Title Chapter Page Number ID Code
Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.
You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith
Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")
i am new to R so please guide me with this.
Below shown is a simple table called Order.
Col1 Col2 Col3
hey hi july 12,2013
hey hi june 12,2013
hey hi April 12,2013
hey hi April 14,2012
If i want to write a query such that i get this as result in a new table ie. i need to use regular expression to match for a part of string in Col3 and then count.
july june April
1 1 2
please help me if anyone knows how to do it.
You can use sub to extract the months' names and table to count the frequencies:
dat <- read.table(text = "Col1 Col2 Col3
hey hi 'july 12,2013'
hey hi 'june 12,2013'
hey hi 'April 12,2013'
hey hi 'April 14,2012'", header = TRUE)
table(sub("^(\\w+) .*", "\\1", dat$Col3))
# April july june
# 2 1 1
How does sub("^(\\w+) .*", "\\1", dat$Col3) work?
The function sub performs replacements in strings. The strings inside quotes are regular expressions. ^ is the beginning of the string, \\w is a word character, + means one or multiple. is a literal space. .* means any number of any character. The parentheses are used to create a group. The first (and only) group (\\w+) matches word characters at the beginning of the string. The second argument in the sub function, "\\1" is used to replace the whole string with the substring representing the first group. In short: the whole string is replaced by the first word.
Data:
data <- read.table(text = "Col1 Col2 Col3
hey hi 'july 12,2013'
hey hi 'june 12,2013'
hey hi 'April 12,2013'
hey hi 'April 14,2012'", header = TRUE)
An answer using dates:
#tranform data in POSIXlt
data$Col3 <- as.POSIXlt(data$Col3, format="%B %d, %Y")
## group using table with POSIXlt numbers (0 is january)
table(data$Col3$mon)
3 5 6
2 1 1
## group using table with normal month numbers
table(month(data$Col3))
4 6 7
2 1 1
## group using aggregate with POSIXlt numbers (0 is january)
aggregate(data$Col1, by=list(data[,"Col3"]$mon), length)
#result
Group.1 x
1 3 2
2 5 1
3 6 1
## group using aggregate with normal month numbers
aggregate(data$Col1, by=list(month(data$Col3)), length)
#result
Group.1 x
1 4 2
2 6 1
3 7 1
PS: whe you get data$Col3$mon in POSIXlt january is 0, so april is 3 and not 4 as you would expect. To get "normal" month numbers you should use month(data$Col3) - just realised that reading Ananda's comment.
If you want a prettier version (by Ananda Mahto):
Col3 <- as.POSIXlt(data$Col3, format="%B %d, %Y"); table(month.name[month(Col3)])
April July June
2 1 1