Matching a word after another word in R regex - regex

I have a dataframe in R with one column (called 'city') containing a text string. My goal is to extract only one word ie the city text from the text string. The city text always follows the word 'in', eg the text might be:
'in London'
'in Manchester'
I tried to create a new column ('municipality'):
df$municipality <- gsub(".*in ?([A-Z+).*$","\\1",df$city)
This gives me the first letter following 'in', but I need the next word (ONLY the next word)
I then tried:
gsub(".*in ?([A-Z]\w+))")
which worked on a regex checker, but not in R. Can someone please help me. I know this is probably very simple but I can't crack it. Thanks in advance.

We can use str_extract
library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')
#[1] "London" "Manchester"

The following regular expression will match the second word from your city column:
^in\\s([^ ]*).*$
This matches the word in followed a single space, followed by a capture group of any non space characters, which comprises the city name.
Example:
df <- data.frame(city=c("in London town", "in Manchester city"))
df$municipality <- gsub("^in\\s([^ ]*).*$", "\\1", df$city)
> df$municipality
[1] "London" "Manchester"

Related

How to combine independent regular expressions and apply them on all rows of a dataset using Pandas?

Problem Statement:
I have two seperate regular expressions that I am trying to "combine" into one and apply to each row in a dataset. The matching part of each row should go to a new Pandas dataframe column called "Wanted". Please see example data below for how values that match should be formatted in the "Wanted" column.
Example Data (how I want it to look):
Column0
Wanted (Want "Column0" to look like this)
Alice\t12-345-623/ 10-1234
Alice, 12-345-623, 10-1234
Bob 201-888-697 / 12-0556a
Bob, 201-888-697, 12-0556a
Tim 073-110-101 / 13-1290
Tim, 073-110-101, 13-1290
Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c
Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c
In other words...:
2-3 digits ----- hyphen ---- 3 digits --- hyphen ---- 3 digits ---- any character ----
2 digits --- hyphen --- 4 digits ---- permit one single character
What I have tried #1:
After dinking around for a while I figured out two different regular expressions that on their own will solve part of the problem. Kinda.
This will match for the first group of numbers in each row (but doesn't get the second group--which I want) I'm interested in that I have tried. I'm not sure how robust this is though.
Example Problem Row (regex = r"(?:\d{1,3}-){0,3}\d{1,3}")
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{1,3}\-){0,3}\d{1,3}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted: Alice, 12-345-623, 10-1234
Got: 12-345-623 # matches the group of numbers but isn't formatted how I would like (see example data)
What I have tried #2:
This will match for the second part in each row--- but! --- only if its the only value in the column. The problem I have is that it matches on the first group of digits instead of the second.
Example Problem Row (regex = r"(?:\d{2,3}-){1}\d{3,4}") # different regex than above!
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{2,3}\-){1}\d{3,4}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted : Alice, 12-345-623, 10-1234
Got: 12-345 # matched on the first part
Known Problems:
When I try, "Alice\t12-345-623/ 10-1234", it will match "12-345" when I'm trying to match "10-1234"
Thank you!
Thanks in advance to all you wizards being willing to help me with this problem. I really appreciate it:)
Note: I have asked regarding regex that may make solving this problem easier. It might not, but here is the link anyways --> How to use regex to select a row and a fixed number of rows following a row containing a specific substring in a pandas dataframe
So this works for the four test examples you gave. How's this using the .split() method? Technically this returns a list of values and not a string.
import re
# text here
text = "Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c"
# split this out to a list. remove the ending parenthesis since you are *splitting* on this
new_splits = re.split(r'\t|/|and|\(| ', text.replace(')',''))
# filter out the blank spaces
list(filter(None,new_splits))
['Joe', '74-111-333', '33-1290', 'Amy', '12-345-623', '10-1234c']
and if you are using pandas you can try the same steps above:
df['answer_Step1'] = df['Column0'].str.split(r'\\t+|/|and|\(| ')
df['answer_final'] = df['answer_Step1'].apply(lambda x: list(filter(None,x)))
You can use
re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text)
See the regex demo.
Pandas version:
df['Wanted'] = df['Column0'].str.replace(r'\s*\band\b\s*|[^\w-]+', ', ', regex=True)
Details:
\s*\band\b\s* - a whole word (\b are word boundaries) and enclosed with optional zero or more whitespace chars
| - or
[^\w-]+ - one or more chars other than letters, digits, _ and -
See a Python demo:
import re
texts = ['Alice 12-345-623/ 10-1234',
'Bob 201-888-697 / 12-0556a','Tim 073-110-101 / 13-1290',
'Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c']
for text in texts:
print(re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text))
# => Alice, 12-345-623, 10-1234
# Bob, 201-888-697, 12-0556a
# Tim, 073-110-101, 13-1290
# Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c

Extracting clock time from string

I have a dataframe that consists of web-scraped data. One of the fields scraped was a time in clock time, but the scraping process wasn't perfect. Most of the 'good' data look something like '4:33, or '103:20 (so a leading single quote, and two fields, minutes and seconds). Also, there is some bad data, the most common one being '],, but also some containing text. I'd like a new string that is something like 4:33, and for bad data, just blank.
So my plan of attack is to match my good data form, and then replace everything else with a blank space. Sometime like time <- gsub('[0-9]+:[0-9]+', '', time). I know this would replace my pattern with a blank, and I want the opposite, but I'm unsure as to how to negate this whole pattern. A simple carat doesn't seem to work, nor applying it to a group. I tried something like gsub("(.)+([0-9]+)(:)([0-9]+)", "\\2\\3\\4", time) but that isn't working either.
Sample:
dput(sample)
c("'], ", "' Ling (2-0)vsThe Dragon(2-0)", "'8:18", "'13:33",
"'43:33")
Expected output:
c("", "", "8:18", "13:33", "43:33")
We can use grep to replace the elements that do not follow the pattern to '' and then replace the quotes (') with ''. Here, the pattern is the strings that start (^) with ' followed by numbers, :, numbers in that order to the end ($) of the string. So, all other string elements (by negating i.e. !) are assigned to '' using the logical index from grepl and we use sub to replace the '.
sample[!grepl("^'\\d+:\\d+$", sample)] <- ''
sub("'", '', sample)
#[1] "" "" "8:18" "13:33" "43:33"
Or we can also do this in one step using gsub by replacing all those characters (.) that do not follow the pattern \\d+:\\d+ with ''.
gsub("(\\d+:\\d+)(*SKIP)(*F)|.", '', sample, perl=TRUE)
#[1] "" "" "8:18" "13:33" "43:33"
Or another option is str_extract from library(stringr). It is not clear whether there are other patterns such as "some text '08:20 value" in the OP's original dataset or not. The str_extract will also extract those time values, if present.
library(stringr)
str_extract(sample, '\\d+:\\d+')
#[1] NA NA "8:18" "13:33" "43:33"
It will give NA instead of '' for those that doesn't follow the pattern.
You can use sub:
sub('.+?(?=[0-9]+:[0-9]+)|.+', '', sample, perl = TRUE)
[1] "" "" "8:18" "13:33" "43:33"
The regex consists of two parts that are combined with a logical or (|).
.+?(?=[0-9]+:[0-9]+)
This regex matches a positive number of characters followed by the target pattern.
.+ This regex matches a positive number of characters.
The logic: Replace everything preceding thte target pattern with an empty string (''). If there is no target pattern, replace everything with the empty string.

Exact string matching in r

I struggling with exact string matching in R. I need only exact match in sentece with searched string:
sentence2 <- "laptop is a great product"
words2 <- c("top","laptop")
I was trying something like this:
sub(paste(c("^",words2,"$")),"",sentence2)
and I need replace laptop by empty string only - for exact match (laptop) but didn't work...
Please, could you help me. Thanks in advance.
Desired output:
is a great product
You can try:
gsub(paste0("^",words2," ",collapse="|"),"",sentence2)
#[1] "is a great product"
The result of paste0("^",words2," ",collapse="|") is "^top |^laptop " which means "either 'top' at the beginning of string followed by a space or 'laptop' at the beginning of string followed by a space".
If you want to match entire words, then you can use \\b to match word boundaries.
gsub(paste0('\\b', words2, '\\b', collapse='|'), '', sentence2)
## [1] " is a great product"
Add optional whitespace to the pattern if you want to replace the adjacent spaces as well.
gsub(paste0('\\s*\\b', words2, '\\b\\s*', collapse='|'), '', sentence2)
## [1] "is a great product"

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

R regex: specifying output selections from wider string matches

One for the regex enthusiasts. I have a vector of strings in the format:
<TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" STYLE="font-size: 10px" size="10" COLOR="#FF0000" LETTERSPACING="0" KERNING="0">Desired output string containing any symbols</FONT></P></TEXTFORMAT>
I'm aware of the perils of parsing this sort of stuff with regex. It would however be useful to know how to efficiently extract an output sub-string of a larger string match - i.e. the contents of angle quotes >...< of the font tag. The best I can do is:
require(stringr)
strng = str_extract(strng, "<FONT.*FONT>") # select font statement
strng = str_extract(strng, ">.*<") # select inside tags
strng = str_extract(strng, "[^/</>]+") # remove angle quote symbols
What would be the simplest formula to achieve this in R?
Use str_match, not str_extract (or maybe str_match_all). Wrap the part that you want to extract match in parentheses.
str_match(strng, "<FONT[^<>]*>([^<>]*)</FONT>")
Or parse the document and extract the contents that way.
library(XML)
doc <- htmlParse(strng)
fonts <- xpathSApply(doc, "//font")
sapply(fonts, function(x) as(xmlChildren(x)$text, "character"))
As agstudy mentioned, xpathSApply takes a function argument that makes things easier.
xpathSApply(doc, "//font", xmlValue)
You can also do it with gsub but I think there are too many permutations to your input vector that may cause this to break...
gsub( "^.*(?<=>)(.*)(?=</FONT>).*$" , "\\1" , x , perl = TRUE )
#[1] "Desired output string containing any symbols"
Explanation
^.* - match any characters from the start of the string
(?<=>) - positive lookbehind zero-width assertion where the subsequent match will only work if it is preceeded by this, i.e. a >
(.*) - then match any characters (this is now a numbered capture group)...
(?=</FONT>) - ...until you match "</FONT>"
.*$ - then match any characters to the end of the string
In the replacement we replace all matched stuff by numbered capture group \\1, and there is only one capture group which is everything between > and </FONT>.
Use at your peril.