Using Regex to extract from second period to end of a string - regex

I'm using gsub in R to extract parts of a string. Everything before the first period is the building. Everything between the first and second period is the name of a piece of equipment. Everything after the second period is the point name. I've managed to figure out how to get the building and equipment, but haven't figured out the point. See below (obviously the line with "point" is incorrect):
library(tidyverse)
df <- data_frame(
var = c("buildA.equipA.point", "buildA.equipA.another.point",
"buildA.equipA.yet.another.point")
)
df2 <- df %>%
mutate(
building = gsub("(^[^.]*)(.*$)", "\\1", var),
equip = gsub("^[^.]*.([^.]+).*", "\\1", var),
point = gsub("^[^.].*", "\\1", var)
)

You may use tidyr::extract here with the regex like
^([^.]+)\.([^.]+)\.(.+)$
See the regex demo.
Details
^ - start of string
([^.]+) - Group 1 (Column "building"): one or more chars other than a dot
\. - a dot
([^.]+) - Group 2 (Column "equip"): one or more chars other than a dot
\. - a dot
(.+) - Group 3 (Column "point"): any 1 or more chars other than line break chars, as many as possible
$ - end of string (not necessary here though).
R demo:
library(tidyverse)
df <- data_frame(
var = c("buildA.equipA.point", "buildA.equipA.another.point",
"buildA.equipA.yet.another.point")
)
df2 <- df %>% extract(var, c("Building", "equip", "point"), "^([^.]+)\\.([^.]+)\\.(.+)$")
df2
# A tibble: 3 x 3
Building equip point
<chr> <chr> <chr>
1 buildA equipA point
2 buildA equipA another.point
3 buildA equipA yet.another.point

You can do something like ^(?:.*?\.){2}(.*) this will match the beginning of the line with ^, then it will match 0 or more characters followed by a . twice in a non-capturing group. After that there only rests the part you're interested in, which we put in a capturing group.
I'm aware this question is not about javascript, but here you can see a working version.
const regex = /^(?:.*?\.){2}(.*)$/gm;
const str = `buildA.equipA.point
buildA.equipA.another.point
buildA.equipA.yet.another.point`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log(m[1]);
}

Related

Remove given string from both start and end of a word

Data :
col 1
AL GHAITHA
AL ASEEL
EMARAT AL
LOREAL
ISLAND CORAL
My code :
def remove_words(df, col, letters):
regular_expression = '^' + '|'.join(letters)
df[col] = df[col].apply(lambda x: re.sub(regular_expression, "", x))
Desired output :
col 1
GHAITHA
ASEEL
EMARAT
LOREAL
ISLAND CORAL
SUNRISE
Function call :
letters = ['AL','SUPERMARKET']
remove_words(df=df col='col 1',letters=remove_letters)
Basically, i wanted remove the letters provided either at the start or end. ( note : it should be seperate string)
Fog eg : "EMARAT AL" should become "EMARAT"
Note "LOREAL" should not become "LORE"
Code to build the df :
raw_data = {'col1': ['AL GHAITHA', 'AL ASEEL', 'EMARAT AL', 'LOREAL UAE',
'ISLAND CORAL','SUNRISE SUPERMARKET']
}
df = pd.DataFrame(raw_data)
You may use
pattern = r'^{0}\b|\b{0}$'.format("|".join(map(re.escape, letters)))
df['col 1'] = df['col 1'].str.replace(pattern, r'\1').str.strip()
The (?s)^{0}\b|(.*)\b{0}$'.format("|".join(map(re.escape, letters)) pattern will create a pattern like (?s)^word\b|(.*)\bword$ and it will match word as a whole word at the start and end of the string.
When checking the word at the end of the string, the whole text before it will be captured into Group 1, hence the replacement pattern contains the \1 placeholder to restore that text in the resulting string.
If your letters list contains items only composed with word chars you may omit map with re.escape, replace map(re.escape, letters) with letters.
The .str.strip() will remove any resulting leading/trailing whitespaces.
See the regex demo.

Looping over brackets with regex

Regex extracting 99% of desired result.
This is my line:
Customer Service Representative (CS) (TM PM *) **
*Can have more parameters. Example (TM PM TR) etc
**Can have more parenthesis. Example (TM PM) (RI) (AB CD) etc
Except for the first bracket (CS in this case) which is group 1, I can have any number of parenthesis and any number of parameters within those parenthesis in group 2.
My attempt yields the desired result, but with brackets
(\(.*?\))\s*(\(.*?\).*)
My result:
My desired result:
group 1 : CS
group 2 : if gg yiy rt jfjfj jhfjh uigtu
I want help on removing those parenthesis from the result.
My attempt:
\((.*?)\)\s*\((.*?\).*)
which gives me
Can someone help me with this? I need to remove all the brackets from group 2 as well. I have been at it for a long time but can't figure out a way. Thank you.
You can't match disjoint sections of text using a single match operation. When you need to repeat a group, there is no way to even use a replace approach with capturing groups.
You need a post-process step to remove ( and ) from Group 2 value.
So, after you get your matches with the current approach, remove all ( and ) from the Group 2 value with
Group2value = Group2value.Replace("(", "").Replace(")", "");
Here is one approach which uses string splitting along with the base string functions:
string input = "(CS) (if gg yiy rt) (jfjfj) (jhfjh uigtu)";
string[] parts = Regex.Split(input, "\\) \\(");
string grp1 = parts[0].Replace("(", "");
parts[0] = "";
parts[parts.Length - 1] = parts[parts.Length - 1].Replace(")", "");
string grp2 = string.Join(" ", parts).Trim();
Console.WriteLine(grp1);
Console.WriteLine(grp2);
CS
if gg yiy rt jfjfj jhfjh uigtu

R - Gsub return first match

I want to extract the 12 and the 0 from the test vector. Every time I try it would either give me 120 or 12:0
TestVector <- c("12:0")
gsub("\\b[:numeric:]*",replacement = "\\1", x = TestVector, fixed = F)
What can I use to extract the 12 and the 0. Can we just have one where I just extract the 12 so I can change it to extract the 0. Can we do this exclusively with gsub?
One option, which doesn't involve using explicit regular expressions, would be to use strsplit() and split the timestamp on the colon:
TestVector <- c("12:0")
parts <- unlist(strsplit(TestVector, ":")))
> parts[1]
[1] "12"
> parts[2]
[1] "0"
Try this
gsub("\\b(\\d+):(\\d+)\\b",replacement = "\\1 \\2", x = TestVector, fixed = F)
Regex Breakdown
\\b #Word boundary
(\\d+) #Find all digits before :
: #Match literally colon
(\\d+) #Find all digits after :
\\b #Word boundary
I think there is no named class as [:numeric:] in R till I know, but it has named class [[:digit:]]. You can use it as
gsub("\\b([[:digit:]]+):([[:digit:]]+)\\b",replacement = "\\1 \\2", x = TestVector)
As suggested by rawr, a much simpler and intuitive way to do it would be to just simply replace : with space
gsub(":",replacement = " ", x = TestVector, fixed = F)
This can be done using scan from base R
scan(text=TestVector, sep=":", what=numeric(), quiet=TRUE)
#[1] 12 0
or with str_extract
library(stringr)
str_extract_all(TestVector, "[^:]+")[[1]]

R regular expression issue

I have a dataframe column including pages paths :
pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html
What I want to do is to extract the first number after a /, for example 123 from each row.
To solve this problem, I tried the following :
num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */
num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/
num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/
my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/
I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html
So, what I really want is to extract the first number after a /.
Any help would be very welcome.
You can use the following regex with gsub:
"^(?:.*?/(\\d+))?.*$"
And replace with "\\1". See the regex demo.
Code:
> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123" "15" "25189" "5418874" ""
The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).
NOTE that perl=T is required.
with stringr str_extract, your code and pattern can be shortened to:
> str_extract(s, "(?<=/)\\d+")
[1] "123" "15" "25189" "5418874" NA
>
The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).
Try this
\/(\d+).*
Demo
Output:
MATCH 1
1. [26-29] `123`
MATCH 2
1. [91-93] `15`
MATCH 3
1. [132-137] `25189`
MATCH 4
1. [197-204] `5418874`

Remove all instances of sub-string after a different sub-string has occurred N times

I've been attempting to replace the character '-' with 'Z' but only if proceeded by 2 or more 'Z's in the string.
input = c("XX-XXZZXX-XZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZX-X",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXX-XZ",
"XZXZXX-XXZXXZXX")
desired_output = c("XX-XXZZXXZXZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZXZX",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXXZXZ",
"XZXZXXZXXZXXZXX")
I've had some success in removing everything before or after the second occurrence but can't quite make the gap to replace the needed character while keeping everything else. There's no grantee that either a Z or - will be in the string.
This is not an easy regex, but you still can use it to achieve what you need.
input = c("XX-XXZZXX-XZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZX-X",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXX-XZ",
"XZXZXX-XXZXXZXX")
gsub("(?:^([^Z]*Z){2}|(?!^)\\G)[^-]*\\K-", "Z", input, perl=T)
See IDEONE demo
The regex just matches two chunks ending with Z (to make sure there are two Zs from the beginning), thenany characters but a hyphen and a hyphen. Only the hyphen is replaced with gsub because we omit what we matched with the \K operator. We match all subsequent hyphens due to \G operator that matches the location after the previous successful match.
Explanation:
(?:^([^Z]*Z){2}|(?!^)\\G) - match 2 alternatives:
^([^Z]*Z){2} - start of string (^) followed by 2 occurrences ({2}) of substrings that contain 0 or more characters other than Z ([^Z]*) followed by Z or...
(?!^)\\G - end of the previous successful match
[^-]*\\K - match 0 or more characters other than - 0 or more times and omit the whole matched text with \K
- - a literal hyphen that will be replaced with Z.
The perl=T is required here.
Way out of my league in regex here as demonstrated by #stribizhev's answer, but you can do this without regular expressions by simply splitting the entire string, counting up the occurrences of Z, and subbing out subsequent -:
input = c("XX-XXZZXX-XZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZX-X",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXX-XZ",
"XZXZXX-XXZXXZXX")
desired_output = c("XX-XXZZXXZXZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZXZX",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXXZXZ",
"XZXZXXZXXZXXZXX")
sp <- strsplit(input, '')
f <- function(x, n = 2) {
x[x == '-' & (cumsum(x == 'Z') >= n)] <- 'Z'
paste0(x, collapse = '')
}
identical(res <- sapply(sp, f), desired_output)
# [1] TRUE
cbind(input, res, desired_output)
# input res desired_output
# [1,] "XX-XXZZXX-XZXXXXX" "XX-XXZZXXZXZXXXXX" "XX-XXZZXXZXZXXXXX"
# [2,] "XX-XXXZXXZXZXXX" "XX-XXXZXXZXZXXX" "XX-XXXZXXZXZXXX"
# [3,] "XXXXXZXXXZXXZX-X" "XXXXXZXXXZXXZXZX" "XXXXXZXXXZXXZXZX"
# [4,] "XXXZXXXZXZXZXXX" "XXXZXXXZXZXZXXX" "XXXZXXXZXZXZXXX"
# [5,] "XZXXX-XXXZXZXXX" "XZXXX-XXXZXZXXX" "XZXXX-XXXZXZXXX"
# [6,] "XX-XXX-ZZX" "XX-XXX-ZZX" "XX-XXX-ZZX"
# [7,] "XXZX-XXZXXX-XZ" "XXZX-XXZXXXZXZ" "XXZX-XXZXXXZXZ"
# [8,] "XZXZXX-XXZXXZXX" "XZXZXXZXXZXXZXX" "XZXZXXZXXZXXZXX"