R regular expression issue - regex

I have a dataframe column including pages paths :
pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html
What I want to do is to extract the first number after a /, for example 123 from each row.
To solve this problem, I tried the following :
num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */
num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/
num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/
my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/
I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html
So, what I really want is to extract the first number after a /.
Any help would be very welcome.

You can use the following regex with gsub:
"^(?:.*?/(\\d+))?.*$"
And replace with "\\1". See the regex demo.
Code:
> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123" "15" "25189" "5418874" ""
The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).
NOTE that perl=T is required.
with stringr str_extract, your code and pattern can be shortened to:
> str_extract(s, "(?<=/)\\d+")
[1] "123" "15" "25189" "5418874" NA
>
The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).

Try this
\/(\d+).*
Demo
Output:
MATCH 1
1. [26-29] `123`
MATCH 2
1. [91-93] `15`
MATCH 3
1. [132-137] `25189`
MATCH 4
1. [197-204] `5418874`

Related

Filter a string using regular expression

I tried the following code. However, the result is not what I want.
$strLine = "100.11 Q9"
$sortString = StringRegExp ($strLine,'([0-9\.]{1,7})', $STR_REGEXPARRAYMATCH)
MsgBox(0, "", $sortString[0],2)
The output shows 100.11, but I want 100.11 9. How could I display it this way using a regular expression?
$sPattern = "([0-9\.]+)\sQ(\d+)"
$strLine = "100.11 Q9"
$sortString = StringRegExpReplace($strLine, $sPattern, '\1 \2')
MsgBox(0, "$sortString", $sortString, 2)
$strLine = "100.11 Q9"
$sortString = StringRegExp($strLine, $sPattern, 3); array of global matches.
For $i1 = 0 To UBound($sortString) -1
MsgBox(0, "$sortString[" & $i1 & "]", $sortString[$i1], 2)
Next
The pattern is to get the 2 groups being 100.11 and 9.
The pattern will 1st match the group with any digit and dot until it reach
/s which will match the space. It will then match the Q. The 2nd group
matches any remaining digits.
StringRegExpReplace replaces the whole string with 1st and 2nd groups
separated with a space.
StringRegExp get the 2 groups as 2 array elements.
Choose 1 from the 2 types regexp above of which you prefer.

How do you find 3 UNIQUE digits in a string of digits?

I am trying to write a regex that is very specific. I want to find 3 digits in a list. The issue comes because I do not care about repeating digits (5, 555, and 55555555555555 are seen as 5). Also, within the 3 digits, they need to be 3 different digits (123 = good, 311 = bad).
Here is what I have so far to find 3 digits, ignoring repeats but it does not specify 3 unique digits.
^(?:([0]{1,}|[1]{1,}|[2]{1,}|[3]{1,}|[4]{1,}|[5]{1,}|[6]{1,}|[7]{1,}|[8]{1,}|[9]{1,}|[0]{1,})(?!.*\\1)){3}$<p>
Here is an example of the types of data I see.
Matching:
458
3333335555111
2222555111
222255558888
111147
9533333333
And not matching:
999999999
222252
888887
Right now my regex will find all of these. How can I ignore any that do not have 3 unique digits?
If your regex-tool of choice supports look-behinds, back-references and possesive matching you could use
^(\d)\1*+(?!.*\1)(\d)\2*+(\d)\3*+$
^ and $ are anchors to ensure, that we check the whole string
(\d) matches a digit into a first capturing group, with \1*+ we possesively match any following occurences of this digit and use the lookbehind (?!.*\1) to ensure, that it doesn't end with that number.
(\d)\2*+ then matches the next different digit, again matching any following occurences possesively (check 122 without the possesive matching to see, why I use it here)
(\d)\3*+ matches the last digit with any following occurences.
Without possesive matching you could make more use of look-behinds, like ^(\d)\1*(?!.*\1)(\d)\2*(?!.*\2)(\d)\3*+$
See https://regex101.com/r/pV2tB2/2 for a demo.
Site Note: Regex might not be the best for this, but as you specifically asked for it - here you are.
This can be done with regex, but it's not the best tool for your work.
Instead of a regex-only approach, you can easily achieve this using Python.
Example:
strings = ['458', '3333335555111', '2222555111', '222255558888', '111147', '9533333333', '955555555', '12222211']
for s in strings:
if len(set(list(s))) == 3:
print "Ok :", s
else:
print "Error :", s
Output:
>> Ok : 458
>> Ok : 3333335555111
>> Ok : 2222555111
>> Ok : 222255558888
>> Ok : 111147
>> Ok : 9533333333
>> Error : 955555555
>> Error : 12222211
I've used the following commands while iterating over the strings inside that list:
list()
set()
len()
Using negative lookahead, this should match any string of digits that contains at least 3 unique digits /^(\d)\1*(?!\1)(\d)(?:\2|\1)*(?!\2|\1)(\d)+$/
(\d) - Match a digit
\1* - Allow that digit to repeat
(?!\1) - Make sure that's followed by a digit that does not match the first match
(\d) - Match the new digit
(?:\2|\1)* - Allow repeats of either the first or second digit
(?!\2|\1) - Make sure that's followed by a digit that does not match the first or second match
(\d)+ - Capture the third unique digit, then allow any number of digits of any kind to follow
I'm not sure if an awk script will do it for you, but here it goes:
awk '
function match_func(num) {
if (match_array[num] == 0)
match_array[num] = 1;
}
{
for (i = 0; i < length($1); i++)
match_func(substr($1, i, 1));
for (i = 0; i < 10; i++)
if (match_array[i] == 1) match_sum++;
if (match_sum == 3)
print $1;
}'

R: change to the same character as the previous character in a string

Suppose I have a vector c('JKA1','BP9C','SSTQ3WA') and I want to change the character before the number to that very number, so that R returns 'JK11' 'B99C' 'SST33WA'. Is there anyway to do this with regex or am I better off using something other than R?
Match the letter before the number and then capture the number through capturing group. Then replace the matched characters with \\1\\1 means double times of characters present inside the group index 1.
> x <- c('JKA1','BP9C','SSTQ3WA')
> gsub("[A-Za-z](\\d)", "\\1\\1", x)
[1] "JK11" "B99C" "SST33WA"
sub function would be enough for this case.
> sub("[A-Z](\\d)", "\\1\\1", x)
[1] "JK11" "B99C" "SST33WA"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"