capture repetition of letters in a word with regex - regex

I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)

You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)

If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).

Related

How to match a string and white space in R

I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.
We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")
You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Extracting part of string using regular expressions

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.
test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")
Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.
ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")
What I’m hoping to achieve is the following:
H987654
G789456
F12
You can use the following pattern with gsub:
> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] "" "H987654" "G789456" "F12" ""
See the regex demo
This pattern matches:
^ - start of a string
(?: - start of an alternation group with 2 alternatives:
WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
| - or
`.* - any 0+ characters
)$ - closing the alternation group and match the end of string with $.
With str_match from stringr, it is even tidier:
> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA "H987654" "G789456" "F12" NA
>
See another regex demo
If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").
If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:
sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12"

regex - excluding a specific part of an URL via regex match in gsub

I'm working with a vector below:
vec <- c("http://statistics.gov.scot/id/statistical-geography/S02000002",
"http://statistics.gov.scot/id/statistical-geography/S02000003")
I would like to remove http://statistics.gov.scot/id/statistical-geography/ from the vector. My present regex syntax:
vec_cln <- gsub(replacement = "", x = vec, perl = TRUE, fixed = FALSE,
pattern = "([[:alnum:]]|[[:punct:]]|)(?<!S\\d{8})")
But this leaves only last digit from vector vec. I'm guessing that the problem is with \\d{8}, however, it's not clear to me how to work around it. I tried various solutions on regex101 but to no avail. Some examples:
(?<!S\d) - this leaves second digit
(?<!S[[:digit:]]) - same
What I'm trying to achieve can be simply summarised: *match everything until you find a capital letter S and 8 digits after.
Notes
I want to arrive at the solution via gsub and regex I don't want to use:
gsubfn and proto objects
I'm not interested in using substr as I may have to work with strings of variable lengths
You can obtain the result using
sub(".*(S\\d{8})", "\\1", vec)
See demo
With .*, we match any amount of (* - 0 or more) any characters but a newline up to the S followed by 8 digits (S\\d{8}). Since (S\\d{8}) is inside unescaped parentheses, the substring matched by this subpattern is placed into a capture group #1. With \\1 backreference, we restore the captured text in the result.
See more about backreferences and capturing groups at regular-expressions.info.
NOTE: if you have more text after S+8 digits, you can use
sub("^.*(S\\d{8}).*$", "\\1", vec)
Here it is with slightly prettier syntax:
library(rex)
library(stringi)
library(magrittr)
regex_1 = rex("S", digits)
vec <- c("http://statistics.gov.scot/id/statistical-geography/S02000002",
"http://statistics.gov.scot/id/statistical-geography/S02000003")
vec %>% stri_extract_last_regex(regex_1)

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

Regular expression to keep a running count of individual characters

Consider the following vector x
x <- c("000a000b000c", "abcd00ab", "abcdefg", "000s00r00g00t00")
Using a single regular expression, I'd like to keep only those elements of x that contain more than three letters. Here are the rules:
The letters are not always consecutive (this is the main issue)
The string elements of x can be of any number of characters
There will be nothing in the string except digits and lower-case letters
The simple way I thought of would be to remove everything that is not a letter and then take the number of characters, something like the following.
x[nchar(gsub("[0-9]+", "", x)) > 3]
# [1] "abcd00ab" "abcdefg" "000s00r00g00t00"
I know that there are statements like [a-z]{4,} that finds four or more consecutive lower-case letters. But what if individual letters are scattered about the string? How can I keep a "running count" of letters such that when it passes three, it becomes a non-match? Right now all I can think of is to write [a-z]+ a bunch of times, but this can get ugly if I want to match say, five or more letters.
This gets me there, but you can see how this could be ugly for longer strings.
grep("[a-z]+.*[a-z]+.*[a-z]+.*[a-z]+.*", x)
# [1] 2 3 4
Is there a way to do that with a better regular expression?
Try this where \\D matches a non-digit, .* matches a string of 0 or more characters and (...){4} says to match four times, i.e. more than 3.
grep("(\\D.*){4}", x, value = TRUE)
This will match if there are 4 or any greater number of non-digits. Just replace 4 with 6 if you need more than 5. If its important to have the number 3 in the regexp then try this pattern (\\D.*){3}\\D instead.
There is a repetition operator you can use: {n} matches the previous token or group n times. To make matches more efficient, you should also be specific in what may be matched between letters (in your case only digits, not "any" character (which the dot . matches)):
^(?:[0-9]*[a-z]){4}[0-9a-z]*$
matches all strings that contain at least 3 lowercase letters.
Explanation:
^ # Start of string
(?: # Start of a (non-capturing) group:
[0-9]* # Match any number of digits
[a-z] # Match one lowercase ASCII letter
){4} # Repeat the group exactly four times
[0-9a-z]* # Then match any following digits/letters
$ # until the end of the string
In R:
grep("^(?:[0-9]*[a-z]){4}[0-9a-z]*$", x, perl=TRUE, value=TRUE);
gives you a character vector with all the elements that are matches by the regex.
The below grep command would find the elements which has four or more letters
> grep("^(?:[^a-z]*[a-z]){4}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
OR
> grep("^(?:[^a-z]*[a-z]){3}[^a-z]*[a-z]", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
To find the elements which has 5 or more letters,
> grep("^(?:[^a-z]*[a-z]){5}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg"
Explanation:
^ the beginning of the string
(?: group, but do not capture (4 times):
[^a-z]* any character except: 'a' to 'z' (0 or
more times)
[a-z] any character of: 'a' to 'z'
){4} end of grouping