regexp - find numbers in a string in any order - regex

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"

What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.

you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)
One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101
You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo
This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.
If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

php check ncr with negative lookbehind and greedy doesn't work

I want to find a erroneous NCR without &# and remedy it, the unicode is 4 or 5 decimal digit, I write this PHP statement:
function repl0($m) {
return '&#'.$m[0];
}
$s = "This is a good 23200; sample ship";
echo "input1= ".htmlentities($s)."<br>";
$out1=preg_replace_callback('/(?<!#)(\d{4,5};)/','repl0',$s);
echo 'output1 = '.htmlentities($out1).'<br>';
The output is:
input1= This is a good 23200; sample ship
output1 = This is a good 2ಀ sample ship
The match only happens once according to the output message.
What I want is to match '23200;' instead of '3200;'.
Default should be greedy mode and I thought it will capture 5-digit number instead 4-digit's
Do I misunderstand 'greedy' here? How can I get what I want?
The (?<!#)(\d{4,5};) pattern matches like this:
(?<!#) - matches a location that is not immediately preceded with #
(\d{4,5};) - then tries to match and consume four or five digits and a ; char immediately after these digits.
So, if you have #32000; string input, 3 cannot be a starting character of a match, as it is preceded with #, but 2 can since it is not preceded by a # and there are five digits with a ; for the pattern to match.
What you need here is to curb the match on the left by adding a digit to the lookbehind,
(?<![#\d])(\d{4,5};)
With this trick, you ensure that the match cannot be immediately preceded with either # or a digit.
You say you finally used (?<!#)(?<!\d)\d{4,5};, and this pattern is functionally equivalent to the pattern above since the lookbehinds, as all lookarounds, "stand their ground", i.e. the regex index does not move when the lookaround patterns are matched. So, the check for a digit or a # char occurs at the same location in the string.

regular expression in R, match substring only if things after

my_string = "2011, this year I made 750,000 dollars"
Is there an elegant way to match "2011" and "750,000" in the string above. The idea is to extract numeric values when it looks like to numeric values, i.e. \d+ or \d+[\.,]?\d* depending on the presence of a comma after
I tried this but it doesn't match exactly what I wanted, I got "2011," which is no good
library(stringr)
str_match_all(fkin, "(\\d+[\\.,]?\\d*)
Here is my expected resut:
"2011" "750,000"
You can do:
[0-9]+(?:[,.][0-9]+)*
It's very elegant, I tried it in front of a mirror.
Here is a one regex pure base R approach to extract integer or float values that are not part of the string of digits separated with a hyphen:
> str <- "2011, this year I made 750,000 dollars and 750,000-589 here"
> regmatches(str, gregexpr('(?<!\\d-)\\b\\d+(?:[,.]\\d+)?+(?!-)', str, perl=T))[[1]]
[1] "2011" "750,000"
See the IDEONE demo and a regex demo.
Since the regex contains lookarounds, you need to specify the perl=TRUE argument.
Pattern explanation:
(?<!\d-) - a negative lookbehind failing the match when a digit with a hyhen precedes the current location
\b\d+ - a word boundary (before the next digit, there cannot be a word char - letter, digit or _)
(?:[,.]\d+)?+ - a non-capturing group ((?:...)) matching 1 or 0 sequences of a comma or dot ([,.]) followed with 1 or more digits (and this sequence is matched possessively (see ?+) so that the regex engine did not check for a hyphen after \b\d+)
(?!-) - a negative loookahead that fails the match if there is a hyphen after the digits detected.

Remove any digit only in first N characters

I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="‌​")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)

Regular expression to keep a running count of individual characters

Consider the following vector x
x <- c("000a000b000c", "abcd00ab", "abcdefg", "000s00r00g00t00")
Using a single regular expression, I'd like to keep only those elements of x that contain more than three letters. Here are the rules:
The letters are not always consecutive (this is the main issue)
The string elements of x can be of any number of characters
There will be nothing in the string except digits and lower-case letters
The simple way I thought of would be to remove everything that is not a letter and then take the number of characters, something like the following.
x[nchar(gsub("[0-9]+", "", x)) > 3]
# [1] "abcd00ab" "abcdefg" "000s00r00g00t00"
I know that there are statements like [a-z]{4,} that finds four or more consecutive lower-case letters. But what if individual letters are scattered about the string? How can I keep a "running count" of letters such that when it passes three, it becomes a non-match? Right now all I can think of is to write [a-z]+ a bunch of times, but this can get ugly if I want to match say, five or more letters.
This gets me there, but you can see how this could be ugly for longer strings.
grep("[a-z]+.*[a-z]+.*[a-z]+.*[a-z]+.*", x)
# [1] 2 3 4
Is there a way to do that with a better regular expression?
Try this where \\D matches a non-digit, .* matches a string of 0 or more characters and (...){4} says to match four times, i.e. more than 3.
grep("(\\D.*){4}", x, value = TRUE)
This will match if there are 4 or any greater number of non-digits. Just replace 4 with 6 if you need more than 5. If its important to have the number 3 in the regexp then try this pattern (\\D.*){3}\\D instead.
There is a repetition operator you can use: {n} matches the previous token or group n times. To make matches more efficient, you should also be specific in what may be matched between letters (in your case only digits, not "any" character (which the dot . matches)):
^(?:[0-9]*[a-z]){4}[0-9a-z]*$
matches all strings that contain at least 3 lowercase letters.
Explanation:
^ # Start of string
(?: # Start of a (non-capturing) group:
[0-9]* # Match any number of digits
[a-z] # Match one lowercase ASCII letter
){4} # Repeat the group exactly four times
[0-9a-z]* # Then match any following digits/letters
$ # until the end of the string
In R:
grep("^(?:[0-9]*[a-z]){4}[0-9a-z]*$", x, perl=TRUE, value=TRUE);
gives you a character vector with all the elements that are matches by the regex.
The below grep command would find the elements which has four or more letters
> grep("^(?:[^a-z]*[a-z]){4}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
OR
> grep("^(?:[^a-z]*[a-z]){3}[^a-z]*[a-z]", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
To find the elements which has 5 or more letters,
> grep("^(?:[^a-z]*[a-z]){5}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg"
Explanation:
^ the beginning of the string
(?: group, but do not capture (4 times):
[^a-z]* any character except: 'a' to 'z' (0 or
more times)
[a-z] any character of: 'a' to 'z'
){4} end of grouping