Regex to not match a particular character or match empty - regex

My list of strings are,
1. bc // should match
2. abc // should not match
3. bc-bc // should match
4. ab-bc // should match
5. bc-ab // should match
I want to match all bcs. If it starts with any other character like a in string 1, I don't want to match.
I tried with regex [^a]bc. It did not match string 2 as well as string 1 and 5, since [] expects a character. Then I did try with [^a]?bc. It matched string 2 also. How to make regex which matches empty or not a particular list of characters?

Do you want to match bc only if it's not preceded by a certain set of characters (like for example a, x, or y)? Then that's exactly what a negative lookbehind assertion is for:
(?<![axy])bc
will match bc or bbc, but not abc or ybc.
If you want to match bc as a complete "word", i. e. not adjacent to any letters or digits, use word boundary anchors:
\bbc\b
Note that in MongoDB, in order to be able to use features like lookbehind that are available only to the PCRE engine (and not to JavaScript), you need to follow a certain syntax (using strings instead of regex objects), for example:
{ name: { $regex: '(?<![axy])bc' } }

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)
One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101
You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo
This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.
If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Match String only if preceding character is not there

I want to completely discard a match if it begins with the letter C.
This is an example text, each line is a separate example:
C4526913CA57248560A562492460C
A000008002A20839256662C
C370694CA102000979A68008192429291C
The regex I am using is
[cC]?([0-9*dD]){5,}[cC]
Match :
1: C4526913C
2: 562492460C
3: 20839256662C
4: C370694C
5: 68008192429291C
but I don't want to match the ones that start with C, and I have tried these
(?!^[cC])[cC]?([0-9*dD]){5,}[cC]
(?![cC].*[cC])([cC]?([0-9*dD]){5,}[cC])
Which adds a negative look ahead but instead it matches everything except the starting C instead of discarding the whole match. Like so:
C4526913C -> 4526913C
How can I achieve this with just regular expressions?
You can match what you don't want and capture in a group what you want to keep.
As there is a single character class in the group ([0-9*dD]){5,} you can omit the group and just repeat the character class.
Note that [0-9*dD] matches a digit 0-9 or * or d or D but only the digits are in the example data to match.
[cC][0-9*dD]{5,}[cC]|([0-9*dD]{5,}[cC])
Regex demo
For the example data (without D d and *) you could also use a lookbehind if that is supported:
(?<![cC0-9])[0-9]{5,}[cC]
Regex demo

Regex match when substring present

I would like to use regex to match numbers if a substring is present, but without matching the substring. Hence,
2-4 foo
foo 4-6
bar 8
should match
2, 4
4, 6
I currently have
(\d{0,}\.?\d{1,})
which returns the numbers (int or float). Using
(\d{0,}\.?\d{1,}(?=\sfoo))
only matches 4, rather than 2 and 4. I also tried a lookahead
^(?=.*?\bfoo\b)(\d{0,}\.?\d{1,})
but that matches the 2 only.
*edited typo
With engines that support infinite width lookbehind patterns, you can use
(?<=\bfoo\b.*)\d*\.?\d+|\d*\.?\d+(?=.*?\bfoo\b)
See this regex demo. It matches any zero or more digits followed with an optional dot and then one or more digits when either preceded with a whole word foo (not necessarily immediately) or when followed with foo whole word somewhere to the right.
When you have access to the code, you can simply check for the word presence in the text and then extract all the match occurrences. In Python, you could use
if 'foo' in text:
print(re.findall(r'\d*\.?\d+', text))
# Or, if you need to make sure the foo is a whole word:
if re.search(r'\bfoo\b', text):
print(re.findall(r'\d*\.?\d+', text))

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.