Regex match when substring present

Regex match when substring present - regex

I would like to use regex to match numbers if a substring is present, but without matching the substring. Hence,
2-4 foo
foo 4-6
bar 8
should match
2, 4
4, 6
I currently have
(\d{0,}\.?\d{1,})
which returns the numbers (int or float). Using
(\d{0,}\.?\d{1,}(?=\sfoo))
only matches 4, rather than 2 and 4. I also tried a lookahead
^(?=.*?\bfoo\b)(\d{0,}\.?\d{1,})
but that matches the 2 only.
*edited typo

With engines that support infinite width lookbehind patterns, you can use
(?<=\bfoo\b.*)\d*\.?\d+|\d*\.?\d+(?=.*?\bfoo\b)
See this regex demo. It matches any zero or more digits followed with an optional dot and then one or more digits when either preceded with a whole word foo (not necessarily immediately) or when followed with foo whole word somewhere to the right.
When you have access to the code, you can simply check for the word presence in the text and then extract all the match occurrences. In Python, you could use
if 'foo' in text:
print(re.findall(r'\d*\.?\d+', text))
# Or, if you need to make sure the foo is a whole word:
if re.search(r'\bfoo\b', text):
print(re.findall(r'\d*\.?\d+', text))

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)

One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101

You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo

This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.

If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Regex match an entire number with lookbehind and look ahead logic(without word boundaries)

I am trying to detect if a string has a number based on few conditions. For example I don't want to match the number if it's surrounded by parentheses and I'm using the lookahead and lookbehind to do this but I'm running into issues when the number contains multiple digits. Also, the number can be between text without any space separators.
My regex:
(?https://regex101.com/r/RnTSMJ/1
Sample examples:
{2}: Should NOT Match. //My regex Works
{34: Should NOT Match. //My regex matches 4 in {34
45}: Should NOT Match. //My regex matches 4 in {45
{123}: Should NOT Match. //My regex matches 2 in {123}
I looked at Regex.Match whole words but this approach doesn't work for me. If I use word boundaries, the above cases work as expected but then cases like the below don't where numbers are surrounded with text. I also want to add some additional logic like don't match specific strings like 1st, 2nd, etc or #1, #2, etc
updated regex:
(?<!\[|\{|\(|#)(\b\d+\b)(?!\]|\}|\|st|nd|rd|th)
See here https://regex101.com/r/DhE3K4/4
123abd //should match 123
abc345 //should match 234
ab2123cd // should match 2123
Is this possible with pure regex or do I need something more comprehensive?

You could match 1 or more digits asserting what is on the left and right is not {, } or a digit or any of the other options to the right
(?<![{\d#])\d+(?![\d}]|st|nd|rd|th)
Explanation
(?<![{\d#]) Negative lookbehind, assert what is on the left is not {, # or a digit
\d+ Match 1+ digits
(?! Negative lookahead, assert what is on the right is not
[\d}]|st|nd|rd|th Match a digit, } or any of the alternatives
) Close lookahead
Regex demo

The following regex is giving the expected result.
(?<![#\d{])\d+(?!\w*(?:}|(?:st|th|rd|nd)\b))
Regex Link

How can I create a regex that matches a 6[A-Z] character string, with 3 or more non-consecutive or consecutive Cs?

(?=[A-Z]{6})(?=([C]){3,6}) is what I have tried so far.
I would like it to work like this:
ABYCCC Match
CBTCAC Match
CCTYEC Match
AFEQCB Don't match
CCEEEE Don't match
EEEEEE Don't match
This however just matches strings with consecutive Cs.
I am very new so any help is appreciated. I'm just using the search in Notepad ++

^(?=(?:.*C){3}).*$
Use this regex.See demo.
https://regex101.com/r/rP5pV8/1

So here we go
\b(?=(?:[ABD-Z]*C){3})[A-Z]{6}\b
This will match any string that contains of 6 Uppercase letters, of whom 3 (or more) are Cs.
It doesn't match:
strings shorter than 6 uppercase letters
strings longer than 6 uppercase letters
strings with less than 3 C but following Cs outside the string
https://regex101.com/r/vV3yS4/2

You can check for occurence of at least 3 C by using a lookahead.
^(?=(?:[^C]*C){3})[A-Z]{6}$
[^C]*C matches any amount of characters, that are not C followed by C
the (?:...) non capture group {3} to be repeated 3 times
[A-Z]{6} requires 6 upper alphas.
See demo at regex101
(Note that I put for demo an addional \n in the negated class for not skipping newlines)

Here's how I would do it in Python:
import re
pattern = re.compile("[A-Z]{6}")
strings = ["AABSDC", "CCCASD", "CAVACC"]
def checkC(letters):
return pattern.match(letters) and letters.count('C') >= 3
for string in strings:
print(checkC(string))
Output:
False
True
True

(?=(.*C.*){3})[A-Z]{6}
I swapped the two parts and removed the "lookahead" from the [A-Z]{6} so that the expression matches something positive.
Then, to the left and right of "C" I added lazy dots that match anything zero or more times. So you still match three or more Cs and allow for anything between them.
After that, I removed the ,6 because "anything" can be some more Cs.

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks

First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).

Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b

Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).

It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';

If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.

Regex negative match query

I've got a regex issue, I'm trying to ignore just the number '41', I want 4, 1, 14 etc to all match.
I've got this [^\b41\b] which is effectively what I want but this also ignores all single iterations of the values 1 and 4.
As an example, this matches "41", but I want it to NOT match:
\b41\b

Try something like:
\b(?!41\b)(\d+)
The (?!...) construct is a negative lookahead so this means: find a word boundary that is not followed by "41" and capture a sequence of digits after it.

You could use a negative look-ahead assertion to exclude 41:
/\b(?!41\b)\d+\b/
This regular expression is to be interpreted as: At any word boundary \b, if it is not followed by 41\b ((?!41\b)), match one or more digits that are followed by a word boundary.
Or the same with a negative look-behind assertion:
/\b\d+\b(?<!\b41)/
This regular expression is to be interpreted as: Match one or more digits that are surrounded by word boundaries, but only if the substring at the end of the match is not preceded by \b41 ((?<!\b41)).
Or can even use just basic syntax:
/\b(\d|[0-35-9]\d|\d[02-9]|\d{3,})\b/
This matches only sequences of digits surrounded by word boundaries of either:
one single digit
two digits that do not have a 4 at the first position or not a 1 at the second position
three or more digits

This is similar to the question "Regular expression that doesn’t contain certain string", so I'll repeat my answer from there:
^((?!41).)*$
This will work for an arbitrary string, not just 41. See my response there for an explanation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex match when substring present - regex

Related

Match all elements with n occurrences

Regex match an entire number with lookbehind and look ahead logic(without word boundaries)

How can I create a regex that matches a 6[A-Z] character string, with 3 or more non-consecutive or consecutive Cs?

regex: find one-digit number

Regex negative match query

Categories

Resources