regex underscore delimited pattern matching - regex

Hi I am struggling with getting the regex right for the pattern matching.
I basically want to use regex to match the following pattern.
[anyCharacters]_[anyCharacters]_[anyCharacters]_[anyCharacters]_[1or2]
for example, the below string should match to the above pattern.
AA_B_D_ test-adf123_1
i tried the below regex but doesn't work .....
^[.]+_[.]+_[.]+_[.]+_(1|2)

. matches any character (once) _ included
.* matches any character (largest sequence) (_ included)
[.]+ matches only . character (at least one) (largest sequence)
[^_]+ matches any character except _ (at least one) (largest sequence)
.*? matches any character (shortest sequence)
you may need one of the last two.
^[^_]+_[^_]+_[^_]+_[^_]+_(1|2)
or
^(.*?_){4}[12]
The problem with .*? is that it can backtrack and matches also
one_two_three_four_five_1
The shortest is
^([^_]+_){4}[12]

Try
^(.+_)+(1|2)$
If you want to specify the number of occurrences:
^(.+_){4}(1|2)$

Use a [^_] negated character class rather than [.] that only matches a dot symbol:
^[^_]+_[^_]+_[^_]+_[^_]+_[12]
If the pattern must match the whole string, add $:
^[^_]+_[^_]+_[^_]+_[^_]+_[12]$
Also, you may shorten it a bit with a limiting quantifier:
^[^_]+(?:_[^_]+){3}_[12]$
See the regex demo.
Note that [12] is a better way to match single chars, it will match 1 or 2. A grouping construct like (...) (or (?:...), a non-capturing variant)
should be used when matching multicharacter values.
Pattern details:
^ - start of string
[^_]+ - 1 or more chars other than _
(?:_[^_]+){3} - 3 occurrences of:
_ - an underscore
[^_]+ - 1 or more chars other than _
_ - an underscore
[12] - 1 or 2
$ - end of string.

Related

Match with optional positive lookahead

I've got 2 strings in the format:
Some_thing_here_1234 Match Me 1 & 1234 Match Me 1_1
In both cases I want the resultant match to be 1234 Match Me 1
So far I've got (?<=^|_)\d{4}\s.+ which works but in the case of string 2 also captures the _1 at the end. I thought I could use a lookahead at the end with an optional such as (?<=^|_)\d{4}\s.+(?=_\d{1}$|$) but it always seems to revert to the second option and so the _1 gets through.
Any help would be great
You can use
(?<=^|_)\d{4}\s[^_]+
See the regex demo.
Details:
(?<=^|_) - a positive lookbehind that matches a location that is immediately preceded with either start of string or a _ char (equal to (?<![^_]))
\d{4} - four digits
\s - a whitespace
[^_]+ - one or more chars other than _.
Your second pattern (?<=^|_)\d{4}\s.+(?=_\d{1}$|$) is greedy and at the end of the string the second alternative |$ will match so you will keep matching the whole line.
Note that you can omit {1}
If you want to use an optional part in the lookahad, you can make the match non greedy and optionally match :_\d in the lookahead followed by the end of the string.
(?<=^|_)\d{4}\s.+?(?=(?:_\d)?$)
See a regex demo.

Regex to match the letter string group between 2 numbers

Is it possible to match only the letter from the following string?
RO41 RNCB 0089 0957 6044 0001 FPS21098343
What I want: FPS
What I'm trying LINK : [0-9]{4}\s*\S+\s+(\S+)
What I get: FPS21098343
Any help is much appreciated! Thanks.
You can try with this:
var String = "0258 6044 0001 FPS21098343";
var Reg = /^(?:\d{4} )+ *([a-zA-Z]+)(?:\d+)$/;
var Match = Reg.exec(String);
console.log(Match);
console.log(Match[1]);
You can match up to the first one or more letters in the following way:
^[^a-zA-Z]*([A-Za-z]+)
^.*?([A-Za-z]+)
^[\w\W]*?([A-Za-z]+)
(?s)^.*?([A-Za-z]+)
If the tool treats ^ as the start of a line, replace it with \A that always matches the start of string.
The point is to match
^ / \A - start of string
[^a-zA-Z]* - zero or more chars other than letters
([A-Za-z]+) - capture one or more letters into Group 1.
The .*? part matches any text (as short as possible) before the subsequent pattern(s). (?s) makes . match line break chars.
Replace A-Za-z in all the patterns with \p{L} to match any Unicode letters. Also, note that [^\p{L}] = \P{L}.
To grep all the groups of letters that go in a row in any place in the string you can simply use:
([a-zA-Z]+)
You could use a capture group to get FPS:
\b[0-9]{4}\s+\S+\s+([A-Z]+)
The pattern matches:
\b[0-9]{4} A wordboundary to prevent a partial match, and match 4 digits
\s+\S+\s+ Match 1+ non whitespace chars between whitespace chars
([A-Z]+) Capture group 1, match 1+ chars A-Z
Regex demo
If the chars have to be followed by digits till the end of the string, you can add \d+$ to the pattern:
\b[0-9]{4}\s+\S+\s+([A-Z]+)\d+$
Regex demo

Regex to exclude text between first occurrence of underline and last occurrence of hyphon

I am searching for a regex to match
the text until (and including) the first occurrence of an underline
and
the text after the last occurrence of a hyphon, but cutting all
leading zeros
Sample input:
Text_un_important-0011
Desired result (by concatenting all matches):
Text_11
I have come up with: (?:^|(?:00))(.+?)(?:$|_) but it has some flaws: It works only if there is exactly one hyphen at the end and two leading zeros. Unfortunately I cannot fix it.
If you follow your current logic, you may use
^[^_]*_|[1-9]\d*$
See the regex demo. Details:
^ - start of string
[^_]* - 0 or more chars other than _ and then
_ - a _ char
| - or
[1-9] - a non-zero digit
\d* - 0 or more digits
$ - end of string
You may also use a replace logic:
Find: _.*-0*
Replace: _
See the regex demo. The _.*-0* pattern matches
_ - an underscore
.* - any 0 or more chars other than line break chars, as many as possible
- - a hyphen
0* - zero or more 0 chars.
Since the first _ is consumed, the replacement pattern should be _.
So , you want to capture the first word and the final digits that are not zero:
([\w\d]+)(_[\w\d]+)*(-0*)(\d+)
use the first and last captured group.
([\w\d]+) captures a word that may have a number (ex. word1)
(_[\w\d]+)* captures repetitions of _word (ex. _word1_anotherword2_third3)
(-0*) captures a hyphen followed by consecutive zeros (ex. -00000000)
(\d+) captures all the digits following the consecutive zeros

Python regex match certain floating point numbers

I'm trying to match: 0 or more numbers followed by a dot followed by ( (0 or more numbers) but not (if followed by a d,D, or _))
Some examples and what should match/not:
match:
['1.0','1.','0.1','.1','1.2345']
not match:
['1d2','1.2d3','1._dp','1.0_dp','1.123165d0','1.132_dp','1D5','1.2356D6']
Currently i have:
"([0-9]*\.)([0-9]*(?!(d|D|_)))"
Which correctly matches everything in the match list. But for those in the things it should not match it incorrectly matches on:
['1.2d3','1.0_dp','1.123165d0','1.132_dp','1.2356D6']
and correctly does not match on:
['1d2','1._dp','1D5']
So it appears i have problem with the ([0-9]*(?!(d|D|_)) part which is trying to not match if there is a d|D|_ after the dot (with zero or more numbers in-between). Any suggestions?
Instead of using a negative lookahead, you might use a negated character class to match any character that is not in the character class.
If you only want to match word characters without the dD_ or a whitespace char you could use [^\W_Dd\s].
You might also remove the \W and \s to match all except dD_
^[0-9]*\.[^\W_Dd\s]*$
Explanation
^ Start of string
[0-9]*\. Match 0+ times a digit 0-9 followed by a dot
[^\W_Dd\s]* Negated character class, match 0+ times a word character without _ D d or whitespace char
$ End of string
Regex demo
If you don't want to use anchors to assert the start and the end of the string you could also use lookarounds to assert what is on the left and right is not a non whitspace char:
(?<!\S)[0-9]*\.[^\W_Dd\s]*(?!\S)
Regex demo
\d*[.](?!.*[_Dd]).* is what you are looking for:

Regex to match numbers followed by a specific character

I am so sorry, I know this is a simple question, which is not appropriate here, but I am terrible in regex.
I use preg_match with a pattern of (numbers A) to match the following replaces with the substrings
2A -> <i>2A</i>
100 A -> <i>100 A</i>
84.55A -> <i>84.55A</i>
92.1 A -> <i>92.1 A</i>
The numbers can be separated from the character or not
The numbers can be decimal
The letter should not be the begging of a word (not matching 4 All;
in fact, A should be followed by a space or period or linebreak)
My problem is to apply OR conditions to match a character which may exist or not to have a single match to be replaced as
$str = preg_replace($pattern, '<i>$1</i>', $str);
I can suggest
'~\b(?<![\d.])\d*\.?\d+\s*A\b~'
See the regex demo. Replace with '<i>$0</i>' where the $0 is the backreference to the whole match.
Details:
\b - leading word boundary
(?<![\d.]) - a negative lookbehind that fails the match if there is a dot or digit before the current location (NOTE: this is added to avoid matching 33.333.4444 A like strings, just remove if not necessary)
\d*\.?\d+ - a usual simplified float/int value regex (0+ digits, an optional . and 1+ digits) (NOTE: if you need a more sophisticated regex for this, see Matching Floating Point Numbers with a Regular Expression)
\s* - 0+ whitespaces
A\b - a whole word A (here, \b is a trailing word boundary).