Match regex for given statement - regex

I want to write regex for the following statement and match the bolded characters "The following strings must be matched
xyz.90001DUS.annotations and xyz.765896DUS.courses".
I tried to write one using regex but it is not matching above strings, can someone please help me?
It should match whole of bolded strings, this is the only criteria.
^xyz.([0-9])?DUS.annotations(.*)?\.annotations$

Your ^xyz.([0-9])?DUS.annotations(.*)?\.annotations$ cannot match the strings inside a longer string due to anchors, ^ and $. Besides, . matches any char other than line break chars, ([0-9])? matches a single optional digit (while you have five in 90001). The (.*)?\.annotations part would match any zero or more chars other than line break chars as many as possible consuming chars up to the last occurrence of .annotations.
What you can use is
xyz\.\d+DUS\.\w+
Or, with word boundaries:
\bxyz\.\d+DUS\.\w+ <<< In most NFA regex flavors
\yxyz\.\d+DUS\.\w+ <<< In PostgreSQL, Tcl
\mxyz\.\d+DUS\.\w+ <<< R (TRE), Tcl
\<xyz\.\d+DUS\.\w+ <<< GNU word boundary
[[:<:]]xyz\.\d+DUS\.\w+ <<< POSIX word boundary
See the regex demo. You do not need a word boundary after \w+, there is always a word boundary after the trailing \w+ in any regex pattern.
Details:
xyz\. - xyz.
\d+ - one or more digits
DUS\. - DUS.
\w+ - one or more word chars.

Related

Regex for finding words containing more then 3 'a' characters

I need to write a regex that will find all words with 3 or more 'a' letters. Suppose that each word is on a new line.
Example of correct words:
Anagram
Assassination
Abaca
I end up with something like this:
^([^aA]*a[^aA]*a[^aA]*a)$
But it will not work correctly if there will be more than 3 'a' letters or if word starts with 'a'.
I would keep it simple and just use:
\b\w*[Aa]\w*[Aa]\w*[Aa]\w*\b
Demo
This regex pattern matches any word containing three lower/upper a/A characters in it, appearing anywhere in the word.
Here is what I tried:
^(?i)(?:[b-z]*a){3}[a-z]*$
See an online demo
^ - Start line anchor.
(?i) - Match rest case-insensitive.
(?:[b-z]*a){3} - A non-capture group where you would match 0+ characters ranging from b-z upto a literal "a". Repeated three times.
[a-z]* - Match any possible remainder.
$- End line anchor.
If you want to use the anchors, you can add matching .* at the end, and add \n to the negated character class to prevent crossing newlines.
^[^aA\n]*[aA][^aA\n]*[aA][^aA\n]*[aA].*$
Regex demo
Or a bit shorter
^(?:[^aA\n]*[aA]){3}.*$
Regex demo

Matching first and last three characters of regex (including overlap)

I am trying to put together a regex expression that matches a word (only one per line) that starts and ends with the same three characters.
I was able to write a solution for words that are at least 6 characters long (meaning there is no overlap), but I am unsure how to do it for overlapping starts and ends such as "heheh".
This is what I have, nice and simple:
^(...).*\1$
I am inclined to believe that this might have something with lookahead and lookbehind but I am not sure.
Any help would be appreciated, thank you!
You will need lookarounds since they are non-consuming patterns, i.e. the regex index is not advanced when the lookaround pattern is matched.
For example, you may do this with GNU grep:
grep -P '^(?=(...)).+\1$' file
grep -P '^(?=(\S{3})).+\1$' file # To avoid counting in spaces
grep -P '^(?=(\w{3})).+\1$' file # Or only allowing letters/digits/underscores
grep -P '^(?=(\p{L}{3})).+\1$' file # Or only allowing letters
See the regex demo
Details
^ - start of string
(?=(...)) - a positive lookahead with a capturing group inside that matches any 3 chars
.+ - any 1+ chars other than line break chars as many as possible
\1 - Group 1 value
$ - the end of string.
To extract words, you may use \w shorthand (that matches letters, digits and underscores) and word boundaries \b:
grep -oP '\b(?=(\w{3}))\w+\1\b' file
See another demo.
Details
\b - a word boundary (start of word here, because it is followed with word chars)
(?=(\w{3})) - a positive lookahead making sure there are 3 word chars while capturing them into Group 1
\w+ - 1+ word chars (not 0 or more because otherwise a 3-char word would be matched)
\1 - Group 1 value
\b - end of word here (as it is preceded with word chars).

RegEx: don't capture match, but capture after match

There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?
With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1
It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo

Regex to match numbers followed by a specific character

I am so sorry, I know this is a simple question, which is not appropriate here, but I am terrible in regex.
I use preg_match with a pattern of (numbers A) to match the following replaces with the substrings
2A -> <i>2A</i>
100 A -> <i>100 A</i>
84.55A -> <i>84.55A</i>
92.1 A -> <i>92.1 A</i>
The numbers can be separated from the character or not
The numbers can be decimal
The letter should not be the begging of a word (not matching 4 All;
in fact, A should be followed by a space or period or linebreak)
My problem is to apply OR conditions to match a character which may exist or not to have a single match to be replaced as
$str = preg_replace($pattern, '<i>$1</i>', $str);
I can suggest
'~\b(?<![\d.])\d*\.?\d+\s*A\b~'
See the regex demo. Replace with '<i>$0</i>' where the $0 is the backreference to the whole match.
Details:
\b - leading word boundary
(?<![\d.]) - a negative lookbehind that fails the match if there is a dot or digit before the current location (NOTE: this is added to avoid matching 33.333.4444 A like strings, just remove if not necessary)
\d*\.?\d+ - a usual simplified float/int value regex (0+ digits, an optional . and 1+ digits) (NOTE: if you need a more sophisticated regex for this, see Matching Floating Point Numbers with a Regular Expression)
\s* - 0+ whitespaces
A\b - a whole word A (here, \b is a trailing word boundary).

What's a RegEx for "up to three words but no more than 20 characters"?

I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b - a word boundary
(?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
\w+ - 1 word (consisting of 1 or more word characters)
(?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
(?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).
Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.