Tricky Regular Expression with a Alphanumeric pattern in uppercase [duplicate] - regex

This question already has answers here:
Can you make just part of a regex case-insensitive?
(5 answers)
Closed 3 years ago.
Okay this might not be tricky at all for some but at the moment really screwing up with my head.
First of all i don't know what engine i am dealing with, but it doesn't seem to identify uppercase.
I have a string for example
Circuit Ref
Service Type
A End Address
Z End Address
52GD J32SD41 O2AE EVC001
Evolve Internet
And I am only trying to extract the string "52GD J32SD41 O2AE EVC001". I have already tried quite a few combinations like
[0-9A-Z]{4}\s[0-9A-Z]+\s[0-9A-Z]+\s[0-9A-Z]+
[A-Z0-9]{4}\s\W+\s\W+\s\W+
[A-Z0-9]{4}\s[A-Z0-9\s]*[A-Z0-9\s]*[A-Z0-9\s]*
Nothing seem to work...I want to keep the expression fairly flexible as the expression can change order of the letters and digits. but the pattern is mostly same. Any nudge in a right direction will be greatly appreciated.
Thanks

This is wild guess, but please try following things:
in front of the regex add (?-i) (Related question, regular-expressions.info, net page about regex)
enclose regex with (?-i: ... )
enclose regex with (?I: ... )
BTW. Regarding 2nd case that you tried: [A-Z0-9]{4}\s\W+\s\W+\s\W+.
Seem that you tried to use \W as "upper case word character", but it is not what it means.
\W means anything that is not \w. That is any non-word character.

Related

Regex to match only if symbol is found at most once in fixed length pattern [duplicate]

This question already has an answer here:
Restricting character length in a regular expression
(1 answer)
Closed 3 years ago.
I need help with a regex that should match a fixed length pattern.
For example, the following regex allows for at most 1 ( and 1 ) in the matched pattern:
([^)(]*\(?[^)(]*\)?[^)(]*)
However I can not / do not want to use this solution because of the *, as the text I have to scan through is very large using it seems to really affect the performance.
I thus want to impose a match length limit, e.g. using {10,100} for example.
In other words, the regex should only match if
there are between 0 and 1 set of parenthese inside the string
the total length of the match is fixed, e.g. not infinite (No *!)
This seems to be a solution to my problem, however I do not get it to work and I have trouble understanding it.
I tried to use the accepted answer and created this:
^(?=[^()]{5,10}$)[^()]*(?:[()][^()]*){0,2}$
which does not seem to really work as expected: https://regex101.com/r/XUiJZz/1
Also please do not mark this question a duplicate of another question, if the answers in that question make use of the kleene star operator, it wont help me.
Edit:
I know this is a possible solution, but I'm wondering if there is a better way to do it:
([^)(]{0,100}\(?[^)(]{0,100}\)?[^)(]{0,100})
I thus want to impose a match length limit, e.g. using {10,100}
You may want to anchors add a lookahead assertion in your regex:
^(?=.{10,100})[^)(]*(?:\(?[^)(]*\))?[^)(]*$
(?=.{10,100}) is lookahead condition to assert that length of string must be between 10 and 100.
RegEx Demo

negation classes regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
i wrote this regex for tokenize a text: "\b\w+\b"
but someone suggets me to convert it into \b[^\W\d_]+\b
can anyone explaing to me why this second way (using negation) is better?
thanks
The first one matches all letters, numbers and the underscore. Depending on the regex engine, this may include unicode letters and numbers. (the word boundaries are superfluous in this case btw.)
The second regex matches only letters (excluding non-word-charcters, digits and the underscore). Due to the word boundary, it will only match them, if they are surrounded by non-word-characters or start/end of th string.
If your regex engine supports this, you might want to use [[:alpha:]] or \p{L} (or [A-Za-z] in case of non-unicode) instead to make your intent clearer.

How to build a regular expression which prohibits hyphens from appearing at the start and end of a string? [duplicate]

This question already has answers here:
RegEx for allowing alphanumeric at the starting and hyphen thereafter
(4 answers)
Closed 5 years ago.
I want to build a regular expression which only matches [A-Za-z0-9\-] with an additional rule that hyphens (-) are not allowed to appear at the start and at the end.
For example:
my-site is matched.
m is matched.
mysite- is not matched.
-mysite is not matched.
Currently, I've come up with ^[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]+$.
But this doesn't match m.
How can I change my regular expression so that it fits my needs?
Use look arounds:
^(?!-)[A-Za-z0-9-]*(?<!-)$
The reason this works is that look arounds don't consume input, so the look ahead and the look behind can both assert on the same character.
Note that you don't need to escape the dash within the character class if it's the first or last character.

How do regex positive look-behinds work?

I have been solving old question from stack so that I can improve my regex knowledge. As I have a basic knowledge of regex, most of them were easy but this question regex problem is tough.
It asks for a regex that extracts from this kind of string ou=persons,ou=(.*),dc=company,dc=org the last string immediately preceded by a comma not followed by (.*). In the last case, this should give dc=company,dc=org.
The solution is (?<=,(?!.*\Q(.*)\E)).* but I cannot understand its flow. I understood (?!.*\Q(.*)\E) portion but other are still mystery to me. Specially ?<= which is a positive look-behind. Does it search from end of string? Can anyone explain it to me like I am a 7 year old kid — and please http://regex101.com/ is not helping.
The RegEx (?<=,(?!.*\Q(.*)\E)).* look-behind potion works like this:
Start at the beginning of the string at first character.
Can we match the the thing we are looking for? ,(?!.*\Q(.*)\E)
If we can't: Move forward one character, Go To 2. and check match again.
If a match is found: Capture all the remaining characters until we can't find any .* (or generally then try the matching the remaining RegEx).
For a more wordly explaination consider reading Lookahead and Lookbehind Zero-Length Assertions.
A lookbehind allows you to specify a context just before the actual match.
You can say ,(dc=) and only return the capture group, or ,\Kdc=, or (?<=,)dc= to return the match on dc= but require that the comma is present just before the match.
The facility also allows for multiple lookbehinds, so you could do (?<=a.*)(?<=b.*)c to match c only if it is preceded by both a and b somewhere in the input.
A lookbehind is basically syntactic sugar, in that you can usually rephrase your conditions using some other regex construct. It can be really handy when you have multiple unanchored constraints, like in the last example

My regular expression matches too much. How can I tell it to match the smallest possible pattern? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I have this RegEx:
('.+')
It has to match character literals like in C. For example, if I have 'a' b 'a' it should match the a's and the ''s around them.
However, it also matches the b also (it should not), probably because it is, strictly speaking, also between ''s.
Here is a screenshot of how it goes wrong (I use this for syntax highlighting):
I'm fairly new to regular expressions. How can I tell the regex not to match this?
It is being greedy and matching the first apostrophe and the last one and everything in between.
This should match anything that isn't an apostrophe.
('[^']+')
Another alternative is to try non-greedy matches.
('.+?')
Have you tried a non-greedy version, e.g. ('.+?')?
There are usually two modes of matching (or two sets of quantifiers), maximal (greedy) and minimal (non-greedy). The first will result in the longest possible match, the latter in the shortest. You can read about it (although in perl context) in the Perl Cookbook (Section 6.15).
Try:
('[^']+')
The ^ means include every character except the ones in the square brackets. This way, it won't match 'a' b 'a' because there's a ' in between, so instead it'll give both instances of 'a'
You need to escape the qutoes:
\'[^\']+\'
Edit: Hmm, we'll I suppose this answer depends on what lang/system you're using.