Matching words that end in any number of digits OR "X" - regex

Trying to match all of these:
{_someWord1} ... $1=someWord, $2=1
{_another82} ... $1=another, $2=82 (item in question)
{_testX} ... $1=test, $2=X
My regex: {_(\w+)(\d+|X)} matches all three, but the groups for the 2nd item are:
{_another82} ... $1=another8, $2=2
I'd like to be able to have any number of digits be in $2, and keep just the words in $1. Do I need to have a look ahead of some sort?

In most regex flavors, you could use ungreedy repetition, which consumes as little as possible (as opposed to the default - as much as possible):
{_(\w+?)(\d+|X)}
However, if the part before the digit, can never contain digits and underscores (which are included in \w) you could simply use a more specific character class:
{_([a-zA-Z]+)(\d+|X)}

Try using a non-greedy match (adding a ? after \w+) to consume as little as possible and still match:
{_(\w+?)(\d+|X)}
or if your language (unspecified) supports look-arounds, then:
{_(\w+)(?<=[a-zA-Z])(\d+|X)}
which asserts that the last character of group 1 must be a letter (although letters may appear elsewhere within group 1)

Related

regex doesn't match the word if it's not the last word

i'm trying to write a regex which can match a word in a string with theese conditions:
the word must be 8 character length.
the word must has 1 alphabetic character at any position of the
word.
the word must has 7 digits at any position of the word.
\b(?=\w{8}\z)(?=[^a-zA-Z]*[a-zA-Z]{1})(?=(?:[\D]*[\d]){7}).*\b
this can find "123r1234" and "foo 123r1234" but it doesn't find "foo bar 123r1234 foo".
i tried to add word boundries but it didn't work.
what is wrong with my regex and how can i fix it?
thanks.
You can use the following regex:
\b(?=[^a-zA-Z]*[a-zA-Z])(?=(?:\D*\d){7})\w{8}\b
See demo
There several things to note here:
It is not necessary to enclose single shorthand classes (like \d) into character classes (pattern becomes too awkward and less readable). Thus, use \D instead of [\D].
The rule of number of look-aheads should equal the number of conditions - 1 (see Fine-Tuning: Removing One Condition at rexegg.com). Most often, length restriction look-aheads with just 1 character/character class are valid candidates for being ported into the base pattern. Here, (?=\w{8}) can easily replace .* at the end.
The (?=\w{8}\z) look-ahead contains an end-of-string \z anchor that forces a match at the end of the string, while you need (as now I know) the end of a word.
[a-zA-Z]{1} is equal to [a-zA-Z] since {1} means *exactly one repetition, and it is redundant (again, regex patterns should be as clean and concise as they can be).
UPDATE (+1 goes to #Jonny5)
There is another way of approaching the current problem: by having the word contain 8 word characters, but matching only 1 letter enclosed with any number of digits. This can be achieved with
(?i)\b(?=\w{8}\b)\d*[a-z]\d*\b
See another demo (Note i modifier is used here)
You can remove last asterisk and change it by the 8 counter.
\b(?=[^a-zA-Z]*[a-zA-Z])(?=(?:[\D]*[\d]){7})\w{8}\b
You can view it running here:
https://regex101.com/r/bX6rK8/1

ColdFusion Regex Match for Digits of Exact Length

I need some assistance constructing a regular expression in a ColdFusion application. I apologize if this has been asked. I have searched, but I may not be asking for the correct thing.
I am using the following to search an email subject line for an issue number:
reMatchNoCase("[0-9]{5}", mailCheck.subject)
The issue number contains only numeric values, and should be exactly 5 digits. This is working except in cases where I have a longer number that appears in the string, such as 34512345. It takes the first 5 digits of that string as a valid issue number as well.
What I want is to retrieve only 5 digit numbers, nothing shorter or longer. I am then placing these into a list to be looped over and processed. Do I perhaps need to include spaces before and after in the regex to get the desired result?
Thank you.
The general way to exclude content from occurring before/after a match is to use negative lookbehind before the match and a negative lookahead afterwards. To do this for numeric digits would be:
(?<!\d)\d{5}(?!\d)
(Where \d is the shorthand for [0-9])
CF's regex supports lookaheads, but unfortunately not lookbehinds, so that wouldn't work directly in rematch - however that probably doesn't matter in this case because it's likely that you don't want, for example, abc12345 to match either - so what you more likely want is:
\b\d{5}\b
Where \b is a "word boundary" - roughly, it checks for a change between a "word character" and a non-word character (or visa versa) - so in this case the first \b will check that there is NOT one of [a-zA-Z0-9_] before the first digit, and the second \b will check that there isn't one after the fifth digit. A \b does not append any characters to the match (i.e. it is a zero-width assertion).
Since you're not dealing with case, you don't need the nocase variable and can simply write:
rematch( '\b\d{5}\b' , mailCheck.subject )
The benefit of this over simply checking for spaces is that the result is five digits (no need to trim), but the downside is that it would match values such as [12345] or 3.14159^2 which are probably not what you want?
To check for spaces, or the start/end of the string, you can do:
rematch( '(?:^| )\d{5}(?= |$)' , mailCheck.subject )
Then use trim on each result to remove spaces.
If that's not what you're after, go ahead and provide more details.

match the same unknown character multiple times

I have a regex problem I can't seem to solve. I actually don't know if regex can do this, but I need to match a range of characters n times at the end of a pattern.
eg. blahblah[A-Z]{n}
The problem is whatever character matches the ending range need to be all the same.
For example, I want to match
blahblahAAAAA
blahblahEEEEE
blahblahQQQQQ
but not
blahblahADFES
blahblahZYYYY
Is there some regex pattern that can do this?
You can use this pattern: blahblah([A-Z])\1+
The \1 is a back-reference to the first capture group, in this case ([A-Z]). And the + will match that character one or more times. To limit it you can replace the + with a specific number of repetitions using {n}, such as \1{3} which will match it three times.
If you need the entire string to match then be sure to prefix with ^ and end with $, respectively, so that the pattern becomes ^blahblah([A-Z])\1+$
You can read more about back-references here.
In most regex implementations, you can accomplish this by referencing a capture group in your regex. For your example, you can use the following to match the same uppercase character five times:
blahblah([A-Z])\1{4}
Note that to match the regex n times, you need to use \1{n-1} since one match will come from the capture group.
blahblah(.)\1*\b should work in nearly all language flavors. (.) captures one of anything, then \1* matches that (the first match) any number of times.
blahblah([A-Z]|[a-z])\1+
This should help.

How to write hive regex to match condition 1 OR condition 2 and return whichever matches?

I need to have "or" logic in my regexp.
For example, from "foobar435" I would need the three numbers, so "435"
But from "barfoo543" I would need the three letters before the three numbers, so "foo"
Individually, the regexes would be "foobar([0-9]){3}" to get the first case, and "[a-zA-Z]{3}([0-9]{3})[a-zA-Z]{3}" to get the second case. How do I get both cases at once with one regexp? So, if the first regexp matches then return "435", but if not, return "foo"?
I am using hive so ideally I want to make one call only. So far I have...
REGEXP_EXTRACT(myString, 'foobar([0-9]){3}', 1) AS columnName
Not sure how to add the second case into this. Thanks!
You can use lookarounds for this.
In your first case, you want to match three digits preceded by "foobar" (use lookbehind):
(?<=foobar)[0-9]{3}
In your second case, you want to match three letters preceded by three letters (use lookbehind) and followed by three digits (use lookahead):
(?<=[a-zA-Z]{3})[a-zA-Z]{3}(?=\d{3})
Note that, if I interpreted your requirements correctly, it looks like you flipped the numeric part with the second alpha part in your expression.
Now that you have your two expressions, you just need to combine them with an 'or':
(?<=foobar)[0-9]{3}|(?<=[a-zA-Z]{3})[a-zA-Z]{3}(?=\d{3})
One thing to be aware of is that this will also match words with additional word characters on either end, ie "xfoobar435x". If this is undesirable, add a word boundary \b to the beginnings of the lookbehinds and to the end of the lookahead.

Matching parts of string that contain no consecutive dashes

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.