Regular expression matching a numerical string exactly - regex

I have a string which looks like this 600/-4.412/11 and one which looks like this 600/11
[optional sign][float or integer]/[optional sign][float or integer]/[optional sign][float or integer]
[optional sign][float or integer]/[optional sign][float or integer]
Example:
1) 600/-4.412/11
2) 600/11
And I need to find a regular expression which matches 1 and one which matches 2. But both expressions mustn't select/match the other one. With my humble regex knowledge I managed to build this expression:
([-+]?[0-9]+(\.?[0-9]+)?\/?){3}
The problem with this expression is that it matches 1) as well as 2) according to http://gskinner.com/RegExr/. Hopefully someone can fix this or at least tell me why this is happening since I hoped that I only had to change {3} to {2} in order to get the different matching.

Problem
The problem is that your regex allows the repeated subpattern, i.e. [-+]?[0-9]+(\.?[0-9]+)?\/? to match without restricting it to each section of numbers which are delimited by /.
For this example in your question: 600/11, the first repetition will match 600/, the second will be 1 and the third one will be the last 1.
Solution 1
WRONG attempt
For validation, you can change it slightly to make it works as you want:
^([-+]?[0-9]+(\.[0-9]+)?(?:/|$)){3}$
(?:/|$) forces a number (floating point or integer) to end with a /, or it is the end of string. This will effectively make sure each repetition will not match within the same number.
^ is added in front and $ behind to make sure that the string has exactly 3 numbers.
The text that are not crossed out still applies in the correct solution.
The CORRECT solution
However, the above regex is WRONG. It will still allow invalid input such as 1/2/3/ to match (ends with /). We need to add an extra assertion at the end to prevent the case above from matching:
^([-+]?[0-9]+(\.[0-9]+)?(?:/|$)){3}(?<!/)$
(?<!/) is a zero-width negative look-behind which checks that the character before the end of the string is not /.
Solution 2
It is less buggy to write the regex in the form [number]([delimiter][number]){repeat} in such cases, rather than fiddling with the form ([number][delimiter/ending]){repeat}.
The answer below will strictly validates the input:
^[-+]?[0-9]+(\.[0-9]+)?(?:/[-+]?[0-9]+(\.[0-9]+)?){2}$
The above is for matching the case of exactly 3 numbers. Change 2 to 1 (or remove {2}) to match exactly 2 numbers.

This happens because almost everything in your expression has been made optional. The only thing that must appear is a single digit, so your expression would also match 007.
It follows that the solution is to make parts of the expression mandatory. There are lots of ways to approach this. One that does not exactly fit your description but which IMHO you should consider nonetheless is
([-+]?[0-9]+(\.[0-9]+)?(?=/|$))
This expression will match both types of inputs, but you can tell them apart by just looking at the count of matches (2 or 3) -- this would be equivalent to testing two different expressions before taking a branch.
Update:
The expression above is too liberal because it is not anchored at the start and end of the input. Here is one that is anchored, and thus will not match if the input contains any spurious characters:
^([-+]?[0-9]+(\.[0-9]+)?(/(?!$)|$)){2,3}$
Breaking it down:
^ start matching at the beginning
( match the following:
[-+]? optional sign
[0-9]+(\.[0-9]+)? integer or float
( either
/(?!$) a slash that's not trailing
| or
$ end of input
)
){2,3} do this exactly two or three times
$ and make sure there is no other input

Related

Regular expression starts with a string but not contain a special character after that

I am trying to find a regular expression which basically matches start of a string but not having a specific character after that. By this I should be achieving same level routes.
Example : Lets say I have the following strings and I need to get routes starting from LAX with no stops.
LAX-LAS-JFK
LAX-PHX-JFK
LAX-JFK
LAX-PHX
The regex should match only route 3 and 4.
I have tried this ^LAX-([^-])* and it didn't work for me when I cross checked on https://www.regextester.com/15.
You can try this:
^LAX(-[A-Z]+){1}$
This matches
LAX-JFK
LAX-PHX
but not
LAX-LAS-JFK
LAX-PHX-JFK
Demo: regex101
Explanation:
^ start
$ end
{1} exact number of repetitions of a pattern, in this case 1
Fun fact: you can replace the 1 by (number of stops + 1), and it will select only the routes with the defined number of stops (another example).
So it sounds like you want to match with strings that only have 1 dash. Perhaps something like this ^(LAX)(-{1})[a-zA-Z]+$ would work? It will check to make sure the string LAX is in the beginning, followed by one dash and ending with alphabetical characters.

How to handle float numbers using regular expression in VB Script

I am trying to get number with submatches in below string and i am not sure how to handle if my string contains either integer(without decimal) or float number
please correct me where i am making mistake in below code.
str="Added Quantity:12.23 Pass"
Set oReg=New RegExp
oReg.pattern="(.*Quantity.*)+((\d{1,})|(\d{1,}\.\d{1,}))(.*)"
set r=oReg.execute(str)
for i=0 to r.count-1
print r.item(1).submatches(i)
next
Your expression will match numbers alright, but it won’t match in the wrong place. To see why, let’s just consider what (Quantity.*)(\d{1,}) matches in the following string:
Quantity:12.23
Here’s the result of that match:
Whole match: Quantity:12
Group 1: Quantity:1
Group 2: 2
— The problem is that .* is greedy and matches as much as possible, including digits. It then backtracks so that it can match at least one digit (\d{1,}) in its second group. But you want to get all digits in there.
Several ways exist to solve this, but the easiest is to make your expression more specific: instead of everything (.), just match non-digits:
(.*Quantity\D*)+(\d{1,})
Furthermore, you don’t need the + quantifier here, and \d{1,} can be shortened to \d+. And in the rest of the expression you can join matching integers and decimals together, and just make the decimal part optional:
.*Quantity\D*(\d+(?:\.\d+)?).*
((?:…) just means that this group will not be captured; the parentheses are merely to enforce operator precedence.)
Finally, note this will match 1 and 0.23, but not 1., nor .23. While this is completely fine, it’s somewhat common (especially in American spelling) to omit a leading zero in front of the decimal point.

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)

Can I shorten this regular expression?

I have the need to check whether strings adhere to a particular ID format.
The format of the ID is as follows:
aBcDe-fghIj-KLmno-pQRsT-uVWxy
A sequence of five blocks of five letters upper case or lower case, separated by one dash.
I have the following regular expression that works:
string idFormat = "[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}";
Note that there is no trailing dash, but the all of the blocks within the ID follow the same format. Therefore, I would like to be able to represent this sequence of four blocks with a trailing dash inside the regular expression and avoid the duplication.
I tried the following, but it doesn't work:
string idFormat = "[[a-zA-Z]{5}[-]{1}]{4}[a-zA-Z]{5}";
How do I shorten this regular expression and get rid of the duplicated parts?
What is the best way to ensure that each block does also not contain any numbers?
Edit:
Thanks for the replies, I now understand the grouping in regular expressions.
I'm running a few tests against the regular expression, the following are relevant:
Test 1: aBcDe-fghIj-KLmno-pQRsT-uVWxy
Test 2: abcde-fghij-klmno-pqrst-uvwxy
With the following regular expression, both tests pass:
^([a-zA-Z]{5}-){4}[a-zA-Z]{5}$
With the next regular expression, test 1 fails:
^([a-z]{5}-){4}[a-z]{5}$
Several answers have said that it is OK to omit the A-Z when using a-z, but in this case it doesn't seem to be working.
You can try:
([a-z]{5}-){4}[a-z]{5}
and make it case insensitive.
If you can set regex options to be case insensitive, you could replace all [a-zA-Z] with just plain [a-z]. Furthermore, [-]{1} can be written as -.
Your grouping should be done with (, ), not with [, ] (although you're correctly using the latter in specifying character sets.
Depending on context, you probably want to throw in ^...$ which matches start and end of string, respectively, to verify that the entire string is a match (i.e. that there are no extra characters).
In javascript, something like this:
/^([a-z]{5}-){4}[a-z]{5}$/i
This works for me, though you might want to check it:
[a-zA-Z]{5}(-[a-zA-Z]{5}){4}
(One group of five letters, followed by [dash+group of five letters] four times)
([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}
Try
string idFormat = "([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}";
I.e. you basically replace your brackets by parentheses. Brackets are not meant for grouping but for defining a class of accepted characters.
However, be aware that with shortened versions, you can use the expression for validating the string, but not for analyzing it. If you want to process the 5 groups of characters, you will want to put them in 5 groups:
string idFormat =
"([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})";
so you can address each group and process it.