regex - how to decrement matched digit - regex

I would like to match one digit and later match it again (it can be done using backreferences) but decremented by one.
Here is an example regex:
"([0-9])abc\\1"
Is it somehow possible to decrement by one value from the backreference - \\1 ?
Edit
I use boost regex.

People are going to hate me for this, but I found it to be an interesting exercise. While regex can't do arithmetic, you can use conditional groups to effectively build a library that maps each numeral to its -1 value.
^(1)?(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?abc(?(1)0)(?(2)1)(?(3)2)(?(4)3)(?(5)4)(?(6)5)(?(7)6)(?(8)7)(?(9)8)$
https://regex101.com/r/47XDtD/1
The other answer posted here is a lot more straightforward and computationally efficient, but the conditional groups will allow for more flexibility in case your real data is more complex (for example, if you need to match the decremented number multiple times).

Ugly but works:
1abc0|2abc1|3abc2|4abc3|5abc4|6abc5|7abc6|8abc7|9abc8
Just substitute abc with your string.
Just gets all combinations of numbers.

Related

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Get value from a column with regex

I have lines of text similar to this:
value1|value2|value3 etc.
The values itself are not interesting. The type of delimeter is also unimportant, just like the number of fields. There could be 100 column, there could be only 5.
I would like to know what is the usual way to write a regexp which will put any given column's value into a capture group.
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
My problem is, that after a while I run into catastrophic backtracking or simply extremely low running times.
What is the usual way to solve a problem like this?
Edit:
For further clarification: this is a use case where further string operations are simply not permitted. Workarounds are not possible. I would like to know if there's a way simply based on regex or not.
As you stated:
My problem is, that after a while I run into catastrophic backtracking
or simply extremely low running times.
What is the usual way to solve a problem like this?
IMHO, You should prefer to perform string operations when you have a predefined structure in string (like for your case | character is used as a separator) because string operations are faster than using Regex which is designed to find a pattern. Like, in case the separators may change and we have to identify it first and then split based on separator, here the need of a Regex comes.
e.g.,
value1|value2;value3-value4
For your case, you can simply perform string split based on the separator character and access the respected index from an array.
EDIT:
If Regex is your only option then try using this regex:
^((.+?)\|){200}
Here 200 is the element I wish to access and seems a bit less time consuming than yours.
Demo
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
As a matter of "steps", using capture groups will cost more step.
However, using capture groups will allow you to condense your pattern and use a curly bracketed quantifier.
In your first pattern above, you can get away with "greedy" negated character classes (remove the ?) because they will halt at the next |, and you don't need to escape the pipe inside the square brackets.
When you want to access a "much later" positioned substring in the input string, not using a quantifier is going to require a horrifically long pattern and make it very, very difficult to comprehend the exact point that will be matched. In these cases, it would be pretty silly not to use a capture group and a quantifier.
I agree with Toto's comment; accessing an array of split results is going to be a very sensible solution if possible.

Negating Regular Expression for Price

I have a regular expression for matching price where decimals are optional like so,
/[0-9]+(\.[0-9]{1,2})?/
Now what I would like to do is get the inverse of the expression, but having trouble doing so. I came up with something simple like,
/[^0-9.]/g
But this allows for multiple '.' characters and more than 2 numbers after the decimal. I am using jQuery replace function on blur to correct an input price field. So if a user types in something like,
"S$4sd3.24151 . x45 blah blah text blah" or "!#%!$43.24.234asdf blah blah text blah"
it will return
43.24
Can anyone offer any suggestions for doing this?
I would do it in two steps. First delete any non-digit and non-dot-character with nothing.
/[^0-9.]//g
This will yield 43.24151.45 and 43.24.234 for the first and second example respectively.
Then you can use your first regex to match the first occurence of a valid price.
/\d(\.\d{1,2})?/
Doing this will give you 43.24 for both examples.
I suppose in programming, it is not always clear what "inverse" means.
To suggest a solution exclusively based on the example that you presented, I will present one that is very similar to what Vince presented. I am having difficulty composing a Regular Expression that both matches the pattern that you need and captures a potentially arbitrary number of digits, through repeating capture groups. And I am not sure whether this would be doable in some reasonable way (perhaps someone else does). But a two step approach should be straightforward.
To note, I suspect that you are referring to JavaScript's replace function, which is a member of the String Object, and not jQuery replaceWith and replaceAll functions, in referring to 'jQuery replace function.' The latter are 'Dom manipulation' functions. But, correct me if I misunderstood.
As an example, based on some hypothetical input, you could use
<b>var numeric_raw = jQuery('input.textbox').attr ('value').replace (/[^0-9.]/g, "")</b>
to remove all characters from a value entered in a text field that are not digits or periods;
then you could use
<b>var numeric_str = numeric_raw.replace (/^[0]*(\d+\.\d{1,2}).*$/, "$1")</b>
The difference between the classes specified here and in Vince's answer are in that I am including filtering for leading 0s.
To note, in Vince's first reg ex, there might be an extra '/' -- but perhaps it has a purpose that I didn't catch.
With respect to "inverse," one way to understand your initial inquiry is that you are looking for an expression that does the opposite of the one that you provided.
To note, while the expression that you provided (/[0-9]+(.[0-9]{1,2})?/) does match both whole numbers and decimal numbers with up to two fractional digits, it also matches any single digit -- so, it may identify a match where one might not be envisioned, for a given input string. The expression does not have anchors ('^', '$'), and so might allow multiple possible matches. For example, in the String "1.111", both "1.11" and "1" match the pattern that you provided.
It appears to me that one pattern that matches any string that does not match your pattern is the following, or at least does this for most cases can be this:
/^(?:(?!.*[0-9]+(\.[0-9]{1,2})?).*)*$/
-- if someone could identify a precisely 'inverse' pattern, please feel free -- I am having some trouble understanding how lookaheads are interpreted at least for some nuances.
This relies on "negative lookahead" functionality, which JavaScript these days supports. You could refer to several stackoverflow postings for more information (eg. Regular Expressions and negating a whole character group), and there are multiple resources that could be found on the Internet that discuss "lookahead" and "lookbehind."
I suppose this answer carries some redundancy with respect to the one already given -- I might have commented on the Original Poster's post or on Vince's answer (instead of writing at least parts of my answer), but I am not yet able to make comments!

Regex expression - one or multiple strings seperated by comma

Novice regex question here.
I need a regex that will accept one or more of the following strings. If there is multiple strings, they need to be separated by a comma.
foo
bar
Any help or a point in the right direction would be appreciated.
^(foo|bar)(,(foo|bar))*$
does that. The capturing groups are not necessary, you could also write this (slightly more efficient) with non-capturing groups as
^(?:foo|bar)(?:,(?:foo|bar))*$
To avoid repeats, you can use a negative lookahead assertion:
^(foo|bar)(?:,(?!\1)(?:foo|bar))?$
(Notice the ? instead of * - if only a single repetition is possible, this makes more sense.)
This approach quickly becomes complicated when a higher number of strings is to be checked. While it's theoretically possible to do that with a regex as well, it's probably not a good idea.

Is there a simple regex to compare numbers to x?

I want a regex that will match if a number is greater than or equal to an arbitrary number. This seems monstrously complex for such a simple task... it seems like you need to reinvent 'counting' in an explicit regex hand-crafted for the x.
For example, intuitively to do this for numbers greater than 25, I get
(\d{3,}|[3-9]\d|2[6-9]\d)
What if the number was 512345? Is there a simpler way?
Seems that there is no simpler way. regex is not thing that for numbers.
You may try this one:
\[1-9]d{6,}|
[6-9]\d{5}|
5[2-9]\d{4}|
51[3-9]\d{3}|
512[4-9]\d{2}|
5123[5-9]\d|
51234[6-9]
(newlines for clarity)
What if the number was 512345? Is there a simpler way?
No, a regex to match a number in a certain range will be a horrible looking thing (especially large numbers ranges).
Regex is simply not meant for such tasks. The better solution would be to "freely" match the digits, like \d+, and then compare them with the language's relational operators (<, >, ...).
In Perl you can use the conditional regexp construct (?(condition)yes-pattern) where the (condition) is (?{CODE}) to run arbitrary Perl code. If you make the yes-pattern be (*FAIL) then you have a regexp fragment which succeeds only when CODE returns false. Thus:
foreach (0 .. 50) {
if (/\A(\d+)(?(?{$1 <= 25})(*FAIL))\z/) {
say "$_ matches";
}
else {
say "$_ does not match";
}
}
The code-evaluation feature used to be marked as experimental but the latest 'perlre' manual page (http://perldoc.perl.org/perlre.html) seems to now imply it is a core language feature.
Technically, what you have is no longer a 'regular expression' of course, but some hybrid of regexp and external code.
I've never heard of a regex flavor that can do that. Writing a Perl module to generate the appropriate regex (as you mentioned in your comment) sounds like a good idea to me. In fact, I'd be surprised if it hasn't been done already. Check CPAN first.
By the way, your regex contains a few more errors besides the excess pipes Yuriy pointed out.
First, the "three or more digits" portion will match invalid numbers like 024 and 00000007. You can solve that by requiring the first digit to be greater than zero. If you want to allow for leading zeroes, you can match them separately.
The third part, 2[6-9]\d, only matches numbers >= 260. Perhaps you meant to make the third digit optional (i.e. 2[6-9]\d?), but that would be redundant.
You should anchor the regex somehow to make sure you aren't matching part of a longer number or a "word" with digits in it. I don't know the best way to do that in your particular situation, but word boundaries (i.e. \b) will probably be all you need.
End result:
\b0*([1-9]\d{2,}|[3-9]\d|2[6-9])\b