Negating Regular Expression for Price - regex

I have a regular expression for matching price where decimals are optional like so,
/[0-9]+(\.[0-9]{1,2})?/
Now what I would like to do is get the inverse of the expression, but having trouble doing so. I came up with something simple like,
/[^0-9.]/g
But this allows for multiple '.' characters and more than 2 numbers after the decimal. I am using jQuery replace function on blur to correct an input price field. So if a user types in something like,
"S$4sd3.24151 . x45 blah blah text blah" or "!#%!$43.24.234asdf blah blah text blah"
it will return
43.24
Can anyone offer any suggestions for doing this?

I would do it in two steps. First delete any non-digit and non-dot-character with nothing.
/[^0-9.]//g
This will yield 43.24151.45 and 43.24.234 for the first and second example respectively.
Then you can use your first regex to match the first occurence of a valid price.
/\d(\.\d{1,2})?/
Doing this will give you 43.24 for both examples.

I suppose in programming, it is not always clear what "inverse" means.
To suggest a solution exclusively based on the example that you presented, I will present one that is very similar to what Vince presented. I am having difficulty composing a Regular Expression that both matches the pattern that you need and captures a potentially arbitrary number of digits, through repeating capture groups. And I am not sure whether this would be doable in some reasonable way (perhaps someone else does). But a two step approach should be straightforward.
To note, I suspect that you are referring to JavaScript's replace function, which is a member of the String Object, and not jQuery replaceWith and replaceAll functions, in referring to 'jQuery replace function.' The latter are 'Dom manipulation' functions. But, correct me if I misunderstood.
As an example, based on some hypothetical input, you could use
<b>var numeric_raw = jQuery('input.textbox').attr ('value').replace (/[^0-9.]/g, "")</b>
to remove all characters from a value entered in a text field that are not digits or periods;
then you could use
<b>var numeric_str = numeric_raw.replace (/^[0]*(\d+\.\d{1,2}).*$/, "$1")</b>
The difference between the classes specified here and in Vince's answer are in that I am including filtering for leading 0s.
To note, in Vince's first reg ex, there might be an extra '/' -- but perhaps it has a purpose that I didn't catch.
With respect to "inverse," one way to understand your initial inquiry is that you are looking for an expression that does the opposite of the one that you provided.
To note, while the expression that you provided (/[0-9]+(.[0-9]{1,2})?/) does match both whole numbers and decimal numbers with up to two fractional digits, it also matches any single digit -- so, it may identify a match where one might not be envisioned, for a given input string. The expression does not have anchors ('^', '$'), and so might allow multiple possible matches. For example, in the String "1.111", both "1.11" and "1" match the pattern that you provided.
It appears to me that one pattern that matches any string that does not match your pattern is the following, or at least does this for most cases can be this:
/^(?:(?!.*[0-9]+(\.[0-9]{1,2})?).*)*$/
-- if someone could identify a precisely 'inverse' pattern, please feel free -- I am having some trouble understanding how lookaheads are interpreted at least for some nuances.
This relies on "negative lookahead" functionality, which JavaScript these days supports. You could refer to several stackoverflow postings for more information (eg. Regular Expressions and negating a whole character group), and there are multiple resources that could be found on the Internet that discuss "lookahead" and "lookbehind."
I suppose this answer carries some redundancy with respect to the one already given -- I might have commented on the Original Poster's post or on Vince's answer (instead of writing at least parts of my answer), but I am not yet able to make comments!

Related

What's the regular expression for an alphabet without the first occurrence of a letter?

I am trying to use FLEX to recognize some regular expressions that I need.
What I am looking for is given a set of characters, say [A-Z], I want a regular expression that can match the first letter no matter what it is, followed by a second letter that can be anything in [A-Z] besides the first letter.
For example, if I give you AB, you match it but if I give you AA you don't. So I am kind of looking for a regex that's something like
[A-Z][A-Z^Besides what was picked in the first set].
How could this be implemented for more occurrences of letters? Say if I want to match 3 letters without each new letter being anything from the previous ones. For instance ABC but not AAB.
Thank you!
(Mathematical) regular expressions have no context. In (f)lex -- where regular expressions are actually regular, unlike most regex libraries -- there is no such thing as a back-reference, positive or negative.
So the only way to accomplish your goal with flex patterns is to enumerate the possibilities, which is tedious for two letters and impractical for more. The two letter case would be something like (abbreviated);
A[B-Z]|B[AC-Z]|C[ABD-Z]|D[A-CE-Z]|…|Z[A-Y]
The inverse expression also has 26 cases but is easier to type (and read). You could use (f)lex's first-longest-match rule to make use of it:
AA|BB|CC|DD|…|ZZ { /* Two identical letters */ }
[[:upper:]]{2} { /* This is the match */ }
Probably, neither of those is the best solution. However, I don't think I can give better advice without knowing more specifics. The key is knowing what action you want to take if the letters do match, which you don't specify. And what the other patterns are. (Recall that a lexical scanner is intended to divide the input into tokens, although you are free to ignore a token once it is identified.)
Flex does come with a number of useful features which can be used for more flexible token handling, including yyless (to rescan part or all of the token), yymore (to combine the match with the next token), and unput (to insert a character into the input stream). There is also REJECT, but you should try other solutions first. See the flex manual chapter on actions for more details.
So the simplest solution might be to just match any two capital letters, and then in the action check whether or not they are the same.

Why doesn't regex support inverse matching?

Several sources linked below seem to indicate regex wasn't designed for inverse matching - why not?
Recently, while trying to put together an answer for a question about a regex to match everything that was left after a specific pattern, I encountered several issues that left me curious about the limitations of regex.
Suppose we have some string: a simple line of text. I have a regex [a-zA-Z]e that will match one letter, followed by an e. This matches 3 times, on le, ne, and te. What if I want to match everything except patterns that match the regex? Suppose I want to capture a simp, li, of, and xt., including spaces (line breaks optional.) I later learned this behavior is called inverse matching, and shortly after, that it's not something regex easily supports.
I've examined some resources, but couldn't find any concrete answer on why inverse matching isn't "good".
Negative lookaheads appear useful for determining if a matched string does not contain some specific string, and are in fact used in several answers as methods to achieve this behavior (or something similar) - but they seem designed to act as a way to disqualify matches, as opposed to capturing non-matching input.
Negative lookaheads apparently shouldn't try to do this and aren't good at it either, choosing to leave inverse matching to the language they're being used with.
My own attempt at inverse matching was pointed out to be situational and very fragile, and looks convoluted even to me. In the comments, Wiktor Stribizew mentioned that "[...] in Java, you can't write a regex that matches any text other than some multicharacter string. With capturing, something can be done, but it is inefficient[.]"
Capture groups (the other method I was considering) appear to have the potential to dramatically slow the regex in more than one language.
All of these seem to indicate regex wasn't designed for inverse pattern matching, but none of them are immediately obvious as to the reasoning behind that. Why wasn't regex designed with built-in ability to perform inverse pattern matching?
While direct regex, as you pointed out, does not easily support the functionality you want, a regex split, does easily support this. Consider the following two scripts, first in Java and then in Python:
String input = "a simple line of text.";
String[] parts = input.split("[a-z]e");
System.out.println(Arrays.toString(parts));
This prints:
[a simp, li, of , xt.]
In Python, we can try something very similar:
inp = "a simple line of text."
parts = re.split(r'[a-z]e', inp)
print(parts)
This prints:
['a simp', ' li', ' of ', 'xt.']
The secret sauce which is missing in pure regex is that of parsing or iteration. A good programming language, such as the above, will expose an API which can iterate an input string, using a supplied pattern, and rollup the portions from the split pattern.

Regex to match hexadecimal and integer numbers [duplicate]

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.
The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/
Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A
Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).
Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.
Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Pattern matching language knowledge, pattern matching approach

I am trying to implement a pattern matching "syntax" and language.
I know of regular expressions but these aren't enough for my scopes.
I have individuated some "mathematical" operators.
In the examples that follow I will suppose that the subject of pattern mathing are character strings but it isn't necessary.
Having read the description bellow: The question is, does any body knows of a mathematical theory explicitating that or any language that takes the same approach implementing it ? I would like to look at it in order to have ideas !
Descprition of approach:
At first we have characters. Characters may be aggregated to form strings.
A pattern is:
a) a single character
b) an ordered group of patterns with the operator matchAny
c) an ordered group of patterns with the operator matchAll
d) other various operators to see later on.
Explanation:
We have a subject character string and a starting position.
If we check for a match of a single character, then if it matches it moves the current position forward by one position.
If we check for a match of an ordered group of patterns with the operator matchAny then it will check each element of the group in sequence and we will have a proliferation of starting positions that will get multiplied by the number of possible matches being advanced by the length of the match.
E.G suppose the group of patterns is { "a" "aba" "ab" "x" "dd" } and the string under examination is:
"Dabaxddc" with current position 2 ( counting from 1 ).
Then applying matchAny with the previous group we have that "a" mathces "aba" matches and "ab" matches while "x" and "dd" do not match.
After having those matches there are 3 starting positions 3 4 5 ( corresponding to "a" "ab" "aba" ).
We may continue our pattern matching by accepting to have more then one starting positions. So now we may continue to the next case under examination and check for a matchAll.
matchAll means that all patterns must match sequentially and are applied sequentially.
subcases of matchAll are match0+ match1+ etc.
I have to add that the same fact to try to ask the question has already helped me and cleared me out some things.
But I would like to know of similar approaches in order to study them.
Please only languages used by you and not bibliography !!!
I suggest you have a look at the paper "Parsing Permutation Phrases". It deals with recognizing a set of things in any order where the "things" can be recognizers themselves. The presentation in the paper might be a little different than what you expect; they don't compile to finite automaton. However, they do give an implementation in a functional language and that should be helpful to you.
Your description of matching strings against patterns is exactly what a compiler does. In particular, your description of multiple potential matches is highly reminiscent of the way an LR parser works.
If the patterns are static and can be described by an EBNF, then you could use an LR parser generator (such as YACC) to generate a recogniser.
If the patterns are dynamic but can still be formulated as EBNF there are other tools that can be applied. It just gets a bit more complicated.
[In Australia at least, Computer Science was a University course in 1975, when I did mine. YACC dates from around 1970 in its original form. EBNF is even older.]