Regex to match hexadecimal and integer numbers [duplicate] - regex

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.

The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/

Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A

Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).

Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.

Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Related

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Positive lookahead that (also) matches the empty string

I'm doing an internship with some Groovy code and I came across the following pattern:
(?=(^\w)*)(\w)+(?=(^\w)*)
It basically just finds words (contiguous collections of word characters) to sift out punctuation and such. Is there a reason to not simply use this pattern?
\w+
Since it's not my code I imagine that there might have been a reason for using something so ridiculously complicated, but at the same time it seems like it would be very inefficient. Is there any difference between the two? They seem to give the same results on http://regexpal.com/.
The answer to why not use just \w+ is capturing groups, this doesn't explain any possible subtlety or logic in the regex though.
The (optional) prefix and suffix strings are partially captured for possible later use, and as noted by m.buettner ^\w is quite likely a meant to be [^\w], meaning the second final group never matches (though there might be cases with multi-line input, see Pattern Matching Flags, I can't see one myself, since \w+ won't match and consume and end of line).
The use of both (?=) and * indicates that perhaps the author was not quite familiar with regexs, typically you use look arounds to constrain (which * effectively undoes here), or to optimise matching.
A polite approach might be assume that the regex was being "tweaked" during development, and has been left with some unneeded subpatterns...

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part