A regex that will parse 00.00 - regex

I'm trying to create a regex that will accept the following values:
(blank)
0
00
00.0
00.00
I came up with ([0-9]){0,2}\.([0-9]){0,2} which to me says "the digits 0 through 9 occurring 0 to 2 times, followed by a '.' character (which should be optional), followed by the digits 0 through 9 occuring 0 to 2 times. If only 2 digits are entered the '.' is not necessary. What's wrong with this regex?

You didn't make the dot optional:
[0-9]{0,2}(\.[0-9]{1,2})?

First off, {0-2} should be {0,2} as it was in the first instance.
Secondly, you need to group the repetition sections as well.
Thirdly, you need to make the whole last part optional. Because if there's a dot, there must be something after it, you should also change the second repetition thing to {1,2}.
([0-9]{0,2})(\.([0-9]{1,2}))?

There are a few problems with your regex:
The dot is a special character, and acts as a wildcard; if you want a literal dot, you need to escape it (\.).
Even if you replaced the dot to not be a wildcard, your regex will match strings like "0." because you did not tell the regular expression engine to only match the dot if there are numbers following it.
Because your expression isn't anchored, it could match strings that contain the pattern within another word, for example (ie. ab12 would match).
A better pattern would be something like:
/\b[0-9]{0,2}(?:\.[0-9]{1,2})?\b/
Note that (?:...) makes the group not create a backreference, which probably is not needed in your case.

Here is one way, illustrated in Perl, to match only the strings you listed. The important part is its method for matching empty strings: it does not make every pattern element optional, a strategy that has the undesirable effect of matching almost every string.
use warnings;
use strict;
my #data = (
'',
'0',
'00',
'00.0',
'00.00',
'foo', # Should not match.
'.0', # Should not match.
);
for (#data){
print $_, "\n" if /^$|^[0-9]{1,2}(\.[0-9]{1,2})?$/;
}

Most of the above examples don't anchor the beginning ^ and ending $ of the data.
I would solve it with one of the following:
^[[:digit:]]{0,2}([.][[:digit:]]{1,2})$
^\d{0,2}([.]\d{1,2})$
^[0-9]{0,2}([.][0-9]{1,2})$
For readability, i generally prefer using [.] to \. and using POSIX classes like [[:digit:]].

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

how to not match a particular number at the end of a string in regex perl?

I have a file that has a list that looks like this:
$ABC01
$ABC02
$ABC03
$ABC04
$ABC05
I want all those strings that are not $ABC02. This is a part of a code and before this step I got string ABC recorded in one variable, and the number 02 recorded in other. I basically want all those entries that start with ABC, and end with a 2 digit number apart from 02.
I tried this but it is not working:
/^\$ABC[^02]$/
What is the correct regex to be used in this case?
What your regex currently does is find "$ABC" followed by one digit. You need to split that [^02] into matching 2 characters like this.
/^\$ABC[^0][^2]$/
...but...that will also match strings that have non-digits at the end. So rather than using what you don't want to find - put in what you do want to find.
/^\$ABC([1-9][0-9]|0[013-9])$/
This will look for either 1 to 9 followed by any digit or 0 followed by any digit except for a 2. You can replace the (...) with (?:...) if you don't want a capture group.
Note that [^02] is a negated character class that matches any character other than 0 and 2, and it does not have to be a digit, BTW.
You may use (?!02$) negative lookahead after C to avoid matching strings that end with 02, but end with any other number (including 0210,etc.):
/^\$ABC(?!02$)[0-9]+$/
See this regex demo
If you do not want to match numbers after C that *start with 02, remove $ from the lookahead:
/^\$ABC(?!02)[0-9]+$/
See another regex demo.
It looks like you just need to eliminate one specific line from your file. Rather than writing a fancy regex pattern that matches $ABC followed by anything but 02, you can simply use the language to discard the given string. Something like print if not /\$ABC02/
In a Perl one-line command, this checks whether the whole line matches the value and ignores it if so
perl -ne 'print unless /^\$ABC02$/' myfile

Regex to match words after dot until a whitespace occurs

Given the following string
span.a.b this.is.really.confusing
I need to return the matches a and b. I've been able to get close with the following regex:
(?<=\.)[\w]+
But it's also matching is, really, and confusing. When I include a negative lookahead I get even closer, but I'm still not there.
(?<=\.)[\w]+(?=\s) # matches b, confusing
How can I match words after a dot until a whitespace occurs?
How can I match words after a dot until a whitespace occurs?
NB: this is language agnostic pseudo-code, but should work.
regex = "^[^\s.]+.(\S+).*"
targets = <extracted_group>.split(".")
Regex explanation:
"^": beings with
"[^\s.]+." 1 or more non-whitespace, non-period characters, followed by a period.
"(\S+)": group and capture all of the following non-whitespace characters
".*": matches 0 or more of any non-newline character
If the split function takes a regex instead of a string, you'll need to escape the '.' or use a character class.
NB: You can do it without the split, but I think that the split is more transparent.
I am not sure if this is good enough for all your possible cases, but it should work with the provided example:
\.([\w]+)\.([\w]+)\s
$1 = a, $2 = b

Regular expression to match phone number?

I want to match a phone number that can have letters and an optional hyphen:
This is valid: 333-WELL
This is also valid: 4URGENT
In other words, there can be at most one hyphen but if there is no hyphen, there can be at most seven 0-9 or A-Z characters.
I dont know how to do and "if statement" in a regex. Is that even possible?
I think this should do it:
/^[a-zA-Z0-9]{3}-?[a-zA-Z0-9]{4}$/
It matches 3 letters or numbers followed by an optional hyphen followed by 4 letters or numbers. This one works in ruby. Depending on the regex engine you're using you may need to alter it slightly.
You seek the alternation operator, indicated with pipe character: |
However, you may need either 7 alternatives (1 for each hyphen location + 1 for no hyphen), or you may require the hyphen between 3rd and 4th character and use 2 alternatives.
One use of alternation operator defines two alternatives, as in:
({3,3}[0-9A-Za-z]-{4,4}[0-9A-Za-z]|{7,7}[0-9A-Za-z])
Not sure if this counts, but I'd break it into two regexes:
#!/usr/bin/perl
use strict;
use warnings;
my $text = '333-URGE';
print "Format OK\n" if $text =~ m/^[\dA-Z]{1,6}-?[\dA-Z]{1,6}$/;
print "Length OK\n" if $text =~ m/^(?:[\dA-Z]{7}|[\dA-Z-]{8})$/;
This should avoid accepting multiple dashes, dashes in the wrong place, etc...
Supposing that you want to allow the hyphen to be anywhere, lookarounds will be of use to you. Something like this:
^([A-Z0-9]{7}|(?=^[^-]+-[^-]+$)[A-Z0-9-]{8})$
There are two main parts to this pattern: [A-Z0-9]{7} to match a hyphen-free string and (?=^[^-]+-[^-]+$)[A-Z0-9-]{8} to match a hyphenated string.
The (?=^[^-]+-[^-]+$) will match for any string with a SINGLE hyphen in it (and the hyphen isn't the first or last character), then the [A-Z0-9-]{8} part will count the characters and make sure they are all valid.
Thank you Heath Hunnicutt for his alternation operator answer as well as showing me an example.
Based on his advice, here's my answer:
[A-Z0-9]{7}|[A-Z0-9][A-Z0-9-]{7}
Note: I tested my regex here. (Just including this for reference)