Regular Expression to match fractions and not dates - regex

I'm trying to come up with a regular expression that will match a fraction (1/2) but not a date (5/5/2005) within a string. Any help at all would be great, all I've been able to come up with is (\d+)/(\d+) which finds matches in both strings. Thanks in advance for the help.

Assuming PCRE, use negative lookahead and lookbehind:
(?<![\/\d])(\d+)\/(\d+)(?![\/\d])
A lookahead (a (?=) group) says "match this stuff if it's followed by this other stuff." The contents of the lookahead aren't matched. We negate it (the (?!) group) so that it
doesn't match stuff after our fraction - that way, we don't match the group in what follows.
The complement to a lookahead is a lookbehind (a (?<=) group) does the opposite - it matches stuff if it's preceeded by this other stuff, and just like the lookahead, we can negate it (the (?<!) group) so that we can match things that don't follow something.
Together, they ensure that our fraction doesn't have other parts of fractions before or after it. It places no other arbitrary requirements on the input data. It will match the fraction 2/3 in the string "te2/3xt", unlike most of the other examples provided.
If your regex flavor uses //s to delimit regular expressions, you'll have to escape the slashes in that, or use a different delimiter (Perl's m{} would be a good choice here).
Edit: Apparently, none of these regexes work because the regex engine is backtracking and matching fewer numbers in order to satisfy the requirements of the regex. When I've been working on one regex for this long, I sit back and decide that maybe one giant regex is not the answer, and I write a function that uses a regex and a few other tools to do it for me. You've said you're using Ruby. This works for me:
>> def get_fraction(s)
>> if s =~ /(\d+)\/(\d+)(\/\d+)?/
>> if $3 == nil
>> return $1, $2
>> end
>> end
>> return nil
>> end
=> nil
>> get_fraction("1/2")
=> ["1", "2"]
>> get_fraction("1/2/3")
=> nil
This function returns the two parts of the fraction, but returns nil if it's a date (or if there's no fraction). It fails for "1/2/3 and 4/5" but I don't know if you want (or need) that to pass. In any case, I recommend that, in the future, when you ask on Stack Overflow, "How do I make a regex to match this?" you should step back first and see if you can do it using a regex and a little extra. Regular expressions are a great tool and can do a lot, but they don't always need to be used alone.
EDIT 2:
I figured out how to solve the problem without resorting to non-regex code, and updated the regex. It should work as expected now, though I haven't tested it. I also went ahead and escaped the /s since you're going to have to do it anyway.
EDIT 3:
I just fixed the bug j_random_hacker pointed out in my lookahead and lookbehind. I continue to see the amount of effort being put into this regex as proof that a pure regex solution was not necessarily the optimal solution to this problem.

Use negative lookahead and lookbehind.
/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/
EDIT: I've fixed my answer to trap for the backtracking bug identified by #j_random_hacker. As proof, I offer the following quick and dirty php script:
<?php
$subject = "The match should include 1/2 but not 12/34/56 but 11/23, now that's ok.";
$matches = array();
preg_match_all('/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/', $subject, $matches);
var_dump($matches);
?>
which outputs:
array(1) {
[0]=>
array(2) {
[0]=>
string(3) "1/2"
[1]=>
string(5) "11/23"
}
}

Lookahead is great if you're using Perl or PCRE, but if they are unavailable in the regex engine you're using, you can use:
(^|[^/\d])(\d+)/(\d+)($|[^/\d])
The 2nd and 3rd captured segments will be the numerator and denominator.
If you do use the above in a Perl regex, remember to escape the /s -- or use a different delimiter, e.g.:
m!(?:^|[^/])(\d+)/(\d+)(?:$|[^/])!
In this case, you can use (?:...) to avoid saving the uninteresting parenthesised parts.
EDIT 18/12/2009: Chris Lutz noticed a tricky bug caused by backtracking that plagues most of these answers -- I believe this is now fixed in mine.

if its line input you can try
^(\d+)\/(\d+)$
otherwise use this perhaps
^(\d+)\/(\d+)[^\\]*.

this will work: (?<![/]{1})\d+/\d+(?![/]{1})

Depending on the language you're working with you might try negative-look-ahead or look-behind assertions: in perl (?!pattern) asserts that /pattern/ can't follow the matched string.
Or, again, depending on the language, and anything you know about the context, a word-boundary match (\b in perl) might be appropriate.

Related

How do you match a pattern skipping exceptions?

In vim, I'd like to match a regular expression in a search and replace operation, but with exceptions — a list of matches that I want to skip.
For example, suppose I have the text:
-one- lorem ipsum -two- blah blah -three- now is the time -four- the quick brown -five- etc. etc.
(but with lots of other possibilities) and I want to match -\(\w\+\)- and replace it with *\1* but skipping over (not matching) -two- and -four-, so the result would be:
*one* lorem ipsum -two- blah blah *three* now is the time -four- the quick brown *five* etc. etc.
It seems like I should be able to use some kind of assertion (lookbehind, lookahead, something) for this, but I'm coming up blank.
You're looking for a negative lookahead assertion. In Vim, that's done via :help /\#!, like (?!pattern) in Perl.
Basically, you say don't match FOO here, and in general match word characters:
/-\(\%(FOO\)\#!\w\+\)-/
Note how I'm using non-capturing groups (:help /\%(). What's still missing is an assertion on the end, so the above would also exclude -FOOBAR-. As we have a unique end delimiter here, it's easiest to append that:
/-\(\%(FOO-\)\#!\w\+\)-/
Applied to your example, you just need to introduce two branches (for the two exclusions) in place of FOO, and you're done:
/-\(\%(two-\|four-\)\#!\w\+\)-/
Or, by factoring out the duplicated end delimiter:
/-\(\%(\%(two\|four\)-\)\#!\w\+\)-/
This matches any word characters in between dashes, except if those words form either two or four.
The negative lookahead in my other answer is the direct solution, but its syntax is a bit complex (and there can be patterns where the rules for the delimiter are not so simple, and the result then is much less readable).
As you're using substitution, an alternative is to put the selection logic into the replacement part of :substitute. Vim allows a Vimscript expression in there via :help sub-replace-expression.
We have the captured word in submatch(1) (equivalent to \1 in a normal replacement), and now just need to check for the two excluded words; if it's one of those, do a no-op substitution by returning the original full match (submatch(0)), else just return the captured group.
:substitute/-\(\w\+\)-/\=submatch(index(['two', 'four'], submatch(1)) == -1 ? 1 : 0)/g
It's not shorter than the lookahead pattern (well, we could golf the pattern and drop the ternary operator, as a boolean is represented by 0/1, anyway), so here I would still use the pattern. But in general, it's good to know that there's more than one way to do it :-)

Lookaround assertions in Perl

im confused what is the use of these lookaround assertions in perl?
example this one:
(?=pattern)
or the positive lookahead. So here's my questions:
How are these useful? what sort of instances they are used?
And related to question 1, why would i want to look ahead of the regex pattern? isnt it more work? looking ahead and then executing the pattern matching again.
I need a very clear example if possible. Thanks
To uppercase what's in between commas, you could use:
(my $x = 'a,b,c,d,e') =~ s/(?<=,)([^,]*)(?=,)/ uc($1) /eg; # a,B,C,D,e
a,b,c,d,e
Pass 1 matches -
Pass 2 matches -
Pass 3 matches -
If you didn't use lookarounds, this is what you'd get,
(my $x = 'a,b,c,d,e') =~ s/,([^,]*),/ ','.uc($1).',' /eg; # a,B,c,D,e
a,b,c,d,e
Pass 1 matches ---
Pass 2 matches ---
Not only does the lookahead avoid repetition, it doesn't work without it!
Another somewhat common use is as part of a string equivalent to [^CHAR].
foo(?:(?!foo|bar).)*bar # foo..bar, with no nested foo or bar
You can use it to narrow down character classes.
\w(?<!\d) # A word char that's not a digit.
Although this can now be done using (?[ ... ]).
It's also useful in more esoteric patterns.
/a/ && /b/ && /c/
can be written as
/^(?=.*?a)(?=.*?b).*?c/s
lookahead lets you check for a pattern without actually matching it.
When you do a(?=b) ,you would match a if its followed by b. Note:it doesn't match b.
So,
1>You can extract hello(without #) from #hello# using
(?<=#)hello(?=#)
2>You can validate passwords with requirements such as a password must have 2 digits,2 letters or more with any other character
^(?=(.*\d){2})(?=(.*[a-z]){2}).*$
Try doing above without lookahead ,you would realize it's importance
I have found lookaheads especially useful for checking multiple conditions. For example, consider a regex that checks that a password has at least one lowercase, one uppercase, one numeric, and one symbol character, and is at least 8 characters in length:
^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[^a-zA-Z0-9]).{8,}$
Try to devise a regex to do the same thing without lookahead assertions! It's possible, but it's extremely cumbersome.
Meanwhile, I've found lookbehinds especially useful for checking boundary conditions—that is, for example, matching a string of 0's, unless it's preceded by another number, like 1000067.
These are my experiences but certainly there are many more practical uses and the way everyone uses a tool can vary from person to person.
There are many reasons to use lookarounds, e.g.
limiting the substring that is considered to be matched: s/(?<=[0-9])+(?=[0-9])/-/ instead of s/([0-9])+([0-9])/$1-$2/.
and-ing various conditions together: /(?=\p{Uppercase}\p{Lowercase})\p{InBasicLatin}{2,}/.
Lookaround assertions is useful when you need a pattern to help locate the match but you don't want the pattern to be part of what is captured.
Here's a simple scenario with lookahead assertion:
Let's say I have
my $text = '98 degrees, 99 Red Balloons, 101 Dalmatians'
and I want to change the number of red balloons from its previous value to 9001, so I use
$text =~ s/\d+(?=Red Balloons)/9001/;

Regex string transformation/extraction

Code:
https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg
How can I get 589944494365122 out of that string using regex?
The best I can do so far is _(.*) resulting 589944494365122_1446403980_n.jpg
First, you should generalize your problem description, like that: How can I get the longest non-empty substring of digits after the first _ in string? The regexp you literally asked for is (589944494365122), but that's not what you expect.
According to my guess about what you want, the answer could be _(\d+).
The rule of extraction I can see in your input is:
211099_589944494365122_1446403980
[0-9]+_ part we want _[0-9]+
so a regex with look-behind and look-ahead will help:
'(?<=\d_)\d+(?=_\d)'
test with grep:
kent$ echo " https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg"|grep -Po '(?<=\d_)\d+(?=_\d)'
589944494365122
This works;
var s = "https://aaa.bbb.net/ccc/211099_589944494365122_1446403980_n.jpg";
var m = /_([^_]*)/.exec(s);
console.log( m[1] ); // 589944494365122
I would go with \d+_(\d+)_\d+_n\.jpg, but depending on the exact specification of the URL this may need a little bit of tweaking.
Also depending on the language, this may need to be altered a little bit. The solution I suggest will work for instance in Ruby (as well as many other regex implementations). Here \d matches any digit and \d+ means one or more digits. I assume the letter before .jpg is always n but you may change this by either replacing n with .(any character) or with \w (any word character).

TextMate: Regex replacing $1 with following 0

I'm trying to fix a file full of 1- and 2-digit numbers to make them all 2 digits long.
The file is of the form:
10,5,2
2,4,5
7,7,12
...
I've managed to match the problem numbers with:
(^|,)(\d)(,|$)
All I want to do now is replace the offending string with:
${1}0$2$3
but TextMate gives me:
10${1}05,2
Any ideas?
Thanks in advance,
Ross
According to this, TextMate supports word boundary anchors, so you could also search for \b\d\b and replace all with 0$0. (Thanks to Peter Boughton for the suggestion!)
This has the advantage of catching all the numbers in one go - your solution will have to be applied at least twice because the regex engine has already consumed the comma before the next number after a successful replace.
Note: Tim's solution is simpler and solves this problem, but I'll leave this here for reference, in case someone has a similar but more complex problem, which using lookarounds can support.
A simpler way than your expression is to replace:
(?<!\d)\d(?!\d)
With:
0$0
Which is "replace all single digits with 0 then itself".
The regex is:
Negative lookbehind to not find a digit (?<!\d)
A single digit: \d
Negative lookahead to not find a digit (?!\d)
Single this is a positional match (not a character match), it caters for both comma and start/end positions.
The $0 part says "entire match" - since the lookbehind/ahead match positions, this will contain the single digit that was matched.
To anyone coming here, as #Amarghosh suggested, it's a bug, or intentional behavior that leads to problems if nothing else.
I just had this problem and had to use the following workaround: If you set up another capture group, and then use a conditional insertion, it will work. For example, I had a string like <WebObject name=Frage01 and wanted to replace the 01 with 02, so I captured the main string in $1 and the end number in $2, which gave me a regex that looked like (<WebObject name=(Frage|Antwort))(01).
Then the replace was $1(?2:02).
The (?2:02) is the conditional insertion, and in this instance will always find something, but it was necessary in order to work around the odd conundrum of appending a number to the end of $n. Hope that helps someone. There is documentation on the conditional insertion here
In TextMate 1.5.11 (1635) ${1} does not work (like the OP described).
I appreciate the many suggestions re altering the query string, however there is a much simpler solution, if you want to break between a capture group and a number: \u.
It is a TextMate specific replacement syntax, that converts the following character to uppercase. As there is no uppercase for numbers, it does nothing and moves on. It is described in the link from Tim Pietzcker's answer.
In my case I had to clean up a csv file, where box measurements were given in cm x cm x mm. Thus I had to add a zero to the first two numbers.
Text: "80 x 40 x 5 mm"
Desired text: "800 x 400 x 5 mm"
Find: (\d+) x (\d+) x (\d+)
Replace: $1\u0 x $2\u0 x $3 mm
Regarding the support of more than 10 capture groups, I do not know if this is a bug. But as OP and #rossmcf wrote, $10 is replaced with null.
You need not ${1} - replace strings support only up to nine groups maximum - so it won't mistake it for $10.
Replace with $10$2$3

Regex with exception of particular words

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)
If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/
I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...
This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)
\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.
Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}