Regex to match sloppy fractions / mixed numbers - regex

I have a series of text that contains mixed numbers (ie: a whole part and a fractional part). The problem is that the text is full of human-coded sloppiness:
The whole part may or may not exist (ex: "10")
The fractional part may or may not exist (ex: "1/3")
The two parts may be separated by spaces and/or a hyphens (ex: "10 1/3", "10-1/3", "10 - 1/3").
The fraction itself may or may not have spaces between the number and the slash (ex: "1 /3", "1/ 3", "1 / 3").
There may be other text after the fraction that needs to be ignored
I need a regex that can parse these elements so that I can create a proper number out of this mess.

Here's a regex that will handle all of the data I can throw at it:
(\d++(?! */))? *-? *(?:(\d+) */ *(\d+))?.*$
This will put the digits into the following groups:
The whole part of the mixed number, if it exists
The numerator, if a fraction exits
The denominator, if a fraction exists
Also, here's the RegexBuddy explanation for the elements (which helped me immensely when constructing it):
Match the regular expression below and capture its match into backreference number 1 «(\d++(?! */))?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single digit 0..9 «\d++»
Between one and unlimited times, as many times as possible, without giving back (possessive) «++»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?! */)»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “/” literally «/»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below «(?:(\d+) */ *(\d+))?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the regular expression below and capture its match into backreference number 2 «(\d+)»
Match a single digit 0..9 «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “/” literally «/»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 3 «(\d+)»
Match a single digit 0..9 «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

I think it may be easier to tackle the different cases (full mixed, fraction only, number only) separately from each other. For example:
sub parse_mixed {
my($mixed) = #_;
if($mixed =~ /^ *(\d+)[- ]+(\d+) *\/ *(\d)+(\D.*)?$/) {
return $1+$2/$3;
} elsif($mixed =~ /^ *(\d+) *\/ *(\d+)(\D.*)?$/) {
return $1/$2;
} elsif($mixed =~ /^ *(\d+)(\D.*)?$/) {
return $1;
}
}
print parse_mixed("10"), "\n";
print parse_mixed("1/3"), "\n";
print parse_mixed("1 / 3"), "\n";
print parse_mixed("10 1/3"), "\n";
print parse_mixed("10-1/3"), "\n";
print parse_mixed("10 - 1/3"), "\n";

If you are using Perl 5.10, this is how I would write it.
m{
^
\s* # skip leading spaces
(?'whole'
\d++
(?! \s*[\/] ) # there should not be a slash immediately following a whole number
)
\s*
(?: # the rest should fail or succeed as a group
-? # ignore possible neg sign
\s*
(?'numerator'
\d+
)
\s*
[\/]
\s*
(?'denominator'
\d+
)
)?
}x
Then you can access the values from the %+ variable like this:
$+{whole};
$+{numerator};
$+{denominator};

Related

regex in Perl to replace content containing double equal signs

I need a regex in Perl to turn this:
(== doc_url html/arbitrary_file_name.html ==)
into this:
(/doc_assets/legacy/html/arbitrary_file_name.html)
I've tried all kinds of things. My current attempt looks like this:
$content =~ s!\=\= doc_url ([\w\W]+?)\=\=!/doc_assets/legacy/$1!gis;
(In this particular attempt, I'm just letting the enclosing parentheses remain, since that doesn't change from the input to the output.)
Anyway, nothing is working for me. I assume it's the == throwing things off. Any help will be greatly appreciated.
I guess you need something like:
s!.*?doc_url (.*?/.*?) .*!(/doc_assets/legacy/$1)!sg
i.e.:
#!/usr/bin/perl
$subject = "(== doc_url html/arbitrary_file_name.html ==)";
$subject =~ s!.*?doc_url (.*?/.*?) .*!(/doc_assets/legacy/$1)!sg;
print $subject;
#(/doc_assets/legacy/html/arbitrary_file_name.html)
Ideone Demo
Regex Explanation:
.*?doc_url (.*?/.*?) .*
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ don’t match at line breaks; Numbered capture
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “doc_url ” literally (case sensitive) «doc_url »
Match the regex below and capture its match into backreference number 1 «(.*?/.*?)»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “/” literally «/»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “ ” literally « »
Match any single character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
(/doc_assets/legacy/$1)
Insert the character string “(/doc_assets/legacy/” literally «(/doc_assets/legacy/»
Insert the text that was last matched by capturing group number 1 «$1»
Insert the character “)” literally «)»

Regex for not matching a string

I have a URL:
/ice-cream/stuff/sandwich/banana
I want to write a regular expression that ONLY matches the URL if these conditions are met:
"ice-cream" is in the URL
"sandwich" is in the URL and comes after "ice-cream"
"banana" is NOT in the URL
I tried this:
ice-cream.sandwich.^[(banana)] as well as many others but haven't found the solution.
Help is appreciate it.
Give a try to the below regex,
^(?!.*banana.*).*?ice-cream.*?sandwich.*$
OR
^(?!.*banana.*)(?:(?!sandwich).)*ice-cream.*?sandwich.*$
DEMO
Explanation:
^ Asserts that we are at the beginning of the line.
(?!.*banana.*) Negative lookahead which checks the line contain the string banana or not. If it's not then the regex engine set the marker on the starting. Or Otherwise it skips the lines which contains the string banana.
(?:(?!sandwich).)* Matches all the characters which are not of the string sandwich.
ice-cream.*?sandwich.* String sandwich must be after to the string ice-cream.
$ End of the line.
Hard to be precise without examples of matches and non-matches, but give this a try:
^(?!.*banana)(?:(?!.*sandwich(?=.*ice-cream))).*ice-cream.*sandwich.*$
Explanation of Regex:
^(?!.*banana)(?:(?!.*sandwich(?=.*ice-cream))).*ice-cream.*sandwich.*$
----------------------------------------------------------------------
^(?!.*banana)(?:(?!.*sandwich(?=.*ice-cream))).*ice-cream.*sandwich.*$
Options: Case insensitive; Exact spacing; Dot doesn't match line breaks; ^$ match at line breaks; Default line breaks
Assert position at the beginning of a line «^»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*banana)»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “banana” literally «banana»
Match the regular expression below «(?:(?!.*sandwich(?=.*ice-cream)))»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*sandwich(?=.*ice-cream))»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “sandwich” literally «sandwich»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*ice-cream)»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “ice-cream” literally «ice-cream»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “ice-cream” literally «ice-cream»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “sandwich” literally «sandwich»
Match any single character that is NOT a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of a line «$»
Created with RegexBuddy
.*ice-cream.+sandwich.(?!banana).*
Try this one

Regex for 2 items but with one exclusion

I am building a RegEx that needs to find lines that have either:
DateTime.Now
or
Date.Now
But cannot have the literal "SystemDateTime" on the same line.
I started with this (DateTime\.Now|Date\.Now) but now I am stuck with where to put the "SystemDateTime"
Use this. Assuming you are not using /s modifier(or DOTALL) which takes newline characters under the dot(.)
(?!.*SystemDateTime)(DateTime\.Now|Date\.Now)
(?!.*SystemDateTime) means there is no SystemDateTime in front.
You could use negative lookahead like this:
(?!.*SystemDateTime)\bDate(?:Time)?\.Now\b
/(?!.*SystemDateTime)Date(?:Time)?\.Now/
DEMO
EXPLANATION:
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*SystemDateTime)»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the characters “SystemDateTime” literally «SystemDateTime»
Match the characters “Date” literally «Date»
Match the regular expression below «(?:Time)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the characters “Time” literally «Time»
Match the character “.” literally «\.»
Match the characters “Now” literally «Now»

Chaning the image url with a regular expression

I have to change a url that looks like
http://my-assets.s3.amazonaws.com/uploads/2011/10/PiaggioBeverly-001-106x106.jpg
into this format
http://my-assets.s3.amazonaws.com/uploads/2011/10/106x106/PiaggioBeverly-001.jpg
I understand I need to create a regular expression pattern that will divide the initial url into three groups:
http://my-assets.s3.amazonaws.com/uploads/
2011/10/
PiaggioBeverly-001-106x106.jpg
and then cut off the resolution string (106x106) from the third group, get rid of the hyphen at the end and move the resolution next to the second. Any idea how to get it done using something like preg_replace?
search this : (.*\/)(\w+-\d+)-(.*?)\.
and replace with : \1\3/\2.
demo here : http://regex101.com/r/fX7gC2
The pattern will be as follow(for input uploads/2011/10/PiaggioBeverly-001-106x106.jpg)
^(.*/)(.+?)(\d+x\d+)(\.jpg)$
And the groups will be holding as follows:
$1 = uploads/2011/10/
$2 = PiaggioBeverly-001-
$3 = 106x106
$4 = .jpg
Now rearrange as per your need. You can check this example from online.
As you have mentioned about preg_replace(), so if its in PHP, you can use preg_match() for this.
<?php
$oldurl = "http://my-assets.s3.amazonaws.com/uploads/2011/10/PiaggioBeverly-001-106x106.jpg";
$newurl = preg_replace('%(.*?)/(\w+)-(\w+)-(\w+)\.(\w+)%sim', '$1/$4/$2-$3.jpg', $oldurl);
echo $newurl;
#http://my-assets.s3.amazonaws.com/uploads/2011/10/106x106/PiaggioBeverly-001.jpg
?>
DEMO
EXPLANATION:
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Match the regular expression below and capture its match into backreference number 1 «(.*?)»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “/” literally «/»
Match the regular expression below and capture its match into backreference number 2 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “-” literally «-»
Match the regular expression below and capture its match into backreference number 3 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “-” literally «-»
Match the regular expression below and capture its match into backreference number 4 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “.” literally «\.»
Match the regular expression below and capture its match into backreference number 5 «(\w+)»
Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

RegEx that matches a string of numbers in a particular format?

I need a regular expression that will tell if a string is in the following format. The groups of numbers must be comma delimited. Can contain a range of numbers separated by a -
300, 200-400, 1, 250-300
The groups can be in any order.
This is what I have so far, but it's not matching the entire string. It's only matching the groups of numbers.
([0-9]{1,3}-?){1,2},?
Try this one:
^(?:\d{1,3}(?:-\d{1,3})?)(?:,\s*\d{1,3}(?:-\d{1,3})?|$)+
Since you didn't specify the number ranges I leave this to you. In any case you should do math with regex :)
Explanation:
"
^ # Assert position at the beginning of the string
(?: # Match the regular expression below
\\d # Match a single digit 0..9
{1,3} # Between one and 3 times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
- # Match the character “-” literally
\\d # Match a single digit 0..9
{1,3} # Between one and 3 times, as many times as possible, giving back as needed (greedy)
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
, # Match the character “,” literally
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\\d # Match a single digit 0..9
{1,3} # Between one and 3 times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
- # Match the character “-” literally
\\d # Match a single digit 0..9
{1,3} # Between one and 3 times, as many times as possible, giving back as needed (greedy)
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"
^(\d+(-\d+)?)(,\s*(\d+(-\d+)?))*$
This should work:
/^([0-9]{1,3}(-[0-9]{1,3})?)(,\s?([0-9]{1,3}(-[0-9]{1,3})?))*$/
You need some repetition:
(?:([0-9]{1,3}-?){1,2},?)+
To ensure that the numbers are correct, i.e. that you don't match numbers like 010, you might want to change the regex slightly. I also changed the range part of the regex, so that you don't match things like 100-200- but only 100 or 100-200, and added support for whitespaces after the comma (optional):
(?:(([1-9]{1}[0-9]{0,2})(-[1-9]{1}[0-9]{0,2})?){1,2},?\s*)+
Also, depending on what you want to capture, you might want to change the capturing brackets () to non capturing ones (?:)
UPDATE
A revised version based on the latest comments:
^\s*(?:(([1-9][0-9]{0,2})(-[1-9][0-9]{0,2})?)(?:,\s*|$))+$
([0-9-]+),\s([0-9-]+),\s([0-9-]+),\s([0-9-]+)
Try this regular expression
^(([0-9]{1,3}-?){1,2},?\s*)+$