Regex that will get dimensions from string - regex

I'm having a heck of a time finding a regex that will pull dimensions from a string.
Here is what I have so far, its not really doing what I want:
((\d+\s*(\d+\d+|\d+)*)\s*[xX]\s*(\d+\s*(\d+\d+|\d+)*)\s*[xX]\s*(\d+\s*(\d+\d+|\d+)*)|(\d+\s*(\d+\d+|\d+)*)\s*[xX]\s*(\d+\s*(\d+\d+|\d+)*)| (\d+\s*(\d+\d+|\d+)*))
Here are some examples of what it pulls (bolded):
-16" x 1476' 80 GA. EQ Ultra-Premium Hand Wrap - Use as replacement for 18" x 1500' 80 Gauge (4/Case)
-48 x 60" Corrugated Sheet (250/Bale)
-MP Die Cut Divider 25 7/ 8 x 19 3/4 1000/bale
-Part "B" (3"x3x 1/2" charcoal foam) only - extra pieces
I'm looking for it grab do the following:
-16" x 1476' 80 GA. EQ Ultra-Premium Hand Wrap - Use as replacement for 18" x 1500' 80 Gauge (4/Case)
-48 x 60" Corrugated Sheet (250/Bale)
-MP Die Cut Divider 25 7/8 x 19 3/4 1000/bale
-Part "B" (3"x3 x 1/2" charcoal foam) only - extra pieces
Notice the regex is not catching the lower part of the fraction because of the "/", also if an inch symbol (i.e. ") is between that dimension and another number it won't grab the first number, you can see that in the first example.
Once I have this regex working, I can strip out the inch and foot symbols (i.e. " and '), and break each number down into each dimension. Just trying to pull the initial dimension numbers first.
Thanks so much if you have any input.

(\d+\s*\d+\s*\/\d+|\d+\s*\/\d+|\d+)["']?(\s*[xX]\s*(\d+\s*\d+\s*\/\d+|\d+\s*\/\d+|\d+)["']?)+
Demo

I don't understand the x-in-the-middle-of-the-number thing, and I didn't attempt to get that part, but otherwise, this works:
([0-9]+["']?(?: [0-9]+/[0-9]+)? x [0-9]+["']?(?: [0-9]+/[0-9]+)?)
Debuggex Demo
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

Related

substitution with eval and repeat the character by grouping string length?

My input as follow
my $s = '<B>Estimated:</B>
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
<B>Instability index:</B>
The instability index (II) is computed to be 31.98
This classifies the protein as stable.';
I want to remove the <B></B> tags from string and put the underline for bold tags.
I expected output is
Estimated:
---------
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
Instability index:
------------------
The instability index (II) is computed to be 31.98
This classifies the protein as stable.
For this tried the following regex but I don't know what is the problem there.
$s=~s/<B>(.+?)<\/B>/"$1\n";"-" x length($1)/seg; # $1\n in not working
In the above regex I don't know how to put this "$1\n"? And how to use the continuous statement in substitution separated by ; or anything else?
How can I fix it?
The e modifier returns back just the last-executed statement, so
$s=~s/<B>(.+?)<\/B>/"$1\\n";"-" x length($1)/seg;
throws away the "$1\\n" (which should really be "$1\n")
This works:
$s=~s/<B>(.+?)<\/B>/"$1\n" . "-" x length($1)/seg;
The reason I was asking about your Perl version was to assess if it was possible to do what is effectively a variable-length lookbehind with \K:
$s=~s/<B>(.+?)<\/B>\K/ "\n" . "-" x length($1)/seg;
\K is available for Perl versions 5.10+.

How can I use RegEx to differentiate between screen resolutions?

I'm trying to create a RegEx expression that will match a numeric range from 0 to 600 so I can easily differentiate between a small mobile device and tablets/desktops. I'm using Qualtrics' survey software to do the rest - all I need is the RegEx expression.
However, I'm not 100% sure how Qualtrics takes in the data. I believe it takes it in the following format:
360x640
320x568
320x480
1920x1080
360x640
1280x800
320x568
1920x1080
360x640
1280x800
1920x1080
480x800
320x480
1280x800
1366x768
320x568
1280x800
Where I'm testing the FIRST number, e.g. the number before the 'x' character.
Here's some RegEx I've tried that did not work:
([0-9]{1,2}|[1-4][0-9]{2}|600)*x
That code recognizes numbers before the 'x', but it doesn't stop at 600 - it recognizes all numbers before the 'x' (e.g. from 000 to 9999).
How do I get the range I want? Please and thank you!
Note: I've tried using the RegEx number range generator here, but it doesn't work for what I want to accomplish.
I'd do:
\b(?:600|[1-9]\d?|[1-5]\d{2})x
Where:
\b is a word boundary, it makes sure there're no digits before
(?: ) is a non capture group
600 matches 600
[1-9]\d? matches number from 1 to 99
[1-5]\d{2} matches number from 100 to 599
I don't believe there are width lower than 100, so you can use this pattern:
^([1-5][0-9]{2}|600)x
You can use this regex but with m modifier if your input contains all those lines together:
^([0-5]\d{2}|600)x
Live demo

Is it possible to increment numbers using regex substitution?

Is it possible to increment numbers using regex substitution? Not using evaluated/function-based substitution, of course.
This question was inspired by another one, where the asker wanted to increment numbers in a text editor. There are probably more text editors that support regex substitution than ones that support full-on scripting, so a regex might be convenient to float around, if one exists.
Also, often I've learned neat things from clever solutions to practically useless problems, so I'm curious.
Assume we're only talking about non-negative decimal integers, i.e. \d+.
Is it possible in a single substitution? Or, a finite number of substitutions?
If not, is it at least possible given an upper bound, e.g. numbers up to 9999?
Of course it's doable given a while-loop (substituting while matched), but we're going for a loopless solution here.
This question's topic amused me for one particular implementation I did earlier. My solution happens to be two substitutions so I'll post it.
My implementation environment is solaris, full example:
echo "0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909" |
perl -pe 's/\b([0-9]+)\b/0$1~01234567890/g' |
perl -pe 's/\b0(?!9*~)|([0-9])(?=9*~[0-9]*?\1([0-9]))|~[0-9]*/$2/g'
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
Pulling it apart for explanation:
s/\b([0-9]+)\b/0$1~01234567890/g
For each number (#) replace it with 0#~01234567890. The first 0 is in case rounding 9 to 10 is needed. The 01234567890 block is for incrementing. The example text for "9 10" is:
09~01234567890 010~01234567890
The individual pieces of the next regex can be described seperately, they are joined via pipes to reduce substitution count:
s/\b0(?!9*~)/$2/g
Select the "0" digit in front of all numbers that do not need rounding and discard it.
s/([0-9])(?=9*~[0-9]*?\1([0-9]))/$2/g
(?=) is positive lookahead, \1 is match group #1. So this means match all digits that are followed by 9s until the '~' mark then go to the lookup table and find the digit following this number. Replace with the next digit in the lookup table. Thus "09~" becomes "19~" then "10~" as the regex engine parses the number.
s/~[0-9]*/$2/g
This regex deletes the ~ lookup table.
Wow, turns out it is possible (albeit ugly)!
In case you do not have the time or cannot be bothered to read through the whole explanation, here is the code that does it:
$str = '0 1 2 3 4 5 6 7 8 9 10 11 12 13 19 20 29 99 100 139';
$str = preg_replace("/\d+/", "$0~", $str);
$str = preg_replace("/$/", "#123456789~0", $str);
do
{
$str = preg_replace(
"/(?|0~(.*#.*(1))|1~(.*#.*(2))|2~(.*#.*(3))|3~(.*#.*(4))|4~(.*#.*(5))|5~(.*#.*(6))|6~(.*#.*(7))|7~(.*#.*(8))|8~(.*#.*(9))|9~(.*#.*(~0))|~(.*#.*(1)))/s",
"$2$1",
$str, -1, $count);
} while($count);
$str = preg_replace("/#123456789~0$/", "", $str);
echo $str;
Now let's get started.
So first of all, as the others mentioned, it is not possible in a single replacement, even if you loop it (because how would you insert the corresponding increment to a single digit). But if you prepare the string first, there is a single replacement that can be looped. Here is my demo implementation using PHP.
I used this test string:
$str = '0 1 2 3 4 5 6 7 8 9 10 11 12 13 19 20 29 99 100 139';
First of all, let's mark all digits we want to increment by appending a marker character (I use ~, but you should probably use some crazy Unicode character or ASCII character sequence that definitely will not occur in your target string.
$str = preg_replace("/\d+/", "$0~", $str);
Since we will be replacing one digit per number at a time (from right to left), we will just add that marking character after every full number.
Now here comes the main hack. We add a little 'lookup' to the end of our string (also delimited with a unique character that does not occur in your string; for simplicity I used #).
$str = preg_replace("/$/", "#123456789~0", $str);
We will use this to replace digits by their corresponding successors.
Now comes the loop:
do
{
$str = preg_replace(
"/(?|0~(.*#.*(1))|1~(.*#.*(2))|2~(.*#.*(3))|3~(.*#.*(4))|4~(.*#.*(5))|5~(.*#.*(6))|6~(.*#.*(7))|7~(.*#.*(8))|8~(.*#.*(9))|9~(.*#.*(~0))|(?<!\d)~(.*#.*(1)))/s",
"$2$1",
$str, -1, $count);
} while($count);
Okay, what is going on? The matching pattern has one alternative for every possible digit. This maps digits to successors. Take the first alternative for example:
0~(.*#.*(1))
This will match any 0 followed by our increment marker ~, then it matches everything up to our cheat-delimiter and the corresponding successor (that is why we put every digit there). If you glance at the replacement, this will get replaced by $2$1 (which will then be 1 and then everything we matched after the ~ to put it back in place). Note that we drop the ~ in the process. Incrementing a digit from 0 to 1 is enough. The number was successfully incremented, there is no carry-over.
The next 8 alternatives are exactly the same for the digits 1to 8. Then we take care of two special cases.
9~(.*#.*(~0))
When we replace the 9, we do not drop the increment marker, but place it to the left of our the resulting 0 instead. This (combined with the surrounding loop) is enough to implement carry-over propagation. Now there is one special case left. For all numbers consisting solely of 9s we will end up with the ~ in front of the number. That is what the last alternative is for:
(?<!\d)~(.*#.*(1))
If we encounter a ~ that is not preceded by a digit (therefore the negative lookbehind), it must have been carried all the way through a number, and thus we simply replace it with a 1. I think we do not even need the negative lookbehind (because this is the last alternative that is checked), but it feels safer this way.
A short note on the (?|...) around the whole pattern. This makes sure that we always find the two matches of an alternative in the same references $1 and $2 (instead of ever larger numbers down the string).
Lastly, we add the DOTALL modifier (s), to make this work with strings that contain line breaks (otherwise, only numbers in the last line will be incremented).
That makes for a fairly simple replacement string. We simply first write $2 (in which we captured the successor, and possibly the carry-over marker), and then we put everything else we matched back in place with $1.
That's it! We just need to remove our hack from the end of the string, and we're done:
$str = preg_replace("/#123456789~0$/", "", $str);
echo $str;
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 30 100 101 140
So we can do this entirely in regular expressions. And the only loop we have always uses the same regex. I believe this is as close as we can get without using preg_replace_callback().
Of course, this will do horrible things if we have numbers with decimal points in our string. But that could probably be taken care of by the very first preparation-replacement.
Update: I just realised, that this approach immediately extends to arbitrary increments (not just +1). Simply change the first replacement. The number of ~ you append equals the increment you apply to all numbers. So
$str = preg_replace("/\d+/", "$0~~~", $str);
would increment every integer in the string by 3.
I managed to get it working in 3 substitutions (no loops).
tl;dr
s/$/ ~0123456789/
s/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/$2$3$4$5/g
s/9(?=9*~)(?=.*(0))|~| ~0123456789$/$1/g
Explanation
Let ~ be a special character not expected to appear anywhere in the text.
If a character is nowhere to be found in the text, then there's no way to make it appear magically. So first we insert the characters we care about at the very end.
s/$/ ~0123456789/
For example,
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909
becomes:
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909 ~0123456789
Next, for each number, we (1) increment the last non-9 (or prepend a 1 if all are 9s), and (2) "mark" each trailing group of 9s.
s/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/$2$3$4$5/g
For example, our example becomes:
1 2 3 4 8 9 19~ 11 29~ 199~ 119~ 299~ 919~ 1999~ 1199~ 1919~ ~0123456789
Finally, we (1) replace each "marked" group of 9s with 0s, (2) remove the ~s, and (3) remove the character set at the end.
s/9(?=9*~)(?=.*(0))|~| ~0123456789$/$1/g
For example, our example becomes:
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
PHP Example
$str = '0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909';
echo $str . '<br/>';
$str = preg_replace('/$/', ' ~0123456789', $str);
echo $str . '<br/>';
$str = preg_replace('/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/', '$2$3$4$5', $str);
echo $str . '<br/>';
$str = preg_replace('/9(?=9*~)(?=.*(0))|~| ~0123456789$/', '$1', $str);
echo $str . '<br/>';
Output:
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909 ~0123456789
1 2 3 4 8 9 19~ 11 29~ 199~ 119~ 299~ 919~ 1999~ 1199~ 1919~ ~0123456789
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
Is it possible in a single substitution?
No.
If not, is it at least possible in a single substitution given an upper bound, e.g. numbers up to 9999?
No.
You can't even replace the numbers between 0 and 8 with their respective successor. Once you have matched, and grouped this number:
/([0-8])/
you need to replace it. However, regex doesn't operate on numbers, but on strings. So you can replace the "number" (or better: digit) with twice this digit, but the regex engine does not know it is duplicating a string that holds a numerical value.
Even if you'd do something (silly) as this:
/(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)/
so that the regex engine "knows" that if group 1 is matched, the digit '0' is matched, it still cannot do a replacement. You can't instruct the regex engine to replace group 1 with the digit '1', group '2' with the digit '2', etc. Sure, some tools like PHP will let you define a couple of different patterns with corresponding replacement strings, but I get the impression that is not what you were thinking about.
It is not possible by regular expression search and substitution alone.
You have to use use something else to help achieve that. You have to use the programming language at hand to increment the number.
Edit:
The regular expressions definition, as part of Single Unix Specification doesn't mention regular expressions supporting evaluation of aritmethic expressions or capabilities for performing aritmethic operations.
Nonetheless, I know some flavors ( TextPad, editor for Windows) allows you to use \i as a substitution term which is an incremental counter of how many times has the search string been found, but it doesn't evaluate or parse found strings into a number nor does it allow to add a number to it.
I have found a solution in two steps (Javascript) but it relies on indefinite lookaheads, which some regex engines reject:
const incrementAll = s =>
s.replaceAll(/(.+)/gm, "$1\n101234567890")
.replaceAll(/(?:([0-8]|(?<=\d)9)(?=9*[^\d])(?=.*\n\d*\1(\d)\d*$))|(?<!\d)9(?=9*[^\d])(?=(?:.|\n)*(10))|\n101234567890$/gm, "$2$3");
The key thing is to add a list of numbers in order at the end of the string in the first step, and in the second, to find the location relevant digit and capture the digit to its right via a lookahead. There are two other branches in the second step, one for dealing with initial nines, and the other for removing the number sequence.
Edit: I just tested it in safari and it throws an error, but it definately works in firefox.
I needed to increment indices of output files by one from a pipeline I can't modify. After some searches I got a hit on this page. While the readings are meaningful, they really don't give a readable solution to the problem. Yes it is possible to do it with only regex; no it is not as comprehensible.
Here I would like to give a readable solution using Python, so that others don't need to reinvent the wheels. I can imagine many of you may have ended up with a similar solution.
The idea is to partition file name into three groups, and format your match string so that the incremented index is the middle group. Then it is possible to only increment the middle group, after which we piece the three groups together again.
import re
import sys
import argparse
from os import listdir
from os.path import isfile, join
def main():
parser = argparse.ArgumentParser(description='index shift of input')
parser.add_argument('-r', '--regex', type=str,
help='regex match string for the index to be shift')
parser.add_argument('-i', '--indir', type=str,
help='input directory')
parser.add_argument('-o', '--outdir', type=str,
help='output directory')
args = parser.parse_args()
# parse input regex string
regex_str = args.regex
regex = re.compile(regex_str)
# target directories
indir = args.indir
outdir = args.outdir
try:
for input_fname in listdir(indir):
input_fpath = join(indir, input_fname)
if not isfile(input_fpath): # not a file
continue
matched = regex.match(input_fname)
if matched is None: # not our target file
continue
# middle group is the index and we increment it
index = int(matched.group(2)) + 1
# reconstruct output
output_fname = '{prev}{index}{after}'.format(**{
'prev' : matched.group(1),
'index' : str(index),
'after' : matched.group(3)
})
output_fpath = join(outdir, output_fname)
# write the command required to stdout
print('mv {i} {o}'.format(i=input_fpath, o=output_fpath))
except BrokenPipeError:
pass
if __name__ == '__main__': main()
I have this script named index_shift.py. To give an example of the usage, my files are named k0_run0.csv, for bootstrap runs of machine learning models using parameter k. The parameter k starts from zero, and the desired index map starts at one. First we prepare input and output directories to avoid overriding files
$ ls -1 test_in/ | head -n 5
k0_run0.csv
k0_run10.csv
k0_run11.csv
k0_run12.csv
k0_run13.csv
$ ls -1 test_out/
To see how the script works, just print its output:
$ python3 -u index_shift.py -r '(^k)(\d+?)(_run.+)' -i test_in -o test_out | head -n5
mv test_in/k6_run26.csv test_out/k7_run26.csv
mv test_in/k25_run11.csv test_out/k26_run11.csv
mv test_in/k7_run14.csv test_out/k8_run14.csv
mv test_in/k4_run25.csv test_out/k5_run25.csv
mv test_in/k1_run28.csv test_out/k2_run28.csv
It generates bash mv command to rename the files. Now we pipe the lines directly into bash.
$ python3 -u index_shift.py -r '(^k)(\d+?)(_run.+)' -i test_in -o test_out | bash
Checking the output, we have successfully shifted the index by one.
$ ls test_out/k0_run0.csv
ls: cannot access 'test_out/k0_run0.csv': No such file or directory
$ ls test_out/k1_run0.csv
test_out/k1_run0.csv
You can also use cp instead of mv. My files are kinda big, so I wanted to avoid duplicating them. You can also refactor how many you shift as input argument. I didn't bother, cause shift by one is most of my use cases.

Regular expression to match object dimensions

I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .
Imagine some sentences along the following lines:
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:
11 1/2" x 32"
8 x 10-3/5
22" x 17"
42 1/2" x 60 yd
5.76 by 8
84cm
13/19"
86 cm
I imagine a world where the following rules apply:
The following are valid units: {cm, mm, yd, yards, ", ', feet}, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units.
A dimension is always described numerically, may or may not have units following it and may or may not have a fractional or decimal part. Being made up of a fractional part on it's own is allowed, e.g., 4/5".
Fractional parts always have a / separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!).
Dimensions may be one-dimensional or two-dimensional, in which case one can assume the following are acceptable for separating two dimensions: {x, by}. If a dimension is only one-dimensional it must have units from the set above, i.e., 22 cm is OK, .333 is not, nor is 4.33 oz.
To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .
[1-9]+[/ ][x1-9]
Update (2)
You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
These should result in the following (# indicates nothing should match):
12 yd
99 cm
#
22" x 17" x 12 cm
#
#
#
I've adapted M42's answer below, to:
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
But while that resolves some new test cases it now fails to match the following others. It reports:
11 1/2" x 32" PASS
(nothing) FAIL
22" x 17" PASS
42 1/2" x 60 yd PASS
(nothing) FAIL
84cm PASS
13/19" PASS
86 cm PASS
22" PASS
(nothing) FAIL
(nothing) FAIL
12 yd x FAIL
99 cm by FAIL
22" x 17" [and also, but separately '12 cm'] FAIL
PASS
PASS
New version, near the target, 2 failed tests
#!/usr/local/bin/perl
use Modern::Perl;
use Test::More;
my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my #out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
chomp;
if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
ok($1 eq $out[$i], $1 . ' in ' . $_);
} else {
ok($out[$i] eq 'no match', ' got "no match" in '.$_);
}
$i++;
}
done_testing;
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
output:
# Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
# at C:\tests\perl\test6.pl line 42.
# Failed test ' got "no match" in They are all 5.76 by 8 frames.'
# at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 - got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 - got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 - got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 - got "no match" in This is a product code: c720 with another number 83 x better.
ok 14 - got "no match" in A number on its own 21.
ok 15 - got "no match" in A volume shouldn't match 0.332 oz.
1..15
It seems difficult to match 5.76 by 8 frames but not 0.332 oz, sometimes you have to match numbers with unit and numbers without unit.
I'm sorry, I'm not able to do better.
One of many possible solutions (should be nlp compatible as it uses only basic regex syntax):
foundMatch = Regex.IsMatch(SubjectString, #"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");
Will get your results :)
Explanation:
"
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\ # Match the character “ ” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
\. # Match the character “.” literally
| # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
"" # Match the character “""” literally
| # Or match regular expression number 5 below (the entire group fails if this one fails to match)
/ # Match the character “/” literally
)
[\d/""x -] # Match a single character present in the list below
# A single digit 0..9
# One of the characters “/""x”
# The character “ ”
# The character “-”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
\b # Assert position at a word boundary
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
by # Match the characters “by” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
yd # Match the characters “yd” literally
)
\b # Assert position at a word boundary
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
This is all what I can get with a regular expression in 'Perl'. Try to adapt it to your regex flavour:
\d.*\d(?:\s+\S+|\S+)
Explanation:
\d # One digit.
.* # Any number of characters.
\d # One digit. All joined means to find all content between first and last digit.
\s+\S+ # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
| # Or. Select one of two expressions between parentheses.
\S+ # Any number of non-space characters. It tries to match double-quotes, or units joined to the
# last number.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
Running the script:
perl script.pl
Result:
11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm

RegEx for value Range from 1 - 365

What is the RegEx for value Range from 1- 365
Try this:
^(?:[1-9]\d?|[12]\d{2}|3[0-5]\d|36[0-5])$
The start anchor ^ and end anchor
$ are to match the whole input and
not just part of it.
(? ) is for grouping.
| is for alternation
[1-9]\d? matches 1 to 99
[12]\d{2} matches 100 to 299
3[0-5]\d matches 300 to 359
36[0-5] matches 360 to 365
You would have to list the possible combinations 1-9, 10-99, 100-299, 300-359, 360-365:
^([1-9]\d?|[12]\d\d|3[0-5]\d|36[0-5])$
Not really a good fit for regex, but if you insist:
^(?:36[0-5]|3[0-5][0-9]|[12][0-9][0-9]|[1-9][0-9]|[1-9])$
This is not allowing leading zeroes. If you wish to allow those, let me know.
The expression above can be shortened a little to
^(?:36[0-5]|3[0-5]\d|[12]\d{2}|[1-9]\d?)$
but I find the first solution to be a bit more readable. YMMV.
A general solution for matching the numbers from 1 to XYZ
^(?!0)(?!\d{4}$)(?![X+1-9]\d{2}$)(?!X[Y+1-9]\d$)(?!XY[Z+1-9]$)\d+$
Notes:
If any of X, Y or Z are 9 that will make X+1 etc. be 10. If that happens the regex part that would require using the 10 should be left out.
This can be extended to numbers with more or less digits following the same principles.
It does not allow left-padding 0es.
Applied to your case:
^(?!0)(?!\d{4}$)(?![4-9]\d{2}$)(?!3[7-9]\d$)(?!36[6-9]$)\d+$
Lets explain:
(?!0\d*) - does not start with 0
(?!\d{4}$) - does not have 4 digits, i.e. between 1000 and infinity
(?![4-9]\d{2}$) - it's not between 400 and 999
(?!3[7-9]\d$) - it's not between 370 and 399
(?!36[6-9]$) - it's not between 366 and 369
Test it.
^36[0-5]|(3[0-5]|[12]?[0-9])[0-9]$
^3(6[0-5]|[0-5]\d)|[12]\d\d|[1-9]\d|[1-9]$
Or if numbers like 05 can not be in input:
^3(6[0-5]|[0-5]\d)|[12]?\d?\d$
P.S.: Anyway no need of regex here. Use ToInt(), <=, >=
It really depends on your regex engine since they may not all be PCRE-style. I usually work to the lowest common denominator unless I know it will be targeting a minimum engine.
To that end, I'd just use something like:
^[1-9]|[1-9][0-9]|[1-2][0-9]{2}|3[0-5][0-9]|36[0-5]$
This will take care of (in order):
1-9.
10-99.
100-299.
300-359.
360-365.
However, unless you're absolutely required to use just a regex, I wouldn't. It's like trying to kill a fly with a thermo-nuclear warhead.
Just use the much simpler ^[0-9]{1,3}$ then use whatever language features you have to convert it to an integer and check it's between 1 and 365 inclusive:
def isValidDayOtherThanLeapYear (s):
if not s.matches ("^[0-9]{1,3}$"):
return false
n = s.toInteger()
if n < 1 or n > 365:
return false
return true
Your code will be more readable that way and I tend to rethink the use of regular expressions the second they start looking like they may be hard to read six months down the track.
This worked for me...
^[1-3][0-6]?[0-5]?$