I have a text file that I extracted from a PDF file. It's arranged in a tabular format; this is part of it:
DATE SESS PROF1 PROF2 COURSE SEC GRADE COUNT
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A 3
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A- 2
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B 4
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B+ 2
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B- 1
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 WU 1
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
2007/09 1 FUENTES TANIA DACSB 06500 002 A 3
2007/09 1 FUENTES TANIA DACSB 06500 002 A- 8
2007/09 1 FUENTES ALEXA DACSB 06500 002 B 5
2007/09 1 FUENTES ALEXA DACSB 06500 002 B+ 3
2007/09 1 FUENTES ALEXA DACSB 06500 002 B- 1
2007/09 1 FUENTES ALEXA DACSB 06500 002 C 1
2007/09 1 FUENTES ALEXA DACSB 06500 002 C+ 1
2007/09 1 LIGGINS FREDER DACSB 06500 003 A 1
Where the first line is the columns names, and the rest of the lines are the data.
there are 8 columns which I want to get, at first it seemed very easy by splitting with split(/\s+/, ...) for each line I read, but then,I noticed that in some lines there are additional spaces, for example:
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
Sometimes the data for a certain column is optional as you can see it.
The problem is complex, but it's not unsolvable. It seems to me that course will always contain a space between the alpha code and the numeric code and that the prof names will also always contain a space. But then you're pretty much screwed if somebody has a two-part last name like "VAN DYKE".
A regex would describe this record:
my $record_exp
= qr{ ^ \s*
(\d{4}/\d{2}) # yyyy/mm date
\s+
(\d+) # any number of digits
\s+
(\S+ \s \S+) # non-space cluster, single space, non-space cluster
\s+
# sames as last, possibly not there, separating spaces are included
# in the conditional, because we have to make sure it will start
# right at the next rule.
(?:(\S+ \s \S+)\s+)?
# a cluster of alpha, single space, cluster of digits
(\p{Alpha}+ \s \d+)
\s+ # any number of spaces
(\S+) # any number of non-space
\s+ # ditto..
(\S+)
\s+
(\S+)
}x;
Which makes the loop a lot easier:
while ( <$input> ) {
my #fields = m{$record_exp};
# ... list of semantic actions here...
}
But you could also store it into structures, knowing that the only variable part of the data is the profs:
use strict;
use warnings;
my #records;
<$input>; # bleed the first line
while ( <$input> ) {
my #fields = split; # split on white-space
my $record = { date => shift #fields };
$record->{session} = shift #fields;
$record->{profs} = [ join( ' ', splice( #fields, 0, 2 )) ];
while ( #fields > 5 ) {
push #{ $record->{profs} }, join( ' ', splice( #fields, 0, 2 ));
}
$record->{course} = splice( #fields, 0, 2 );
#$record{ qw<sec grade count> } = #fields;
push #records, $record;
}
Believe it ambiguous :
if PROF1 can contain spaces, how do you know where it ends and where PROF2 begins? What if PROF2 also contains a space? Or 3 spaces ..
You probably can't even tell yourself, and if you can it's because you can tell the difference between a first-name and a surname.
If you're on Linux/Unix, try running text2pdf on the pdf.. might give you better results.
Looks to me like the first four columns and last 5 columns are always present and the 5th and 6th (prof2) columns are optional
So split the line as you were attempting, pull off the first four and last five elements from the resulting array, then whatever remains is your 5th column and 6th columns
If however either the prof1 or the prof2 entry can be missing, you're stuck - your file format is ambiguous
There is nothing that says you must use only a single regex. You can go prune off bits of your line in chunks if that makes it easier to handle the weird parts.
I would probably still use split(), but then access the data thusly:
my #values = split '\s+', $string;
my $date = $values[0];
my $sess = $values[1];
my $count = $values[-1];
my $grade = $values[-2];
my $sec = $values[-3];
my $course = $values[-4];
my #profs = #values[2..($#values-5)];
With this construct you don't have to worry about how many profs you have. Even if you have none, the other values will all work fine (and you'll get an empty array for your profs).
Related
Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.
You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith
Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")
I have a file like this:
3107 0.9 0.0 0.0 chr1 29312346 29312694 (219937927) C L1HS LINE/L1 (4) 6151 5803 54360
8095 0.5 0.0 0.0 chr1 31040661 31041597 (218209024) + L1HS LINE/L1 5203 6139 (16) 57249
...
When the 9th column is C, I need to subtract column 14 from 13, and when the 9th column is +, I need to subtract column 12 from 13.
I understand I can create arrays, but how can I use a regex, such as ($line =~/(\w+)\s+(\w+)/), to solve this instead?
You can split at white spaces into #F array(first value being $F[0]), subtract columns, and output values separated by space.
perl -lane'
$F[12] -= $F[13] if $F[8] eq "C";
$F[12] -= $F[11] if $F[8] eq "+";
print "#F";
' file
Since you wanted to use a regex, here is another solution. It is perhaps a bit unsharp, because you did not define your lines cleanly but with only two example lines, and for those, it works. I commented the regex so that you can see, which part of the expression is matching a certain group and which of them are captured.
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
while( <DATA> )
{
if( $_ =~ /[0-9]+ # 1
\s+
[0-9.]+ # 2
\s+
[0-9.]+ # 3
\s+
[0-9.]+ # 4
\s+
[a-z0-9]+ # 5
\s+
[0-9]+ # 6
\s+
[0-9]+ # 7
\s+
\([a-z0-9]+\) # 8
\s+
([c+]) # 9 -> capture group 1
\s+
[a-z0-9]+ # 10
\s+
[a-z0-9\/]+ # 11
\s+
\(?([0-9]+)\)? # 12 -> capture group 2
\s+
([0-9]+) # 13 -> capture group 3
\s+
\(?([0-9]+)\)? # 14 -> capture group 4
\s+
[0-9]+? # 15
/ix )
{
say "Matched: $_";
say "Operation: $1";
if( $1 eq "+" )
{
say "$2 - $3 = ".( $2 - $3 );
}
elsif( $1 eq "C" )
{
say "$4 - $3 = ".( $4 - $3 );
}
else
{
say "Nothing do to here...";
}
}
}
exit;
#1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
__DATA__
3107 0.9 0.0 0.0 chr1 29312346 29312694 (219937927) C L1HS LINE/L1 (4) 6151 5803 54360
8095 0.5 0.0 0.0 chr1 31040661 31041597 (218209024) + L1HS LINE/L1 5203 6139 (16) 57249
Update:
As you can see in the perl documentation, I used the x flag to have comments in my regex. The i flag makes it case insensitive.
Furthermore, I didn't just try to devide all the single columns by whitespaces but also by their types, which is an advantage of using a regular expression. While \s+ expressions are seperators for columns here, allowing arbitary amounts of whitespace all the single groups are kind of specified. That allows to find non-conforming lines. For example, by defining caputre group $1 as ([c+]) I was able to reduce the possible characters, that trigger an operation to C and + ( and c because of case-inesensitivity).
Binding a group to a variable (capturing it) is done by using parenthises.
This way, I was able to only pick the columns I really need (see the comments).
Do not use a regex for a problem like this.
If you're just working with columns separated by whitespace, the proper tool is split.
my #cols = split ' ', $line;
from a given noun list in a .txt file, where nouns are separated by new lines, such as this one:
hooligan
football
brother
bollocks
...and a separate .txt file containing a series of regular expressions separated by new lines, like this:
[a-z]+\tNN(S)?
[a-z]+\tJJ(S)?
...I would like to run the regular expressions through each sentence of a corpus and, every time the regexp matches a pattern, if that pattern contains one of the nouns in the list of nouns, I would like to print that noun in the output and (separated it by tab) the regular expression that matched it. Here is an example of how the resulting output could be:
football [a-z]+NN(S)?\'s POS[a-z]+NN(S)?
hooligan [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
hooligan [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
football [a-z]+NN(S)?[a-z]+NN(S)?
brother [a-z]+PP$[a-z]+NN(S)?
bollocks [a-z]+DT[a-z]+NN(S)?
football [a-z]+NN(s)?(be)VBZnotRB
The corpus I would use is huge (tens of GB) and has the following format (each sentence is contained in the tag <s>):
<s>
Hooligans hooligan NNS 1 4 NMOD
, , , 2 4 P
unbridled unbridled JJ 3 4 NMOD
passion passion NN 4 0 ROOT
- - : 5 4 P
and and CC 6 4 CC
no no DT 7 9 NMOD
executive executive JJ 8 9 NMOD
boxes box NNS 9 4 COORD
. . SENT 10 0 ROOT
</s>
<s>
Hooligans hooligan NNS 1 4 NMOD
, , , 2 4 P
unbridled unbridled JJ 3 4 NMOD
passion passion NN 4 0 ROOT
- - : 5 4 P
and and CC 6 4 CC
no no DT 7 9 NMOD
executive executive JJ 8 9 NMOD
boxes box NNS 9 4 COORD
. . SENT 10 0 ROOT
</s>
<s>
Portsmouth Portsmouth NP 1 2 SBJ
bring bring VVP 2 0 ROOT
something something NN 3 2 OBJ
entirely entirely RB 4 5 AMOD
different different JJ 5 3 NMOD
to to TO 6 5 AMOD
the the DT 7 12 NMOD
Premiership Premiership NP 8 12 NMOD
: : : 9 12 P
football football NN 10 12 NMOD
's 's POS 11 10 NMOD
past past NN 12 6 PMOD
. . SENT 13 2 P
</s>
<s>
This this DT 1 2 SBJ
is be VBZ 2 0 ROOT
one one CD 3 2 PRD
of of IN 4 3 NMOD
Britain Britain NP 5 10 NMOD
's 's POS 6 5 NMOD
most most RBS 7 8 AMOD
ardent ardent JJ 8 10 NMOD
football football NN 9 10 NMOD
cities city NNS 10 4 PMOD
: : : 11 2 P
think think VVP 12 2 COORD
Liverpool Liverpool NP 13 0 ROOT
or or CC 14 13 CC
Newcastle Newcastle NP 15 19 SBJ
in in IN 16 15 ADV
miniature miniature NN 17 16 PMOD
, , , 18 15 P
wound wind VVD 19 13 COORD
back back RB 20 19 ADV
three three CD 21 22 NMOD
decades decade NNS 22 19 OBJ
. . SENT 23 2 P
</s>
I started to work to a script in PERL to achieve my goal, and in order to not run out of memory with such a huge dataset I used the module Tie::File so that my script would read one line at a time (instead of trying to open the entire corpus file in memory). This would work perfectly with a corpus where each sentence corresponds to one single line, but not in the current case where sentences are spread on more lines and delimited by a tag.
Is there a way to achieve what I want using a combination unix terminal commands (e.g. cat and grep)? Alternatively, which would be the best solution for this issue? (Some code examples would be great).
A simple regex alternation is sufficient to extract matching data from the noun list and Regexp::Assemble can handle the requirement for identifying which pattern from the other file matched. And, as Jonathan Leffler mentions in his comment, setting the input record separator allows you to read a single record at a time, even when each record spans multiple lines.
Combining all that into a running example, we get:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use Regexp::Assemble;
my #nouns = qw( hooligan football brother bollocks );
my #patterns = ('[a-z]+\s+NN(S)?', '[a-z]+\s+JJ(S)?');
my $name_re = '(' . join('|', #nouns) . ')'; # Assumes no regex metacharacters
my $ra = Regexp::Assemble->new(track => 1);
$ra->add(#patterns);
local $/ = '<s>';
while (my $line = <DATA>) {
my $match = $ra->match($line);
next unless defined $match;
while ($line =~ /$name_re/g) {
say "$1\t\t$match";
}
}
__DATA__
...
...where the content of the __DATA__ section is the sample corpus provided in the original question. I didn't include it here in the interest of keeping the answer compact. Note also that, in both patterns, I changed \t to \s+; this is because the tabs were not preserved when I copied and pasted your sample corpus.
Running that code, I get the output:
hooligan [a-z]+\s+NN(S)?
hooligan [a-z]+\s+NN(S)?
football [a-z]+\s+NN(S)?
football [a-z]+\s+NN(S)?
football [a-z]+\s+JJ(S)?
football [a-z]+\s+JJ(S)?
Edit: Corrected regexes. I initially replaced \t with \s, causing it to match NN or JJ only when preceded by exactly one space. It now also matches multiple spaces, which better emulates the original \t.
I ended up writing a quick code that solves my problem. I used Tie::File to handle huge textual datasets and specified </s> as record separator, as suggested by Jonathan Leffler (the solution proposed by Dave Sherohman seems very elegant but I couldn't try it).
After the separation of the sentences I isolate the columns that I need (2nd and 3rd) and I run the regular expressions. Before printing the output I check whether the matched word is present in my word list: if not, this is excluded from the output.
I share my code here (comments included) in case someone else needs something similar.
It's bit dirty and it could definitely be optimized but it works for me and it supports very large corpora (I tested it with a corpus of 10GB: it completed successfully in a few hours).
use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.
if ($#ARGV < 0 ) { print "Usage: perl albzcount.pl corpusfile\n"; exit; }
#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!");
my #nouns_contained_in_list=<DAT>;
close(DAT);
# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my #regexps_contained_in_list=<DAT>;
close(DAT);
# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)
# With TIE I don't load the entire file in an array. Perl thinks it's an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my #raw_corpus_data, 'Tie::File', $corpusfile, recsep => '</s>' or die "Can't read file: $!\n";
#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (#raw_corpus_data){
#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my #corpus_sublines = split('\n', $corpus_line);
#declare variable. Later values will be appended to it
my $corpus_line;
#for each line that composes a sentence
foreach my $sentence_newline(#corpus_sublines){ a
#explode by tab (column separator)
my #corpus_columns = split('\t', $sentence_newline);
#put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
$corpus_line .= "#corpus_columns[1]\t#corpus_columns[2]\n";
#... Now the corpus has the format I want and can be processed
}
#foreach regex
foreach my $single_regexp(#regexps_contained_in_list){
# Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file.
# Without this, the regular expressions read from the file don't always work.
$single_regexp =~ s/\r|\n//g;
#if the corpus line analyzed in this cycle matches the regexp
if($corpus_line =~ m/$single_regexp/) {
# explode by tab the matched results so the first word $onematch[0] can be isolated
# $& is the entire matched string
my #onematch = split('\t', $&);
# OUTPUT RESULTS
#if the matched noun is not empty and it is part of the word list
if ($onematch[0] ne "" && grep( /^$onematch[0]$/, #nouns_contained_in_list )) {
print "$onematch[0]\t$single_regexp\n";
} # END OUTPUT RESULTS
} #END if the corpus line analyzed in this cycle matches the regexp
} #END foreach regex
} #END go throught the lines of the corpus, one by one
# Untie the source corpus file
untie #raw_corpus_data;
I'm having troubles trying to match a pattern of dates. Any of the following dates are legal:
- 121212
- 4 9 12
- 5-3-2000
- 62502
- 3/3/11
- 09-08-2001
- 8 6 07
- 12 10 2004
- 4-16-08
- 3/7/2005
What makes this date matching really challenging is that the year doesn't have to be 4 digits (a 2 digit year is assumed to be in the 21st century i.e. 02 = 2002), the month/date can either be written with a beginning 0 if it is a one digit month, and the dates may or may not be separated by spaces, dashes, or slashes.
This is what I currently have:
/((((0[13578])|([13578])|(1[02]))[\/-]?\s*(([1-9])|(0[1-9])|([12][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/-]?\s*(([1-9])|(0[1-9])|([12][0-9])|(30)))|((2|02)[\/](([1-9])|(0[1-9])|([12][0-9])))[\/-]?\s*(20[0-9]{2})|([0-9]{2}))/g
This almost works, except right now I'm not exactly sure if I'm assuming the length of the dates and months. For example, in the case 121212, I might be assuming the month is 1 instead of 12. Also, for some reason when I'm printing out $1 and $2, it is the same value. In the case of 121212, $1 is 1212, $2 is 1212 and $3 is 12. However, I just want $1 to be 121212.
Your task is ambiguous, since you may not be able to tell mmd from mdd or mdccyy from mmddyy.
You left off the option for spaces or dashes in one place where you match /.
You aren't checking for leap years.
This is doable, but it's awfully easy to make a mistake; how about not trying to do it with a regex.
The CPAN modules Time::ParseDate and DateTime are probably what you're looking for, except the 62502 pattern:
use DateTime;
use Time::ParseDate;
foreach my $str (<DATA>) {
chomp $str;
$str =~ tr{ }{/};
my $epoch = parsedate($str, GMT => 1);
next unless $epoch; # skip 62502
my $dt = DateTime->from_epoch ( epoch => $epoch );
print $dt->ymd, "\n";
}
__DATA__
121212
4 9 12
5-3-2000
62502
3/3/11
09-08-2001
8 6 07
12 10 2004
4-16-08
3/7/2005
Once you have the DateTime object, you can extract year, month, and day information easily.
This solution handles all of the cases that you provided. But the solution isn't foolproof because the problem has ambiguities. E.g. how do we interpret the date 12502? Is it 1/25/02 or 12/5/02?
use 5.010;
while (my $line = <DATA>) {
chomp $line;
my #date = $line =~ /
\A
([01]?\d) # month is 1-2 digits, but the first digit may only be 0 or 1
[ \-\/]? # may or may not have a separator
([0123]?\d) # day is 1-2 digits
[ \-\/]?
(\d{2,4}) # year is 2-4 digits
\z
/x;
say join '_', #date;
}
__DATA__
121212
4 9 12
5-3-2000
12502
3/3/11
09-08-2001
8 6 07
12 10 2004
4-16-08
3/7/2005
This is the best I could come up with based on what info you've given. It matches all possibilities, and has error checking for month/day ranges and also the year (from 1900 to 2099)
/(1[012]|0?\d)([-\/ ]?)([12]\d|3[01]|0?\d)\2((19|20)?\d\d)/
I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .
Imagine some sentences along the following lines:
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:
11 1/2" x 32"
8 x 10-3/5
22" x 17"
42 1/2" x 60 yd
5.76 by 8
84cm
13/19"
86 cm
I imagine a world where the following rules apply:
The following are valid units: {cm, mm, yd, yards, ", ', feet}, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units.
A dimension is always described numerically, may or may not have units following it and may or may not have a fractional or decimal part. Being made up of a fractional part on it's own is allowed, e.g., 4/5".
Fractional parts always have a / separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!).
Dimensions may be one-dimensional or two-dimensional, in which case one can assume the following are acceptable for separating two dimensions: {x, by}. If a dimension is only one-dimensional it must have units from the set above, i.e., 22 cm is OK, .333 is not, nor is 4.33 oz.
To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .
[1-9]+[/ ][x1-9]
Update (2)
You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
These should result in the following (# indicates nothing should match):
12 yd
99 cm
#
22" x 17" x 12 cm
#
#
#
I've adapted M42's answer below, to:
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
But while that resolves some new test cases it now fails to match the following others. It reports:
11 1/2" x 32" PASS
(nothing) FAIL
22" x 17" PASS
42 1/2" x 60 yd PASS
(nothing) FAIL
84cm PASS
13/19" PASS
86 cm PASS
22" PASS
(nothing) FAIL
(nothing) FAIL
12 yd x FAIL
99 cm by FAIL
22" x 17" [and also, but separately '12 cm'] FAIL
PASS
PASS
New version, near the target, 2 failed tests
#!/usr/local/bin/perl
use Modern::Perl;
use Test::More;
my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my #out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
chomp;
if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
ok($1 eq $out[$i], $1 . ' in ' . $_);
} else {
ok($out[$i] eq 'no match', ' got "no match" in '.$_);
}
$i++;
}
done_testing;
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
output:
# Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
# at C:\tests\perl\test6.pl line 42.
# Failed test ' got "no match" in They are all 5.76 by 8 frames.'
# at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 - got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 - got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 - got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 - got "no match" in This is a product code: c720 with another number 83 x better.
ok 14 - got "no match" in A number on its own 21.
ok 15 - got "no match" in A volume shouldn't match 0.332 oz.
1..15
It seems difficult to match 5.76 by 8 frames but not 0.332 oz, sometimes you have to match numbers with unit and numbers without unit.
I'm sorry, I'm not able to do better.
One of many possible solutions (should be nlp compatible as it uses only basic regex syntax):
foundMatch = Regex.IsMatch(SubjectString, #"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");
Will get your results :)
Explanation:
"
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\ # Match the character “ ” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
\. # Match the character “.” literally
| # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
"" # Match the character “""” literally
| # Or match regular expression number 5 below (the entire group fails if this one fails to match)
/ # Match the character “/” literally
)
[\d/""x -] # Match a single character present in the list below
# A single digit 0..9
# One of the characters “/""x”
# The character “ ”
# The character “-”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
\b # Assert position at a word boundary
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
by # Match the characters “by” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
yd # Match the characters “yd” literally
)
\b # Assert position at a word boundary
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
This is all what I can get with a regular expression in 'Perl'. Try to adapt it to your regex flavour:
\d.*\d(?:\s+\S+|\S+)
Explanation:
\d # One digit.
.* # Any number of characters.
\d # One digit. All joined means to find all content between first and last digit.
\s+\S+ # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
| # Or. Select one of two expressions between parentheses.
\S+ # Any number of non-space characters. It tries to match double-quotes, or units joined to the
# last number.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
Running the script:
perl script.pl
Result:
11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm