Is it possible to parse this nightmare using Perl?

Is it possible to parse this nightmare using Perl? - regex

I'm working on some doc file, that when copied and pasted into a text file, gives me the following sample 'output':
ARTA215 ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.
This advanced study in drawing with the life ....
Prerequisite: ARTA150
Lab Fee Required
ARTA220 CERAMICS II (3 Cr) (2:2) + Studio 1 hr.
This course affords the student the opportunity to ex...
Lab Fee Required
ARTA250 SPECIAL TOPICS IN ART
This course focuses on selected topic....
ARTA260 PORTFOLIO DEVELOPMENT (3 Cr) (3:0)
The purpose of this course is to pre....
BIOS010 INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)
This course is a preparatory course designed to familiarize the begi....
BIOS101 GENERAL BIOLOGY (4 Cr) (3:3)
This course introduces the student to the principles of mo...
Lab Fee Required
BIOS102 INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)
This course is an introd....
Lab Fee Required
I want to be able to parse it so that 3 fields are generated and I could output the values into a .csv file.
The line breaks, spacing, etc... is how it could be at any point during this file.
My best guess is for a regex to find 4 capitalized alpha chars followed by 3 num chars, then find out if the next 2 chars are capitalized. (this accounts for the course #, but also excludes the possibility of tripping up during where it might say "prerequisite" as in the first entry). After this, the regex finds the first line break and gets everything after it until it finds the next course #. The 3 fields would be a course number, a course title, and a course description. The course number and title are on the same line always and the description is everything beneath.
Sample end result would contain 3 fields which I'm guessing could be stored into 3 arrays:
"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
Like I said, it's quite a nightmare, but I want to automate this instead of cleaning up after someone each time the file is generated.

Consider the following example that depends on blocks of course descriptions being completely contained within what Perl considers to be paragraphs:
#! /usr/bin/perl
$/ = "";
my $record_start = qr/
^ # starting with a newline
\s* # allow optional leading whitespace
([A-Z]+\d+) # capture course tag, e.g., ARTA215
\s+ # separating whitespace
(.+?) # course title on rest of line
\s*\n # consume trailing whitespace
/mx;
while (<>) {
my($course,$title);
if (s/\A$record_start//) { # fix Stack Overflow highlighting /
($course,$title) = ($1,$2);
}
elsif (s/(?s:^.+?)(?=$record_start)//) { # ditto /
redo;
}
else {
next;
}
my $desc;
die unless s/^(.+?)(?=$record_start|\s*$)//s;
(my $desc = $1) =~ s/\s*\n\s*/ /g;
for ($course, $title, $desc) {
s/^\s+//; s/\s+$//; s/\s+/ /g;
}
print join("," => map qq{"$_"} => $course, $title, $desc), "\n";
redo if $_;
}
When fed your sample input, it outputs
"ARTA215","ADVANCED LIFE DRAWING (3 Cr) (2:2) + Studio 1 hr.","This advanced study in drawing with the life .... Prerequisite: ARTA150 Lab Fee Required"
"ARTA220","CERAMICS II (3 Cr) (2:2) + Studio 1 hr.","This course affords the student the opportunity to ex... Lab Fee Required"
"ARTA250","SPECIAL TOPICS IN ART","This course focuses on selected topic...."
"ARTA260","PORTFOLIO DEVELOPMENT (3 Cr) (3:0)","The purpose of this course is to pre...."
"BIOS010","INTRODUCTION TO BIOLOGICAL CONCEPTS (3IC) (2:2)","This course is a preparatory course designed to familiarize the begi...."
"BIOS101","GENERAL BIOLOGY (4 Cr) (3:3)","This course introduces the student to the principles of mo... Lab Fee Required"
"BIOS102","INTRODUCTION TO HUMAN BIOLOGY (4 Cr) (3:3)","This course is an introd.... Lab Fee Required"

Try:
my $course;
my #courses;
while ( my $line = <$input_handle> ) {
if ( $line =~ /^([A-Z]{4}\d+)\s+([A-Z]{2}.*)/ ) {
$course = [ "$1", "$2" ];
push #courses, $course;
}
elsif ($course) {
$course->[2] .= $line
}
else {
# garbage before first course in file
next
}
}
This produces an array of arrays, as I understand you want. It would make more sense to me to have an array of hashes or even a hash of hashes.

I had roughly the same idea as Gbacon to use paragraph mode since that will neatly chunk the file into records for you. He typed faster, but I wrote one, so here's my crack at it:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "";
my #items;
while (<>) {
my( $course, $description ) = (split /\n/, $_)[0, 1];
my( $course_id, $name ) = ($course =~ m/^(\w+)\s+(.*)$/);
push #items, [ $course_id, $name, $description ];
}
for my $record (#items) {
print "Course id: ", $record->[0], "\n";
print "Name and credits: ", $record->[1], "\n";
print "Description: ", $record->[2], "\n";
}
As Ysth points out in a comment on Gbacon's answer, paragraph mode may not work here. If not, never mind.

regex may be overkill for this, as the pattern appears to be simply:
[course]
[description]
{Prerequisites}
{Lab Fee Required}
where [course] is composed of
[course#] [course title] {# Cr} [etc/don't care]
and the course# is just the first 7 characters.
so you can scan the file with a simple state-machine, something like:
//NOTE: THIS IS PSEUDOCODE
s = 'parseCourse'
f = openFile(blah)
l = readLine(f)
while (l) {
if (s=='parseCourse') {
if (l.StartsWith('Prerequisite:')) {
extractPrerequisite(l)
}
else if (l.StartsWith('Lab Fee Required')) {
extractLabFeeRequired(l)
}
else {
extractCourseInfo(l)
s='parseDescription'
}
}
else if (s=='parseDescription') {
extractDescription(l)
s='parseCourse'
}
l = readLine(f)
}
close(f)

#!/usr/bin/perl
$/ = "\n\n";
$FS = "\n";
$, = ',';
while (<>) {
chomp;
#F = split($FS, $_);
print join($,,#F) ."\n";
}

Related

Comparing filenames and determine their incremental digits

Imagine i have a sequence of files, e.g.:
...
segment8_400_av.ts
segment9_400_av.ts
segment10_400_av.ts
segment11_400_av.ts
segment12_400_av.ts
...
When the filenames are known, i can match against the filenames with a regular expression like:
/segment(\d+)_400_av\.ts/
Because i know the incremental pattern.
But what would be a generic approach to this? I mean how can i take two file names out of the list, compare them and find out where in the file name the counting part is, taking into account any other digits that can occur in the filename (the 400 in this case)?
Goal: What i want to do is to run the script against various file sequences to check for example for missing files, so this should be the first step to find out the numbering scheme. File sequences can occur in many different fashions, e.g.:
test_1.jpg (simple counting suffix)
test_2.jpg
...
or
segment9_400_av.ts (counting part inbetween, with other static digits)
segment10_400_av.ts
...
or
01_trees_00008.dpx (padded with zeros)
01_trees_00009.dpx
01_trees_00010.dpx
Edit 2: Probably my problem can be described more simple: With a given set of files, i want to:
Find out, if they are a numbered sequence of files, with the rules below
Get the first file number, get the last file number and file count
Detect missing files (gaps in the sequence)
Rules:
As melpomene summarized in his answer, the file names only differ in one substring, which consists only of digits
The counting digits can occur anywhere in the filename
The digits can be padded with 0's (see example above)
I can do #2 and #3, what i am struggling with is #1 as a starting point.

You tagged this question regex, so here's a regex-based solution:
use strict;
use warnings;
my $name1 = 'segment12_400_av.ts';
my $name2 = 'segment10_400_av.ts';
if (
"$name1\0$name2" =~ m{
\A
( \D*+ (?: \d++ \D++ )* ) # prefix
( \d++ ) # numeric segment 1
( [^\0]* ) # suffix
\0 # separator
\1 # prefix
( \d++ ) # numeric segment 2
\3 # suffix
\z
}xa
) {
print <<_EOT_;
Result of comparing "$name1" and "$name2"
Common prefix: $1
Common suffix: $3
Varying numeric parts: $2 / $4
Position of varying numeric part: $-[2]
_EOT_
}
Output:
Result of comparing "segment12_400_av.ts" and "segment10_400_av.ts"
Common prefix: segment
Common suffix: _400_av.ts
Varying numeric parts: 12 / 10
Position of varying numeric part: 7
It assumes that
the strings are different (guard the condition with $name1 ne $name2 && ... if that's not guaranteed)
there's only one substring that's different between the input strings (otherwise it won't find any match)
the differing substring consists of digits only
all digits surrounding the first point of difference are part of the varying increment (e.g. the example above recognizes segment as the common prefix, not segment1)
The idea is to combine the two names into a single string (separated by NUL, which is unambiguous because filenames can't contain \0), then let the regex engine do the hard work of finding the longest common prefix (using greediness and backtracking).
Because we're in a regex, we can get a bit more fancy than just finding the longest common prefix: We can make sure that the prefix doesn't end with a digit (see the segment1 vs. segment case above) and we can verify that the suffix is also the same.

See if this works for you:
use strict;
use warnings;
sub compare {
my ( $f1, $f2 ) = #_;
my #f1 = split /(\d+)/sxm, $f1;
my #f2 = split /(\d+)/sxm, $f2;
my $i = 0;
my $out1 = q{};
my $out2 = q{};
foreach my $p (#f1) {
if ( $p eq $f2[$i] ) {
$out1 .= $p;
$out2 .= $p;
}
else {
$out1 .= sprintf ' ((%s)) ', $p;
$out2 .= sprintf ' ((%s)) ', $f2[$i];
}
$i++;
}
print $out1 . "\n";
print $out2 . "\n";
return;
}
print "Test1:\n";
compare( 'segment8_400_av.ts', 'segment9_400_av.ts' );
print "\n\nTest2:\n";
compare( 'segment999_8_400_av.ts', 'segment999_9_400_av.ts' );
You basically split strings by starting/ending digits, the loop through the items and compare each of the 'pieces'. If they are equal, you accumulate. If not, then you highlight the differences and accumulate.
Output (I'm using ((number)) for the highlight)
Test1:
segment ((8)) _400_av.ts
segment ((9)) _400_av.ts
Test2:
segment999_ ((8)) _400_av.ts
segment999_ ((9)) _400_av.ts

I assume that only the counter differs across the strings
use warnings;
use strict;
use feature 'say';
my ($fn1, $fn2) = ('segment8_400_av.ts', 'segment12_400_av.ts');
# Collect all numbers from all strings
my #nums = map { [ /([0-9]+)/g ] } ($fn1, $fn2);
my ($n, $pos); # which number in the string, at what position
# Find which differ
NUMS:
for my $j (1..$#nums) { # strings
for my $i (0..$#{$nums[0]}) { # numbers in a string
if ($nums[$j]->[$i] != $nums[0]->[$i]) { # it is i-th number
$n = $i;
$fn1 =~ /($nums[0]->[$i])/g; # to find position
$pos = $-[$i];
say "It is $i-th number in a string. Position: $pos";
last NUMS;
}
}
}
We loop over the array with arrayrefs of numbers found in each string, and over elements of each arrayref (eg [8, 400]). Each number in a string (0th or 1st or ...) is compared to its counterpart in the 0-th string (array element); all other numbers are the same.
The number of interest is the one that differs and we record which number in a string it is ($n-th).
Then its position in the string is found by matching it again and using #- regex variable with (the just established) index $n, so the offset of the start of the n-th match. This part may be unneeded; while question edits helped I am still unsure whether the position may or not be useful.
Prints, with position counting from 0
It is 0-th number in a string. Position: 7
Note that, once it is found that it is the $i-th number, we can't use index to find its position; an number earlier in strings may happen to be the same as the $i-th one, in this string.
To test, modify input strings by adding the same number to each, before the one of interest.
Per question update, to examine the sequence (for missing files for instance), with the above findings you can collect counters for all strings in an array with hashrefs (num => filename)
use Data::Dump qw(dd);
my #seq = map { { $num[$_]->[$n] => $fnames[$_] } } 0..$#fnames;
dd \#seq;
where #fnames contains filenames (like two picked for the example above, $fn1 and $fn2). This assumes that the file list was sorted to begin with, or add the sort if it wasn't
my #seq =
sort { (keys %$a)[0] <=> (keys %$b)[0] }
map { { $num[$_]->[$n] => $fnames[$_] } }
0..$#fnames;
The order is maintained by array.
Adding this to the above example (with two strings) adds to the print
[
{ 8 => "segment8_400_av.ts" },
{ 12 => "segment12_400_av.ts" },
]
With this all goals in "Edit 2" should be straighforward.

I suggest that you build a regex pattern by changing all digit sequences to (\d+) and then see which captured values have changed
For instance, with segment8_400_av.ts and
segment9_400_av.ts you would generate a pattern /segment(\d+)_(\d+)_av\.ts/. Note that s/\d+/(\d+)/g will return the number of numeric fields, which you will need for the subsequent check
The first would capture 8 and 400 which the second would capture 9 and 400. 8 is different from 9, so it is in that region of the string where the number varies
I can't really write much code as you don't say what sort of result you want from this process

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}

Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

Find words, that are substrings of other words efficiently

I have an Ispell list of english words (nearly 50 000 words), my homework in Perl is to get quickly (like under one minute) list of all strings, that are substrings of some other word. I have tried solution with two foreach cycles comparing all words, but even with some optimalizations, its still too slow. I think, that right solution could be some clever use of regular expressions on array of words. Do you know how to solve this problem quicky (in Perl)?

I have found fast solution, which can find some all these substrings in about 15 seconds on my computer, using just one thread. Basically, for each word, I have created array of every possible substrings (eliminating substrings which differs only in "s" or "'s" endings):
#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
my $word = $_[0];
my #split_list;
my ($i, $j);
for ($i = 0; $i < length($word); ++$i){
for ($j = 1; $j <= length($word) - $i; ++$j){
unless
(
($j == length($word)) or
($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
)
{
push(#split_list, substr($word, $i, $j));
}
}
}
return #split_list;
}
Then I just create list of all candidates for substrings and make intersection with words:
my #substring_candidates;
foreach my $word (#words) {
push( #substring_candidates, split_to_all_valid_subwords($word));
}
#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} #substring_candidates;
my %words=map{$_=>1} #words;
my #substrings = grep( $substring_candidates{$_}, #words );
Now in substrings I have array of all words, that are substrings of some other words.

Perl regular expressions will optimize patterns like foo|bar|baz into an Aho-Corasick match - up to a certain limit of total compiled regex length. Your 50000 words will probably exceed that length, but could be broken into smaller groups. (Indeed, you probably want to break them up by length and only check words of length N for containing words of length 1 through N-1.)
Alternatively, you could just implement Aho-Corasick in your perl code - that's kind of fun to do.

update
Ondra supplied a beautiful solution in his answer; I leave my post here as an example of overthinking a problem and failed optimisation techniques.
My worst case kicks in for a word that doesn't match any other word in the input. In that case, it goes quadratic. The OPT_PRESORT was a try to advert the worst case for most words. The OPT_CONSECUTIVE was a linear-complexity filter that reduced the total number of items in the main part of the algorithm, but it is just a constant factor when considering the complexity. However, it is still useful with Ondras algorithm and saves a few seconds, as building his split list is more expensive than comparing two consecutive words.
I updated the code below to select ondras algorithm as a possible optimisation. Paired with zero threads and the presort optimisation, it yields maximum performance.
I would like to share a solution I coded. Given an input file, it outputs all those words that are a substring of any other word in the same input file. Therefore, it computes the opposite of ysth's ideas, but I took the idea of optimisation #2 from his answer. There are the following three main optimisations that can be deactivated if required.
Multithreading
The questions "Is word A in list L? Is word B in L?" can be easily parallelised.
Pre-sorting all the words for their length
I create an array that points to the list of all words that are longer than a certain length, for every possible length. For long words, this can cut down the number of possible words dramatically, but it trades quite a lot of space, as one word of length n appears in all lists from length 1 to length n.
Testing consecutive words
In my /usr/share/dict/words, most consecutive lines look quite similar:
Abby
Abby's
for example. As every word that would match the first word also matches the second one, I immediately add the first word to the list of matching words, and only keep the second word for further testing. This saved about 30% of words in my test cases. Because I do that before optimisation No 2, this also saves a lot of space. Another trade-off is that the output will not be sorted.
The script itself is ~120 lines long; I explain each sub before showing it.
head
This is just a standard script header for multithreading. Oh, and you need perl 5.10 or better to run this. The configuration constants define the optimisation behaviour. Add the number of processors of your machine in that field. The OPT_MAX variable can take the number of words you want to process, however this is evaluated after the optimisations have taken place, so the easy words will already have been caught by the OPT_CONSECUTIVE optimisation. Adding anything there will make the script seemingly slower. $|++ makes sure that the status updates are shown immediately. I exit after the main is executed.
#!/usr/bin/perl
use strict; use warnings; use feature qw(say); use threads;
$|=1;
use constant PROCESSORS => 0; # (false, n) number of threads
use constant OPT_MAX => 0; # (false, n) number of words to check
use constant OPT_PRESORT => 0; # (true / false) sorts words by length
use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading
use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm
use constant BLABBER_AT => 10; # (false, n) print progress at n percent
die q(The optimisations Ondra and Presort are mutually exclusive.)
if OPT_PRESORT and OPT_ONDRA;
exit main();
main
Encapsulates the main logic, and does multi-threading. The output of n words will be matched will be considerably smaller than the number of input words, if the input was sorted. After I have selected all matched words, I print them to STDOUT. All status updates etc. are printed to STDERR, so that they don't interfere with the output.
sub main {
my #matching; # the matching words.
my #words = load_words(\#matching); # the words to be searched
say STDERR 0+#words . " words to be matched";
my $prepared_words = prepare_words(#words);
# do the matching, possibly multithreading
if (PROCESSORS) {
my #threads =
map {threads->new(
\&test_range,
$prepared_words,
#words[$$_[0] .. $$_[1]] )
} divide(PROCESSORS, OPT_MAX || 0+#words);
push #matching, $_->join for #threads;
} else {
push #matching, test_range(
$prepared_words,
#words[0 .. (OPT_MAX || 0+#words)-1]);
}
say STDERR 0+#matching . " words matched";
say for #matching; # print out the matching words.
0;
}
load_words
This reads all the words from the input files which were supplied as command line arguments. Here the OPT_CONSECUTIVE optimisation takes place. The $last word is either put into the list of matching words, or into the list of words to be matched later. The -1 != index($a, $b) decides if the word $b is a substring of word $a.
sub load_words {
my $matching = shift;
my #words;
if (OPT_CONSECUTIVE) {
my $last;
while (<>) {
chomp;
if (defined $last) {
push #{-1 != index($_, $last) ? $matching : \#words}, $last;
}
$last = $_;
}
push #words, $last // ();
} else {
#words = map {chomp; $_} <>;
}
#words;
}
prepare_words
This "blows up" the input words, sorting them after their length into each slot, that has the words of larger or equal length. Therefore, slot 1 will contain all words. If this optimisation is deselected, it is a no-op and passes the input list right through.
sub prepare_words {
if (OPT_ONDRA) {
my $ondra_split = sub { # evil: using $_ as implicit argument
my #split_list;
for my $i (0 .. length $_) {
for my $j (1 .. length($_) - ($i || 1)) {
push #split_list, substr $_, $i, $j;
}
}
#split_list;
};
return +{map {$_ => 1} map &$ondra_split(), #_};
} elsif (OPT_PRESORT) {
my #prepared = ([]);
for my $w (#_) {
push #{$prepared[$_]}, $w for 1 .. length $w;
}
return \#prepared;
} else {
return [#_];
}
}
test
This tests if the word $w is a substring in any of the other words. $wbl points to the data structure that was created by the previous sub: Either a flat list of words, or the words sorted by length. The appropriate algorithm is then selected. Nearly all of the running time is spent in this loop. Using index is much faster than using a regex.
sub test {
my ($w, $wbl) = #_;
my $l = length $w;
if (OPT_PRESORT) {
for my $try (#{$$wbl[$l + 1]}) {
return 1 if -1 != index $try, $w;
}
} else {
for my $try (#$wbl) {
return 1 if $w ne $try and -1 != index $try, $w;
}
}
return 0;
}
divide
This just encapsulates an algorithm that guarantees a fair distribution of $items items into $parcels buckets. It outputs the bounds of a range of items.
sub divide {
my ($parcels, $items) = #_;
say STDERR "dividing $items items into $parcels parcels.";
my ($min_size, $rest) = (int($items / $parcels), $items % $parcels);
my #distributions =
map [
$_ * $min_size + ($_ < $rest ? $_ : $rest),
($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1)
], 0 .. $parcels - 1;
say STDERR "range division: #$_" for #distributions;
return #distributions;
}
test_range
This calls test for each word in the input list, and is the sub that is multithreaded. grep selects all those elements in the input list where the code (given as first argument) return true. It also regulary outputs a status message like thread 2 at 10% which makes waiting for completition much easier. This is a psychological optimisation ;-).
sub test_range {
my $wbl = shift;
if (BLABBER_AT) {
my $range = #_;
my $step = int($range / 100 * BLABBER_AT) || 1;
my $i = 0;
return
grep {
if (0 == ++$i % $step) {
printf STDERR "... thread %d at %2d%%\n",
threads->tid,
$i / $step * BLABBER_AT;
}
OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)
} #_;
} else {
return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} #_;
}
}
invocation
Using bash, I invoked the script like
$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)
Where 1000 is the number of lines I wanted to input, dict/words was the word list I used, and /dev/null is the place I want to store the output list, in this case, throwing the output away. If the whole file should be read, it can be passed as an argument, like
$ perl script.pl input-file >output-file
time just tells us how long the script ran. Using 2 slow processors and 50000 words, it executed in just over two minutes in my case, which is actually quite good.
update: more like 6–7 seconds now, with the Ondra + Presort optimisation, and no threading.
further optimisations
update: overcome by better algorithm. This section is no longer completely valid.
The multithreading is awful. It allocates quite some memory and isn't exactly fast. This isn't suprising considering the amount of data. I considered using a Thread::Queue, but that thing is slow like $#*! and therefore is a complete no-go.
If the inner loop in test was coded in a lower-level language, some performance might be gained, as the index built-in wouldn't have to be called. If you can code C, take a look at the Inline::C module. If the whole script were coded in a lower language, array access would also be faster. A language like Java would also make the multithreading less painful (and less expensive).

Regex performance: validating alphanumeric characters

When trying to validate that a string is made up of alphabetic characters only, two possible regex solutions come to my mind.
The first one checks that every character in the string is alphanumeric:
/^[a-z]+$/
The second one tries to find a character somewhere in the string that is not alphanumeric:
/[^a-z]/
(Yes, I could use character classes here.)
Is there any significant performance difference for long strings?
(If anything, I'd guess the second variant is faster.)

Just by looking at it, I'd say the second method is faster.
However, I made a quick non-scientific test, and the results seem to be inconclusive:
Regex Match vs. Negation.
P.S. I removed the group capture from the first method. It's superfluous, and would only slow it down.

Wrote this quick Perl code:
#testStrings = qw(asdfasdf asdf as aa asdf as8up98;n;kjh8y puh89uasdf ;lkjoij44lj 'aks;nasf na ;aoij08u4 43[40tj340ij3 ;salkjaf; a;lkjaf0d8fua ;alsf;alkj
a a;lkf;alkfa as;ldnfa;ofn08h[ijo ok;ln n ;lasdfa9j34otj3;oijt 04j3ojr3;o4j ;oijr;o3n4f;o23n a;jfo;ie;o ;oaijfoia ;aosijf;oaij ;oijf;oiwj;
qoeij;qwj;ofqjf08jf0 ;jfqo;j;3oj4;oijt3ojtq;o4ijq;onnq;ou4f ;ojfoqn;aonfaoneo ;oef;oiaj;j a;oefij iiiii iiiiiiiii iiiiiiiiiii);
print "test 1: \n";
foreach my $i (1..1000000) {
foreach (#testStrings) {
if ($_ =~ /^([a-z])+$/) {
#print "match"
} else {
#print "not"
}
}
}
print `date` . "\n";
print "test 2: \n";
foreach my $j (1..1000000) {
foreach (#testStrings) {
if ($_ =~ /[^a-z]/) {
#print "match"
} else {
#print "not"
}
}
}
then ran it with:
date; <perl_file>; date
it isn't 100% scientific, but it gives us a good idea. The first Regex took 10 or 11 seconds to execute, the second Regex took 8 seconds.

How to parse a command line with regular expressions?

I want to split a command line like string in single string parameters. How look the regular expression for it. The problem are that the parameters can be quoted. For example like:
"param 1" param2 "param 3"
should result in:
param 1, param2, param 3

You should not use regular expressions for this. Write a parser instead, or use one provided by your language.
I don't see why I get downvoted for this. This is how it could be done in Python:
>>> import shlex
>>> shlex.split('"param 1" param2 "param 3"')
['param 1', 'param2', 'param 3']
>>> shlex.split('"param 1" param2 "param 3')
Traceback (most recent call last):
[...]
ValueError: No closing quotation
>>> shlex.split('"param 1" param2 "param 3\\""')
['param 1', 'param2', 'param 3"']
Now tell me that wrecking your brain about how a regex will solve this problem is ever worth the hassle.

I tend to use regexlib for this kind of problem. If you go to: http://regexlib.com/ and search for "command line" you'll find three results which look like they are trying to solve this or similar problems - should be a good start.
This may work:
http://regexlib.com/Search.aspx?k=command+line&c=-1&m=-1&ps=20

("[^"]+"|[^\s"]+)
what i use
C++
#include <iostream>
#include <iterator>
#include <string>
#include <regex>
void foo()
{
std::string strArg = " \"par 1\" par2 par3 \"par 4\"";
std::regex word_regex( "(\"[^\"]+\"|[^\\s\"]+)" );
auto words_begin =
std::sregex_iterator(strArg.begin(), strArg.end(), word_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i)
{
std::smatch match = *i;
std::string match_str = match.str();
std::cout << match_str << '\n';
}
}
Output:
"par 1"
par2
par3
"par 4"

Without regard to implementation language, your regex might look something like this:
("[^"]*"|[^"]+)(\s+|$)
The first part "[^"]*" looks for a quoted string that doesn't contain embedded quotes, and the second part [^"]+ looks for a sequence of non-quote characters. The \s+ matches a separating sequence of spaces, and $ matches the end of the string.

Regex: /[\/-]?((\w+)(?:[=:]("[^"]+"|[^\s"]+))?)(?:\s+|$)/g
Sample: /P1="Long value" /P2=3 /P3=short PwithoutSwitch1=any PwithoutSwitch2
Such regex can parses the parameters list that built by rules:
Parameters are separates by spaces (one or more).
Parameter can contains switch symbol (/ or -).
Parameter consists from name and value that divided by symbol = or :.
Name can be set of alphanumerics and underscores.
Value can absent.
If value exists it can be the set of any symbols, but if it has the space then value should be quoted.
This regex has three groups:
the first group contains whole parameters without switch symbol,
the second group contains name only,
the third group contains value (if it exists) only.
For sample above:
Whole match: /P1="Long value"
Group#1: P1="Long value",
Group#2: P1,
Group#3: "Long value".
Whole match: /P2=3
Group#1: P2=3,
Group#2: P2,
Group#3: 3.
Whole match: /P3=short
Group#1: P3=short,
Group#2: P3,
Group#3: short.
Whole match: PwithoutSwitch1=any
Group#1: PwithoutSwitch1=any,
Group#2: PwithoutSwitch1,
Group#3: any.
Whole match: PwithoutSwitch2
Group#1: PwithoutSwitch2,
Group#2: PwithoutSwitch2,
Group#3: absent.

Most languages have other functions (either built-in or provided by a standard library) which will parse command lines far more easily than building your own regex, plus you know they'll do it accurately out of the box. If you edit your post to identify the language that you're using, I'm sure someone here will be able to point you at the one used in that language.
Regexes are very powerful tools and useful for a wide range of things, but there are also many problems for which they are not the best solution. This is one of them.

This will split an exe from it's params; stripping parenthesis from the exe; assumes clean data:
^(?:"([^"]+(?="))|([^\s]+))["]{0,1} +(.+)$
You will have two matches at a time, of three match groups:
The exe if it was wrapped in parenthesis
The exe if it was not wrapped in parenthesis
The clump of parameters
Examples:
"C:\WINDOWS\system32\cmd.exe" /c echo this
Match 1: C:\WINDOWS\system32\cmd.exe
Match 2: $null
Match 3: /c echo this
C:\WINDOWS\system32\cmd.exe /c echo this
Match 1: $null
Match 2: C:\WINDOWS\system32\cmd.exe
Match 3: /c echo this
"C:\Program Files\foo\bar.exe" /run
Match 1: C:\Program Files\foo\bar.exe
Match 2: $null
Match 3: /run
Thoughts:
I'm pretty sure that you would need to create a loop to capture a possibly infinite number of parameters.
This regex could easily be looped onto it's third match until the match fails; there are no more params.

If its just the quotes you are worried about, then just write a simple loop to dump character by character to a string ignoring the quotes.
Alternatively if you are using some string manipulation library, you can use it to remove all quotes and then concatenate them.

there's a python answer thus we shall have a ruby answer as well :)
require 'shellwords'
Shellwords.shellsplit '"param 1" param2 "param 3"'
#=> ["param 1", "param2", "param 3"] or :
'"param 1" param2 "param 3"'.shellsplit

Though answer is not RegEx specific but answers Python commandline arg parsing:
dash and double dash flags
int/float conversion based on SO answer
import sys
def parse_cmd_args():
_sys_args = sys.argv
_parts = {}
_key = "script"
_parts[_key] = [_sys_args.pop(0)]
for _part in _sys_args:
# Parse numeric values float and integers
if _part.replace("-", "1", 1).replace(".", "1").replace(",", "").isdigit():
_part = int(_part) if '.' not in _part and float(_part)/int(_part) == 1 else float(_part)
_parts[_key].append(_part)
elif "=" in _part:
_part = _part.split("=")
_parts[_part[0].strip("-")] = _part[1].strip().split(",")
elif _part.startswith(("-")):
_key = _part.strip("-")
_parts[_key] = []
else:
_parts[_key].extend(_part.split(","))
return _parts

Something like:
"(?:(?<=")([^"]+)"\s*)|\s*([^"\s]+)
or a simpler one:
"([^"]+)"|\s*([^"\s]+)
(just for the sake of finding a regexp ;) )
Apply it several time, and the group n°1 will give you the parameter, whether it is surrounded by double quotes or not.

If you are looking to parse the command and the parameters I use the following (with ^$ matching at line breaks aka multiline):
(?<cmd>^"[^"]*"|\S*) *(?<prm>.*)?
In case you want to use it in your C# code, here it is properly escaped:
try {
Regex RegexObj = new Regex("(?<cmd>^\\\"[^\\\"]*\\\"|\\S*) *(?<prm>.*)?");
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
It will parse the following and know what is the command versus the parameters:
"c:\program files\myapp\app.exe" p1 p2 "p3 with space"
app.exe p1 p2 "p3 with space"
app.exe

Here's a solution in Perl:
#!/usr/bin/perl
sub parse_arguments {
my $text = shift;
my $i = 0;
my #args;
while ($text ne '') {
$text =~ s{^\s*(['"]?)}{}; # look for (and remove) leading quote
my $delimiter = ($1 || ' '); # use space if not quoted
if ($text =~ s{^(([^$delimiter\\]|\\.|\\$)+)($delimiter|$)}{}) {
$args[$i++] = $1; # acquired an argument; save it
}
}
return #args;
}
my $line = <<'EOS';
"param 1" param\ 2 "pa\"ram' '3" 'pa\'ram" "4'
EOS
say "ARG: $_" for parse_arguments($line);
Output:
ARG: param 1
ARG: param\ 2
ARG: pa"ram' '3
ARG: pa'ram" "4
Note the following:
Arguments can be quoted with either " or ' (with the "other"
quote type treated as a regular character for that argument).
Spaces and quotes in arguments can be escaped with \.
The solution can be adapted to other languages. The basic approach is to (1) determine the delimiter character for the next string, (2) extract the next argument up to an unescaped occurrence of that delimiter or to the end-of-string, then (3) repeat until empty.

\s*("[^"]+"|[^\s"]+)
that's it

(reading your question again, just prior to posting I note you say command line LIKE string, thus this information may not be useful to you, but as I have written it I will post anyway - please disregard if I have missunderstood your question.)
If you clarify your question I will try to help but from the general comments you have made i would say dont do that :-), you are asking for a regexp to split a series of parmeters into an array. Instead of doing this yourself I would strongly suggest you consider using getopt, there are versions of this library for most programming languages. Getopt will do what you are asking and scales to manage much more sophisticated argument processing should you require that in the future.
If you let me know what language you are using I will try and post a sample for you.
Here are a sample of the home pages:
http://www.codeplex.com/getopt
(.NET)
http://www.urbanophile.com/arenn/hacking/download.html
(java)
A sample (from the java page above)
Getopt g = new Getopt("testprog", argv, "ab:c::d");
//
int c;
String arg;
while ((c = g.getopt()) != -1)
{
switch(c)
{
case 'a':
case 'd':
System.out.print("You picked " + (char)c + "\n");
break;
//
case 'b':
case 'c':
arg = g.getOptarg();
System.out.print("You picked " + (char)c +
" with an argument of " +
((arg != null) ? arg : "null") + "\n");
break;
//
case '?':
break; // getopt() already printed an error
//
default:
System.out.print("getopt() returned " + c + "\n");
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js