Regular expression to match strings with embedded spaces - regex

I am trying to write a regular expression but I can't pass the words space
I have a data file like this (generated by another utility)
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
The pattern has to match and capture data like this
"field" "100"
"blahbla" "<Set>"
"scree" "<what>"
".Cont.asasd" ""
"Othreaol" "Value, Other value"
My early solution is
/^([\s\*]+)([\w]+[\s\.\-\>]{0,2}[\w]+)(\s*\:\s)(.*)/
but I have problem with some strings like
Z.15 example : No
the space stops the pattern from matching
H.25 miss here : No
same thing here

There are some complicated answers here. I think I'd use a simple split:
while( <DATA> ) {
chomp;
my( $field, $value ) = split /\s*:\s*/, $_, 2;
print "Field [$field] value [$value]\n";
}
__DATA__
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
This gives:
Field [* field] value [100]
Field [blahbla] value [<Set>]
Field [scree] value [<what>]
Field [.Cont.asasd] value []
Field [Othreaol] value [Value, Other value]
Field [Point->IP] value [0.0.0.0 Port 5060]
From there, I'd filter the names and values as needed instead of trying to do it all in a single regex:
my #pairs =
grep { $_->[0] !~ /->/ } # filter keys
map { $_->[0] =~ s/\A\*\s+//; $_ } # transform keys
map { chomp; [ split /\s*:\s*/, $_, 2 ] } # parse line
<DATA>;
use Data::Printer;
p #pairs;
__DATA__
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060

Since you want to separate the values by colon, use the complement of that character in your regex for all those characters before the split.
my $regex
= qr{
( # v- no worry, this matches the first non-space, non-colon
[^\s:]
(?> [^:\n]* # this matches all non-colon chars on the line
[^\s:] # match the last non-space, non-colon, if there
)? # but possibly not there
) # end group
\s* # match any number of whitespace
: # match the colon
\s* # followed by any number of whitespace
( \S # Start second capture with any non space
(?> .* # anything on the same line
\S # ending in a non-space
)? # But, possibly not there at all
| # OR
) # nothing - this gives the second capture as an
# empty string instead of an undef
}x;
while ( <$in> ) {
$hash{ $1 } = $2 if m/$regex/;
}
%hash then looks like this:
{ '* field' => '100'
, '.Cont.asasd' => ''
, 'H.25 miss here' => 'No'
, Othreaol => 'Value, Other value'
, 'Point->IP' => '0.0.0.0 Port 5060'
, 'Z.15 example' => 'No'
, blahbla => '<Set>'
, scree => '<what>'
}
Of course, as I begin to think on it, if you could be assured of a /\s+:\s+/ pattern or at least a /\s{2,}:\s{2,}/ pattern, it might be simpler to just split the line like so:
while ( <$in> ) {
if ( my ( $k, #v )
= grep {; length } split /\A\s+|\s+\z|(\s+:\s+)/
) {
shift #v; # the first one will be the separator
$hash{ $k } = join( '', #v );
}
}
It does the same thing, at does not have to do nearly as much backtracking to trim results. And it ignores escaped colons without a whole lot more syntax, because it has to be a bare colon surrounded by spaces. You could just simply add the following to the if block:
$k =~ s/(?<!\\)(\\\\)*\\:/$1:/g;

I don't understand why the Point->IP line is omitted from your example output, but something like the code below should suit you.
use strict;
use warnings;
while (<DATA>) {
next unless /([^\s*].+?)\s*:\s*(.*?)\s*$/;
printf qq("%s" "%s"\n), $1, $2;
}
__DATA__
* field : 100
blahbla : <Set>
scree : <what>
.Cont.asasd :
Othreaol : Value, Other value
Point->IP : 0.0.0.0 Port 5060
Z.15 example : No
H.25 miss here : No
output
"field" "100"
"blahbla" "<Set>"
"scree" "<what>"
".Cont.asasd" ""
"Othreaol" "Value, Other value"
"Point->IP" "0.0.0.0 Port 5060"
"Z.15 example" "No"
"H.25 miss here" "No"

Related

How can I find sentences nested deeper than one bracket '()' set?

I want to print sentences from text file placed in () brackets deeper than one pair of brackets.
For example for this text file :
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
the output should be :
print me
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
hhhhh
This is what I've done so far:
#!/usr/bin/perl -w
open(FILE, "<", $ARGV[0]) or die "file open error";
if ( #ARGV ) #if there are args
{
if ( -f $ARGV[0] ) #if its regular file
{
while(<FILE>)
{
my #array = split('\)',$_);
foreach(#array)
{
if ($_ =~ /.*\((.*)/)
{
print "$1\n";
}
}
}
close(FILE);
}
else{
print "Arg is not a file\n";}
}
else{
print "no args\n";}
My code can't separate the sentences placed in deeper brackets.
Assuming brackets are balanced:
use strict;
use warnings;
my #a;
while (<DATA>) {
while (/\(([^()]*(?:\(((?1))\)[^()]*(?{push #a, $2}))*+)\)/g){}
}
print join "\n", #a;
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bb(xxxx)b))aa)
blabla (blabla(hhhhh))
It returns:
print me
cccc
xxxx
bbbb(cccc)bb(xxxx)b
eeee(bbbb(cccc)bb(xxxx)b)
hhhhh
The idea is to store the capture group 2 content after each recursion, using the (?{...}) construct to execute code in the pattern.
Note that the order of results isn't ideal since the innermost content appears first. Unfortunately, I didn't find a way to change the order of results.
Pattern details:
\( # opening bracket level 1
( # open capture group 1
[^()]* # all that is not a bracket
(?:
\( # opening bracket for level 2 (or more when a recursion occurs)
( # capture group 2: to store the result
(?1) # recursion
)
\) # closing bracket for level 2 (or more ...)
[^()]* #
(?{push #a, $2}) # store the capture group 2 content in #a
)*+ # repeat when needed
)
\) # closing bracket level 1
EDIT: This pattern assumes that brackets are balanced, but if it isn't the case, this may cause problems of unwanted results for certain strings. The reason is that results are stored before the whole pattern succeeds.
Example with the string 1234 ( 5678 (abcd(efgh)ijkl) where a closing bracket is missing:
1234 ( 5678 (abcd(efgh)ijkl)
# ^ ^---- second attempt succeeds, "efgh" is stored
# '---- first attempt fails, but "efgh", "abcd(efgh)ijkl" are stored
To solve the problem, you can choose between two default behaviours:
the strict behaviour that only accepts balanced brackets. All you need is to store the results in a temporary array and to reset this array in the while loop or when a closing bracket is missing. In this case the result will only be "efgh":
my #a;
my #b;
while (<DATA>) {
while (/\(([^()]*(?:\(((?1))\)[^()]*(?{push #b, $2}))*+)(?:\)|(?{undef #b})(*F))/g) {
push #a, #b;
undef #b;
}
}
a more tolerant behaviour that doesn't make mandatory the closing bracket. To do that you must replace each \) with (?:\)|$). In this case, the first attempt succeeds and consumes characters until the end of the string (in other words, there isn't a second attempt). The results are "efgh" and "abcd(efgh)ijkl"
This is probably easiest, and the most maintainable with a two-pass solution.
The initial pass captures all first level parentheses. The second pass captures all enclosed parenthesis groups, only advancing a single character in order to match every level of embedded paren groups:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
my $data = do { local $/; <DATA> };
my $parens_content_re = qr{
\(
(
(?:
[^()]*+
|
\( (?1) \)
)*
)
\)
}x;
say for map {/(?=$parens_content_re)\(/g} map {/$parens_content_re/g} $data;
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
----(----(aaaa(123)bbbb(456)cccc)----)----
Outputs:
$ perl parens.pl
print me
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
hhhhh
aaaa(123)bbbb(456)cccc
123
456
This code works by capturing levels recursively, using a simple regex for ) and split-ing by ( for the opening paren. It first prepares by peeling off the two starting layers of nesting. It works for shown examples, and a few others. However, there are other ways to nest pairs, for which rules are not specified. Also, this is probably rough around the edges. There is no magic of any kind involved and adjusting code for new cases should be feasible.
use warnings;
use strict;
my ($lev, #el, #res, $rret);
while (my $str = <DATA>)
{
print "\nString: $str\n";
#res = ();
# Drop two layers to start: strip last two ), split by ( and drop 0,1
$str =~ s/ (.*) \) [^)]* \) [^)]* $/$1/x;
#el = split '\(', $str;
#el = #el[2..$#el];
# Edge case: may have one element and be done, but with extra )
if (#el > 1) { $lev = join '(', #el }
else { ($lev = $el[0]) =~ s|\)||g }
push #res, $lev;
# Get next level and join string back, recursively
while ( $rret = nest_one($lev) ) {
$lev = join '(', #$rret;
push #res, $lev;
last if #$rret == 1;
}
print "\t$_\n" for #res;
}
# Strip last ) and past it, split by ( and drop first element
sub nest_one {
(my $lev = $_[0]) =~ s/(.*) \) [^)]* $/$1/x;
my #el = split '\(', $lev;
shift #el;
return (#el) ? \#el : undef;
}
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
It prints
blabla(nothing(print me)) nanana (nanan)
print me
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
blabla (blabla(hhhhh))
hhhhh

Perl regex with a negative lookahead behaves unexpectedly

I'm attempting to match /ezmlm-(any word except 'weed' or 'return')\s+/ with a regex. The following demonstrates a foreach loop which does the right thing, and an attempted regex which almost does:
#!/usr/bin/perl
use strict;
use warnings;
my #tests = (
{ msg => "want 'yes', string has ezmlm, but not weed or return",
str => q[|/usr/local/bin/ezmlm-reject '<snip>'],
},
{ msg => "want 'yes', array has ezmlm, but not weed or return",
str => [ <DATA> ],
},
{ msg => "want 'no' , has ezmlm-weed",
str => q[|/usr/local/bin/ezmlm-weed '<snip>'],
},
{ msg => "want 'no' , doesn't have ezmlm-anything",
str => q[|/usr/local/bin/else '<snip>'],
},
{ msg => "want 'no' , ezmlm email pattern",
str => q[crazy/but/legal/ezmlm-wacky#example.org],
},
);
print "foreach regex\n";
foreach ( #tests ) {
print doit_fe( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t";
print doit_re( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t<--- $_->{msg}\n";
};
# for both of the following subs:
# #_ will contain one or more lines of data
# match the pattern /ezmlm-(any word except 'weed' or 'return')\s+/
sub doit_fe {
my $has_ezmlm = 0;
foreach ( #_ ) {
next if $_ !~ m/ezmlm-(.*?)\s/;
return 0 if $1 eq 'weed' or $1 eq 'return';
$has_ezmlm++;
};
return $has_ezmlm;
};
sub doit_re { return grep /ezmlm-(?!weed|return)/, #_; };
__DATA__
|/usr/local/bin/ezmlm-reject '<snip>'
|/usr/local/bin/ezmlm-issubn '<snip>'
|/usr/local/bin/ezmlm-send '<snip>'
|/usr/local/bin/ezmlm-archive '<snip>'
|/usr/local/bin/ezmlm-warn '<snip>'
The output of the sample program is as follows:
foreach regex
yes yes <--- want 'yes', string has ezmlm, but not weed or return
yes yes <--- want 'yes', array has ezmlm, but not weed or return
no no <--- want 'no' , has ezmlm-weed
no no <--- want 'no' , doesn't have ezmlm-anything
no yes <--- want 'no' , ezmlm email pattern
In the last instance, the regex fails, matching a goofy but legal email address. If I amend the regex placing a \s after the negative lookahead pattern like so:
grep /ezmlm-(?!weed|return)\s+/
The regex fails to match at all. I'm supposing it has to do with the how the negative pattern works. I've tried making the negation non-greedy, but it seems there's some lesson buried in 'perldoc perlre' that is escaping me. Is it possible to do this with a single regex?
The negative look-ahead is zero-width which means that the regex
/ezmlm-(?!weed|return)\s+/
will only match if one or more space characters immediately follow "ezmlm-".
The pattern
/ezmlm-(?!weed|return)/
will match
"crazy/but/legal/ezmlm-wacky#example.org"
because it contains "ezmlm-" not followed by "weedy" or "return".
Try
/ezmlm-(?!weed|return)\S+\s+/
where \S+ is one or more non-space characters (or instead use [^#\s]+ if you want to deny email addresses even if followed by a space).

How to remove strings which do not start or end with specific substring?

Unfortunately, I'm not a regex expert, so I need a little help.
I'm looking for the solution how to grep an array of strings to get two lists of strings which do not start (1) or end (2) with the specific substring.
Let's assume we have an array with strings matching to the following rule:
[speakerId]-[phrase]-[id].txt
i.e.
10-phraseone-10.txt 11-phraseone-3.txt 1-phraseone-2.txt
2-phraseone-1.txt 3-phraseone-1.txt 4-phraseone-1.txt
5-phraseone-3.txt 6-phraseone-2.txt 7-phraseone-2.txt
8-phraseone-10.txt 9-phraseone-2.txt 10-phrasetwo-1.txt
11-phrasetwo-1.txt 1-phrasetwo-1.txt 2-phrasetwo-1.txt
3-phrasetwo-1.txt 4-phrasetwo-1.txt 5-phrasetwo-1.txt
6-phrasetwo-3.txt 7-phrasetwo-10.txt 8-phrasetwo-1.txt
9-phrasetwo-1.txt 10-phrasethree-10.txt 11-phrasethree-3.txt
1-phrasethree-1.txt 2-phrasethree-11.txt 3-phrasethree-1.txt
4-phrasethree-3.txt 5-phrasethree-1.txt 6-phrasethree-3.txt
7-phrasethree-1.txt 8-phrasethree-1.txt 9-phrasethree-1.txt
Let's introduce variables:
$speakerId
$phrase
$id1, $id2
I would like to grep a list and obtain an array:
with elements which contain specific $phrase but we exclude those strigns which simultaneously start with specific $speakerId AND end with one of specified id's (for instance $id1 or $id2)
with elements which have specific $speakerId and $phrase but do NOT contain one of specific ids at the end (warning: remember to not exclude the 10 or 11 for $id=1 , etc.)
Maybe someone coulde use the following code to write the solution:
#AllEntries = readdir(INPUTDIR);
#Result1 = grep(/blablablahere/, #AllEntries);
#Result2 = grep(/anotherblablabla/, #AllEntries);
closedir(INPUTDIR);
Assuming a basic pattern to match your example:
(?:^|\b)(\d+)-(\w+)-(?!1|2)(\d+)\.txt(?:\b|$)
Which breaks down as:
(?:^|\b) # starts with a new line or a word delimeter
(\d+)- # speakerid and a hyphen
(\w+)- # phrase and a hyphen
(\d+) # id
\.txt # file extension
(?:\b|$) # end of line or word delimeter
You can assert exclusions using negative look-ahead. For instance, to include all matches that do not have the phrase phrasetwo you can modify the above expression to use a negative look-ahead:
(?:^|\b)(\d+)-(?!phrasetwo)(\w+)-(\d+)\.txt(?:\b|$)
Note how I include (?!phrasetwo). Alternatively, you find all phrasethree entries that end in an even number by using a look-behind instead of a look-ahead:
(?:^|\b)(\d+)-phrasethree-(\d+)(?<![13579])\.txt(?:\b|$)
(?<![13579]) just makes sure the last number of the ID falls on an even number.
It sounds a bit like you're describing a query function.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my ( $set_a, $set_b ) = query( 2, 'phrasethree', [ 1, 3 ] );
print Dumper( { a => $set_a, b => $set_b } );
# a) fetch elements which
# 1. match $phrase
# 2. exclude $speakerId
# 3. match #ids
# b) fetch elements which
# 1. match $phrase
# 2. match $speakerId
# 3. exclude #ids
sub query {
my ( $speakerId, $passPhrase, $id_ra ) = #_;
my %has_id = map { ( $_ => 0 ) } #{$id_ra};
my ( #a, #b );
while ( my $filename = glob '*.txt' ) {
if ( $filename =~ m{\A ( \d+ )-( .+? )-( \d+ ) [.] txt \z}xms ) {
my ( $_speakerId, $_passPhrase, $_id ) = ( $1, $2, $3 );
if ( $_passPhrase eq $passPhrase ) {
if ( $_speakerId ne $speakerId
&& exists $has_id{$_id} )
{
push #a, $filename;
}
if ( $_speakerId eq $speakerId
&& !exists $has_id{$_id} )
{
push #b, $filename;
}
}
}
}
return ( \#a, \#b );
}
I like the approach with pure regular expressions using negative lookaheads and -behinds. However, it's a little bit hard to read. Maybe code like this could be more self-explanatory. It uses standard perl idioms that are readable like english in some cases:
my #all_entries = readdir(...);
my #matching_entries = ();
foreach my $entry (#all_entries) {
# split file name
next unless /^(\d+)-(.*?)-(\d+).txt$/;
my ($sid, $phrase, $id) = ($1, $2, $3);
# filter
next unless $sid eq "foo";
next unless $id == 42 or $phrase eq "bar";
# more readable filter rules
# match
push #matching_entries, $entry;
}
# do something with #matching_entries
If you really want to express something that complex in a grep list transformation, you could write code like this:
my #matching_entries = grep {
/^(\d)-(.*?)-(\d+).txt$/
and $1 eq "foo"
and ($3 == 42 or $phrase eq "bar")
# and so on
} readdir(...)

Regular expressions to match protected separated values

I'd like to have a regular expression to match a separated values with some protected values that can contain the separator character.
For instance:
"A,B,{C,D,E},F"
would give:
"A"
"B"
"{C,D,E}"
"F"
Please note the protected values can be nested, as follows:
"A,B,{C,D,{E,F}},G"
would give:
"A"
"B"
"{C,D,{E,F}}"
"G"
I already coded that feature with a character iteration as follow:
sub Parse
{
my #item;
my $curly;
my $string;
foreach(split //)
{
$_ eq "{" and ++$curly;
$_ eq "}" and --$curly;
if(!$curly && /[,:]/)
{
push #item, $string;
undef $string;
next;
}
$string .= $_;
}
push #item, $string;
return #item;
}
But it would definitively be so much nicer with a regexp.
A regex that supports nesting would look as follows:
my #items;
push #items, $1 while
/
(?: ^ | \G , )
(
(?: [^,{}]+
| (
\{
(?: [^{}]
| (?2)
)*
\}
)
| # Empty
)
)
/xg;
$ perl -E'$_ = shift; ... say for #items;' 'A,B,{C,D,{E,F}},G'
A
B
{C,D,{E,F}}
G
Assumes valid input since it can't extract and validate at the same time. (Well, not without making things really messy.)
Improved from nhahtdh's answer.
$_ = "A,B,{C,D,E},F";
while ( m/(\{.*?\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "[$&]\n";
}
Improved it again. Please look at this one!
$_ = "A,B,{C,D,{E,F}},G";
while ( m/(\{.*\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "$&\n";
}
It will get:
A
B
{C,D,{E,F}}
G
$a = "A,B,{C,D,E},F";
while ($a =~ s/(\{[\{\}\w,]+\}|\w)//) {
push (#res, $1);
}
print "\#res: #res\n"
Result:
#res: A B {C,D,E} F
Explanation : we try to match either the protected block \{[\{\}\w,]+\} or just a single character \w successively in a loop, deleting it from the original string if there is a match. Every time there is a match, we store it (meaning the $1) in the array, et voilà!
Here is a regex in bash:
chronos#localhost / $ echo "A,B,{C,D,E},F" | grep -oE "(\{[^\}]*\}|[A-Z])"
A
B
{C,D,E}
F
Try this regex. Use the regex to match and extract the token.
/(\{.*?\}|(?<=,|^).*?(?=,|$))/
I have not tested this code in Perl.
There is an assumption about on how the regex engine works here (I assume that it will try to match the first part \{.*?\} before the second part). I also assume that there are no nested curly bracket, and badly paired curly brackets.
$s = "A,B,{C,D,E},F";
#t = split /,(?=.*{)|,(?!.*})/, $s;

Finding results from and between groups of parentheses with regexp

Text format:
(Superships)
Eirik Raude - olajkutató fúrósziget
(Eirik Raude - Oil Patch Explorer)
I need regex to match text beetween first set of parentheses. Results: text1.
I need regex to match text beetween first set of parentheses and second set of parentheses. Results: text2.
I need regex to match text beetween second set of parentheses. Results: text3.
text1: Superships, represent english title,
text2: Eirik Raude - olajkutató fúrósziget, represent hungarian subtitle,
text3: Eirik Raude - Oil Patch Explorer, represent english subtitle.
I need regex for perl script to match this title and subtitle. Example script:
($anchor) = $tree->look_down(_tag=>"h1", class=>"blackbigtitle");
if ($anchor) {
$elem = $anchor;
my ($engtitle, $engsubtitle, $hunsubtitle #tmp);
while (($elem = $elem->right()) &&
((ref $elem) && ($elem->tag() ne "table"))) {
#tmp = get_all_text($elem);
push #lines, #tmp;
$line = join(' ', #tmp);
if (($engtitle) = $line =~ m/**regex need that return text1**/) {
push #{$prog->{q(title)}}, [$engtitle, 'en'];
t "english-title added: $engtitle";
}
elsif (($engsubtitle) = $line =~ m/**regex need that return text3**/) {
push #{$prog->{q(sub-title)}}, [$subtitle, 'en'];
t "english_subtitle added: $engsubtitle";
}
elsif (($hunsubtitle) = $line =~ m/**regex need that return text2**/) {
push #{$prog->{q(hun-subtitle)}}, [$hunsubtitle, 'hu'];
t "hungarinan_subtitle added: $hunsubtitle";
}
}
}
Considering your comment, you can do something like :
if (($english_title) = $line =~ m/^\(([^)]+)\)$/) {
$found_english_title = 1;
# do stuff
} elsif (($english-subtitle) = $line =~ m/^([^()]+)$/) {
# do stuff
} elsif ($found_english_title && ($hungarian-title) = $line =~ m/^\(([^)]+)\)$/) {
# do stuff
}
If you need to match them all in one expression:
\(([^)]+)\)([^(]+)\(([^)]+)\)
This matches (, then anything that's not ), then ), then anything that's not (, then, (, ... I think you get the picture.
First group will be text1, second group will be text2, third group will be text3.
You can also just make a more generix regex that matches something like "(text1)", "(text1)text2(text3)" or "text1(text2)" when applied several times:
(?:^|[()])([^()])(?:[()]|$)
This matches the beginning of the string or ( or ), then characters that are not ( or ), then ( or ) or the end of the string. :? is for non-capturing group, so the first group will have the string. Something more complex is necessary to match ( with ) every time, i.e., it can match "(text1(".