Regex newbie question on start and end of captures - regex

I need some help with regular expressions. Please see the example below. I am capturing specific rid values that are contained between between this
","children":[
and ending with this
}]}]}
as shown below.
My problem is that the block shown below repeats itself several times and I want all rids between the start of ","children":[ to }]}]} per block only.
I know I can capture individual rid value with: rid":"([\w\d\-\."]+)
But I don't know how to specify to capture all rid":"([\w\d\-\."]+) that exist between between the start of ","children":[ to }]}]}
Example:
","children":[{"type":"stub","context":"","rid":"b1c4922237ce.ee6a3644443fe.10711226e93.d0af7aadbd0-4be3-4353ddd.8b47.f2f4aaf2474f","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.290c6e93.91c15f91-a1c-4c36.9939.4ab7b94a39ad","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.27c3ee93.22e90c22-7406-463a.8bff.f6ea88f6ffcc","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6a182e93.5c0e7d5c-ff65-451d.afc0.cfc7fbcfc02d","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6970ae93.8ea3978e-112b-4bbb.8405.d17071d105d2","metaclass":"ASAPModel.BarrierCategory"}]}]},
","children":[{"type":"stub","context":"","rid":"b1c4922237ce.ee6a3644443fe.10711226e93.d0af7aadbd0-4be3-4353ddd.8b47.f2f4aaf2474f","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.290c6e93.91c15f91-a1c-4c36.9939.4ab7b94a39ad","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.27c3ee93.22e90c22-7406-463a.8bff.f6ea88f6ffcc","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6a182e93.5c0e7d5c-ff65-451d.afc0.cfc7fbcfc02d","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6970ae93.8ea3978e-112b-4bbb.8405.d17071d105d2","metaclass":"ASAPModel.BarrierCategory"}]}]},
My problem is that I don't understand how to specify the beginning and end values of where to start the non capturing group and how to say identify one or more of these capture groups sort of like []+

This looks like JSON (though you example data is incomplete to be valid).
If so then perhaps JSON module from CPAN might be best way forward:
use strict;
use warnings;
use JSON qw( from_json );
# my example data
my $data = q( [
{"children":[ {"type":"stub","rid":"aa"}, {"type":"stub2","rid":"bb"} ] },
{"children":[ {"type":"stub","rid":"cc"}, {"type":"stub2","rid":"dd"} ] } ]
);
my $json = from_json( $data );
for my $rec ( #$json ) {
for my $child ( #{ $rec->{children} } ) {
say "rid: ", $child->{rid};
}
}
This prints:
rid: aa
rid: bb
rid: cc
rid: dd

You need to break this up into two steps:
Get the length of data
Get the rids
# Make sure you get the first one
my ( $child ) = $record =~ m/"children":\[([^\]]+)\]/g;
# Get all in span - the g operator tells the regex to get all ( 'global' )
my #rids = $child =~ m/"rid":"([^"]+)"/g; # <-- g operator
But it looks like JSON to me, and you could parse data like this with JSON::Syck

some thing like \",\"children\":(.*)(?=\\]\\}\\]\\})
play around with it
the forum is absorbing some of my backslashes, word of warning to double up for anyone else
in response to edits
Try breaking up the data into its bracketed groups first, then doing one search for each in a for loop. you can get all the groups at once using regex groups.

Related

php regexp to search replace string functions to mb string functions

Solution was to look into look-aheads and look-behinds - the concept of LookArounds in RegEx helped me solve my issue since replacements was eaten from eachother when i did a replacement
So we've been working for a while to make some transitions on some of our older projects and (perhaps bad/old coding habits) and are working on making them php7-ready.
In this process i have made some adjustments in the .php files of the project so that for example
The problem at hand is that im facing some issues with danish characters in php string functions (strlen, substr etc) and would like for them to use mb_string functions instead. From what i can read on the internet using the "overload" function is not the way to go, so therefore i've decided to make filebased search replace.
My search replace function look like this right now (Updated thanks to #SeanBright)
$testfile = file_get_contents($file);
$array = array ( 'strlen'=>'mb_strlen',
'strpos'=>'mb_strpos',
'substr'=>'mb_substr',
'strtolower'=>'mb_strtolower',
'strtoupper'=>'mb_strtoupper',
'substr_count'=>'mb_substr_count',
'split'=>'mb_split',
'mail'=>'mb_send_mail',
'ereg'=>'mb_ereg',
'eregi'=>'mb_eregi',
'strrchr' => 'mb_strrchr',
'strichr' => 'mb_strichr',
'strchr' => 'mb_strchr',
'strrpos' => 'mb_strrpos',
'strripos' => 'mb_strripos',
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr'
);
foreach($array as $function_name => $mb_function_name){
$search_string = '/(^|[\s\[{;(:!\=\><?.,\*\/\-\+])(?<!->)(?<!new )' . $function_name . '(?=\s?\()/i';
$testfile = preg_replace($search_string, "$1".$mb_function_name."$2$3", $test,-1,$count);
}
print "<pre>";
print $test;
The $file has this content:
<?php
print strtoupper('test');
print strtolower'test');
print substr('tester',0,1);
print astrtoupper('test');
print bstrtolower('test');
print csubstr(('tester',0,1);
print [substr('tester',0,1)];
print {substr('tester',0,1)};
substr('test',0,1);
substr('test',0,1);
(substr('test',0,1));
!substr();
if(substr()==substr()=>substr()<substr()){
?substr('test');
}
"test".substr('test');
'asd'.substr('asd');
'asd'.substr('asd');
substr( substr('asdsadsadasd',0,-1),strlen("1"),strlen("100"));
substr (substr ('Asdsadsadasd',0,-1), strlen("1"), strlen("100"));
substr(substr(substr('Asdsadsadasd',0,-1),0,-1), strlen("1"), strlen("100"));
mailafsendelse(substr('asdsadsadasd',0,-1), strlen("1"), strlen("100"));
mail(test);
substr ( tester );
substr ( tester );
mail mail mail mail ( tester );
$mail->mail ();
$mail -> mail ();
new Mail();
new mail ();
strlen ( tester )*strlen ( tester )+strlen ( tester )/strlen ( tester )-strlen ( tester )
;
The point here is that the actual php code does not have to be valid syntax. I just wanted to make it work in different scenarios
My regEx problem is that i cannot find out why this line:
substr(substr(substr('Asdsadsadasd',0,-1),0,-1), strlen("1"), strlen("100"));
is not working. The 1st and 3rd substr are replaced correct but the 2nd looks like this:
mb_substr(substr(mb_substr('Asdsadsadasd',0,-1),0,-1), mb_strlen("1"), mb_strlen("100"));
As a note my search string is made to work with all sorts of characters in front of function name and require that the characters AFTER the function name is a "("
In a perfect world i would like to also exclude stringfunctions that are methods in classes, for example: $order->mail() that would send an email. This i would like NOT to be converted to $order->mb_send_mail()
From my understanding all parameters are the same, so it should not be a problem.
Complete script can be found here
https://github.com/welrachid/phpStringToMBString
The problem is that some of the characters you are using to delimit your function call checks are being consumed by matching. If you switch the last group to be a positive lookahead, this will fix the problem:
$search_string = '/([ \[{\n\t\r;(:!=><?\.,])'.($function_name).'([\ |\t]{0,1})(?=[(]{1})/i';
^^ Add these
Your current expression also won't match function calls at the beginning of the line. The following handles that and also simplifies things a bit:
$search_string = '/(^|[\s\[{;(:!=><?.,])' . $function_name . '(?=\s?\()/i';
I've set up an example on regex101.com.
You might even be able to get away with:
$search_string = '/(^|\W)' . $function_name . '(?=\s?\()/i';
Where \W will match a non-word character.
Update
To prevent matching method calls, you can add a negative lookbehind to your pattern:
$search_string = '/(^|[\s\[{;(:!=><?.,])(?<!->)' . $function_name . '(?=\s?\()/i';
^^^^^^^

perl string catenation and substitution in a single line?

I need to modify a perl variable containing a file path; it needs to begin and end with a forward slash (/) and have all instances of multiple forward slashes reduced to a single slash.
(This is because an existing process does not enforce a consistent configuration syntax, so there are hundreds of config files scattered everywhere that may or may not have slashes in the right places in file names and path names.)
Something like this:
foreach ( ($config->{'backup_path'},
$config->{'work_path'},
$config->{'output_path'}
) ) {
$_ = "/" . $_ . "/";
$_ =~ s/\/{2,}/\//g;
}
but this does not look optimal or particularly readable to me; I'd rather have a more elegant expression (if it ends up using an unusual regex I'll use a comment to make it clearer.)
Input & output examples
home/datamonster//c2counts becomes /home/datamonster/c2counts/
home/////teledyne/tmp/ becomes /home/teledyne/tmp/
and /var/backup/DOC/all_instruments/ will pass through unchanged
Well, just rewriting what you got:
my #vars = qw ( backup_path work_path output_path );
for ( #{$config}{#vars} ) {
s,^/*,/,; #prefix
s,/*$,/,; #suffix
s,/+,/,g; #double slashes anywhere else.
}
I'd be cautious - optimising for magic regex is not an advantage in every situation, because they become quite quickly unreadable.
The above uses the hash slice mechanism to select values out of a hash (reference in this case), and the fact that s/// implicitly operates on $_ anyway. And modifies the original var when it does.
But it's also useful to know, if you're operating on patterns containing / it's helpful to switch delimiters, because that way you don't get the "leaning toothpicks" effect.
s/\/{2,}/\//g can be written as:
s,/+,/,g
or
s|/{2,}|/|g
if you want to keep the numeric quantifier, as + is inherently 1 or more which works the same here, because it collapses a double into a single anyway, but it technically matches / (and replaces it with /) where the original pattern doesn't. But you wouldn't want to use the , if you have that in your pattern, for the same reason.
However I think this does the trick;
s,(?:^/*|\b\/*$|/+),/,g for #{$config}{qw ( backup_path work_path output_path )};
This matches an alternation grouping, replacing either:
start of line, zero or more /
word boundary, zero or more / end of line
one or more slashes anywhere else.
with a single /.
uses the hash slice mechanism as above, but without the intermediate 'vars'.
(For some reason the second grouping doesn't work correctly without the word boundary \b zero width anchor - I think this is a backtracking issue, but I'm not entirely sure)
For bonus points - you could probably select #vars using grep if your source data structure is appropriate:
my #vars = grep { /_path$/ } keys %$config;
#etc. Or inline with:
s,(?:^/*|\b\/*$|/+),/,g for #{$config}{grep { /_path$/ } keys %$config };
Edit: Or as Borodin notes:
s|(?:/|\A|\z)/*|/|
Giving us:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $config = {
backup_path => "/fish/",
work_path => "narf//zoit",
output_path => "/wibble",
test_path => 'home/datamonster//c2counts',
another_path => "/home/teledyne/tmp/",
again_path => 'home/////teledyne/tmp/',
this_path => '/var/backup/DOC/all_instruments/',
};
s,(?:/|\A|\b\z)/*,/,g for #{$config}{grep { /_path$/ } keys %$config };
print Dumper $config;
Results:
$VAR1 = {
'output_path' => '/wibble/',
'this_path' => '/var/backup/DOC/all_instruments/',
'backup_path' => '/fish/',
'work_path' => '/narf/zoit/',
'test_path' => '/home/datamonster/c2counts/',
'another_path' => '/home/teledyne/tmp/',
'again_path' => '/home/teledyne/tmp/'
};
you could do it like this, but I wouldn't call it more readable:
foreach ( ($config->{'backup_path'},
$config->{'work_path'},
$config->{'output_path'}
) ) {
( $_ = "/$_/" ) =~ s/\/{2,}/\//g;
}
This question already got many fantastic answers.
From the view of non-perl-expert (me), some are hard to read / understand. ;)
So, I would probably use this:
my #vars = qw ( backup_path work_path output_path );
for my $var (#vars) {
my $value = '/' . $config->{$var} . '/';
$value =~ s|//+|/|g;
$config->{$var} = $value;
}
For me, this is will be readable after a year too. :)

Why isn't my regex matching my input data?

I have a column of values (strings) that look like this:
arg123ala
arg345ala_r
thr567por thr789pro
pro1ala,thr2leu
I am trying to identify those values where the following pattern is met only once and no extra text is present:
three letters-some numbers-three letters
In the previous example, this would match the first value, but not the other three, because they have extra bits of text or there are two instances of the pattern separated by blank spaces or commas.
I tried using something like this in Perl:
if ( $value =~ /^[[:alpha:]]{3}\d{1,9}[[:alpha:]]{3}$) {
$qualifier = "ok";
}
else {
$qualifier = "needs cleaning";
}
And actually checked the regular expression in regexplanet.com, where it worked beautifully. However, when I used it in my code it wasn't matching any of the values I listed above, missing even the first one. Any idea why this could be happening? Any advice on an alternative for this?
It works fine. Here it is fixed (you didn't terminate your regex) and incorporated into a working program
use strict;
use warnings;
use v5.10;
while ( my $value = <DATA> ) {
my $qualifier;
if ( $value =~ /^[[:alpha:]]{3}\d{1,9}[[:alpha:]]{3}$/ ) {
$qualifier = "ok";
}
else {
$qualifier = "needs cleaning";
}
say $qualifier;
}
__DATA__
arg123ala
arg345ala_r
thr567por thr789pro
pro1ala,thr2leu
output
ok
needs cleaning
needs cleaning
needs cleaning
Looks like topic starter forgot final / in regexp.
I would use expression like this: /^[a-z]{3}\d+[a-z]{3}$/

Use regular expressions to find host name

so I have little problem, because I need to print host name which is bettwen "(?# )", for example:
Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu
And I need to print "researchscan425.eecs.umich.edu".
I tried something like:
if(my ($test) = $linelist =~ /\b\(\?\#(\S*)/)
{
print "$test\n";
}
But it doesn't print me anything.
You can use this regex:
\(\?#(.*?)\)
researchscan425.eecs.umich.edu will be captured into Group 1.
See demo
Sample code:
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?#(.*?)\)/)
{
print "$test\n";
}
How about:
if(my ($test) = $linelist =~ /\(\?\#([^\s)]+)/)
You need to remove the \b which exists before (. Because there isn't a word boundary exists before ( (non-word character) and after space (non-word charcater).
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?\#([^)]*)/)
{
print "$test\n";
}
The problem here is the definition of \b.
It's "word boundary" - on regex101 that means:
(^\w|\w$|\W\w|\w\W)
Now, why this is causing you problems - ( is not a word character. So the transition from space to bracket doesn't trigger this pattern.
Switch your pattern to:
\s\(\?\#(\S+)
And it'll work. (Note - I've changed * to + because you probably want one or more, not zero or more).
It's amazing what you can do with logging tools or with perl as part of the logging service itself (c.f. Ubic), but even if you're just writing a "quick script" to parse logs for reporting (i.e. something you or someone else won't look at again for months or years) it helps to make them easy to maintain.
One approach to doing this is to process the lines of your log file lines with Regexp::Common. One advantage is that RX::Common matches practically "self document" what you are doing. For example, to match on specific "RFC compliant" definitions of what constitutes a "domain" using the $linelist you posted:
use Regexp::Common qw /net/;
if ( $line =~ /\?\#$RE{net}{domain}{-keep}/ ) { say $1 }
Then, later, if you need you can add other matches e.g "numeric" IPv4 or IPv6 addresses, assign them for use later in the script, etc. (Perl6::Form and IO::All used for demonstration purposes only - try them out!):
use IO::All ;
use Regexp::Common qw/net/;
use Perl6::Form;
my $purelog = io 'logfile.lines.txt' ;
sub _get_ftphost_names {
my #hosts = () ;
while ($_ = $purelog->getline) {
/\(\?\#$RE{net}{IPv6}{-sep => ":" }{-keep}/ ||
/\(\?\#$RE{net}{IPv4}{-keep}/ ||
/\(\?\#$RE{net}{domain}{-keep}/ and push #hosts , $1 ;
}
return \#hosts ;
}
sub _get_bytes_transfered {
... ;
}
my #host_list = _get_ftphost_names ;
print form
"{[[[[[[[[[[(30+)[[[[[[[[[[[[[}", #host_list ;
One of the great things about Regexp::Common (besides stealing regexp ideas from the source) is that it also makes it fairly easy to roll your own matches, You can use those to capture other parts of the file in an easily understandable way adding them piece by piece. Then, as what was supposed to be your four line script grows and transforms itself into a ITIL compliant corporate reporting tool, you and your career can advance apace :-)

How can I parse quoted CSV in Perl with a regex?

I'm having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
I'd like to avoid Text::CSV as it isn't installed on the target server. Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.
sub parse_csv {
my $text = shift; #record containg CSVs
my #columns = ();
push(#columns ,$+) while $text =~ m{
# The first part groups the phrase inside quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#columns ,undef) if substr($text, -1,1) eq ',';
return #columns ; # list of vars that was comma separated.
}
Does anyone have a suggestion for improving the regex to handle the above cases?
Please, Try Using CPAN
There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.
If you can't store text files in your project, then I'm wondering how it is you are coding your project.
http://novosial.org/perl/life-with-cpan/non-root/
Should be a good guide on how to get these into a working state locally.
Not using CPAN is really a recipe for disaster.
Please consider this before trying to write your own CSV implementation.
Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.
note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.
You can parse CSV using Text::ParseWords which ships with Perl.
use Text::ParseWords;
while (<DATA>) {
chomp;
my #f = quotewords ',', 0, $_;
say join ":" => #f;
}
__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
which parses your CSV correctly....
# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:
The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)
So you may notice that
# S,"YBELT,"V"",000323030,
came out as (ie. quotes dropped around "V")
# S:YBELT,V:000323030:
however if its escaped like so
# S,"YBELT,\"V\"",000323030,
then quotes will be retained
# S:YBELT,"V":000323030:
tested; working:-
$_.=','; # fake an ending delimiter
while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
$cell=defined($1) ? $1:$2; $cell=~s/""/"/g;
print "$cell\n";
}
# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
# "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.
be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:
if("$pre$_"=~/,"[^,]*\z/) {
$pre.=$_; next;
}
$_="$pre$_";
This works like charm
line is assumed to be comma separated with embeded ,
my #columns = Text::ParseWords::parse_line(',', 0, $line);
Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.
You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to #INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).
Tested:
use Test::More tests => 2;
use strict;
sub splitCommaNotQuote {
my ( $line ) = #_;
my #fields = ();
while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
if ( $2 ) {
push( #fields, $3 );
} else {
push( #fields, $1 );
}
last if ( ! $4 );
}
return( #fields );
}
is_deeply(
+[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
+['S', '"D" REL VALVE ASSY', '000771881', ''],
"Quote in value"
);
is_deeply(
+[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
+['S', 'BELT V,FAN', '000324244', ''],
"Strip quotes from entire value"
);