How can I parse quoted CSV in Perl with a regex? - regex

I'm having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
I'd like to avoid Text::CSV as it isn't installed on the target server. Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.
sub parse_csv {
my $text = shift; #record containg CSVs
my #columns = ();
push(#columns ,$+) while $text =~ m{
# The first part groups the phrase inside quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#columns ,undef) if substr($text, -1,1) eq ',';
return #columns ; # list of vars that was comma separated.
}
Does anyone have a suggestion for improving the regex to handle the above cases?

Please, Try Using CPAN
There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.
If you can't store text files in your project, then I'm wondering how it is you are coding your project.
http://novosial.org/perl/life-with-cpan/non-root/
Should be a good guide on how to get these into a working state locally.
Not using CPAN is really a recipe for disaster.
Please consider this before trying to write your own CSV implementation.
Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.
note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.

You can parse CSV using Text::ParseWords which ships with Perl.
use Text::ParseWords;
while (<DATA>) {
chomp;
my #f = quotewords ',', 0, $_;
say join ":" => #f;
}
__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
which parses your CSV correctly....
# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:
The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)
So you may notice that
# S,"YBELT,"V"",000323030,
came out as (ie. quotes dropped around "V")
# S:YBELT,V:000323030:
however if its escaped like so
# S,"YBELT,\"V\"",000323030,
then quotes will be retained
# S:YBELT,"V":000323030:

tested; working:-
$_.=','; # fake an ending delimiter
while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
$cell=defined($1) ? $1:$2; $cell=~s/""/"/g;
print "$cell\n";
}
# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
# "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.
be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:
if("$pre$_"=~/,"[^,]*\z/) {
$pre.=$_; next;
}
$_="$pre$_";

This works like charm
line is assumed to be comma separated with embeded ,
my #columns = Text::ParseWords::parse_line(',', 0, $line);

Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.

You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to #INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).

Tested:
use Test::More tests => 2;
use strict;
sub splitCommaNotQuote {
my ( $line ) = #_;
my #fields = ();
while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
if ( $2 ) {
push( #fields, $3 );
} else {
push( #fields, $1 );
}
last if ( ! $4 );
}
return( #fields );
}
is_deeply(
+[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
+['S', '"D" REL VALVE ASSY', '000771881', ''],
"Quote in value"
);
is_deeply(
+[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
+['S', 'BELT V,FAN', '000324244', ''],
"Strip quotes from entire value"
);

Related

Use regular expressions to find host name

so I have little problem, because I need to print host name which is bettwen "(?# )", for example:
Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu
And I need to print "researchscan425.eecs.umich.edu".
I tried something like:
if(my ($test) = $linelist =~ /\b\(\?\#(\S*)/)
{
print "$test\n";
}
But it doesn't print me anything.
You can use this regex:
\(\?#(.*?)\)
researchscan425.eecs.umich.edu will be captured into Group 1.
See demo
Sample code:
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?#(.*?)\)/)
{
print "$test\n";
}
How about:
if(my ($test) = $linelist =~ /\(\?\#([^\s)]+)/)
You need to remove the \b which exists before (. Because there isn't a word boundary exists before ( (non-word character) and after space (non-word charcater).
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?\#([^)]*)/)
{
print "$test\n";
}
The problem here is the definition of \b.
It's "word boundary" - on regex101 that means:
(^\w|\w$|\W\w|\w\W)
Now, why this is causing you problems - ( is not a word character. So the transition from space to bracket doesn't trigger this pattern.
Switch your pattern to:
\s\(\?\#(\S+)
And it'll work. (Note - I've changed * to + because you probably want one or more, not zero or more).
It's amazing what you can do with logging tools or with perl as part of the logging service itself (c.f. Ubic), but even if you're just writing a "quick script" to parse logs for reporting (i.e. something you or someone else won't look at again for months or years) it helps to make them easy to maintain.
One approach to doing this is to process the lines of your log file lines with Regexp::Common. One advantage is that RX::Common matches practically "self document" what you are doing. For example, to match on specific "RFC compliant" definitions of what constitutes a "domain" using the $linelist you posted:
use Regexp::Common qw /net/;
if ( $line =~ /\?\#$RE{net}{domain}{-keep}/ ) { say $1 }
Then, later, if you need you can add other matches e.g "numeric" IPv4 or IPv6 addresses, assign them for use later in the script, etc. (Perl6::Form and IO::All used for demonstration purposes only - try them out!):
use IO::All ;
use Regexp::Common qw/net/;
use Perl6::Form;
my $purelog = io 'logfile.lines.txt' ;
sub _get_ftphost_names {
my #hosts = () ;
while ($_ = $purelog->getline) {
/\(\?\#$RE{net}{IPv6}{-sep => ":" }{-keep}/ ||
/\(\?\#$RE{net}{IPv4}{-keep}/ ||
/\(\?\#$RE{net}{domain}{-keep}/ and push #hosts , $1 ;
}
return \#hosts ;
}
sub _get_bytes_transfered {
... ;
}
my #host_list = _get_ftphost_names ;
print form
"{[[[[[[[[[[(30+)[[[[[[[[[[[[[}", #host_list ;
One of the great things about Regexp::Common (besides stealing regexp ideas from the source) is that it also makes it fairly easy to roll your own matches, You can use those to capture other parts of the file in an easily understandable way adding them piece by piece. Then, as what was supposed to be your four line script grows and transforms itself into a ITIL compliant corporate reporting tool, you and your career can advance apace :-)

Regex : conditions in captured variables

This is my data (in a file):
5807035;Fab;2015/01/05;04;668100;18:06:01,488;18:06:02,892
5807028;Opt;2015/01/05;04;836100;17:12:45,223;17:12:47,407
5807028;Fab;2015/01/05;04;836100;17:12:47,470;17:12:48,172
5807027;Opt;2015/01/05;04;926100;17:12:31,807;17:12:34,365
5807027;Fab;2015/01/05;04;926100;17:12:34,443;17:12:37,095
5807026;Opt;2015/01/05;04;682100;17:12:11,698;17:12:19,062
5807026;Fab;2015/01/05;04;682100;17:12:19,124;17:12:21,667
5807025;Opt;2015/01/05;04;217100;17:12:00,669;17:12:02,635
This is my Perl code :
while ( $data =~ m/(\d+);(Opt|Fab);(.+);(\d{2});(.+);(.+);(.+)\n(\d+);(Opt|Fab);.+;\d{2};.+;(.+);(.+)\n/g ) {
if ( "$1" eq "$8" && "$2" ne "$9" ) {
print OUTFILE "$1;$3;$4;$5;$6;$7;$10;$11\n";
}
}
The lines 1 and 2 match the regex, but do not satisfy the condition of the if statement. That's fine.
On the other hand, the lines 2 and 3 satisfy the regex, AND the condition of the if statement. However, it these lines are not retrieved.
I suppose it's because the regex read two lines, then the next two lines, etc. I think I should include the condition of the if statement in the regex (if I'm not mistaken).
What do you guys think ?
The variable $data holds the content of my CSV file.
Since you want to check line 1 & 2, then 2 & 3, you need to prevent the regex engine from consuming the 2nd line by placing the regex to match the second line in a look-ahead:
while ( $data =~ m/(\d+);(Opt|Fab);(.+);(\d{2});(.+);(.+);(.+)\n(?=(\d+);(Opt|Fab);.+;\d{2};.+;(.+);(.+)\n)/g ) {
I didn't think too much when I first answer, but as #ThisSuitIsBlackNot suggested in the comment, using regular expression to parse CSV results in low maintainability code. Using CSV library to parse the data and process them is a better idea here.

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.

Using perl to split a line that may contain whitespace

Okay, so I'm using perl to read in a file that contains some general configuration data. This data is organized into headers based on what they mean. An example follows:
[vars]
# This is how we define a variable!
$var = 10;
$str = "Hello thar!";
# This section contains flags which can be used to modify module behavior
# All modules read this file and if they understand any of the flags, use them
[flags]
Verbose = true; # Notice the errant whitespace!
[path]
WinPath = default; # Keyword which loads the standard PATH as defined by the operating system. Append with additonal values.
LinuxPath = default;
Goal: Using the first line as an example "$var = 10;", I'd like to use the split function in perl to create an array that contains the characters "$var" and "10" as elements. Using another line as an example:
Verbose = true;
# Should become [Verbose, true] aka no whitespace is present
This is needed because I will be outputting these values to a new file (which a different piece of C++ code will read) to instantiate dictionary objects. Just to give you a little taste of what it might look like (just making it up as I go along):
define new dictionary
name: [flags]
# Start defining keys => values
new key name: Verbose
new value val: 10
# End dictionary
Oh, and here is the code I currently have along with what it is doing (incorrectly):
sub makeref($)
{
my #line = (split (/=/)); # Produces ["Verbose", " true"];
}
To answer one question, why I am not using Config::Simple, is that I originally did not know what my configuration file would look like, only what I wanted it to do. Making it up as I went along - at least what seemed sensible to me - and using perl to parse the file.
The problem is I have some C++ code that will load the information in the config file, but since parsing in C or C++ is :( I decided to use perl. It's also a good learning exercise for me since I am new to the language. So that's the thing, this perl code is not really apart of my application, it just makes it easier for the C++ code to read the information. And, it is more readable (both the config file, and the generated file). Thanks for the feedback, it really helped.
If you're doing this parsing as a learning exercise, that's fine. However, CPAN has several modules that will do a lot of the work for you.
use Config::Simple;
Config::Simple->import_from( 'some_config_file.txt', \my %conf );
split splits on a regular expression, so you can simply put the whitespace around the = sign into its regex:
split (/\s*=\s*/, $line);
You obviously do not want to remove all whitespace, or such a line would be produced (whitespace missing in the string):
$str="Hellothere!";
I guess that only removing whitespace from the beginning and end of the line is sufficient:
$line =~ s/^\s*(.*?)\s*$/$1/;
A simpler alternative with two statements:
$line =~ s/^\s+//;
$line =~ s/\s+$//;
Seems like you've got it. Strip the whitespaces before splitting.
sub makeref($)
{
s/\s+//g;
my #line = (split(/=/)); # gets ["verbose", "true"]
}
This code does the trick (and is more efficient without reversing).
for (#line) {
s/^\s+//;
s/\s+$//;
}
You probably have it all figured out, but I thought I'd add a little. If you
sub makeref($)
{
my #line = (split(/=/));
foreach (#line)
{
s/^\s+//g;
s/\s+$//g;
}
}
then you will remove the whitespace before and after both the left and right side. That way something like:
this is a parameter = all sorts of stuff here
will not have crazy spaces.
!!Warning: I probably don't know what I'm talking about!!

Regex newbie question on start and end of captures

I need some help with regular expressions. Please see the example below. I am capturing specific rid values that are contained between between this
","children":[
and ending with this
}]}]}
as shown below.
My problem is that the block shown below repeats itself several times and I want all rids between the start of ","children":[ to }]}]} per block only.
I know I can capture individual rid value with: rid":"([\w\d\-\."]+)
But I don't know how to specify to capture all rid":"([\w\d\-\."]+) that exist between between the start of ","children":[ to }]}]}
Example:
","children":[{"type":"stub","context":"","rid":"b1c4922237ce.ee6a3644443fe.10711226e93.d0af7aadbd0-4be3-4353ddd.8b47.f2f4aaf2474f","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.290c6e93.91c15f91-a1c-4c36.9939.4ab7b94a39ad","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.27c3ee93.22e90c22-7406-463a.8bff.f6ea88f6ffcc","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6a182e93.5c0e7d5c-ff65-451d.afc0.cfc7fbcfc02d","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6970ae93.8ea3978e-112b-4bbb.8405.d17071d105d2","metaclass":"ASAPModel.BarrierCategory"}]}]},
","children":[{"type":"stub","context":"","rid":"b1c4922237ce.ee6a3644443fe.10711226e93.d0af7aadbd0-4be3-4353ddd.8b47.f2f4aaf2474f","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.290c6e93.91c15f91-a1c-4c36.9939.4ab7b94a39ad","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.27c3ee93.22e90c22-7406-463a.8bff.f6ea88f6ffcc","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6a182e93.5c0e7d5c-ff65-451d.afc0.cfc7fbcfc02d","metaclass":"ASAPModel.BarrierCategory"},
{"type":"stub","context":"","rid":"b1c497ce.ee6a64fe.6970ae93.8ea3978e-112b-4bbb.8405.d17071d105d2","metaclass":"ASAPModel.BarrierCategory"}]}]},
My problem is that I don't understand how to specify the beginning and end values of where to start the non capturing group and how to say identify one or more of these capture groups sort of like []+
This looks like JSON (though you example data is incomplete to be valid).
If so then perhaps JSON module from CPAN might be best way forward:
use strict;
use warnings;
use JSON qw( from_json );
# my example data
my $data = q( [
{"children":[ {"type":"stub","rid":"aa"}, {"type":"stub2","rid":"bb"} ] },
{"children":[ {"type":"stub","rid":"cc"}, {"type":"stub2","rid":"dd"} ] } ]
);
my $json = from_json( $data );
for my $rec ( #$json ) {
for my $child ( #{ $rec->{children} } ) {
say "rid: ", $child->{rid};
}
}
This prints:
rid: aa
rid: bb
rid: cc
rid: dd
You need to break this up into two steps:
Get the length of data
Get the rids
# Make sure you get the first one
my ( $child ) = $record =~ m/"children":\[([^\]]+)\]/g;
# Get all in span - the g operator tells the regex to get all ( 'global' )
my #rids = $child =~ m/"rid":"([^"]+)"/g; # <-- g operator
But it looks like JSON to me, and you could parse data like this with JSON::Syck
some thing like \",\"children\":(.*)(?=\\]\\}\\]\\})
play around with it
the forum is absorbing some of my backslashes, word of warning to double up for anyone else
in response to edits
Try breaking up the data into its bracketed groups first, then doing one search for each in a for loop. you can get all the groups at once using regex groups.