Multiline match with irregular new line - regex

I have text file with many entries like this:
[...]
Wind: 83,476,224
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
EnterValues: 656,136,1
Speed: 48,32
State: 2,102,83,476,224
[...]
From above part I would like to extract:
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
It would be simple if EnterValues: exists after every Solution:, unfortunately it doesn't. Sometime it is Speed, sometime something different. I don't know how to construct the end of regex (I assume it should be sth like this:Solution:.*?(?<!~)\n).
My file has \n as a delimiter of new line.

What you need is to apply a "record separator" that has the functionality of a regex. Unfortunately, you cannot use $/, because it cannot be a regex. You can however read the entire file into one line, and split that line using a regex:
use strict;
use warnings;
use Data::Dumper;
my $str = do {
local $/; # disable input record separator
<DATA>; # slurp the file
};
my #lines = split /^(?=\pL+:)/m, $str; # lines begin with letters + colon
print Dumper \#lines;
__DATA__
Wind: 83,476,224
Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
EnterValues: 656,136,1
Speed: 48,32
State: 2,102,83,476,224
Output:
$VAR1 = [
'Wind: 83,476,224
',
'Solution: (category,runs)~
0.235,6.52312667,~
0.98962,14.33858333,~
sdasd,cccc,~
0.996052905,sdsd
',
'EnterValues: 656,136,1
',
'Speed: 48,32
',
'State: 2,102,83,476,224
'
You will do some sort of post processing on these variables, I assume, but I will leave that to you. One way to go from here is to split the values on newline.

As I see you first read all file to memory, but this is not a good pracrice. Try use flip flop operator:
while ( <$fh> ) {
if ( /Solution:/ ... !/~$/ ) {
print $_, "\n";
}
}
I can't test it right now, but I think this should work fine.

You can match from Solution to word followed by colon,
my ($solution) = $text =~ /(Solution:.*?) \w+: /xs;

Related

Perl Regex: How to parse string from " to" without \"?

I have to parse current line "abc\",","\"," by regex in Perl,
and get this result "abc\"," and "\","
I do this
while (/(\s*)/gc) {
if (m{\G(["])([^\1]+)\1,}gc){
say $2;
}
}
but it is wrong, because this regexp go to the last ",
My question is, How can I jump over this \" and stop on first ", ?
The following program performs matches according to your specification:
while (<>) {
#arr = ();
while (/("(?:\\"|[^"])*")/) {
push #arr, $1;
$_ = $';
}
print join(' ', #arr), "\n";
}
Input file input.txt:
"abc", "def"
"abc\",","\","
Output:
$ ./test.pl < input.txt
"abc" "def"
"abc\"," "\","
It can be improved to match more strictly because in this form a lot of input is possible that is maybe not desirable, but it serves as a first pointer. Additionally, it is better matching a CSV file with the corresponding module and not with regular expressions, but you have not stated if your input is really a CSV file.
Don't reinvent the wheel. If you have CSV, use a CSV parser.
use Text::CSV_XS qw( );
my $string = '"abc\",","\","';
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
$csv->parse($_)
my #fields = $csv->fields();
Regexes aren't the best tool for this task. The standard Text::ParseWords module does this easily.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::ParseWords;
my $line = '"abc\",","\","';
my #fields = parse_line(',', 1, $line);
for (0 .. $#fields) {
say "$_: $fields[$_]"
}
The output is:
0: "abc\","
1: "\","
split /(?<!\\)",(?<!\\)"/, $_
(preceded by cleaning the boundary of $_ with s/^"// && s/"$//; because enclosing external quotes didn't need to be in the definition of the input string, but you have them)
returns directly the array you want (without the need of external loop as the loop is inside the core perl function split, you might add \s* surrounding the comma according to how the string might be provided).
..but (actually just a note as you didn't mention) there could be a deeper case
If you have \" meaning " you possibly have also \\ meaning \, so you might have \\\" and \\", the last one (more generally an even number of \ preceding ") is complicate with one line regexp because look-behind is implemented for fixed size, and the unsupported regexp form (?<!\\(?:\\\\)*)" which would potentially get well also a string delimiter after backslash not intending as escape quote \" from the sequence \\", is inapplicable and a less efficient code that mine would be required, but again this marginal consideration is about the case that \\ has to be hypothetically interpreted too.

Match multiple line string in Perl

I'm new to Perl and I was wondering if someone can help me.
I have an input like this:
a,b,
c,d,e,f,g,h,
i,j,q // Letras
I'm trying to get the letters before // separately and then print them between {} separated by :.
I tried with this RE ([\w,;:\s\t]*)(\n|\/\/)/m and I could get in $1 all letters for each line (as a string including separators) but not what I want.
I need to match that pattern more than one time in the same file so I was using /g.
Edit:
Here is my code block:
while ( <> ) {
if ( /([\w,;:\s\t]*)(\n|\/\/)/m ) {
print "$1\n";
}
}
/m is for using ^, and $ to match by line in a string with multiple lines.
On the other hand, you are reading the input line by line. You cannot expect to match across lines with a single expression if you only look at one line at a time.
Instead, read by chunks by setting $/ to an appropriate value. If the chunks always end in the exact string "// Letras\n\n", the task is even simpler.
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = '//';
while (my $chunk = <DATA>) {
chomp $chunk;
my #fields = ($chunk =~ /([a-z])[, ]/g);
next unless #fields;
printf "{%s}\n", join(':', #fields);
}
__DATA__
a,b,
c,d,e,f,g,h,
i,j,q // Letras
a,b,
c,d,e,f,g,h,
i,j,q // Metras
Output:
{a:b:c:d:e:f:g:h:i:j:q}
{a:b:c:d:e:f:g:h:i:j:q}
You can also use File::Stream:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Stream;
my $stream = File::Stream->new(
\*DATA,
separator => qr{ (?: \s+ // [^\n]+ ) \n\n }x
);
while (my $chunk = <$stream>) {
$chunk =~ s{ \s+ // .* \z }{}sx;
$chunk =~ s{ ,\n? }{:}gx;
print "{$chunk}\n";
}
__DATA__
a,b,
c,d,e,f,g,h,
i,j,q // Letras
a,b,
c,d,e,f,g,h,
i,j,q // Metras
I think what you're aiming for is to remove comments (denoted by a double slash) from each line, and print it out enclosed by braces, and with a colon : separator instead of commas
First of all you should remove the trailing linefeed character from each line using chomp
Then all you need to remove any trailing comment is s|\s*//.*||. That removes any spaces before the // as well. I'm using a pipe character | as the delimiter so as to avoid having to escape the slashes within the regex pattern. And the data is being processed one line at a time so there no need for the global /g modifier
This program reads from the file specified on the command line, which I've set up to contain the data you show in the question
use strict;
use warnings;
while ( <DATA> ) {
chomp;
s|\s*//.*||;
print "{$_}\n";
}
output
{a,b,}
{c,d,e,f,g,h,}
{i,j,q}
Update
Thanks to Sinan Ünür's solution I notice that you've asked to "print [the letters] between {} separated by :"
This is a modification of the while loop above, which finds all substrings within the current line that don't contain commas, and joins them together again using colons :
while ( <> ) {
chomp;
s|\s*//.*||;
my $values = join ':', /[^,]+/g;
print "{$values}\n";
}
output
{a:b}
{c:d:e:f:g:h}
{i:j:q}
I am sure the true solution is much more simple, but unless you elaborate your question we have to cater for all possibilities
Are you looking to combine the letters on all 3 lines into the output, or convert each line?
In other words, is your desired output
{a:b}
{c:d:e:f:g:h}
{i:j:q}
or
{a:b:c:d:e:f:g:h:i:j:q}
?
If you want the former, Borodin's answer works.
If you want the latter, then you should load the contents into an array, and print it using a join statement. To do that, I've modified Borodin's answer:
while ( <> ) { # read each line
chomp; # remove \n from line
s|\s*//.*||; # remove comment
push #values, ':', /[^,]+/g; # store letters in array
}
my $values = join ':', #values; # convert array to string
print "{$values}\n"; # print the results
my $str = "a,b,
c,d,e,f,g,h,
i,j,q // Letras";
$str = join "",map {s/,/:/g ;(split)[0]} split '\n', $str;
print "{$str}";
Sample output
{a:b:c:d:e:f:g:h:i:j:q}
I am considering a string with multilines separated by newline character.
join "",map {s/,/:/g ;(split)[0]} split '\n', $str
This is evaluated from right to left.
Split with \n on $str produces 3 elements which is input for map.
(split)[0] : default delimiter for split is whitespace. so each element is split for whitespace and 0th element is only considered discarding others.
Ex (split)[0] for i,j,q // Letras produces 3 elements "i,j,q" "//" "Letras" where only element 0 i.e., "i,j,q" is considered.
, is replaced with :
join is used to combine all the resulting elements from map.

Pattern match in perl

my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my $name = "";
#name = ( $line =~ m/Name:([\w\s\_\,/g );
foreach (#name) {
print $name."\n";
}
I want to capture the word between Name: and ,Region whereever it occurs in the whole line. The main loophole is that the name can be of any format
Amanda_Marry_Rose
Amanda.Marry.Rose
Amanda Marry Rose
Amanda/Marry/Rose
I need a help in capturing such a pattern every time it occurs in the line. So for the line I provided, the output should be
Amanda_Marry_Rose
Raghav.S.Thomas
Does anyone has any idea how to do this? I tried keeping the below line, but it's giving me the wrong output as.
#name=($line=~m/Name:([\w\s\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\#\[\\\]\^\_\`\{\|\}\~\´]+)\,/g);
Output
Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE
To capture between Name: and the first comma, use a negated character class:
/Name:([^,]+)/g
This says to match one or more characters following Name: which isn't a comma:
while (/Name:([^,]+)/g) {
print $1, "\n";
}
This is more efficient than a non-greedy quantifier, e.g:
/Name:(.+?),/g
As it doesn't require backtracking.
Reg-ex corrected:
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my #name = ($line =~ /Name\:([\w\s_.\/]+)\,/g);
foreach my $name (#name) {
print $name."\n";
}
What you have there is comma separated data. How you should parse this depends a lot on your data. If it is full-fledged csv data, the most safe approach is to use a proper csv parser, such as Text::CSV. If it is less strict data, you can get away with using the light-weight parser Text::ParseWords, which also has the benefit of being a core module in Perl 5. If what you have here is rather basic, user entered fields, then I would recommend split -- simply because when you know the delimiter, it is easier and safer to define it, than everything else inside it.
use strict;
use warnings;
use Data::Dumper;
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
# Simple split
my #fields = split /,/, $line;
print Dumper for map /^Name:(.*)/, #fields;
use Text::ParseWords;
print Dumper map /^Name:(.*)/, quotewords(',', 0, $line);
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
});
$csv->parse($line);
print Dumper map /^Name:(.*)/, $csv->fields;
Each of these options give the same output, save for the one that uses Text::CSV, which also issues an undefined warning, quite correctly, because your data has a trailing comma (meaning an empty field at the end).
Each of these has different strengths and weaknesses. Text::CSV can choke on data that does not conform with the CSV format, and split cannot handle embedded commas, such as Name:"Doe, John",....
The regex we use to extract the names very simply just captures the entire rest of the lines that begin with Name:. This also allows you to perform sanity checks on the field names, for example issue a warning if you suddenly find a field called Doe;Name:
The simple way is to look for all sequences of non-comma characters after every instance of Name: in the string.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my #names = $line =~ /Name:([^,]+)/g;
print "$_\n" for #names;
output
Amanda_Marry_Rose
Raghav.S.Thomas
However, it may well be useful to parse the data into an array of hashes so that related fields are gathered together.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my %info;
my #persons;
while ( $line =~ / ([a-z]+) : ([^:,]+) /gix ) {
my ($key, $val) = (lc $1, $2);
if ($info{$key}) {
push #persons, { %info };
%info = ();
}
$info{$key} = $val;
}
push #persons, { %info };
use Data::Dump;
dd \#persons;
print "\nNames:\n";
print "$_\n" for map $_->{name}, #persons;
output
[
{
cardtype => "DebitCard",
host => "USE",
name => "Amanda_Marry_Rose",
product => "Satin",
region => "US",
},
{
name => "Raghav.S.Thomas",
region => "UAE",
},
]
Names:
Amanda_Marry_Rose
Raghav.S.Thomas

Regex only with starting match

I am familliar with capturing multiple words with a definite match in perl, eg:
$string="dasd 341312 ddas 42 fsd 5345";
#numbers=$string=~/(\d+)/g;
This returns an array of numbers in my string.
I have data in this form:
random
text
START=somenumber
lines
of
text
here
START=someothernumber
other
text
here
START=thirdnumber
more
text
...
How can I capture to array all data blocks beginning with START= and going on (multiline) until the next START= (without it).
so eg:
$array[1] = " START=someothernumber
other
text
here"
Perhaps the following will be helpful:
use strict;
use warnings;
use Data::Dumper;
my $data = do { local $/; <DATA> };
my #array = $data =~ /(START=.+?)(?=START=|\z)/gs;
print Dumper \#array;
__DATA__
random
text
START=somenumber
lines
of
text
here
START=someothernumber
other
text
here
START=thirdnumber
more
text
Output:
$VAR1 = [
'START=somenumber
lines
of
text
here
',
'START=someothernumber
other
text
here
',
'START=thirdnumber
more
text
'
];
There is a simple way of doing this. Switch on multiline and global replace and then remember that you have to handle the new lines ( this is the key to unwinding this ). This will solve your issue:
while ($string =~ /^.*?START=([\w\s\n]*$)/mg) {
print $1,"\n";
}

Print line of pattern match in Perl regex

I am looking for a keyword in a multiline input using a regex like this,
if($input =~ /line/mi)
{
# further processing
}
The data in the input variable could be like this,
this is
multi line text
to be matched
using perl
The code works and matches the keyword line correctly. However, I would also like to obtain the line where the pattern was matched - "multi line text" - and store it into a variable for further processing. How do I go about this?
Thanks for the help.
You can grep out the lines into an array, which will then also serve as your conditional:
my #match = grep /line/mi, split /\n/, $input;
if (#match) {
# ... processing
}
TLP's answer is better but you can do:
if ($input =~ /([^\n]+line[^\n]+)/i) {
$line = $1;
}
I'd look if the match is in the multiline-String and in case it is, split it into lines and then look for the correct index number (starting with 0!):
#!/usr/bin/perl
use strict;
use warnings;
my $data=<<END;
this is line
multi line text
to be matched
using perl
END
if ($data =~ /line/mi){
my #lines = split(/\r?\n/,$data);
for (0..$#lines){
if ($lines[$_] =~ /line/){
print "LineNr of Match: " . $_ . "\n";
}
}
}
Did you try his?
This works for me. $1 represents the capture of regex inside ( and )
Provided there is only one match in one of the lines.If there are matches in multiple lines, then only the first one will be captured.
if($var=~/(.*line.*)/)
{
print $1
}
If you want to capture all the lines which has the string line then use below:
my #a;
push #a,$var=~m/(.*line.*)/g;
print "#a";