Regex: Interpret groups with the same content as a single group - regex

I have the following situation:
^ID[ \t]*=[ \t]*('(.*)'|"(.*)")
The group with content
01
when a file contains:
ID = '01'
is the second.
Instead if:
ID = "01"
is the third.
This cause me a problem with perl:
perl -lne "print \$2 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/" test.txt
That if group with single quotes matches then i get the output:
01
Otherwise i obtain an empty string.
How do I make both the case of single quotes and double quotes interpret as group two in regex?

You can print both the groups, as they can never match at the same time:
perl -lne "print \$2.\$3 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/"
or remember the quotes in $2 and use $3 for the quoted string, followed by the remembered quote:
perl -lne "print \$3 if /^ID[ \t]*=[ \t]*((['\"])(.*)\2)/"

This looks like it's a good candidate for the branch reset operator, (?|...). Either capture in that alternation is $1, and the branch-reset construct takes care of the grouping without capturing anything:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01');
foreach ( #strings ) {
say $1 if m/^ID \h* = \h* (?|'(\d+)'|"(\d+)") /x
}
You need v5.10, and that allows you to use the \h to match horizontal whitespace.
But, you don't need to repeat the pattern. You can match the quote and match that same quote later. A relative backreference, \g{N}, can do that:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01' );
foreach ( #strings ) {
say $2 if m/^ID \h* = \h* (['"])(\d+)\g{-2} /x
}
I prefer that \g{-2} because I usually don't have to update numbering if I change the pattern to include more captures before the thing if refers to.
And, since this is a one-liner, don't type out the literal quotes (as ikegami has already shown):
say $2 if m/^ID \h* = \h* ([\x22\x27])(\d+)\g{-2} /x

Only one of the two will be defined, so simply use the one that's defined.
perl -nle'print $1//$2 if /^ID\h*=\h*(?:\x27(.*)\x27|"(.*)")/' # \x27 is '
You could also use a backreference.
perl -nle'print $2 if /^ID\h*=\h*(["\x27])(.*)\1/'
Note that all the provided solutions including these two fail (leave the escape sequence in) if you have something like ID="abc\"def" or ID="abc\ndef", assuming those are supported.

Thank you #brian_d_foy:
perl -lne "print \$1 if /^ID\h*=\h*(?|'(.*)'|\"(.*)\")/" test.txt
Or better:
perl -lne "print \$2 if /^ID\h*=\h*(['\"])(.*)\1/" test.txt
I have decided of accept also
ID = 01 #Followed by one or more horizontal spaces.
In addition to:
ID = "01" #Followed by one or more horizontal spaces.
And:
ID = '01' #Followed by one or more horizontal spaces.
Therefore I have adopted a super very complex solution:
perl -lne "print \$2 if /^ID\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/" test.txt
I have done a fusion of your both solutions #brian_d_foy. The double round parentheses are used to bring the second alternative to the second group as well, otherwise it would be the first group and without even the "branch reset operator", it would be group 4.
I after have enhanced the sintax in a function
function parse-config {
command perl -pe "s/\R/\n/g" "$2" | command perl -lne "print \$2 if /^$1\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/"
return $?
}
parse-config "ID" "test.txt"
In this:
"s/\R/\n/g"
I replace all CRLF or CR or LF, in LF. \R is a super powerfull special character present from perl v5.10. Apparently this version of perl has introduced several fundamental innovations for me. The chance would have that I needed all (\h \R ?|). Whoever did the update was brilliant.
I needed this because the dollar "$" at the end of the line did not work, because there was a "\r" before the "Linux end of line" "\n".

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?
A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.
You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

How to group string of characters by 4?

I have string 1234567890 and I want to format it as 1234 5678 90
I write this regex:
$str =~ s/(.{4})/$1 /g;
But for this case 12345678 this does not work. I get excess whitespace at the end:
>>1234 5678 <<
I try to rewrite regex with lookahead:
s/((?:.{4})?=.)/$1 /g;
How to rewrite regex to fix that case?
Just use unpack
use strict;
use warnings 'all';
for ( qw/ 12345678 1234567890 / ) {
printf ">>%s<<\n", join ' ', unpack '(A4)*';
}
output
>>1234 5678<<
>>1234 5678 90<<
Context is your friend:
join(' ', $str =~ /(.{1,4})/g)
In list context, the match will all four character chunks (and anything shorter than that at the end of the string -- thanks to greediness). join will ensure the chunks are separated by spaces and there are no trailing spaces at the end.
If $str is huge and the temporary list increases the memory footprint too much, then you might just want to do the s///g and strip the trailing space.
My preference is for using the simplest possible patterns in regexes. Also, I haven't measured but with long strings, just a single chop might be cheaper than a conditional pattern in the s///g:
$ echo $'12345678\n123456789' | perl -lnE 's/(.{1,4})/$1 /g; chop; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<
You had the syntax almost right. Instead of just ?=., you need (?=.) (parens are part of the lookahead syntax). So:
s/((?:.{4})(?=.))/$1 /g
But you don't need the non-capturing grouping:
s/(.{4}(?=.))/$1 /g
And I think it is more clear if the capture doesn't include the lookahead:
s/(.{4})(?=.)/$1 /g
And given your example data, a non-word-boundary assertion works too:
s/(.{4})\B/$1 /g
Or using \K to automatically Keep the matched part:
s/.{4}\B\K/ /g
To fix the regex I should write:
$str =~ s/(.{4}(?=.))/$1 /g;
I should just add parentheses around ?=.. Without them ?=. is counted as non greed match followed by =.
So we match four characters and append space after them. Then I look ahead that there are still characters. For example, the regex will not match for string 1234
Just use a look ahead to see that you have at least one character remaining:
$ echo $'12345678\n123456789' | perl -lnE 's/.{4}\K(?=.{1})/ /g; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<

perl regex to remove dashes

I have some files I am processing, and I would like to remove the dashes from the non date fields.
I came up with s/([^0-9]+)-([^0-9]+)/$1 $2/g but that only works if there is one dash only in the string, or I should say it will only remove one dash.
So lets say I have:
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
What regex would I use to produce
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Don't do it with one regex. There is no requirement that a single regex must contain all of your code's logic.
Use one regex to see if it's a date, and then a second one to do your transformation. It will be much clearer to the reader (that's you, in the future) if you split it up into two.
#!/usr/bin/perl
use warnings;
use strict;
while ( my $str = <DATA>) {
chomp $str;
my $old = $str;
if ( $str !~ /^\d{4}-\d{2}-\d{2}$/ ) { # First regex to see if it's a date
$str =~ s/-/ /g; # Second regex to do the transformation
}
print "$old\n$str\n\n";
}
__DATA__
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
Running that gives you:
2014-05-01
2014-05-01
this-and
this and
this-and-that
this and that
this-and-that-and-that-too
this and that and that too
2015-01-01
2015-01-01
Using look around :
$ perl -pe 's/
(?<!\d) # a negative look-behind with a digit: \d
- # a dash, literal
(?!\d) # a negative look-ahead with a digit: \d
/ /gx' file
OUTPUT
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Look around are some assertions to ensure that there's no digit (in this case) around -. A look around don't make any capture, it's really just there to test assertions. It's a good tool to have near you.
Check :
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
Lose the + - it's catching the string up until the last -, including any previous - characters:
s/([^0-9]|^)-+([^0-9]|$)/$1 $2/g;
Example: https://ideone.com/r2CI7v
As long as your program receives each field separately in the $_ variable, all you need is
tr/-/ / if /[^-\d]/
This should do it
$line =~ s/(\D)-/$1 /g;
As I explained in a comment, you really need to use Text::CSV to split each record into fields before you edit the data. That's because data that contain whitespace need to be enclosed in double quotes, so a field like this-and-that will start out without spaces, but needs them added when the hyphens are translated to spaces.
This program shows a simple example that uses your own data.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({eol => $/});
while (my $row = $csv->getline(\*DATA)) {
for (#$row) {
tr/-/ / unless /^\d\d\d\d-\d\d-\d\d$/;
}
$csv->print (\*STDOUT, $row);
}
__DATA__
2014-05-01,this-and-that,this-and-that,this-and-that-and-that-too,2015-01-01
output
2014-05-01,"this and that","this and that","this and that and that too",2015-01-01

perl replace space with tab

I would like to replace each two spaces from the beginning of each line, with a tab.
I tried the following:
s/^(\s{2})+/\t/gm;
it didnt work.
If you're reading the file line by line:
$line =~ s/\G[ ]{2}/\t/g;
If you've slurped the entire file:
$file =~ s/(?:\G|^)[ ]{2}/\t/mg;
Remember that the + quantifier means “one or more of,” and it’s applied to \s{2} which means “exactly two whitespace characters.” For a simple example, consider a program that creates strings of zero to ten spaces and attempts to match them against a similar pattern.
#! /usr/bin/env perl
use strict;
use warnings;
for (0 .. 10) {
$_ = " " x $_;
printf "%-13s %s\n", "[$_]:", /^(\s{2})+$/ ? "match!" : "no match.";
}
Output:
[]: no match.
[ ]: no match.
[ ]: match!
[ ]: no match.
[ ]: match!
[ ]: no match.
[ ]: match!
[ ]: no match.
[ ]: match!
[ ]: no match.
[ ]: match!
As written, your pattern substitutes a single TAB character for any positive even number of whitespace characters at logical beginning-of-line.
You don’t provide the broader context of your code. From the use of the /m and /g switches, I assume you have some hunk of text, perhaps the entire contents of a file, that you want to operate on as a whole. The program below simulates this assumed situation using a here-document and replaces the first two spaces only of each line with a TAB.
#! /usr/bin/env perl
use strict;
use warnings;
$_ = <<EOText;
Three
Two
Four
Five
Zero
One
EOText
s/^ /\t/mg;
# for display purposes only
s/\t/\\t/g;
print;
Output:
\t Three
\tTwo
\t Four
\t Five
Zero
One
Note that the extra commented s/// would not remain in your code. It is there to add contrast between space and TAB characters.
If this is the sole purpose of your program, it becomes an easy one-liner. To create a new file with the modified contents, use
$ perl -pe 's/^ /\t/' input-file >output-file
Editing in place looks like
$ perl -i.bak -pe 's/^ /\t/' input-file
How about this?
my $test_string = " some test stuff\ndivided to\n provide the challenge";
$test_string =~ s/^[ ]{2}/\t/gm;
print $test_string;
Explanation: \s is actually not a single symbol alias, but a character 'whitespace' class: it includes both \n\ and \t for example. If you want to replace only spaces, use spaces in your regexes; setting a character class (instead of just /^ {2}/... to me is more readable (and won't break with the /x modifier).
Besides, if you want to replace just two space symbols, you don't need to use + quantifier.
UPDATE: if you need to replace each two spaces, I guess I'd use this instead:
$test_string =~ s#^((?:[ ]{2})+)#"\t" x (length($1)/2)#gme;
... or just \G anchor as in the ikegami's answer.
As an alternative solution, without /m modifier you can use positive lookbehind. Such approach can be usefull for cases where you need to check something else, not just beginning of line, so when \m modifier would not help >>
$_ = " 123\n 456\n 789";
s/(?:(?<=^)|(?<=\n))\s{2}/\t/g;
print $_;
In the above example code each /g double whitespace \s{2} that is behind beginning of string (?<=^) or (?: .. | .. ) new line character (?<=\n) is replaced by tab \t.

replace newlines within quoted string with \n

I need to write a quick (by tomorrow) filter script to replace line breaks (LF or CRLF) found within double quoted strings by the escaped newline \n. The content is a (broken) javascript program, so I need to allow for escape sequences like "ab\"cd" and "ab\\"cd"ef" within a string.
I understand that sed is not well-suited for the job as it work per line, so I turn to perl, of which I know nothing :)
I've written this regex: "(((\\.)|[^"\\\n])*\n?)*" and tested it with the http://regex.powertoy.org. It indeed matches quoted strings with line breaks, however, perl -p -e 's/"(((\\.)|[^"\\\n])*(\n)?)*"/TEST/g' does not.
So my questions are:
how to make perl to match line breaks?
how to write the "replace-by" part so that it keeps the original string and only replaces newlines?
There is this similar question with awk solution, but it is not quite what I need.
NOTE: I usually don't ask "please do this for me" questions, but I really don't feel like learning perl/awk by tomorrow... :)
EDIT: sample data
"abc\"def" - matches as one string
"abc\\"def"xy" - match "abcd\\" and "xy"
"ab
cd
ef" - is replaced by "ab\ncd\nef"
Here is a simple Perl solution:
s§
\G # match from the beginning of the string or the last match
([^"]*+) # till we get to a quote
"((?:[^"\\]++|\\.)*+)" # match the whole quote
§
$a = $1;
$b = $2;
$b =~ s/\r?\n/\\n/g; # replace what you want inside the quote
"$a\"$b\"";
§gex;
Here is another solution in case you wouldn't want to use /e and just do it with one regex:
use strict;
$_=<<'_quote_';
hai xtest "aa xx aax" baix "xx"
x "axa\"x\\" xa "x\\\\\"x" ax
xbai!x
_quote_
print "Original:\n", $_, "\n";
s/
(
(?:
# at the beginning of the string match till inside the quotes
^(?&outside_quote) "
# or continue from last match which always stops inside quotes
| (?!^)\G
)
(?&inside_quote) # eat things up till we find what we want
)
x # the thing we want to replace
(
(?&inside_quote) # eat more possibly till end of quote
# if going out of quote make sure the match stops inside them
# or at the end of string
(?: " (?&outside_quote) (?:"|\z) )?
)
(?(DEFINE)
(?<outside_quote> [^"]*+ ) # just eat everything till quoting starts
(?<inside_quote> (?:[^"\\x]++|\\.)*+ ) # handle escapes
)
/$1Y$2/xg;
print "Replaced:\n", $_, "\n";
Output:
Original:
hai xtest "aa xx aax" baix "xx"
x "axa\"x\\" xa "x\\\\\"x" ax
xbai!x
Replaced:
hai xtest "aa YY aaY" baix "YY"
x "aYa\"Y\\" xa "Y\\\\\"Y" ax
xbai!x
To work with line breaks instead of x, just replace it in the regex like so:
s/
(
(?:
# at the beginning of the string match till inside the quotes
^(?&outside_quote) "
# or continue from last match which always stops inside quotes
| (?!^)\G
)
(?&inside_quote) # eat things up till we find what we want
)
\r?\n # the thing we want to replace
(
(?&inside_quote) # eat more possibly till end of quote
# if going out of quote make sure the match stops inside them
# or at the end of string
(?: " (?&outside_quote) (?:"|\z) )?
)
(?(DEFINE)
(?<outside_quote> [^"]*+ ) # just eat everything till quoting starts
(?<inside_quote> (?:[^"\\\r\n]++|\\.)*+ ) # handle escapes
)
/$1\\n$2/xg;
Until the OP posts some example content to test by, try adding the "m" (and possibly the "s") flag to the end of your regex; from perldoc perlreref (reference):
m Multiline mode - ^ and $ match internal lines
s match as a Single line - . matches \n
For testing you might also find that adding the command line argument "-i.bak" so that you keep a backup of the original file (now with the extension ".bak").
Note also that if you want to capture but not store something you can use (?:PATTERN) rather than (PATTERN). Once you have your captured content use $1 through $9 to access stored matches from the matching section.
For more info see the link about as well as perldoc perlretut (tutorial) and perldoc perlre (full-ish documentation)
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common;
$_ = '"abc\"def"' . '"abc\\\\"def"xy"' . qq("ab\ncd\nef");
print "befor: {{$_}}\n";
s{($RE{quoted})}
{ (my $x=$1) =~ s/\n/\\n/g;
$x
}ge;
print "after: {{$_}}\n";
Using Perl 5.14.0 (install with perlbrew) one can do this:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.14.0;
use Regexp::Common qw/delimited/;
my $data = <<'END';
"abc\"def"
"abc\\"def"xy"
"ab
cd
ef"
END
my $output = $data =~ s/$RE{delimited}{-delim=>'"'}{-keep}/$1=~s!\n!\\n!rg/egr;
print $output;
I need 5.14.0 for the /r flag of the internal replace. If someone knows how to avoid this please let me know.