So I'm working on my own little formatting correction script that uses Perl regex for substitution, but I can't get this one to match. I've used similar matching for other fixes but this one doesn't work and I can't figure out why.
# basically takes in a string to modify and the match and substitution strings
perlRegex(){
PERL_BADLANG=0 perl -le '
$string = '"'"''"${1}"''"'"';
$string =~ s/'"${2}"'/'"${3}"'/gm;
print "$string\n";
exit';
}
LINE_BREAK='\n'
# contents is the example below
EDITED=$(cat file.txt);
EDITED=$(perlRegex "${EDITED}" '(?<='"${LINE_BREAK}"')( +)([^{]+{$'"${LINE_BREAK}"')([^\s][^;]+;$)' '$1$2$1$1$3')
My current attempt is https://regex101.com/r/vgatOd/1 which gives me the output I want.
(?<=\n)( +)([^{]+{$\n)([^\s][^;]+;$)
to
$1$2$1$1$3
(?<=\n)( +) $1: copies the spaces at the beginning of the line
([^{]+{$\n) $2: captures the remaining content of the line with ending {
([^\s][^;]+;$) $3: captures the next line without a leading spaces, with ending ;
The substitution will add the spaces twice on before the second line.
Example input:
if (debug) {
Tools.DebugLine("Log");
}
Aim is to pad the Tools line to be at the correct column:
if (debug) {
Tools.DebugLine("Log");
}
Given the regex101 does what I would like it to do, I'm perplexed as to what part of it does not work in Perl regex.
Taken approach to make java file formatting isn't error prone, you should consider a better way to achieve the desired effect.
The chosen regular expression is quite excessive and mixes \n, , \s all of which falls under \s class.
The following demo code strips regular expression for simplification.
use strict;
use warnings;
my $data = do { local $/; <DATA> };
my $re = qr/([^\n]+?\{\s+)(\S+?;)(\s+\})/;
my $indent = ' ' x 8;
$data =~ s/$re/$1${indent}$2$3/gsm;
print $data;
__DATA__
if (debug) {
Tools.DebugLine("Log");
}
Output
if (debug) {
Tools.DebugLine("Log");
}
Please see Perl regular expressions
Solved by changing to just using \n instead of $ along with them
https://regex101.com/r/poQlwm/1
(?<=\n)([ ]+)([^{]+\{\n)([^\s][^;]+;\n)
Related
The following regular expression gives me proper results when tried in Notepad++ editor but when tried with the below perl program I get wrong results. Right answer and explanation please.
The link to file I used for testing my pattern is as follows:
(http://sainikhil.me/stackoverflow/dictionaryWords.txt)
Regular expression: ^Pre(.*)al(\s*)$
Perl program:
use strict;
use warnings;
sub print_matches {
my $pattern = "^Pre(.*)al(\s*)\$";
my $file = shift;
open my $fp, $file;
while(my $line = <$fp>) {
if($line =~ m/$pattern/) {
print $line;
}
}
}
print_matches #ARGV;
A few thoughts:
You should not escape the dollar sign
The capturing group around the whitespaces is useless
Same for the capturing group around the dot .
which leads to:
^Pre.*al\s*$
If you don't want words like precious final to match (because of the middle whitespace, change regex to:
^Pre\S*al\s*$
Included in your code:
while(my $line = <$fp>) {
if($line =~ /^Pre\S*al\s*$/m) {
print $line;
}
}
You're getting messed up by assigning the pattern to a variable before using it as a regex and putting it in a double-quoted string when you do so.
This is why you need to escape the $, because, in a double-quoted string, a bare $ indicates that you want to interpolate the value of a variable. (e.g., my $str = "foo$bar";)
The reason this is causing you a problem is because the backslash in \s is treated as escaping the s - which gives you just plain s:
$ perl -E 'say "^Pre(.*)al(\s*)\$";'
^Pre(.*)al(s*)$
As a result, when you go to execute the regex, it's looking for zero or more ses rather than zero or more whitespace characters.
The most direct fix for this would be to escape the backslash:
$ perl -E 'say "^Pre(.*)al(\\s*)\$";'
^Pre(.*)al(\s*)$
A better fix would be to use single quotes instead of double quotes and don't escape the $:
$ perl -E "say '^Pre(.*)al(\s*)$';"
^Pre(.*)al(\s*)$
The best fix would be to use the qr (quote regex) operator instead of single or double quotes, although that makes it a little less human-readable if you print it out later to verify the content of the regex (which I assume to be why you're putting it into a variable in the first place):
$ perl -E "say qr/^Pre(.*)al(\s*)$/;"
(?^u:^Pre(.*)al(\s*)$)
Or, of course, just don't put it into a variable at all and do your matching with
if($line =~ m/^Pre(.*)al(\s*)$/) ...
Try removing trailing newline character(s):
while(my $line = <$fp>) {
$line =~ s/[\r\n]+$//s;
And, to match only words that begin with Pre and end with al, try this regular expression:
/^Pre\w*al$/
(\w means any letter of a word, not just any character)
And, if you want to match both Pre and pre, do a case-insensitive match:
/^Pre\w*al$/i
I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)
I am reading in another perl file and trying to find all strings surrounded by quotations within the file, single or multiline. I've matched all the single lines fine but I can't match the mulitlines without printing the entire line out, when I just want the string itself. For example, heres a snippet of what I'm reading in:
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
so the output I'd like is
'Hello World!';
"chmod";
"This is a fun multiple line string, please match";
but I am getting:
'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
This is the code I am using to find the strings - all file content is stored in #contents:
my #strings_found = ();
my $line;
for(#contents) {
$line .= $_;
}
if($line =~ /(['"](.?)*["'])/s) {
push #strings_found,$1;
}
print #strings_found;
I am guessing I am only getting 'Hello World!'; correctly because I am using the $1 but I am not sure how else to find the others without looping line by line, which I would think would make it hard to find the multi line string as it doesn't know what the next line is.
I know my regex is reasonably basic and doesn't account for some caveats but I just wanted to get the basic catch most regex working before moving on to more complex situations.
Any pointers as to where I am going wrong?
Couple big things, you need to search in a while loop with the g modifier on your regex. And you also need to turn off greedy matching for what's inside the quotes by using .*?.
use strict;
use warnings;
my $contents = do {local $/; <DATA>};
my #strings_found = ();
while ($contents =~ /(['"](.*?)["'])/sg) {
push #strings_found, $1;
}
print "$_\n" for #strings_found;
__DATA__
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
Outputs
'Hello World!'
"chmod"
"This is a fun
multiple line string, please match"
You aren't the first person to search for help with this homework problem. Here's a more detailed answer I gave to ... well ... you ;) finding words surround by quotations perl
regexp matching (in perl and generally) are greedy by default. So your regexp will match from 1st ' or " to last. Print the length of your #strings_found array. I think it will always be just 1 with the code you have.
Change it to be not greedy by following * with a ?
/('"*?["'])/s
I think.
It will work in a basic way. Regexps are kindof the wrong way to do this if you want a robust solution. You would want to write parsing code instead for that. If you have different quotes inside a string then greedy will give you the 1 biggest string. Non greedy will give you the smallest strings not caring if start or end quote are different.
Read about greedy and non greedy.
Also note the /m multiline modifier.
http://perldoc.perl.org/perlre.html#Regular-Expressions
I get a question about parse a vector has strings like this:
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"
I want to get:
"M92R R236G"
"G98K"
"M34* G87K M389L"
When I use
while ($info1=~s/^(.*)\:(([A-Z\*]){1}([\d]+)([A-Z\*]){1})\,//)
{
$pos=$2;
}
the result $pos only give me the last one for each row, that is:
"R236G"
"G98K"
"M389L"
How should I correct the script?
The reason your code isn't working is that you have a greedy ^(.*) at the start of of the regular expression. That will take up as much of the target string as possible as long as the rest of the pattern matches, so you will find only the last occurrence of the substring. You can fix it by just changing it to a non-greedy pattern ^(.*?).
A few other notes on your regular expression:
There is no need to escape : or ,, or * when it is inside a character class [...]
There is never a need for the quantifier {1} as that is the effect of a pattern without a quantifier
There is no need to put \d inside a character class [\d], as it works fine on its own
There is no need to enclose subpatterns in parentheses unless you need access to whatever substring matched that subpattern when the match succeeds. So, for instance ^.* is fine without the parentheses
This modification of your code works identically to yours, but is very much more concise
while ($info1 =~ s/^.*?:([A-Z*]\d+[A-Z*]),// ) {
my $pos = $1;
...
}
But the best solution is to use a global match that finds all occurrences of a pattern within a string, and doesn't need to modify the string in the process.
This program does what you describe. It just looks for all the alphanumeric or asterisk strings that follow a colon in each record.
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([A-Z0-9*]+)/g;
print "#fields\n";
}
__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"
output
M92R R236G
G98K
M34* G87K M389L
Using a one-liner :
$ perl -ne 'print q/"/ . join(" ", m/:([^,]+),/g) . qq/"\n/' file
"M92R R236G"
"G98K"
"M34* G87K M389L"
In a script :
$ perl -MO=Deparse -ne 'print "\042" . join(" ", m/:([^,]+),/g) . "\042\n"' file
script :
LINE: while (defined($_ = <ARGV>)) {
print '"' . join(' ', /:([^,]+),/g) . qq["\n];
}
You can use as regex a colon and some alpanumerics characters, use an array to save them and print at the end of the loop. Here you have an example:
#!/usr/bin/env perl;
use strict;
use warnings;
my (#data);
while ( <DATA> ) {
while ( m/:([[:alnum:]*]+)/g ) {
push #data, $1;
}
printf qq|"%s"\n|, join q| |, #data;
undef #data;
}
__DATA__
"chr1-247751935-G-.:M92R,chr1-247752366-G-.:R236G,"
"chr1-247951785-G-.:G98K,"
"chr13-86597895-S-78:M34*,chr13-56891235-S-8:G87K,chr13-235689125-S-7:M389L,"
Run it like:
perl script.pl
That yields:
"M92R R236G"
"G98K"
"M34* G87K M389L"
I'm using this regex in Perl to match and replace the following expressions:
_HI2_
_HI_2
HI2_
_HI_2
if ($subject =~ m/_?HI2?_?|HI2?_?/) {
# Successful match
} else {
# Match attempt failed
}
I also want to do this though:
The text is: ABCDEMAFGHIJ
This is a sequence HI in there but must be ignored because if you look left you can see that this line starts with The text is:.
The text is: ABCDEHI2FGHI
As above, two sequence of HI here.
How can I build into this regex a match and ignore it because of a line prefix?
Why not just match twice?
If $subject does not match /^The text is:/, run the replace ..
Try this regex:
/^(?!The text is:).*(?:_?HI2?_?|HI2?_?)/
Or use two matches like:
if($subject !~ /^This text is:/i && $subject =~ /_?HI2?_?|HI2?_?/)
I just discovered this brilliant resource here and the section on Perl.
You can find there details of a (*SKIP)(*F) construct which will blow your mind; your described problem as a one-liner:
cat > test.txt <<EOF
_HI2_
_HI_2xxxHI_2
The text is: ABCDEMAFGHIJ
HI2_
The text is: ABCDEHI2FGHI
_HI_2
EOF
perl -ne '/^The text is:.*$(*SKIP)(*F)|.+/ && s/_?HI_?2?_?/HAPPY/; print' test.txt
# or
perl -ne 's/(^The text is:.*$)(*SKIP)(*F)|_?HI_?2?_?/HAPPY/g; print' test.txt
I have new found love and respect for Perl; Sed is my go-to, but now I know how to skip lines (read: leave unchanged) in Perl, I will hesitate less
Try telling it is the start of the line with "^", ignore whitespaces if that you think is needed(I always tend to do it). Also you could mark the end of the string with "$"
if ($subject =~ m/^\s*_?HI2?_?|HI2?_?/) {
# Successful match
} else {
# Match attempt failed
}
Not the most elegant method but easy to understand (TIMTOWTDI :)
#!/usr/bin/perl
use strict;
use warnings;
my #text = ("ABCDEHI2FGHI", "The text is: ABCDEHI2FGHI");
for (#text) {
my $new = my_replace($_); # do the replacement
print "$new\n"; # print result
}
sub my_replace {
my ($text) = #_;
return $text if ($text =~ m/The text is:/); # return if prefixed / no replacement
$text =~ s/(_?HI2?_?|HI2?_?)/__replacement__/g; # do replace (give a replacement string here)
return $text; # return result of replacement
}
Otherwise you can use a "negative lookbehind".
To try see regex101 or debuggex.
/(?<!^The text is.*)(_?HI2?_?|HI2?_?)/