Regex to search, replace and repeat while there's a match? - regex
I'm trying to come up with a way to do the following. Given a line of text like (actual Fortran77 code, these declarations begin at column 7)
CHARACTER FOO*1, BAR*2, OTHER*6
replace it by
CHARACTER*1 FOO
CHARACTER*2 BAR
CHARACTER*6 OTHER
I tried this
perl -pe '1 while s/^[^*][ ]+CHARACTER[ ]+([a-z-A-Z0-9_]+)\*([0-9]+)(,?)(.*)/ CHARACTER\*$2 $1\n CHARACTER $4/' foo.f
The [^*] bit is intentional (I ignore lines with * at column 1). The output I get is
CHARACTER*1 FOO
CHARACTER BAR*2, OTHER*6
I already see I'll have a problem when only CHARACTER OTHER*6 remains, but I haven't even got there yet. Any help with also this problem would be appreciated.
What should I do to end having the original line replaced by the three lines? Thanks.
perl -ple 'if (/^( +CHARACTER) +(\w+\*\d+(?:, *\w+\*\d+)*) *$/) { my $p = $1; $_ = join "\n", map /(\w+)\*(\d+)/ ? "$p*$2 $1" : die("wtf: $_"), split /,/, $2; }' foo.f
Man, this looks awful in one line. Formatted:
perl -ple '
if (/^( +CHARACTER) +(\w+\*\d+(?:, *\w+\*\d+)*) *$/) {
my $p = $1;
$_ =
join "\n",
map /(\w+)\*(\d+)/ ? "$p*$2 $1" : die("wtf: $_"),
split /,/,
$2;
}
' foo.f
First thing that came to my mind:
# Open file containing the line and loop through lines (I put only one line)
open (INFH, '<', "num.txt");
while(<INFH>){
#arr = split(',',$_);
foreach $arrEle (#arr){
if ($arrEle =~ /CHARACTER/){
$arrEle =~ s/(\s+)(\w+)(\s+)(\w+)(\*)(\d+)/$1$2$5$6$3$4/;
}else{
$arrEle =~ s/(\s+)(\w+)(\*)(\d+)/\tCHARACTER$3$4$1$2/;
}
print "$arrEle\n";
}
}
I will update with more details as we discuss in the comments section.
This should work with a line any big, like this:
CHARACTER FOO*1, BAR*2, OTHER*6, OTHERS*7, OTHERS*8 and so on
Your original try is just missing one thing: ^ by default only matches at the beginning of the string, but you want to match at the beginning of a line in the middle of the string. Change your s/.../.../ to s/.../.../m.
Then you still have to deal with the final substitution; either make the , mandatory to match and add a second substitution to deal with the move of the final *length or use /e and make what's being substituted an expression that does different things when $3 is empty.
Related
Repeating regex pattern
I have a string such as this word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl> where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be: word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl> So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so: perl -pe 's///g;' in out This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... : /<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/ It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated? EDIT: This is part of a workflow that does some other transformations within a shell script: #!/bin/sh perl -pe '# s/replace/me/g; s/replace/me/g; ' $1 > tmp ... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain use warnings; use strict; use feature 'say'; my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>); my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter) $str =~ s{$tag_re}{ my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag) $t =~ s/(\w+)/$o$1$c/g; $t; }ge; say $str; The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code. One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded my $str = q(...); with my $str = shift #ARGV; # first argument on the command line and then use that script in your shell script as #!/bin/sh ... script.pl $1 > output_file where $1 is the shell variable as shown in the "Edit" to the question. In a one-liner echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" | perl -wpe' s{(<[^>]+>) (.+?) (</[^>]+>)} {($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex; ' what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print #!/bin/sh ... perl -wE'$_=shift; ...; say' $1 > output_file where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed. The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it: s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g; The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word". The s/ at the beginning is a substitution, so the command would be something like: cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out You need the $_ at the end because the result of the global substitution is the number of substitutions made. This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like: my $e = $in; while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){ my $p = $1; my $s = $2; print($p); $s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g; print($s); $e = $'; # ' (stop markup highlighter) } print($e); You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.
Perl grep a multi line output for a pattern
I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it. Multiline text in $output looks like this _skv_version=1 COMPONENTSEQUENCE=C1- BEGIN_C1 COMPONENT=SecurityJNI TOOLSEQUENCE=T1- END_C1 CMD_ID=null CMD_USES_ASSET_ENV=null_jdk1.7.0_80 CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80 BEGIN_C1_T1 CMD_ID=msdotnet_VS2013_x64 CMD_ID=ant_1.7.1 CMD_FILE=path/to/abcI.vc12.sln BEGIN_CMD_OPTIONS_RELEASE -useideenv The code I am using to grep for the pattern use strict; use warnings; my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV="; my #matching_lines; my $output = `cmd to get output` ; print "output is : $output\n"; if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) { print "1 is : $1\n"; push (#matching_lines, $1); } I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results. Desired output jdk1.7.0_80 ivy jdk1.7.3_80 msdotnet_VS2013_x64 ant_1.7.1
Regarding your regular expression: You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end) You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression You can also simplify the (\w+([\.]?\w+)*) bit down to (.+). The result would be: while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) { print "1 is : $1\n"; push (#matching_lines, $1); } That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like: while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) { my $text = $1; my #text; if ($text =~ /,/) { #text = split /,(?:null_)?/, $text; } else { #text = $text; } for (#text) { print "1 is : $_\n"; push (#matching_lines, $_); } } The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-) (I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)
Matching string in perl which is parsed with awk
I did next with awk: awk '/secon/ {print $1}' vladtest.sh |cut -c2-6 How to match this string in perl loop on appropriate way: my $var1= `awk '/secon/ {print $1}' vladtest.sh |cut -c2-6` if ($var1 eq "secon"){ print "OK"; } else { print "FALSE" }
First of all, your problem is the fact that the result of `...` includes the newline. So $var1 is not "secon", but "secon\n". You could deal with that any number of ways - wrapping chomp(...) around the whole assignment to $var1, or changing the right side of the eq to include the "\n", or using a regex instead: if ($var1 =~ /^secon$/) works with or without the chomp, because $ matches at a string-final newline if present. Second of all, you're using about two programs too many here: my $var1= `awk '/secon/ {print $1}' vladtest.sh |cut -c2-6` I mean,awk can do anything cut can do, and perl can do anything either can do. When I'm typing one-liners at the shell prompt, I'm a big fan of awk and cut, simply for their economy of expression; within their respective specialities, their command lines are generally shorter than the equivalent (even using perl -a etc). But since you're already in a Perl program, there's no reason to shell out to something else here. Here's an example drop-in replacement for the above: my $var1; { open(my $fh, '<', 'vladtest.sh'); while (<$fh>) { $var1 .= substr(split(' ')[0],1,5)."\n" if /secon/; } } But you don't need to go through all that if you just want to detect if there's a match. {open(my $fh, '<', 'vladtest.sh'); print 'NOT ' unless grep { /^\s*\Ssecon/ } <$fh>; say 'OK';} # or print "OK\n"; if you don't have 5.10 extensions enabled.
My awk's a bit rusty, but you're grabbing $1 then looking for characters 2 through 6 to match 'secon'. If any line matches this, then print "OK", otherwise "NOT OK". Here's a perl only solution: use 5.10.0; while (<>) { if (/^\s*\Ssecon/) { say 'OK'; exit 0; } } say 'NOT OK'; exit 1; The regexp is made up of: ^ match the start of each line, followed by \s* zero or more whitespace characters, followed by \S one non-whitespace character, followed by secon the literal string that you're interested in As soon as we have a line that matches, we can print 'OK' and exit. Otherwise we'll fall through the end of the loop and print 'NOT OK'. The use 5.10.0; line is needed so you can use say().
There are better ways to implement this request, but follow your current idea , change the first line to resolve the issue. my $var1= `awk '$1~/secon/ {print substr($1,2,5)}' vladtest.sh`
Print line of pattern match in Perl regex
I am looking for a keyword in a multiline input using a regex like this, if($input =~ /line/mi) { # further processing } The data in the input variable could be like this, this is multi line text to be matched using perl The code works and matches the keyword line correctly. However, I would also like to obtain the line where the pattern was matched - "multi line text" - and store it into a variable for further processing. How do I go about this? Thanks for the help.
You can grep out the lines into an array, which will then also serve as your conditional: my #match = grep /line/mi, split /\n/, $input; if (#match) { # ... processing }
TLP's answer is better but you can do: if ($input =~ /([^\n]+line[^\n]+)/i) { $line = $1; }
I'd look if the match is in the multiline-String and in case it is, split it into lines and then look for the correct index number (starting with 0!): #!/usr/bin/perl use strict; use warnings; my $data=<<END; this is line multi line text to be matched using perl END if ($data =~ /line/mi){ my #lines = split(/\r?\n/,$data); for (0..$#lines){ if ($lines[$_] =~ /line/){ print "LineNr of Match: " . $_ . "\n"; } } }
Did you try his? This works for me. $1 represents the capture of regex inside ( and ) Provided there is only one match in one of the lines.If there are matches in multiple lines, then only the first one will be captured. if($var=~/(.*line.*)/) { print $1 } If you want to capture all the lines which has the string line then use below: my #a; push #a,$var=~m/(.*line.*)/g; print "#a";
How do I use Perl to intersperse characters between consecutive matches with a regex substitution?
The following lines of comma-separated values contains several consecutive empty fields: $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n" I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution. I tried this first of all: $rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A' which returned 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string. I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out: $rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A' which resulted in: 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n That didn't work either. It just shifted the comma-pairings by one. I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions? The final string should look like this: 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings: #!/usr/bin/perl use strict; use warnings; use autodie; my $str = <<EO_DATA; 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,, EO_DATA open my $str_h, '<', \$str; while(my $row = <$str_h>) { chomp $row; print join(',', map { length $_ ? $_ : 'N/A'} split /,/, $row, -1 ), "\n"; } Output: E:\Home> t.pl 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A You can also use: pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g; Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use $str =~ s{,(,|\n)}{,N/A$1}g; Therefore, I used a loop to move pos $str back by a character after each successful substitution. Now, as #ysth shows: $str =~ s!,(?=[,\n])!,N/A!g; would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind. Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline: s!,(?=[,\n])!,N/A!g; Example: my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"; use Data::Dumper; $Data::Dumper::Useqq = $Data::Dumper::Terse = 1; print Dumper($rawData); $rawData =~ s!,(?=[,\n])!,N/A!g; print Dumper($rawData); Output: "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n" "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for (?<=,)(?=,|$) and replace that with N/A. This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version: my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"; while ($rawData =~ s/,,/,N\/A,/g) {}; print $rawData; Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either: $string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1); The ,-1 is needed at the end to force split to include any empty fields at the end of the string.