Regex only inside multiline match - regex

I have an old app that generates something like:
USERLIST (
"jasonr"
"jameso"
"tommyx"
)
ROLELIST (
"op"
"admin"
"ro"
)
I need some form of regex that changes ONLY the USERLIST section to USERLIST("jasonr", "jameso", "tommyx") and the rest of the text remain intact:
USERLIST("jasonr", "jameso", "tommyx")
ROLELIST (
"op"
"admin"
"ro"
)
In addition to the multiline issue, I don't know how to handle the replacement in only part of the string. I've tried perl (-0pe) and sed, can't find a solution. I don't want to write an app to do this, surely there is a way...

perl -0777 -wpe'
s{USERLIST\s*\(\K ([^)]+) }{ join ", ", $1 =~ /("[^"]+")/g }ex' file
Prints the desired output on the shown input file. Broken over lines for easier view.
With -0777 switch the whole file is read at once into a string ("slurped") and is thus in $_. With /x modifier literal spaces in the pattern are ignored so can be used for readability.
Explanation
Capture what follows USERLIST (, up to the first closing parenthesis. This assumes no such paren inside USERLIST( ... ). With \K lookbehind all matches prior to it stay (are not "consumed" out of the string) and are excluded from $&, so we don't have to re-enter them in the replacement side
The replacement side is evaluated as code, courtesy of /e modifier. In it we capture all double-quoted substrings from the initial $1 capture (assuming no nested quotes) and join that list by , . The obtained string is then used for the replacement for what was in the parentheses following USERLIST

With your shown samples in GNU awk please try following awk code.
awk -v RS='(^|\n)USERLIST \\(\n[^)]*\\)\n' '
RT{
sub(/[[:space:]]+\(\n[[:space:]]+/,"(",RT)
sub(/[[:space:]]*\n\)\n/,")",RT)
gsub(/"\n +"/,"\", \"",RT)
print RT
}
END{
printf("%s",$0)
}
' Input_file
Explanation: Setting RS(record separator) as (^|\n)USERLIST \\(\n[^)]*\\)\n for all lines of Input_file. Then in main program checking condition if RT is NOT NULL then substituting [[:space:]]+\(\n[[:space:]]+ with "(" and then substituting [[:space:]]*\n\)\n with ) and then substituting "\n +" with \" finally printing its value. Then in this program's END block printing line's value in printf function to get rest of the values.
Output will be as follows:
USERLIST("jasonr", "jameso", "tommyx")
ROLELIST (
"op"
"admin"
"ro"
)

This might work for you (GNU sed):
sed '/USERLIST/{:a;N;/^)$/M!ba;s/(\n\s*/(/;s/\n)/)/;s/\n\s*/, /g}' file
If a line contains USERLIST, gather up the list and format as required.

Related

Perl regex - print only modified line (like sed -n 's///p')

I have a command that outputs text in the following format:
misc1=poiuyt
var1=qwerty
var2=asdfgh
var3=zxcvbn
misc2=lkjhgf
etc. I need to get the values for var1, var2, and var3 into variables in a perl script.
If I were writing a shell script, I'd do this:
OUTPUT=$(command | grep '^var-')
VAR1=$(echo "${OUTPUT}" | sed -ne 's/^var1=\(.*\)$/\1/p')
VAR2=$(echo "${OUTPUT}" | sed -ne 's/^var2=\(.*\)$/\1/p')
VAR3=$(echo "${OUTPUT}" | sed -ne 's/^var3=\(.*\)$/\1/p')
That populates OUTPUT with the basic content that I want (so I don't have to run the original command multiple times), and then I can pull out each value using sed VAR1 = 'qwerty', etc.
I've worked with perl in the past, but I'm pretty rusty. Here's the best I've been able to come up with:
my $output = `command | grep '^var'`;
(my $var1 = $output) =~ s/\bvar1=(.*)\b/$1/m;
print $var1
This correctly matches and references the value for var1, but it also returns the unmatched lines, so $var1 equals this:
qwerty
var2=asdfgh
var3=zxcvbn
With sed I'm able to tell it to print only the modified lines. Is there a way to do something similar with in perl? I can't find the equivalent of sed's p modifier in perl.
Conversely, is there a better way to extract those substrings from each line? I'm sure I could match match each line and split the contents or something like that, but was trying to stick with regex since that's how I'd typically solve this outside of perl.
Appreciate any guidance. I'm sure I'm missing something relatively simple.
One way
my #values = map { /\bvar(?:1|2|3)\s*=\s*(.*)/ ? $1 : () } qx(command);
The qx operator ("backticks") returns a list of all lines of output when used in list context, here imposed by map. (In a scalar context it returns all output in a string, possibly multiline.) Then map extracts wanted values: the ternary operator in it returns the capture, or an empty list when there is no match (so filtering out such lines). Please adjust the regex as suitable.
Or one can break this up, taking all output, then filtering needed lines, then parsing them. That allows for more nuanced, staged processing. And then there are libraries for managing external commands that make more involved work much nicer.
A comment on the Perl attempt shown in the question
Since the backticks is assigned to a scalar it is in scalar context and thus returns all output in a string, here multiline. Then the following regex, which replaces var1=(.*) with $1, leaves the next two lines since . does not match a newline so .* stops at the first newline character.
So you'd need to amend that regex to match all the rest so to replace it all with the capture $1. But then for other variables the pattern would have to be different. Or, could replace the input string with all three var-values, but then you'd have a string with those three values in it.
So altogether: using the substitution here (s///) isn't suitable -- just use matching, m//.
Since in list context the match operator also returns all matches another way is
my #values = qx(command) =~ /\bvar(?:1|2|3)\s*=\s*(.*)/g;
Now being bound to a regex, qx is in scalar context and so it returns a (here multiline) string, which is then matched by regex. With /g modifier the pattern keeps being matched through that string, capturing all wanted values (and nothing else). The fact that . doesn't match a newline so .* stops at the first newline character is now useful.
Again, please adjust the regex as suitable to yoru real problem.
Another need came up, to capture both the actual names of variables and their values. Then add capturing parens around names, and assign to a hash
my %val = map { /\b(var(?:1|2|3))\s*=\s*(.*)/ ? ($1, $2) : () } qx(command);
or
my %val = qx(command) =~ /\b(var(?:1|2|3))\s*=\s*(.*)/g;
Now the map for each line of output from command returns a pair of var-name + value, and a list of such pairs can be assigned to a hash. The same goes with subsequent matches (under /g) in the second case..
In scalar context, s/// and s///g return whether it found a match or not. So you can use
print $s if $s =~ s///;

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

How to find and replace a pattern string using sed/perl/awk?

I have a file foo.properties with contents like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
In my script, I need to replace whatever value is against ph (The current value is unknown to the bash script) and change it to 0.5. So the the file should look like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
I know it can be easily done if the current value is known by using
sed "s/\,ph\:0.03\,/\,ph\:0.5\,/" foo.properties
But in my case, I have to actually read the contents against allNames and search for the value and then replace within a for loop. Rest all is taken care of but I can't figure out the sed/perl command for this.
I tried using sed "s/\,ph\:.*\,/\,ph\:0.5\,/" foo.properties and some variations but it didn't work.
A simpler sed solution:
sed -E 's/([=,]ph:)[0-9.]+/\10.5/g' file
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
Here we match ([=,]ph:) (i.e. , or = followed by ph:) and capture in group #1. This should be followed by 1+ of [0-9.] character to natch any number. In replacement we put \1 back with 0.5
With your shown samples, please try following awk code.
awk -v new_val="0.5" '
match($0,/,ph:[0-9]+(\.[0-9]+)?/){
val=substr($0,RSTART+1,RLENGTH-1)
sub(/:.*/,":",val)
print substr($0,1,RSTART) val new_val substr($0,RSTART+RLENGTH)
next
}
1
' Input_file
Detailed Explanation: Creating awk's variable named new_val which contains new value which needs to put in. In main program of awk using match function of awk to match ,ph:[0-9]+(\.[0-9]+)? regex in each line, if a match of regex is found then storing that matched value into variable val. Then substituting everything from : to till end of value in val variable with : here. Then printing values as pre requirement of OP(values before matched regex value with val(edited matched value in regex) with new value and rest of line), using next will avoid going further and by mentioning 1 printing rest other lines which are NOT having a matched value in it.
2nd solution: Using sub function of awk.
awk -v newVal="0.5" '/^allNames=/{sub(/,ph:[^,]*/,",ph:"newVal)} 1' Input_file
Would you please try a perl solution:
perl -pe '
s/(?<=\bph:)[\d.]+(?=,|$)/0.5/;
' foo.properties
The -pe option makes perl to read the input line by line, perform
the operation, then print it as sed does.
The regex (?<=\bph:) is a zero-length lookbehind which matches
the string ph: preceded by a word boundary.
The regex [\d.]+ will match a decimal number.
The regex (?=,|$) is a zero-length lookahead which matches
a comma or the end of the string.
As the lookbehind and the lookahead has zero length, they are not
substituted by the s/../../ operator.
[Edit]
As Dave Cross comments, the lookahead (?=,|$) is unnecessary as long as the input file is correctly formatted.
Works with decimal place or not, or no value, anywhere in the line.
sed -E 's/(^|[^-_[:alnum:]])ph:[0-9]*(.[0-9]+)?/ph:0.5/g'
Or possibly:
sed -E 's/(^|[=,[:space:]])ph:[0-9]+(.[0-9]+)?/ph:0.5/g'
The top one uses "not other naming characters" to describe the character immediately before a name, the bottom one uses delimiter characters (you could add more characters to either). The purpose is to avoid clashing with other_ph or autograph.
Here you go
#!/usr/bin/perl
use strict;
use warnings;
print "\nPerl Starting ... \n\n";
while (my $recordLine =<DATA>)
{
chomp($recordLine);
if (index($recordLine, "ph:") != -1)
{
$recordLine =~ s/ph:.*?,/ph:0.5,/g;
print "recordLine: $recordLine ...\n";
}
}
print "\nPerl End ... \n\n";
__DATA__
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
output:
Perl Starting ...
recordLine: allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5 ...
Perl End ...
Using any sed in any shell on every Unix box (the other sed solutions posted that use sed -E require GNU or BSD seds):
a) if ph: is never the first tag in the allNames list (as shown in your sample input):
$ sed 's/\(,ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
b) or if it can be first:
$ sed 's/\([,=]ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5

How to append after Regex Search in Perl script

I'm trying to convert a MySQL dump into SQLite database, for database migration. I need to edit the date to append time, so for example 2018-09-19 should be converted to 2018-09-19 00:00:00.00. The reason for this format has to do with how our application works. This is the solution I came up with but it doesn't work.
#!/usr/bin/perl
while (<>){
<Other Stuff>
....
s/([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))\[/$1[00:00:00.00]][/
print;
}
For testing I created a test.txt file with just for testing
2019-03-06
And in command line or terminal I used the following command to test if the append works.
perl -pe 's/([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))\[/$1[00:00:00.00]][/' < test.txt > testout.txt
This gives a clear error of:
syntax error at -e line 1, near "00:" Execution of -e aborted due to compilation errors.
Using this #dada's solution that looks like this gives no error but also doesn't append the 00:00:00.00 at the end of the line
The Expected output should be
2019-03-06 00:00:00.00
Your problem statement says you want to turn:
2018-09-19
into:
2018-09-19 00:00:00.00
However, your code is:
s/([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))\[/$1[00:00:00.00]][/
Using /x we can write it a bit more legibly as:
s/
(
[12]\d{3} # year
- # hyphen
( 0[1-9] | 1[0-2] ) # month (saved as $2)
- # hyphen
( 0[1-9] | [12]\d | 3[01] ) # day (saved as $3)
) # save this as $1
\[ # square bracket
/$1[00:00:00.00]][/x
From this, it is clear that 2018-09-19 does not match because it does not end with a square bracket.
The replacement value is:
$1[00:00:00.00]][
This (tries to) say:
lookup index 00:00:00.00 in array #1 and substitute value
append ][
However this is not valid perl and not what you wanted anyway.
What is happening is that instead of $x + [y] (scalar followed by string value), perl is seeing $x[y] (value from array). To prevent this, either use braces ( ${x}[y] ) or escape the bracket ( $x\[y] ). This results in:
${1}[00:00:00.00]][
which is still not what the problem said was needed as the zeros are wrapped in brackets.
To get what you say you want, remove the \[ from the end of the search part and remove the unnecessary brackets from the replacement part:
s/
(
[12]\d{3}
- ( 0[1-9] | 1[0-2] )
- ( 0[1-9] | [12]\d | 3[01] )
)
# no bracket here
/$1 00:00:00.00/x; # no brackets here
Note that your code as given has another bug which is that the final print needs to be separated from the s/// by a semi-colon.

Substituting multiple occurrences of a character inside a grep match

I am trying to use TextWrangler to take a bunch of text files, match everything within some angle-bracket tags (so far so good), and for every match, substitute all occurrences of a specific character with another.
For instance, I'd like to take something like
xx+xx <f>bar+bar+fo+bar+fe</f> yy+y <f>fee+bar</f> zz
match everything within <f> and </f> and then substitute all +'s with, say, *'s (but ONLY inside the "f" tag).
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
I think I can easily match "f" tags containing +'s with an expression like
<f>[^<]*\+[^<]*</f>
but I have no idea on how to substitute only a subclass of character for each match. I don't know a priori how many +'s there are in each tag.
I think I should run a regular expression for all matches of the first regular expression, but I am not really sure how to do that.
(In other words, I would like to match all +'s but only inside specific angle-bracket tags).
Does anyone have a hint?
Thanks a lot,
Daniele
In case you're OK with an awk solution:
$ awk '{
while ( match($0,/<f>[^<]*\+[^<]*<\/f>/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/\+/,"*",tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
print
}' file
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
The above will work using any awk in any shell on any UNIX box. It relies on there being no < within each <f>...</f> as indicated by your sample code. If there can be then include that in your example and we can tweak the script to handle it:
$ awk '{
gsub("</f>",RS)
while ( match($0,/<f>[^\n]*\+[^\n]*\n/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/\+/,"*",tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
gsub(RS,"</f>")
print
}' file
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz