How to append after Regex Search in Perl script - regex

I'm trying to convert a MySQL dump into SQLite database, for database migration. I need to edit the date to append time, so for example 2018-09-19 should be converted to 2018-09-19 00:00:00.00. The reason for this format has to do with how our application works. This is the solution I came up with but it doesn't work.
#!/usr/bin/perl
while (<>){
<Other Stuff>
....
s/([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))\[/$1[00:00:00.00]][/
print;
}
For testing I created a test.txt file with just for testing
2019-03-06
And in command line or terminal I used the following command to test if the append works.
perl -pe 's/([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))\[/$1[00:00:00.00]][/' < test.txt > testout.txt
This gives a clear error of:
syntax error at -e line 1, near "00:" Execution of -e aborted due to compilation errors.
Using this #dada's solution that looks like this gives no error but also doesn't append the 00:00:00.00 at the end of the line
The Expected output should be
2019-03-06 00:00:00.00

Your problem statement says you want to turn:
2018-09-19
into:
2018-09-19 00:00:00.00
However, your code is:
s/([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))\[/$1[00:00:00.00]][/
Using /x we can write it a bit more legibly as:
s/
(
[12]\d{3} # year
- # hyphen
( 0[1-9] | 1[0-2] ) # month (saved as $2)
- # hyphen
( 0[1-9] | [12]\d | 3[01] ) # day (saved as $3)
) # save this as $1
\[ # square bracket
/$1[00:00:00.00]][/x
From this, it is clear that 2018-09-19 does not match because it does not end with a square bracket.
The replacement value is:
$1[00:00:00.00]][
This (tries to) say:
lookup index 00:00:00.00 in array #1 and substitute value
append ][
However this is not valid perl and not what you wanted anyway.
What is happening is that instead of $x + [y] (scalar followed by string value), perl is seeing $x[y] (value from array). To prevent this, either use braces ( ${x}[y] ) or escape the bracket ( $x\[y] ). This results in:
${1}[00:00:00.00]][
which is still not what the problem said was needed as the zeros are wrapped in brackets.
To get what you say you want, remove the \[ from the end of the search part and remove the unnecessary brackets from the replacement part:
s/
(
[12]\d{3}
- ( 0[1-9] | 1[0-2] )
- ( 0[1-9] | [12]\d | 3[01] )
)
# no bracket here
/$1 00:00:00.00/x; # no brackets here
Note that your code as given has another bug which is that the final print needs to be separated from the s/// by a semi-colon.

Related

Regex only inside multiline match

I have an old app that generates something like:
USERLIST (
"jasonr"
"jameso"
"tommyx"
)
ROLELIST (
"op"
"admin"
"ro"
)
I need some form of regex that changes ONLY the USERLIST section to USERLIST("jasonr", "jameso", "tommyx") and the rest of the text remain intact:
USERLIST("jasonr", "jameso", "tommyx")
ROLELIST (
"op"
"admin"
"ro"
)
In addition to the multiline issue, I don't know how to handle the replacement in only part of the string. I've tried perl (-0pe) and sed, can't find a solution. I don't want to write an app to do this, surely there is a way...
perl -0777 -wpe'
s{USERLIST\s*\(\K ([^)]+) }{ join ", ", $1 =~ /("[^"]+")/g }ex' file
Prints the desired output on the shown input file. Broken over lines for easier view.
With -0777 switch the whole file is read at once into a string ("slurped") and is thus in $_. With /x modifier literal spaces in the pattern are ignored so can be used for readability.
Explanation
Capture what follows USERLIST (, up to the first closing parenthesis. This assumes no such paren inside USERLIST( ... ). With \K lookbehind all matches prior to it stay (are not "consumed" out of the string) and are excluded from $&, so we don't have to re-enter them in the replacement side
The replacement side is evaluated as code, courtesy of /e modifier. In it we capture all double-quoted substrings from the initial $1 capture (assuming no nested quotes) and join that list by , . The obtained string is then used for the replacement for what was in the parentheses following USERLIST
With your shown samples in GNU awk please try following awk code.
awk -v RS='(^|\n)USERLIST \\(\n[^)]*\\)\n' '
RT{
sub(/[[:space:]]+\(\n[[:space:]]+/,"(",RT)
sub(/[[:space:]]*\n\)\n/,")",RT)
gsub(/"\n +"/,"\", \"",RT)
print RT
}
END{
printf("%s",$0)
}
' Input_file
Explanation: Setting RS(record separator) as (^|\n)USERLIST \\(\n[^)]*\\)\n for all lines of Input_file. Then in main program checking condition if RT is NOT NULL then substituting [[:space:]]+\(\n[[:space:]]+ with "(" and then substituting [[:space:]]*\n\)\n with ) and then substituting "\n +" with \" finally printing its value. Then in this program's END block printing line's value in printf function to get rest of the values.
Output will be as follows:
USERLIST("jasonr", "jameso", "tommyx")
ROLELIST (
"op"
"admin"
"ro"
)
This might work for you (GNU sed):
sed '/USERLIST/{:a;N;/^)$/M!ba;s/(\n\s*/(/;s/\n)/)/;s/\n\s*/, /g}' file
If a line contains USERLIST, gather up the list and format as required.

Remove everything after a changing string [duplicate]

This question already has an answer here:
How to get first N parts of a path?
(1 answer)
Closed 2 years ago.
I'm having some trouble with the following problem;
As input, I get a few lines of paths to files as follows:
root/child/abc/somefile.txt
root/child/def/123/somefile.txt
root/child/ghijklm/somefile.txt
The root/child piece is always in the path, everything after can differ.
I would like to remove everything after the grandchild folder. So the output would be:
root/child/abc/
root/child/def/
root/child/ghijklm/
I've tried the following:
sed 's/\/child\/.*/\/child\/.*/'
But of course that would just give the following output:
root/child/.*
root/child/.*
root/child/.*
Any help would be appreciated!
with cut:
cut -d\/ -f1,2,3 file
With awk: Could you please try following, written and tested with shown samples in GNU awk.
awk 'match($0,/root\/child\/[^/]*/){print substr($0,RSTART,RLENGTH)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/root\/child\/[^/]*/){ ##Using match function to match root/child/... till next / in current line.
print substr($0,RSTART,RLENGTH) ##printig substring from RSTART to till RLENGTH.
}
' Input_file ##Mentioning Input_file name here.
With sed:
sed 's/.*\(root\/child\/[^/]*\).*/\1/' Input_file
Explanation: Using sed's substitution method to match root/child/ till next occurrence of / and saving it into temp buffer(back reference method) and substituting whole line with only matched back referenced value.
This might work for you (GNU sed):
sed -E 's/^(([^/]*[/]){3}).*/\1/' file
Delete everything after the third group of non-forward-slashes/slash.
You were close.
sed 's%\(/child/[^/]*\)/.*%\1%'
The regex [^/]* matches as many characters as possible which are not a slash; then we replace the entire match with just the part we captured in parentheses, effectively trimming off the rest.
With Perl:
perl -pe 's{ ^ ( ( [^/]+ / ){3} ) .* $ }{$1}x' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
The regex uses this modifier:
x : Disregard whitespace and comments, for readability.
The substitution statement, explained:
^ : beginning of the line.
$ : end of the line.
[^/]+ / : one or more characters that are not slashes (/), followed by a slash.
( [^/]+ / ){3} : one or more non-slash characters, followed by a slash, repeated exactly 3 times.
( ( [^/]+ / ){3} ) : the above, with parenthesis to capture the matched part into the first capture variable, $1, to be used later in the substitution. Capture groups are counted left to right.
.* : zero or more occurrences of any character.
s{THIS}{THAT} : replace THIS with THAT.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

defaulting unmatched backrefrences in perl or sed

Is there a way to default back referenced variables $1, $2 and $3 here ?
start="a" hi="1" bye="2"
start="b" bye="3"
start="c" hi="4"
I am using this command to filter out:
perl-ne 'print if s/.*start="([^"]+).*?hi="([^"]+).*?bye="([^"]+).*/$1 $2 $3/g'
a 1 2
Is there a way to generate below result :
a 1 2
b null 3
c 4 null
I also searched for defaulting a back referenced variable but no working solution about it on that front. Eg, in bash we use ${var:-null} to default the var to a string null.
The special number variables ($1 etc) get introduced for capture groups even if their subpatterns fail to match, when the capture groups are optional (otherwise the whole match fails if any one subpattern fails). Those without a match stay undef.
For example, if a pattern has three optional capture groups, like (...)?, then after the regex (or after the matching part in a substitution operator) there will exist all $1,$2,$3 variables, some possibly being undef if their subpattern didn't match (that ? still made those formally match, by there being zero occurrences of that pattern).
Then test each $N and if undef replace it with a desired phrase ('null' here)
perl -wnE'/
(?: start \s*=\s* "([^"]+)"\s* )?
(?: hi \s*=\s* "([^"]+)"\s* )?
(?: bye \s*=\s* "([^"]+)"\s* )? /x;
say join " ", map { $_//"null" } ($1,$2,$3)
' file
(broken over lines and spaced-out for readability)   Since each term has the same structure the pattern can be prepared far more flexibly from a list of expected words.†
For the given sample file this prints
a 1 2
b null 3
c 4 null
† This is an overkill for a specific case and in a one-liner but is useful in a more rounded script which may be used with different keyword sets, since all hard-coded input is in the definition of the input array (#w)
perl -wnE'
BEGIN {
#w = qw(start hi bye); # keywords to form a pattern with
$re = join " ",
map { q{(?:} . $_ . q{\s*=\s*"([^"]+)"\s*)?} } #w;
};
#m = /$re/x;
say join " ", map { $_//"null" } #m
' file
This prints the same for the given input file. In bash shell it can simply be copy-pasted as it stands; in other shells you may need to make it back into one line, and remove comments. (Given as a command-line program, "one"-liner, for easy testing.)
Something like:
$ perl -nE 'my %vals=();
while (m/(\w+)="([^"]+)"/g) { $vals{$1} = $2 }
printf "%s %s %s\n", $vals{start}, $vals{hi}//"null", $vals{bye}//"null"
' input.txt
a 1 2
b null 3
c 4 null
Splits up the input into individual key/value pairs, saves them in a hash table, and then prints out the values using the // operator, which returns the left hand argument if it's defined, otherwise the right hand argument.
Variation if start, hi and bye are the only keys you can have and they always appear in that order:
$ perl -ne 'm/start="([^"]+)"(?:\s+hi="([^"]+)")?(?:\s+bye="([^"]+)")?/;
printf "%s %s %s\n", $1, $2//"null", $3//"null"' input.txt
a 1 2
b null 3
c 4 null
Uglier regular expression that makes the hi and bye parts optional matches.
This might work for you (GNU sed):
sed -E 's/^/start=null hi=null bye=null\n/ # insert a template
:a # loop name
s/(\S+=)\S+(.*\n.*)\1"(\S+)"/\3\2/ # replace lookup with value
ta # repeat till failure
s/\S+=//g # remove any template
P # print
d' file # delete debris
Insert a template and loop replacing matches with original values.
When no more matches, remove any unmatched template keys and debris from the original line.

Gawk - Regexp - unable to get results

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2
As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2
Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

Regular Expression: Capture character pattern zero or one positions from start of string

I have a series of entries, which can be represented by this string:
my_string="-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz"
For each entry, I need to return whether it starts with 'R' or 'D'. In order to do this, I need to ignore any character that comes before it. So, I wrote this regular expression:
for i in $my_string; do echo $i | grep -E -o "^*?[RD]"; done
However, this is only returning R or D for entries which are not preceded by a character.
How do I get this regex to return the R or D value in every case, whether there is a character in front of it or not? Keep in mind that the only thing which can be 'hard-coded' into the expression is the pattern to be matched.
It will be easy if you use sed:
sed -r 's/^.?([RD]).*$/\1/'
i.e.
for i in $my_string; do echo $i | sed -r 's/^.?([RD]).*$/\1/'; done
Update:
Here is what each part of the command means:
-r : extended regular expression, although I think -e should work but
turns out that during my testing, in order to use capturing group
in regex, I need -r. Anyway, not the main point
The script can be read as:
s/XXXX/YYYY/ : substitude from XXXX to YYYY
The "from" pattern (XXXX) means:
^ : start with
.? : zero or one occurence of any character
( : start of group
[RD] : either R or D
) : end of group (which means, the group will contains either R or D
.* : any number of any character
$ : till the end
the "to" pattern (YYYY):
\1 : content of capture group 1 in the "from" pattern (which is the "R or D")
Use a parameter expansion to remove the prefix before using grep:
for i in $my_string; do echo ${i#[^RD]} | grep -o "^[RD]" ; done
or use a simple test without grep (since you already know that each item starts with a R or a D):
for i in $my_string; do
if [[ $i =~ ^[^D]?R ]] ; then
echo 'R'
else
echo 'D'
fi
done
This regex worked in my local tests. Please have a try:
^.?[RD]
I can't think of a way to ONLY return the letter you want. I'd have a command after to detect whether the returned string is greater than 1 character long, and if so, I'd return only the second character.
I'm not 100% sure of what you are asking ( i understood you want to match only R and D at the beginning of a filename, whatever the character before it, if there is one ), but I think you should use lookbehind, in php you would do
$re = "/(?<=^\S|\s\S|\s)[RD]/";
$str = "-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz";
preg_match_all($re, $str, $matches);
You can see the output here.
To use Perl syntax in bash you must enable it. https://unix.stackexchange.com/questions/84477/forcing-bash-to-use-perl-regex-engine
You can test your regexp here if you need https://regex101.com/r/vV3nS3/1
This does it when using the modifier 'g' for global: (^| ).?(R|D)
See the regex101 here