Why does this bash/sed call work? - regex

I've been looking at examples of using sed to extract a substring using regex and I have a test script working. Problem is I don't understand why and would like to. Here's the script:
#!/bin/bash
string=" ID : s0016b54e23bc.ab.cd.efghig\
Name : cd167095"
echo -e "string: '$string'"
name=`echo $string | sed 's/.*\(cd.*\)/\1/'`
echo -e "\nExtracted: $name"
And it outputs:
string: ' ID : s0016b54e23bc.ab.cd.efghigName : cd167095'
Extracted: cd167095
The regex should have two matches:
cd.efghigName : cd167095
and
cd167095
Why is the second match returned?

Because it's "greedy"
The first .* matches as much as possible for the expression as a whole to succeed.
To see this, change the second cd to ef or something, and you will see the script return the first.
Now, if you use something like Ruby, Python, or Perl, you get more elaborate regular expressions, and you can use .*? which is the "non-greedy" form of .*.
#!/usr/bin/env ruby
string=" ID : s0016b54e23bc.ab.cd.efghig\
Name : cd167095"
puts string.gsub /.*?(cd.*)/, '\1'
so ross$ ./qq3
cd.efghigName : cd167095
Though really, I would just write:
string[/cd.*/]

Related

regex that works with sed not honored with ${var//search/replace}

I am trying to simply do a regular expression replace in bash but cannot figure it out. In my test, I would like the following string transformed:
test_data(123)
to
test_xyz
I've tried the following:
echo "test_data(123)" | sed -e 's/.*\(data(.*)\).*/xyz/g'
And that gets me: xyz
Then I tried:
var=${"test_data(123)"//.*\(data(.*)\).*/xyz}
But I get an error - bad substitution
How do I get my desired results on the regex replace in bash?
${foo//$match/$replace} uses fnmatch (glob-style) patterns, not any form compatible with BRE/ERE/PCRE or other conventional regex syntax formats.
input="test_data(123)"
match='data(*)'
replace='xyz'
result=${input//$match/$replace}
echo "$result"
...properly emits test_xyz.

Conditional in perl regex replacement

I'm trying to return different replacement results with a perl regex one-liner if it matches a group. So far I've got this:
echo abcd | perl -pe "s/(ab)(cd)?/defined($2)?\1\2:''/e"
But I get
Backslash found where operator expected at -e line 1, near "1\"
(Missing operator before \?)
syntax error at -e line 1, near "1\"
Execution of -e aborted due to compilation errors.
If the input is abcd I want to get abcd out, if it's ab I want to get an empty string. Where am I going wrong here?
You used regex atoms \1 and \2 (match what the first or second capture captured) outside of a regex pattern. You meant to use $1 and $2 (as you did in another spot).
Further more, dollar signs inside double-quoted strings have meaning to your shell. It's best to use single quotes around your program[1].
echo abcd | perl -pe's/(ab)(cd)?/defined($2)?$1.$2:""/e'
Simpler:
echo abcd | perl -pe's/(ab(cd)?)/defined($2)?$1:""/e'
Simpler:
echo abcd | perl -pe's/ab(?!cd)//'
Either avoid single-quotes in your program[2], or use '\'' to "escape" them.
You can usually use q{} instead of single-quotes. You can also switch to using double-quotes. Inside of double-quotes, you can use \x27 for an apostrophe.
Why torture yourself, just use a branch reset.
Find (?|(abcd)|ab())
Replace $1
And a couple of even better ways
Find abcd(*SKIP)(*FAIL)|ab
Replace ""
Find (?:abcd)*\Kab
Replace ""
These use regex wisely.
There is really no need nowadays to have to use the eval form
of the regex substitution construct s///e in conjunction with defined().
This is especially true when using the perl command line.
Good luck...

Regular Expression: Capture character pattern zero or one positions from start of string

I have a series of entries, which can be represented by this string:
my_string="-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz"
For each entry, I need to return whether it starts with 'R' or 'D'. In order to do this, I need to ignore any character that comes before it. So, I wrote this regular expression:
for i in $my_string; do echo $i | grep -E -o "^*?[RD]"; done
However, this is only returning R or D for entries which are not preceded by a character.
How do I get this regex to return the R or D value in every case, whether there is a character in front of it or not? Keep in mind that the only thing which can be 'hard-coded' into the expression is the pattern to be matched.
It will be easy if you use sed:
sed -r 's/^.?([RD]).*$/\1/'
i.e.
for i in $my_string; do echo $i | sed -r 's/^.?([RD]).*$/\1/'; done
Update:
Here is what each part of the command means:
-r : extended regular expression, although I think -e should work but
turns out that during my testing, in order to use capturing group
in regex, I need -r. Anyway, not the main point
The script can be read as:
s/XXXX/YYYY/ : substitude from XXXX to YYYY
The "from" pattern (XXXX) means:
^ : start with
.? : zero or one occurence of any character
( : start of group
[RD] : either R or D
) : end of group (which means, the group will contains either R or D
.* : any number of any character
$ : till the end
the "to" pattern (YYYY):
\1 : content of capture group 1 in the "from" pattern (which is the "R or D")
Use a parameter expansion to remove the prefix before using grep:
for i in $my_string; do echo ${i#[^RD]} | grep -o "^[RD]" ; done
or use a simple test without grep (since you already know that each item starts with a R or a D):
for i in $my_string; do
if [[ $i =~ ^[^D]?R ]] ; then
echo 'R'
else
echo 'D'
fi
done
This regex worked in my local tests. Please have a try:
^.?[RD]
I can't think of a way to ONLY return the letter you want. I'd have a command after to detect whether the returned string is greater than 1 character long, and if so, I'd return only the second character.
I'm not 100% sure of what you are asking ( i understood you want to match only R and D at the beginning of a filename, whatever the character before it, if there is one ), but I think you should use lookbehind, in php you would do
$re = "/(?<=^\S|\s\S|\s)[RD]/";
$str = "-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz";
preg_match_all($re, $str, $matches);
You can see the output here.
To use Perl syntax in bash you must enable it. https://unix.stackexchange.com/questions/84477/forcing-bash-to-use-perl-regex-engine
You can test your regexp here if you need https://regex101.com/r/vV3nS3/1
This does it when using the modifier 'g' for global: (^| ).?(R|D)
See the regex101 here

perl regex - anchors and pattern matching

I coded perl regex to extract the words after a certain anchor,
it seems like its not working. What am I doing wrong.
This is my actual output, I need to extract every number after groups keyword
$id cuser301 uid=2301(cuser301) gid=32(rpc) groups=32(rpc),1001(cgrp1),1002(cgrp2),1003(cgrp3),1004(cgrp4),1005(cgrp5),1006(cgrp6),1007(cgrp7),1008(cgrp8),1009(cgrp9),1010(cgrp10),1011(cgrp11),1012(cgrp12),1013(cgrp13),1014(cgrp14),1015(cgrp15),1016(cgrp16),1017(cgrp17),1018(cgrp18),1019(cgrp19),1020(cgrp20),1021(cgrp21),1022(cgrp22),1023(cgrp23),1024(cgrp24),1025(cgrp25),1026(cgrp26),1027(cgrp27),1028(cgrp28),1029(cgrp29),1030(cgrp30),1031(cgrp31),1032(cgrp32)
From the above, I run the id command and then would like to capture the numbers after groups Please help.
I am using the following.
my $check_groups = execute("\id $user"); #---> (execute is to run commands on the linux client, please ignore it)
my $new_groups = ('/^groups/',$check_groups); # ---> Now $new_groups should have all numbers after groups.
my $input = '$id cuser301 uid=2301(cuser301) gid=32(rpc) groups=32(rpc),1001(cgrp1),1002(cgrp2),1003(cgrp3),1004(cgrp4),1005(cgrp5),1006(cgrp6),1007(cgrp7),1008(cgrp8),1009(cgrp9),1010(cgrp10),1011(cgrp11),1012(cgrp12),1013(cgrp13),1014(cgrp14),1015(cgrp15),1016(cgrp16),1017(cgrp17),1018(cgrp18),1019(cgrp19),1020(cgrp20),1021(cgrp21),1022(cgrp22),1023(cgrp23),1024(cgrp24),1025(cgrp25),1026(cgrp26),1027(cgrp27),1028(cgrp28),1029(cgrp29),1030(cgrp30),1031(cgrp31),1032(cgrp32)';
print join ',', $input =~ /(?:.*groups=|\G.*?)\b([0-9]+)/g;
This is a common pattern; in more complicated cases where you want to ensure the \G branch only applies after the first non-zero-length match, you can use \G(?!\A) instead of just \G.
Try doing this :
$ echo <INPUT> | perl -ne 'print "$1," while /,(\d+)\(/g'
Check https://regex101.com/r/uZ9tO6/1

Monster perl regex

I'm trying to change strings like this:
<a href='../Example/case23.html'><img src='Blablabla.jpg'
To this:
<a href='../Example/case23.html'><img src='<?php imgname('case23'); ?>'
And I've got this monster of a regular expression:
find . -type f | xargs perl -pi -e \
's/<a href=\'(.\.\.\/Example\/)(case\d\d)(.\.html\'><img src=\')*\'/\1\2\3<\?php imgname\(\'\2\'); \?>\'/'
But it isn't working. In fact, I think it's a problem with Bash, which could probably be pointed out rather quickly.
r: line 4: syntax error near unexpected token `('
r: line 4: ` 's/<a href=\'(.\.\.\/Example\/)(case\d\d)(.\.html\'><img src=\')*\'/\1\2\3<\?php imgname\(\'\2\'); \?>\'/''
But if you want to help me with the regular expression that'd be cool, too!
Teaching you how to fish:
s/…/…/
Use a separator other than / for the s operator because / already occurs in the expression.
s{…}{…}
Cut down on backslash quoting, prefer [.] over \. because we'll shellquote later. Let's keep backslashes only for the necessary or important parts, namely here the digits character class.
s{<a href='[.][.]/Example/case(\d\d)[.]html'>…
Capture only the variable part. No need to reassemble the string later if the most part is static.
s{<a href='[.][.]/Example/case(\d\d)[.]html'><img src='[^']*'}{<a href='../Example/case$1.html'><img src='<?php imgname('case$1'); ?>'}
Use $1 instead of \1 to denote backreferences. [^']* means everything until the next '.
To serve now as the argument for the Perl -e option, this program needs to be shellquoted. Employ the following helper program, you can also use an alias or shell function instead:
> cat `which shellquote`
#!/usr/bin/env perl
use String::ShellQuote qw(shell_quote); undef $/; print shell_quote <>
Run it and paste the program body, terminate input with Ctrl+d, you receive:
's{<a href='\''[.][.]/Example/case(\d\d)[.]html'\''><img src='\''[^'\'']*'\''}{<a href='\''../Example/case$1.html'\''><img src='\''<?php imgname('\''case$1'\''); ?>'\''}'
Put this together with shell pipeline.
find . -type f | xargs perl -pi -e 's{<a href='\''[.][.]/Example/case(\d\d)[.]html'\''><img src='\''[^'\'']*'\''}{<a href='\''../Example/case$1.html'\''><img src='\''<?php imgname('\''case$1'\''); ?>'\''}'
Bash single-quotes do not permit any escapes.
Try this at a bash prompt and you'll see what I mean:
FOO='\'foo'
will cause it to prompt you looking for the fourth single-quote. If you satisfy it, you'll find FOO's value is
\foo
You'll need to use double-quotes around your expression. Although in truth, your HTML should be using double-quotes in the first place.
Single quotes within single quotes in Bash:
set -xv
echo ''"'"''
echo $'\''
I wouldn't use a one-liner. Put your Perl code in a script, which makes it much easier to get the regex right without wondering about escaping quotes and such.
I'd use a script like this:
#!/usr/bin/perl -pi
use strict;
use warnings;
s{
( <a \b [^>]* \b href=['"] [^'"]*/case(\d+)\.html ['"] [^>]* > \s*
<img \b [^>]* \b src=['"] ) [^'"<] [^'"]*
}{$1<?php imgname('case$2'); ?>}gix;
and then do something like:
find . -type f | xargs fiximgs
– Michael
if you install the package mysql, it comes with a command called replace.
With the replace command you can:
while read line
do
X=`echo $line| replace "<a href='../Example/" ""|replace ".html'><" " "|awk '{print $1}'`
echo "<a href='../Example/$X.html'><img src='<?php imgname('$X'); ?>'">NewFile
done < myfile
same can be done with sed. sed s/'my string'/'replace string'/g.. replace is just easier to work with special characters.