Regular expression with conditional replacement - regex

I am trying to write a RegEx for replacing a character in a string, given that a condition is met. In particular, if the string ends in y, I would like to replace all instances of a to o and delete the final y. To illustrate what I am trying to do with examples:
Katy --> Kot
cat --> cat
Kakaty --> KoKot
avidly --> ovidl
I was using the RegEx s/\(\w*\)a\(\w*\)y$/\1o\2/g but it does not work. I was wondering how would one be able to capture the "conditional" nature of this task with a RegEx.
Your help is always most appreciated.

With GNU sed:
If a line ends with y (/y$/), replace every a with o and replace trailing y with nothing (s/y$//).
sed '/y$/{y/a/o/;s/y$//}' file
Output:
Kot
cat
Kokot
ovidl

You may use awk:
Input:
cat file
Katy
cat
KaKaty
avidly
Command:
awk '/y$/{gsub(/a/, "o"); sub(/.$/, "")} 1' file
Kot
cat
KoKot
ovidl

You could use some sed spaghetti code, but please don't
sed '
s/y$// ; # try to replace trailing y
ta ; # if successful, goto a
bb ; # otherwise, goto b
:a
y/a/o/ ; # replace a with o
:b
'

Related

Regex contain match that should not match

Given this ; delimited string
hap;; z
z ;d;hh
z;d;hh ;gfg;fdf ;ppp
ap;jj
lo mo;z
d;23
;;io;
b yio;b;12
b
a;b;bb;;;34
I am looking to get columns $1 $2 $3 from any line that contains ap or b or o m in column 1
Using this regex
^(?:(.*?(?:ap|b|o m).*?)(?:;([^\r\n;]*))?(?:;([^\r\n;]*))?(?:;.*)?|.*)$
as shown in this demo one can see that line 11 should not be matching, but it does.
Can not use negated character class to match the before and after sections of column 1, as far as I understand.
Any help making line 11, not match?
You may consider this perl one-liner that works like awk:
perl -F';' -MEnglish -ne 'BEGIN {$OFS=";"} print $F[0],$F[1],$F[2] if $F[0] =~ /ap|b|o m/' file
An awk would be even more simpler:
awk 'BEGIN {FS=OFS=";"} $1 ~ /ap|b|o m/{print $1,$2,$3}' file
hap;; z
ap;jj;
lo mo;z;
b yio;b;12
b ;;
Here is a regex that match your data:
^([^;\n]*(?:ap|b|o m)[^;]*);((?(1)[^;]*));?((?(1)[^;]*))$
You can see it in action.

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

How do I use super-sed's Perl regex dot match?

I'm trying to use super-sed's Perl regex /S, but can't get it to work at all. This flag makes dots match newlines. This would be a very handy tool, if only I could understand how it's used! For example, I expect the following command will match and replace the pattern which spans across a newline to be replaced with Xs:
echo "(123) 456-7890\n(212) 567-9050" | ssed -R -e "s/78.*?5/x/S"
So, I am expecting this output:
(123) 456-XXXX
XXXXXXX67-9050
Instead I get (no match):
(123) 456-7890
(212) 567-9050
Ssed, like sed, works in a line-based manner. If you want to work on multiple lines at the same time, you have to fetch them first. One way to do that in sed (and ssed) is
:a $! { N; ba; }
Where :a is a jump label, N fetches the next line, ba jumps back to :a and the $! check sees to it that this only happens as long as there are more lines to read.
Once we have that, the other difficulty is to get the right number of Xs into the right places. Ssed, like sed, does not make this very convenient, and it requires some shuffling around with the hold buffer to get the substituted part isolated and ready for processing. I came up with the following:
$ ssed -R ':a $! { N; ba; }; h; s/(.*?78)(.*?5)(.*)/\2/S; s/./X/g; s/^/#/; x; G; s/(.*?78)(.*?5)(.*)\n#(.*)/\1\4\3/S' << EOF
> (123) 456-7890
> (212) 567-9050
> EOF
(123) 456-78XX
XXXXXXX67-9050
This works as follows:
:a $! { N; ba; } # read full input into pattern space
h # save a copy of it in the hold buffer
s/(.*?78)(.*?5)(.*)/\2/S # isolate the part to substitute
s/./X/g # replace non-newlines with X
s/^/#/ # Put an # as marker before the X's.
x # Swap hold buffer and pattern space
G # append hold buffer (now the X's) to
# the pattern space. The PS now contains
# the input followed by an # followed by
# the X's.
s/(.*?78)(.*?5)(.*)\n#(.*)/\1\4\3/S # Use the # marker (that we know to be
# the last # in the PS) to isolate the
# X's and the original regex to isolate
# the part we want to replace, then
# reassemble.
As you can see, this is about as messy in ssed as it would be in sed, so I still suggest that it might be saner to use Perl:
$ perl -0777 -pe 's/(?<=78)(.*?5)/$1=~s{[^\n]}{X}gr/se' << EOF
> (123) 456-7890
> (212) 567-9050
> EOF
(123) 456-78XX
XXXXXXX67-9050
Here, the -0777 option puts perl into slurp mode, which makes it read the whole input in one go rather than linewise, and the code is a simple substitution, where
(?<=78) is a lookbehind expression that matches an empty string if it is preceded by 78
/e enables us to use a perl expression in the replacement clause of s///, and
$1=~s{[^\n]}{X}gr takes the first capture and replaces all non-newline charaters in it with X, yielding the result of the substitution. This is then substituted into the string where (.*?5) was matched.
Noooo!!!! It's bad enough people are using sed for all sorts of wacky machinations but now there's super-sed for even more crazy rune combinations???
You don't tell us what sseds /S command does so I'm guessing it's for doing substitutions across multi-line blocks but sed is for simple substitutions on individual lines, that is all, and you should forget you ever heard about super-sed. For anything interesting related to manipulating text you should just use awk, e.g. with GNU awk for multi-char RS:
$ printf "(123) 456-7890\n(212) 567-9050\n" |
awk -v RS='78[^5]*5' -v ORS= '{print $0 gensub(/[^\n]/,"X","g",RT)}'
(123) 456-XXXX
XXXXXXX67-9050
or if you didn't want the 78 to be replaced:
$ printf "(123) 456-7890\n(212) 567-9050\n" |
awk -v RS='78[^5]*5' -v ORS= '{print $0 substr(RT,1,2) gensub(/[^\n]/,"X","g",substr(RT,3))}'
(123) 456-78XX
XXXXXXX67-9050
or:
$ printf "(123) 456-7890\n(212) 567-9050\n" |
awk -v RS='^$' -v ORS= 'match($0,/(.*78)([^5]*5)(.*)/,a){print a[1] gensub(/[^\n]/,"X","g",a[2]) a[3]}'
(123) 456-78XX
XXXXXXX67-9050
and if you don't like that for some reason then just use perl, it's got to be every bit as readily available as ssed, probably more so!

Skipping line and regex in awk [duplicate]

This question already has answers here:
how to ignore blank lines and comment lines using awk
(3 answers)
Closed 8 years ago.
I'm working with awk and I need to skip lines that are blank or comments. I've been trying inside the loop to see if it match the regex for this and then using next
{if ($0 ~"/^($|#)/" ) {next;}}
but the if statement is never getting hit and I can't figure out why. (My input has blank lines and comments)
I need to add this line inside of an awkscript in the block, not a command line argument.
Assuming you're inside a block of awk code that doesn't benefit from default print of matching patterns and you need to use an if test , here is the basis for a solution
$ echo "a
b
c
d
#
#e
f
" | awk '{if ($0 ~ /^(#|$)/ ) {next;} ;print}'
produces output of
a
b
c
d
f
If you want to skip blank lines that have spaces/tabs included, you can add
awk '{if ($0 ~ /^(#|[ \t]*$)/ ) {next;} ;print}'
#-------------------^^^^^^
# means char-class of space and tab char
# * means zero or more of preceding
IHTH
In awk, a regular expression is marked by the beginning and ending slashes. If you place it inside quotes, it ceases to be a regex and becomes a string. Thus, replace:
{if ($0 ~"/^($|#)/" ) {next;}}
With:
{if ($0 ~ /^($|#)/ ) {next;}}
Example
Consider the input file:
$ cat input
one
#comment
two
three
four
Now observe the awk script:
$ awk '{if ($0 ~ /^($|#)/ ) {next;}} 1' input
one
two
three
four
You can use the following one:
awk '! /^($|#)/' infile
It uses a default action of print for each line that doesn't begin with # or is a blank one.
awk '/^$|#/{next} {print $0}'
would do the job
or
more simply
awk '/^[^$#]/ '
what it does?
/^[^$#]/ matches each line to the regex and if a match is found, the default action to print the entire record is done.
^ anchors the regex at the begining of the line.
[^$#] negatates the character class. ensures that the start of line is not followed by
$ => line is empty, negation skips the line
# => comment
eg
$ cat input
hello
#world
this is a
test
$ awk '/^[^$#]/ ' input
hello
this is a
test

How to exclude patterns in regex conditionally in bash?

This is the content of input.txt:
hello=123
1234
stack=(23(4))
12341234
overflow=345
=
friends=(987)
Then I'm trying to match all the lines with equal removing the external parenteses (if the line has it).
To be clear, this is the result I'm looking for:
hello=123
stack=23(4)
overflow=345
friends=987
I toughth in something like this:
cat input.txt | grep -Poh '.+=(?=\()?.+(?=\))?'
But does not returns nothing. What am I doing wrong? Do you have any idea to do this? I'm so interested.
Using awk:
awk 'BEGIN{FS=OFS="="} NF==2 && $1!=""{gsub(/^\(|\)$/, "", $2); print}' file
hello=123
stack=23(4)
overflow=345
friends=987
Here is an alternate way with sed:
sed -nr ' # Use n to disable default printing and r for extended regex
/.+=.+/ { # Look for lines with key value pairs separated by =
/[(]/!ba; # If the line does not contain a paren branch out to label a
s/\(([^)]+)\)/\1/; # If the line contains a paren find a subset and print that
:a # Our label
p # print the line
}' file
$ sed -nr '/.+=.+/{/[(]/!ba;s/\(([^)]+)\)/\1/;:a;p}' file
hello=123
stack=23(4)
overflow=345
friends=987