regex for strings optionally surrounded with quotes - regex

I'm trying to build a regex that matches strings which are either surrounded with quotes or have no quotes at either side. Moreover, a string the regex has to match may have quotes in the middle. Here's a result of my efforts at the moment:
^("?+)(.*[^"])(\1)$
It works well with strings having quotes both at start and end, having no quotes at any side or having quotes at start only:
$ echo '"blah "blah" blah"' | perl -ne 'if(/^("?+)(.*[^"])(\1)$/){print "$1\n$2\n$3"}'
"
blah "blah" blah
"
$ echo 'blah "blah" blah' | perl -ne 'if(/^("?+)(.*[^"])(\1)$/){print "$1\n$2\n$3"}'
blah "blah" blah
$ echo '"blah "blah" blah' | perl -ne 'if(/^("?+)(.*[^"])(\1)$/){print "$1\n$2\n$3"}'
But it matches strings having quotes only at end:
$ echo 'blah "blah" blah"' | perl -ne 'if(/^("?+)(.*[^"])(\1)$/){print "$1\n$2\n$3"}'
blah "blah" blah"
Any ideas what's the problem with the regex and how to fix it?

In your last case, ("?+) matches the empty string. (\1) effectively becomes a no-op: It also matches an empty string.
That leaves us with ^(.*[^"])$. This matches because your input string has a non-" character at the end: a newline ("\n").
You can fix this by removing the newline before running the regex (perl -ne 'chomp; ...').
As a side note, you might want to make the middle part of your regex optional. Otherwise it won't match the empty string or a string consisting of two quotes ("").

Related

Capture word after pattern with slash

I want to extract word1 from:
something /CLIENT_LOGIN:word1 something else
I would like to extract the first word after matching pattern /CLIENT_LOGIN:.
Without the slash, something like this works:
A=something /CLIENT_LOGIN:word1 something else
B=$(echo $A | awk '$1 == "CLIENT_LOGIN" { print $2 }' FS=":")
With the slash though, I can't get it working (I tried putting / and \/ in front of CLIENT_LOGIN). I don't care getting it done with awk, grep, sed, ...
Using sed:
s='=something /CLIENT_LOGIN:word1 something else'
sed -E 's~.* /CLIENT_LOGIN:([^[:blank:]]+).*~\1~' <<< "$s"
word1
Details:
We use ~ as regex delimiter in sed
/CLIENT_LOGIN:([^[:blank:]]+) matches /CLIENT_LOGIN: followed by 1+ non-whitespace characters that is captured in group #1
.* on both sides matches text before and after our match
\1 is used in substitution to put 1st group's captured value back in output
1st solution: With your shown samples, please try following GNU grep solution.
grep -oP '^.*? /CLIENT_LOGIN:\K(\S+)' Input_file
Explanation: Simple explanation would be, using GNU grep's o and P options. Which are responsible for printing exact match and enabling PCRE regex. In main program, using regex ^.*? /CLIENT_LOGIN:\K(\S+): which means using lazy match from starting of value to till /CLIENT_LOGIN: to match very first occurrence of string. Then using \K option to forget till now matched values so tat we can print only required values, which is followed by \S+ which means match all NON-Spaces before any space comes.
2nd solution: Using awk's match function along with its split function to print the required value.
awk '
match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){
split(substr($0,RSTART,RLENGTH),arr,":")
print arr[2]
}
' Input_file
3rd solution: Using GNU awk's FPAT option please try following solution. Simple explanation would be, setting FPAT to /CLIENT_LOGIN: followed by all non-spaces values. In main program of awk using sub to substitute everything till : with NULL for first field and then printing first field.
awk -v FPAT='/CLIENT_LOGIN:[^[:space:]]+' '{sub(/.*:/,"",$1);print $1}' Input_file
Performing a regex match and capturing the resulting string in BASH_REMATCH[]:
$ regex='.*/CLIENT_LOGIN:([^[:space:]]*).*'
$ A='something /CLIENT_LOGIN:word1 something else'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
word1
Verifying B remains undefined if we don't find our match:
$ A='something without the desired string'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
<<<=== nothing output
Fixing your awk command, you can use
A="/CLIENT_IPADDR:23.4.28.2 /CLIENT_LOGIN:xdfmb1d /MXJ_C"
B=$(echo "$A" | awk 'match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){print substr($0,RSTART+14,RLENGTH-14)}')
See the online demo yielding xdfmb1d. Details:
\/CLIENT_LOGIN: - a /CLIENT_LOGIN: string
[^[:space:]]+ - one or more non-whitespace chars
The pattern above is what awk searches for, and once matched, the part of this match value after /CLIENT_LOGIN: is "extracted" using substr($0,RSTART+14,RLENGTH-14) (where 14 is the length of the /CLIENT_LOGIN: string).

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?
The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).
Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'
The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).
sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)
Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

How to process a regular expression after being evaluated (sed)

I need to replace each character of a regular expression, once evaluated, with each character plus the # symbol.
For example:
If the regular expression is: POS[AB]
and the input text is: POSA_____POSB
I want to get this result: P#O#S#A#_____P#O#S#B#
Please, using sed or awk.
I have tried this:
$ echo "POSA_____POSB" | sed "s/POS[AB]/&#/g"
POSA#_____POSB#
$ echo "POSA_____POSB" | sed "s/./&#/g"
P#O#S#A#_#_#_#_#_#P#O#S#B#
But what I need is:
P#O#S#A#_____P#O#S#B#
Thank you in advance.
Best regards,
Octavio
Perl to the resuce!
perl -pe 's/(POS[AB])/$1 =~ s:(.):$1#:gr/ge'
The /e interprets the replacement as code, and it contains another substitution which replaces each character with itself plus #.
In ancient Perls before 5.14 (i.e. without the /r modifier), you need to use a bit more complex
perl -pe 's/(POS[AB])/$x = $1; $x =~ s:(.):$1#:g; $x/ge'
echo "POSA_____POSB" | sed "s/[^_]/&#/g"
or
echo "POSA_____POSB" | sed "s/[POSAB]/&#/g"
Try this regex:
echo "POSA_____POSB" | sed "s/[A-Z]/&#/g"
Output:
P#O#S#A#_____P#O#S#B#
You may replace regex pattern using awk with sub (first matching substring, sed "s///") or gsub (substitute matching substrings globally, sed "s///g") commands. The regex themselves will not differ between sed and awk. In your case you want:
Solution 1
EDIT: edited to match the comments
The following awk will limit substitution to a given substring (e.g.'POSA_____POSB'):
echo "OOPS POSA_____POSB" | awk '{str="POSA_____POSB"}; {gsub(/[POSAB]/,"&#",str)}; {gsub(/'POSA_____POSB'/, str); print $0} '
If your input consist only of matched string, try this:
echo "POSA_____POSB" | awk '{gsub(/[POSAB]/,"&#");}1'
Explanation:
Separate '{}' for each action and explicit print are for clarity sake.
The gsub accepts 3 arguments gsub(pattern, substitution [, target]) where target must be variable (gsub will change it inplace and store result there).
We use var named 'str' and initialize it with value (your string) before doing any substitutions.
The second gsub is there to put modified str into $0 (matches the whole record/line).
The expressions are greedy by default --- they will match the longest string possible.
[] introduces set of characters to be matched: every occurence of any char will be matched. The expression above says awk to match each occurence of any of "POSAB".
Your first regexp does not work as expected for you told sed to match POS ending in any of [AB] (the whole string at once).
In the other expression you told it to match any single character (including "_") when you used: '.' (dot).
If you want to generalize this solution you may use: [\w] expression which will match any of [a-zA-Z0-9_] or [a-z], [A-Z], [0-9] to match lowercase, uppercase letters and numbers respectively.
Solution 2
Note that you might negate character sets with [^] so: [^_] would also work in this particular case.
Explanation:
Negation means: match anything but the character between '[]'. The '^' character must come as first char, right after opening '['.
Sidenotes:
Also it may be good idea to directly indicate you want to match one character at a time with [POSAB]? or [POSAB]{1}.
Also note that some implementations of sed might need -r switch to use extended (more complicated) regexps.
With the given example you can use
echo "POSA_____POSB" | sed -r 's/POS([AB])/P#O#S#\1#/g'
This will fail for more complicated expressions.
When your input is without \v and \r, you can use
echo "POSA_____POSB" |
sed -r 's/POS([AB])/\v&\r/g; :loop;s/\v([^\r])/\1#\v/;t loop; s/[\v\r]//g'

why doesn't this Perl capture work

I expected this to capture and print just the group defined in parens, but instead it prints the whole line. How can I capture and print just the group in parens?
echo "abcdef" | perl -ne "print $1 if /(cd)/ "
What I want this to print: cd
What it actually prints: abcdef
How to fix?
In the perl command, you have to use single quotes or protect variables :
echo "abcdef" | perl -ne "print \$1 if /(cd)/"
or
echo "abcdef" | perl -ne 'print $1 if /(cd)/'
In double quotes, the shell expand $1.
The instant fix to your question is to change your double quotes to single quotes, like this:
$ echo abcdef | perl -ne 'print $1 if /(cd)/'
cd
With double quotes, the shell environment interprets your unprotected variable $1, which in your environment apparently evaluates to an empty string. So perl only receives the command print if /(cd)/ which is an implied command print $_ if /(cd)/ which prints the entire line.
You can also use a protected variable like this:
$ echo abcdef | perl -ne "print \$1 if /(cd)/"
cd
Note that matches which use different delimiters (other than / and /) are required to begin with the m keyword rather than using the shorthand form. But in your case, this does not matter, but it is often something worth being aware of when working with matches, e.g., m|/| would match a / character using the pipe as the delimiter for the regular expression.

Regex to find string without curly braces but "\{", "\}" is allowed

I have a regex to find string without curly braces "([^\{\}]+)". So that it can extract "cde" from follwing string:
"ab{cde}f"
Now I need to escape "{" with "\{" and "}" with "\}".
So if my original string is "ab{cd\{e\}}f" then I need to extract "cd{e}" or "cd\{e\}" (I can remove "\" later).
Thanks in advance.
This should work:
([^{}\\]|\\{|\\})+
To allow escapes inside your braces you can use:
{((?:[^\\{}]+|\\.)*)}
Perl example:
my $str = "ab{cd\\{e\\}} also foo{ad\\}ok\\{a\\{d}";
print "$str\n";
print join ', ', $str =~ /{((?:[^\\{}]+|\\.)*)}/g;
Output:
ab{cd\{e\}} also foo{ad\}ok\{a\{d}
cd\{e\}, ad\}ok\{a\{d
Note that any regex special characters are effectively escaped by putting them inside a range (i.e. square brackets). So:
[.] matches a literal period.
[[] matches a left square bracket.
[a] matches the letter a.
[{] matches a left curly brace.
So:
$ echo "ab{cde}f" | sed -r 's/[^{]*[{](.+)}.*/\1/'
cde
$ echo "ab{c\{d\}e}f" | sed -r 's/[^{]*[{](.+)}.*/\1/'
c\{d\}e
Or:
$ echo "ab{cde}f" | sed 's/[^{]*{//;s/}[^}]*$//'
cde
$ echo "ab{c\{d\}e}f" | sed 's/[^{]*{//;s/}[^}]*$//'
c\{d\}e
Or even:
$ php -r '$s="ab{cde}f"; print preg_replace("/[^{]*[{](.+)}.*", "$1", $s) . "\n";'
cde
$ php -r '$s="ab{c\{d\}e}f"; print preg_replace("/[^{]*[{](.+)}.*/", "$1", $s) . "\n";'
c\{d\}e
Obviously, this does not handle escaped backslashes. :-)
\{(.+)\} would extract everything between the first and last curly bracket