Escaping braces with regexp in the middle of a string - regex

I want to write a regular expression in tcl that can detect the presence of curly braces ({,}) in middle of a string and replace it with a backslash.
For example i/p:
designs/abc/def {/designs/abc/def/abc{123}defg} {abc/sed/123erf} -conect abc
o/p:
designs/abc/def {/designs/abc/def/abc\{123\}defg} {abc/sed/123erf} -conect abc

Since you mentioned that only braces surrounded by characters on both sides should be replaced, then I think that you need word boundaries:
% set input "designs/abc/def {/designs/abc/def/abc{123}defg} {abc/sed/123erf} -conect abc"
designs/abc/def {/designs/abc/def/abc{123}defg} {abc/sed/123erf} -conect abc
% regsub -all {\y[{}]\y} $input {\\\0} result
2
% puts $result
designs/abc/def {/designs/abc/def/abc\{123\}defg} {abc/sed/123erf} -conect abc
In Tcl, \y matches between \w and \W, that is between a word and a non-word character or between a word character and the beginning/end of string.
The replace of \\\0 gives a backslash and the matched string.
In case you can also have braces escaped at the beginning/end of string, you'll need something a bit different:
% set input "{/designs/abc/def/abc{123}defg}"
{/designs/abc/def/abc{123}defg}
% regsub -all {(?:\y|^)[{}](?:\y|$)} $input {\\\0} result
4
% puts $result
\{/designs/abc/def/abc\{123\}defg\}

Usually you can use lookaround to make that elegant, but you can fake it by including part of the match in the output: replace (\S)([{}])(\S) by \1\\\2\3.

Related

What does -line flag do in tcl regular expression?

Below I have copied the code I had written. I don't know what the line flag does.
set value "hi this is venkat345
hi this is venkat435
hi this is venkat567"
regexp -all -line -- {(venkat.+)$} $value a b
puts "Full Match: $a"
puts "Sub Match1: $b"
The above code gives the following output
Full Match: venkat567
Sub Match1: venkat567
Can any one explain me when and where should I choose the -line flag in tcl regular expression
The man page has defined it well I believe:
-line
Enables newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning. With this flag, [^ bracket expressions and . never match newline, ^ matches an empty string after any newline in addition to its normal function, and $ matches an empty string before any newline in addition to its normal function. This flag is equivalent to specifying both -linestop and -lineanchor, or the (?n) embedded option (see the re_syntax manual page).
If you want to understand it another way, . and [^ ... ] usually match newlines, for example:
regexp -- {^....$} "ab\nc"
returns 1 (meaning the regexp matches the string, counting \n as 1 character) but using the -line switch will prevent . to match \n.
Similary:
regexp -- {^[^abc]+$} "de\nf"
will also return 1 because the negated class [^abc] is able to match a character that is not abc, which includes \n.
The second function of the -line switch makes ^ match at every beginning of line instead of matching only at the start of the whole string, and makes $ match at every end of line instead of matching only at the end of the whole string.
% set text {abc
abc}
abc
abc
% regexp -- {^abc$} $text
0
% regexp -line -- {^abc$} $text
1
As for the when and where, it will depend on what you are trying to do. Based on your sample code, it would seem to me that you need to get all the usernames beginning with venkat that can appear at the end of any line. Since you want to match many, you will need to use the -all and -inline switches to get the matched strings, and I would recommend to change the regexp a bit:
set value "hi this is venkat345
hi this is venkat435
hi this is venkat567"
# I removed the capture group and changed . to \S to match non-space characters
set results [regexp -all -inline -line -- {venkat\S+$} $value]
puts $results
# venkat345 venkat435 venkat567
-line just make sure your . will never match a newline.
According to the Tcl regexp documentation:
-line
Enables newline-sensitive matching. By default, newline is a
completely ordinary character with no special meaning. With this flag,
‘[^’ bracket expressions and ‘.’ never match newline, ‘^’ matches an
empty string after any newline in addition to its normal function, and
‘$’ matches an empty string before any newline in addition to its
normal function. This flag is equivalent to specifying both -linestop
and -lineanchor, or the (?n) embedded option (see METASYNTAX, below).
Here is the output without -line option:
Full Match: venkat345
hi this is venkat435
hi this is venkat567
Sub Match1: venkat345
hi this is venkat435
hi this is venkat567
The .+ just matches all the lines up to the value string end.

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

Wildcard beginning of a line in perl

How to use wildcard for beginning of a line?
Example, I want to replace abc with def.
This is what my file looks like
abc
abc
abc
hg abc
Now I want that abc should be replaced in only first 3 lines. How to do it?
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
What condition to be put before beginning of first space?
Thanks
What about:
s/(^ *)abc/$1def/g
(^ *) -> zero or morespaces at start of line
This will strictly replace abc with def.
Also note I've used a real space and not \s because you said "beginning of first space". \s matches more characters than only space.
You are making a couple of mistakes in your regex
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
You don't need /g (global, match as many times as possible) if you only want to replace from the beginning of the string (since that can only match once).
Inside a character class bracket all characters are literal except ], - and ^, so ['\s'] means "match whitespace or apostrophe '"
Spaces inside the regex is interpreted literally, unless the /x modifier is used (which it is not)
Quantifiers apply to whatever they immediately precede, so \s* means "zero or more whitespace", but \s * means "exactly one whitespace, followed by zero or more space". Again, unless /x is used.
You do not need to supply $_ =~, since that is the variable any regex uses unless otherwise specified.
If you want to replace abc, and only abc when it is the first non-whitespace in a line, you can do this:
s/^\s*\Kabc/def/
An alternate for the \K (keep) escape is to capture and put back
s/^(\s*)abc/$1def/
If you want to keep the whitespace following the target string abc, you do not need to do anything. If you want it removed, just add \s* at the end
s/^\s*\Kabc\s*/def/
Also note that this is simply a way to condense logic into one statement. You can also achieve the same by using very simple building blocks:
if (/^\s*abc/) { # if abc is the first non-whitespace
s/abc/def/; # ...substitute it
}
Since the substitution only happens once (if the /g modifier is not used), and only the first match is affected, this will flawlessly substitute abc for def.
Try this:
$_ =~ s/^['\s'] * abc ['\s'] * /def/g;
If you need to check from start of a line then use ^.
Also, I am not sure why you have ' and spaces in your regex. This should also work for you:
$_ =~ s/^[\s]*abc[\s]*/def/g;
Use ^ character, and remove unnecessary apostrophes, spaces and [ ] :
$_ =~ s/^\s*abc/def/g
If you want to keep those spaces that were before the "abc":
$_ =~ s/^(\s*)abc/\1def/g

TCL - obtain the list of strings separated by white space in another string using regular expressions

How to write a regexp in TCL that matches word and whitespaces. For example I have
aaaa bbbb cccc
and I want to match "aaaaa ", "bbbb ", "cccc ".
And also please tell me what is the regex symbol for whitespace and non-whitespace. I can't find it anywhere.
Thanks.
My thought would be to just search for groupings of word characters:
set text {aaaa bbbb cccc}
regexp -all -inline {\S+} $text
> aaaa bbbb cccc
You can find the writeup for the Tcl regular expression syntax on the re_syntax man page
I'm not quite sure exactly what you want, but here's an example:
set str "aaaa bbbb cccc "
regexp {(\S+)\s+(\S+)\s+(\S+)} $str -> wordA wordB wordC
puts "The first is \"$wordA\", second \"$wordB\", and third \"$wordC\""
Which produces this output:
The first is "aaaa", second "bbbb", and third "cccc"
Within the RE, \S+ means a non-empty sequence of non-whitespace characters and \s+ means a non-empty sequence of whitespace. I could have used \w+ (“word” chars) and \W+ (“non-word” chars) respectively. The parentheses in the RE surround capture groups; Tcl does not require REs to match the whole input string.
Regex symbol for whitespace is " ". Like [a-z .] gives you a whitespace, as well as period and lowercase letters.

Insertion with Regex to format a date (Perl)

Suppose I have a string 04032010.
I want it to be 04/03/2010. How would I insert the slashes with a regex?
To do this with a regex, try the following:
my $var = "04032010";
$var =~ s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
print $var;
The \d means match single digit. And {n} means the preceding matched character n times. Combined you get \d{2} to match two digits or \d{4} to match four digits. By surrounding each set in parenthesis the match will be stored in a variable, $1, $2, $3 ... etc.
Some of the prior answers used a . to match, this is not a good thing because it'll match any character. The one we've built here is much more strict in what it'll accept.
You'll notice I used extra spacing in the regex, I used the x modifier to tell the engine to ignore whitespace in my regex. It can be quite helpful to make the regex a bit more readable.
Compare s{(\d{2})(\d{2})(\d{4})}{$1/$2/$3}x; vs s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
Well, a regular expression just matches, but you can try something like this:
s/(..)(..)(..)/$1/$2/$3/
#!/usr/bin/perl
$var = "04032010";
$var =~ s/(..)(..)(....)/$1\/$2\/$3/;
print $var, "\n";
Works for me:
$ perl perltest
04/03/2010
I always prefer to use a different delimiter if / is involved so I would go for
s| (\d\d) (\d\d) |$1/$2/|x ;