How to check for multiple special characters using Regex in Perl - regex

I am a beginner in perl and I am looking on how to search for multiple special characters in a file using Regex. Basically, I have a closing tag /> which I need to verify in a file. I have read that when we have special characters, we need to precede using '\'. But I have two special characters together and I am not sure how to have this check done.
I am using something like below, including /> in-between /\""/ but its not working :
$line =~ /\/>/
Could someone help me with this pattern matching using Regex?

I believe \Y or \Q..\E is your friend here.
/text\Y$literal\Ytext/
/\Q.*\Etext/
\Y listed as without any runtime variable interpolations. This means all perl variables will be treated as literal symbols.
\Q..\E listed as (disable) pattern metacharacter. This means usual regex special characters will be treated as literal symbols without the needing to escape them.
http://perldoc.perl.org/perlre.html

Related

vim search replace including newline

I've been googling for 3 hours now without success.
I have a huge file which is a concatenation of many XML files.
Thus I want to search replace all occurences of <?xml [whatever is between]<body> (including those words).
And then same for </body>[whatever is between]</html> (including those words).
The closest I came from is
:%s/<?xml \(.*\n\)\{0,180\}\/head>//g
FYI If I try this:
:%s/<?xml \(\(.*\)\+\n\)\+\/head>\n//g
I get a E363: pattern uses more memory than 'maxmempattern'.
I've tried to follow this without success.
To match any number of symbols including a newline between <?xml and <body>, you can use
:%s/<?xml \_.*<\/head>//g
The \_.* can be used to match any symbols including a newline. To match as few symbols as possible, use .\{-}: :%s/<?xml \_.\{-}<\/head>//g.
See Vim wiki, Patterns including end-of-line section:
\_. Any character including a newline
And from the Vim regex help, 4.3 Quantifiers, Greedy and Non-Greedy section:
\{-}
matches 0 or more of the preceding atom, as few as possible
UPDATE
As far as escaping regex metacharacters is concerned, you can refer to Vim Regular Expression Special Characters: To Escape or Not To Escape help page. You can see } is missing on the list. Why? Because a regex engine is usually able to tell what kind of } it is from the context. It knows if it is preceded with \{ or not, and can parse the expression correctly. Thus, there is no reason to escape this closing brace, which keeps the pattern "clean".

Why does this sed command output "[18" instead of "18"?

echo [18%] | sed s:[\[%\]]::g
I'm really confused by this, because the same exact pattern successfully replaces [18%] in vim. I've also tested the expression in a few online regex tools and they all say that it will match on the [, %, and ] as intended. I have tried adding the -r option as well as surrounding the substitution command in quotes.
I know that there are other commands that I could use to accomplish this task, but I want to know why it is behaving this way so I can get a better understanding of sed.
$ echo [18%] | sed s:[][%]::g
18
sed supports POSIX.2 regular expression syntax: basic (BRE) syntax by default, extended syntax with the -r flag. In POSIX.2 syntax, basic or extended, you include a right square bracket by making it the first character in the character class. Backslashes do not help.
This is annoying because almost every other modern language and tool uses Perl or Perl-like regex syntax. POSIX syntax is an anachronism.
You can read about the POSIX.2 syntax in the regex(7) man page.
A bracket expression is a list of characters enclosed in "[]". It normally
matches any single character from the list (but see below). If the list begins
with '^', it matches any single character (but see below) not from the rest of
the list. If two characters in the list are separated by '-', this is shorthand
for the full range of characters between those two (inclusive) in the collating
sequence, for example, "[0-9]" in ASCII matches any decimal digit. It is ille‐
gal(!) for two ranges to share an endpoint, for example, "a-c-e". Ranges are
very collating-sequence-dependent, and portable programs should avoid relying on
them.
To include a literal ']' in the list, make it the first character (following a
possible '^'). To include a literal '-', make it the first or last character, or
the second endpoint of a range. To use a literal '-' as the first endpoint of a
range, enclose it in "[." and ".]" to make it a collating element (see below).
With the exception of these and some combinations using '[' (see next para‐
graphs), all other special characters, including '\', lose their special signifi‐
cance within a bracket expression.

Basic Vim - Search and Replace text bounded by specific characters

Say I wanted to replace :
"Christoph Waltz" = "That's a Bingo";
"Gary Coleman" = "What are you talking about, dear Willis?";
to just have :
"Christoph Waltz"
"Gary Coleman"
i.e. I want to remove all the characters including and after the = and the ;
I thought the regex for finding the pattern would be \=.*?\;. In vim, I tried :
:%s/\=.*?\;$//g
but it gave me an Invalid Command error and Nothing after \=. How do I remove the above text? Apologies, but I'm new to this.
Vim's regular expression dialect is different; its escaping is optimized for text searches. See :help perl-patterns for a comparison with Perl regular expressions. As #EvergreenTree has noted, you can influence the escaping behavior with special atoms; cp. :help /\v.
For your example, the non-greedy match is .\{-}, not .*?, and, as mentioned, you mustn't escape several literal characters:
:%s/ =.\{-};$//
(The /g flag is superfluous, too; there can be only one match anchored to the end via $.)
This is because of vim's weird handling of regexes by default. Instead of \= interpreting as a literal =, it interprets it as a special regex character. To make vim's regex system work more normally, you can prefix it with \v, which is "very magic" mode. This should do the trick:
%s/\v\=.*\;$//g
I won't explain how vim enterprets every single character in very magic mode here, but you can read about it in this help topic:
:help /magic

Specific search pattern using regex

I would like to search for a pattern in following type of strings.
I have both of these patterns
"<deliveries!ntg5!intel!api!ntg5!avt!tuner!src>CDAVTTunerTVProxy.cpp"
and
"<.>api/sys/mocca/pf/comm/component/src\HBServices.hpp"
I would like to extract the file names from the patterns above
I tried the following
if(m/(\|>[0-9a-zA-Z_]\.cpp"$|\.hpp"$|\.h"$|\.c")$/){
Above expression is not listing file names with " >xxxxx.cpp" ( or .hpp, or .h, or .c)
Any idea would be of great help.
There are a few mistakes in your regex
if(m/(\|>[0-9a-zA-Z_]\.cpp"$|\.hpp"$|\.h"$|\.c")$/){
I assume that \|> is supposed to match either \ or >, but this is incorrect. It will try to match a pipe | followed by >. Backslash is used to escape characters, and so if you want to match a literal backslash, you need to escape it: \\. This is the wrong way to use an alternation, though (see more below), and there is a better way, which is to use a character class: [\\>].
[0-9a-zA-Z_] is a character class that is represented by \w, so it makes sense to use that instead to make your regex more readable. Also, you are only matching one character. If you want to match more than that, you need to supply a quantifier, such as +, which is suitable in this case. The quantifier + means to match 1 or more times.
Your alternations | are mixed up. Unless you group them properly, they will be intended to match the entire string. Your regex as it is now would capture strings like:
|>A.cpp"
.hpp"
.c"
Which is not what you want. If you want to apply the different extensions to the main file name body, you have to group the alternate extensions properly:
\w+\.(?:cpp|hpp|h|c)"$
Using parentheses that do not capture (?: ... ) are suitable for grouping. As you can also see, there is no need to repeat the parts of the string which are identical for all extensions.
So what do we end up with?
/([\\>]\w+\.(?:cpp|hpp|h|c)")$/
Although I do not think that you really want to include the leading [\\>] in the match, or the trailing ". So more properly it would be
/[\\>](\w+\.(?:cpp|hpp|h|c))"$/
Note that as I said in the comment, there is a module to use if these are paths, and you want to extract the file name. File::Basename is included in Perl core since version 5.
Please try this regex:
m/([0-9a-zA-Z_]+\.(?:cpp|hpp|h|c))$/
This one is looking for the extension cpp, hpp, h or c at the end of the string(using $) and then looking for the file name just before the period(.) with extension.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.