Perl regex with exclamation marks - regex

How do you define/explain this Perl regex:
$para =~ s!//!/!g;
I know the s means search, and g means global (search), but not sure how the exclamation marks ! and extra slashes / fit in (as I thought the pattern would look more like s/abc/def/g).

Perl's regex operators s, m and tr ( thought it's not really a regex operator ) allow you to use any symbol as your delimiter.
What this means is that you don't have to use / you could use, like in your question !
# the regex
s!//!/!g
means search and replace all instances of '//' with '/'
you could write the same thing as
s/\/\//\/g
or
s#//#/#g
or
s{//}{/}g
if you really wanted but as you can see the first one, with all the backslashes, is very hard to understand and much more cumbersome.
More information can be found in the perldoc's perlre

The substitution regex (and other regex operators, like m///) can take any punctuation character as delimiter. This saves you the trouble of escaping meta characters inside the regex.
If you want to replace slashes, it would be awkward to write:
s/\/\//\//g;
Which is why you can write
s!//!/!g;
...instead. See http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
And no, s/// is the substitution. m/// is the search, though I do believe the intended mnemonic is "match".

The exclamation marks are the delimiter; perl lets you choose any character you want, within reason. The statement is equivalent to the (much uglier) s/\/\//\//g — that is, it replaces // with /.

Related

matching cond in perl using double exclaimation

if ($a =~ m!^$var/!)
$var is a key in a two dimensional hash and $a is a key in another hash.
What is the meaning of this expressions?
This is a regular expression ("regex"), where the ! character is used as the delimiter for the pattern that is to be matched in the string that it binds to via the =~ operator (the $a† here).
It may clear it up to consider the same regex with the usual delimiter instead, $a =~ /^$var\// (then m may be omitted); but now any / used in the pattern clearly must be escaped. To avoid that unsightly and noisy \/ combo one often uses another character for the delimiter, as nearly any character may be used (my favorite is the curlies, m{^$var/}). ‡ §
This regex in the question tests whether the value in the variable $a begins with (by ^ anchor) the value of the variable $var followed by / (variables are evaluated and the result used). §
† Not a good choice for a variable name since $a and $b are used by the builtin sort
‡ With the pattern prepared ahead of time the delimiter isn't even needed
my $re = qr{^$var/};
if ($string =~ $re) ...
(but I do like to still use // then, finding it clearer altogether)
Above I use qr but a simple q() would work just fine (while I absolutely recommend qr). These take nearly any characters for the delimiter, as well.
§ Inside a pattern the evaluated variables are used as regex patterns, what is wrong in general (when this is intended they should be compiled using qr and thus used as subpatterns).
An unimaginative example: a variable $var = q(\s) (literal backslash followed by letter s) evaluated inside a pattern yields the \s sequence which is then treated as a regex pattern, for whitespace. (Presumably unintended; we just wanted \ and s.)
This is remedied by using quotemeta, /\Q$var\E/, so that possible metacharacters in $var are escaped; this results in the correct pattern for the literal characters, \\s. So a correct way to write the pattern is m{^\Q$var\E/}.
Failure to do this also allows the injection bug. Thanks to ikegami for commenting on this.
The match operator (m/.../) is one of Perl's "quote-like" operators. The standard usage is to use slashes before and after the regex that goes in the middle of the operator (and if you use slashes, then you can omit the m from the start of the operator). But if the regex itself contains a slash then it is convenient to use a different delimiter instead to avoid having to escape the embedded slash. In your example, the author has decided to use exclamation marks, but any non-whitespace character can be used.
Many Perl operators work like this - m/.../, s/.../.../, tr/.../.../, q/.../, qq/.../, qr/.../, qw/.../, qx/.../ (I've probably forgotten some).

RegEx to match string between delimiters or at the beginning or end

I am processing a CSV file and want to search and replace strings as long as it is an exact match in the column. For example:
xxx,Apple,Green Apple,xxx,xxx
Apple,xxx,xxx,Apple,xxx
xxx,xxx,Fruit/Apple,xxx,Apple
I want to replace 'Apple' if it is the EXACT value in the column (if it is contained in text within another column, I do not want to replace). I cannot see how to do this with a single expression (maybe not possible?).
The desired output is:
xxx,GRAPE,Green Apple,xxx,xxx
GRAPE,xxx,xxx,GRAPE,xxx
xxx,xxx,Fruit/Apple,xxx,GRAPE
So the expression I want is: match the beginning of input OR a comma, followed by desired string, followed by a comma OR the end of input.
You cannot put ^ or $ in character classes, so I tried \A and \Z but that didn't work.
([\A,])Apple([\Z,])
This didn't work, sadly. Can I do this with one regular expression? Seems like this would be a common enough problem.
It will depend on your language, but if the one you use supports lookarounds, then you would use something like this:
(?<=,|^)Apple(?=,|$)
Replace with GRAPE.
Otherwise, you will have to put back the commas:
(^|,)Apple(,|$)
Or
(\A|,)Apple(,|\Z)
And replace with:
\1GRAPE\2
Or
$1GRAPE$2
Depending on what's supported.
The above are raw regex (and replacement) strings. Escape as necessary.
Note: The disadvatage with the latter solution is that it will not work on strings like:
xxx,Apple,Apple,xxx,xxx
Since the comma after the first Apple got consumed. You'd have to call the regex replacement at most twice if you have such cases.
Oh, and I forgot to mention, you can have some 'hybrids' since some language have different levels of support for lookbehinds (in all the below ^ and \A, $ and \Z, \1 and $1 are interchangeable, just so I don't make it longer than it already is):
(?:(?<=,)|(?<=^))Apple(?=,|$)
For those where lookbehinds cannot be of variable width, replace with GRAPE.
(^|,)Apple(?=,|$)
And the above one for where lookaheads are supported but not lookbehinds. Replace with \1Apple.
This does as you wish:
Find what: (^|,)(?:Apple)(,|$)
Replace with: $1GRAPE$2
This works on regex101, in all flavors.
http://regex101.com/r/iP6dZ8
I wanted to share my original work-around (before the other answers), though it feels like more of a hack.
I simply prepend and append a comma on the string before doing the simpler:
/,Apple,/,GRAPE,/g
then cut off the first and last character.
PHP looks like:
$line = substr(preg_replace($search, $replace, ','.$line.','), 1, -1);
This still suffers from the problem of consecutive columns (e.g. ",Apple,Apple,").

How do I need to escape search string in vim?

I need to search and replace this:
ExecIf($["${debug}" = "1"]?NoOp
with this:
GoSub(chanlog,s,1(1,[${CHANNEL}]
I can't seem to do it in vim, and I'm not sure what needs to be escaped, as nothing I've tried works.
If you want to change a long string with lots of punctuation characters, and it's an exact match (you don't want any of them to be treated as regex syntax) you can use the nomagic option, to have the search pattern interpreted as a literal string.
:set nomagic
:%s/ExecIf($["${debug}" = "1"]?NoOp/GoSub(chanlog,s,1(1,[${CHANNEL}]/
:set magic
You still have to watch out for the delimiters (the slashes of the s/// command) but you can use any character for that, it doesn't have to be a slash, so when you have something like this and there are slashes in the search or replace string, just pick something else, like s#foo#bar# or s:bar:baz:.
If you're having problems with which characters to escape in a vim substitution (:s//), remember the nomagic concept, and in particular the nomagic version of a substitute: :snomagic// or :sno//. nomagic means: interpret each character literally.
So this should work without worrying about escaping characters in the substitution:
:sno/ExecIf($["${debug}" = "1"]?NoOp/GoSub(chanlog,s,1(1, [${CHANNEL}]/
Get to know magic vs. nomagic, :sno//, and \v, \V:
:help magic
The nomagic version of a search for your string uses \V:
/\VExecIf($["${debug}" = "1"]?NoOp
you have to escape the [] and the spaces:
:s/ExecIf($\["${debug}"\ =\ "1"\]?NoOp/GoSub(chanlog,s,1(1,\[${CHANNEL}\]/
just a bit trial and error

Regex - Multiline Problem

I think I'm burnt out, and that's why I can't see an obvious mistake. Anyway, I want the following regex:
#BIZ[.\s]*#ENDBIZ
to grab me the #BIZ tag, #ENDBIZ tag and all the text in between the tags. For example, if given some text, I want the expression to match:
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
At the moment, the regex matches nothing. What did I do wrong?
ADDITIONAL DETAILS
I'm doing the following in PHP
preg_replace('/#BIZ[.\s]*#ENDBIZ/', 'my new text', $strMultiplelines);
The dot loses its special meaning inside a character class — in other words, [.\s] means "match period or whitespace". I believe what you want is [\s\S], "match whitespace or non-whitespace".
preg_replace('/#BIZ[\s\S]*#ENDBIZ/', 'my new text', $strMultiplelines);
Edit: A bit about the dot and character classes:
By default, the dot does not match newlines. Most (all?) regex implementations have a way to specify that it match newlines as well, but it differs by implementation. The only way to match (really) any character in a compatible way is to pair a shorthand class with its negation — [\s\S], [\w\W], or [\d\D]. In my personal experience, the first seems to be most common, probably because this is used when you need to match newlines, and including \s makes it clear that you're doing so.
Also, the dot isn't the only special character which loses its meaning in character classes. In fact, the only characters which are special in character classes are ^, -, \, and ]. Check out the "Metacharacters Inside Character Classes" section of the character classes page on Regular-Expressions.info.
// Replaces all of your code with "my new text", but I do not think
// this is actually what you want based on your description.
preg_replace('/#BIZ(.+?)#ENDBIZ/s', 'my new text', $contents);
// Actually "gets" the text, which is what I think you might be looking for.
preg_match('/(#BIZ)(.+?)(#ENDBIZ)/s', $contents, $matches);
list($dummy, $startTag, $data, $endTag) = $matches;
This should work
#BIZ[\s\S]*#ENDBIZ
You can try this online Regular Expression Testing Tool
The mistake is the character group [.\s] that will match a dot (not any character) or white space. You probably tried to get .* with . matching newline characters, too. You achieve this by enabling the single line option ((?s:) does this in .NET regex).
(?s:#BIZ.*?#ENDBIZ)
Depending on the environment you're using your regex in, it may need special care to properly parse multiline text, eg re.DOTALL in Python. So what environment is that?
you can use
preg_replace('/#BIZ.*?#ENDBIZ/s', 'my new text', $strMultiplelines);
the 's' modifier says "match the dot with anything, even the newline character". the '?' says don't be greedy, such as for the case of:
foo
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
bar
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
hello world
the non-greediness won't get rid of the "bar" in the middle.
Unless I am missing something, you handle this the same way that you would in Perl, with either the /m or /s modifier at the end? Oddly enough the other answers that rather correctly pointed this out got down voted?!
It looks like you're doing a javascript regex, you'll need to enable multiline by specifying the m flag at the end of the expression:
var re = /^deal$/mg

regular expression to split up searchphrase

I was hoping someone could help me writing a regex for c++ that matches words in a searchphrase, and explain it bit by bit for learning purposes.
What I need is a regex that matches string within " " like "Hello you all", and single words that starts/ends with * like *ack / overfl*.
For the quote part I have \"[\^\\s][\^\"]*\" but I can't figure out the wildcard (*) part, and how I should combine it with the quote regex.
Try this regular expression:
(?:\*?\w+\*?|"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*")+
For readability I replaced the backslash characters by \x5C.
The expression "(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*" will also match "foo \"bar\"" and other proper escaped quote sequences (but only the " might be escaped).
So foo* bar *baz *quux* "foo \"bar\"" should be splitted into:
foo*
bar
*baz
*quux*
"foo \"bar\""
If you don’t want to match bar in the example above, use this:
(?:\*\w+|\w+\*|"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*")+
As long as there is no quote nesting (nesting in general is something regex is bad at):
"(?:(?<=\\)"|[^"])*"|\*[^\s]+|[^\s]+\*
This regex allows for escaped double quotes ('\"'), though, if you need that. And the match includes the enclosing double quotes.
This regex matches:
"A string in quotes, possibly containing \"escaped quotes\""
*a_search_word_beginning_with_a_star
a_search_word_ending_with_a_star*
*a_search_word_enclosed_in_stars*
Be aware that it will break at strings like this:
A broken \"string "with the quotes all \"mangled up\""
If you expect (read: can't entirely rule out the possibility) to get these, please don't use regex, but write a small quote-aware parser. For a one-shot search and replace activity or input in a guaranteed format, the regex is okay to use.
For validating/parsing user input, it is not okay to use. That's where I would recommend a parser. Knowing the difference is the key.