Meaning of (?s) in regex - regex

I am very new to the regular expression arena. Recently I searched for a regular expression for Powershell that allows me to match a html tag and I found the following in this site.
$content -match '(?s)<table[^>]+width\s*=\s*"300px"\s*.*?>(.*?)</table>'
I have been looking for all regular expressions references and books (Perl and Powershell) for the meaning of (?s) with no luck. It looks like a condition but missing the then part.
Can someone point me to the right direction for the meaning of this?
Thanks

According to Regular Expressions reference site.
Turn on "dot matches newline" for the remainder of the regular
expression. (Older regex flavors may turn it on for the entire regex.)

"?" means 1 or 0 matches. "?s" enables dot matching newlines. A period is normally a wildcard that will match any character, save the newline.

Related

Basic Vim - Search and Replace text bounded by specific characters

Say I wanted to replace :
"Christoph Waltz" = "That's a Bingo";
"Gary Coleman" = "What are you talking about, dear Willis?";
to just have :
"Christoph Waltz"
"Gary Coleman"
i.e. I want to remove all the characters including and after the = and the ;
I thought the regex for finding the pattern would be \=.*?\;. In vim, I tried :
:%s/\=.*?\;$//g
but it gave me an Invalid Command error and Nothing after \=. How do I remove the above text? Apologies, but I'm new to this.
Vim's regular expression dialect is different; its escaping is optimized for text searches. See :help perl-patterns for a comparison with Perl regular expressions. As #EvergreenTree has noted, you can influence the escaping behavior with special atoms; cp. :help /\v.
For your example, the non-greedy match is .\{-}, not .*?, and, as mentioned, you mustn't escape several literal characters:
:%s/ =.\{-};$//
(The /g flag is superfluous, too; there can be only one match anchored to the end via $.)
This is because of vim's weird handling of regexes by default. Instead of \= interpreting as a literal =, it interprets it as a special regex character. To make vim's regex system work more normally, you can prefix it with \v, which is "very magic" mode. This should do the trick:
%s/\v\=.*\;$//g
I won't explain how vim enterprets every single character in very magic mode here, but you can read about it in this help topic:
:help /magic

A regular expression that matches two long strings and ignores everything in between

I am searching through a 1.5 million line Premiere Pro project for any text that matches one of my audio filters and is set to mono.
Text that I am searching for begins with the <ChannelType> tag and ends with the <FilterMatchName>Tags. So it would looks like this
<ChannelType>0</ChannelType>
<FrameRate>5292000</FrameRate>
</AudioComponent>
<FilterPreset>0</FilterPreset>
<OpaqueData Encoding="base64" Checksum="53060659">AAAAAD8L8lo+AUr+Pac1NjwTmoUAAAAAP0uQDD37nIg9ui6MPjwU5j+AAAA+C/JaAAAAAD8qqqsAAAAAP4AAAD92L8w9py8FAAAAAHNvZnQgY29tcHJlc3Npb24AIiBkZWZhdWx0PSIwIiBzdGVwPSIxIiBtaW49IjAiIG1heD0iMSIvPgoJICA8Zmw=</OpaqueData>
<FilterIndex>-1</FilterIndex>
<FilterMatchName>1094998321 Dynamics1</FilterMatchName>
If I were in a Word doc, I would just do a find as
<ChannelType>0</ChannelType>*<FilterMatchName>1094998321 Dynamics1</FilterMatchName>
I am terrible with Regex. I was hoping someone could help me out. Everything I have tried either doesn't match anything, or matches EVERYTHING in the document. I am using Notepad++.
Since you are working in Notepad++, you have access to PCRE regular expressions. This one will get all the text between <ChannelType> and </FilterMatchName>
(?s)<ChannelType>.*?</FilterMatchName>
the (?s) allows the . to match newline characters
After matching <ChannelType>, the .*? lazily matches all characters up to...
the closing </FilterMatchName>, which we match.
Let me know if you have any questions. :)
What type of regular expressions are you using (which language/library)?
Basically you can use .* instead of * in regular expressions. IF your text is long though, it's better to use a Reluctant quantifier[1] if your re implementation allows it.
This is a good site with comparison of different re implementations and tutorials:
http://www.regular-expressions.info
[1] http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

Negative lookahead alternative

For a URL pattern such as this one:
/detail.php?a=BYGhs5w8e9o&b=234844617545&h=9827a
I would like Google Analytics to match only the URL's with the a and b parameters in it:
/orderdetail.php?a=BYGhs5w8e9o&b=234844617545
And thus strip out:
&h=9827a
The main goal is to be able to setup a goal in Google Analytics which covers only the a and b parameters and ignores the h parameter.
Is there an easy way to accomplish this without a negative lookahead?
Standard regular expressions do not need negative lookahead for this. Just do a match and replace. Searching for:
(/detail.php\?a=\w+&b=\w+)&h=\w+
and replacing with \1 works with the regular expressions in Notepad++ version 6.5.5. Google's regular expressions may be subtly different.
The above works by surrounding the wanted text with capturing braces and leaving the unwanted part outside. The ? needs escaping as un-escaped it means the previous item (ie the p) is optional. The \w sequence mean any "word" character so \w+ means a word.

Is there a way to compare regular expression backreferences?

I have the following sample expression that I'm passing to egrep over a word list:
^([a-z])lu([a-z])\2er$
I'd like to further stipulate that the content of \1 and \2 must be different, e.g. this would match "bluffer" but not "blubber". Is there a way to build this into the expression itself (so I can get my results right from egrep or something like it), or am I stuck doing this in some real language with regular expression support and manually checking that none of my groups are the same?
You could add the negative lookahead (?!\1) in front of the 2nd match group. The following regex:
([a-z])lu(?!\1)([a-z])\2er
matches "bluffer" but not "blubber". This only works properly if both the groups match the same amount of characters.
You need something more powerful. Regular expressions can't track state. Sed could probably do what you need.

RegExp: want to find all links that do not end in ".html"

I'm a relative novice to regular expressions (although I've used them many times successfully).
I want to find all links in a document that do not end in ".html"
The regular expression I came up with is:
href=\"([^"]*)(?<!html)\"
In Notepad++, my editor, href=\"([^"]*)\" finds all the links (both those that end in "html" and those that do not).
Why doesn't negative lookbehind work?
I've also tried lookahead:
href=\"[^"]*(?!html\")
but that didn't work either.
Can anybody help?
Cheers, grovel
That regular expression would work fine, if you were using PERL or PCRE (e.g. preg_match in PHP). However, lookahead and lookbehind assertions are not supported by most, especially the more simple, regular expression engines, like one that is used by the Notepad++. Only the most basic syntax such as quantifiers, subpatterns and characters classes are supported by almost all regular expression engines.
You can find the documentation for the notepad++ regular expression engine at: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
Edit: Notepad++ using SciTE regular expression engine and it does not support look around expressions.
For more info take a look here http://www.scintilla.org/SciTERegEx.html
Original Answer
^.*(?<!\.html)$
You can make a regexp that does it, but it would probably be too complex:
href=\"((([^"]*)([^h"][^"][^"][^"]|[^t"][^"][^"]|[^m"][^"]|[^l]))|([^"]|)([^"]|)([^"]|))\"
Thank you all very much.
In the end the regular expression did indeed not work.
I simply used a workaround, and replaced all links with themselves+".html", then replaced all occurences of ".html.html" with ".html".
So I replaced href=\"([^"]*)\" with href="\1.html" and then .html.html with .html
Thanks anyway, grovel
Note that Notepad++ (now?) supports assertions like this. (I have Notepad++ 6.3, dated Feb 3 2012.)
I believe the Regular Expressions documentation implies that both replace-variants use the same PCRE-dialect:
standard: Search | Replace (default shortcut Ctrl H)
plugin: TextFX | TextFX Quick | Find/Replace (default shortcut Ctrl R)