What's the difference between these two regular expressions? [duplicate] - regex

This question already has answers here:
What's the meaning of a number after a backslash in a regular expression?
(2 answers)
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
The short and immediate version of the question is: Why are these two regex different? i.e.,
href=(['"]).+?\1
vs
href=(['"]).+?['"] or href=(['"]).+?(['"])
I am practicing regex on this site and I am trying to solve this level
http://play.inginf.units.it/#/level/6
I am posting the entire content here in case the site goes down in future.
<tr>
<a href="javascript:openurl('/Xplore/accessinfo.jsp')" class="topUnderlineLinks">
PDF(3141 KB)
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
<td width="33%" ><div align="right"> Help <a href="/xpl/contactus.jsp" class="subNavLinks">Contact
Kimya ile ilgili çeþitli temel referans
<a href="http://search.epnet.com/login.asp?profile=web&defaultdb=geh"
<a href="http://iimpft.chadwyck.com/" target="_parent">International
NFPA Standartlarý
Project Gutenberg
<a href="http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&uicode=istanbultek"
Scitation
dergilerin listesini görmek için bu yolu
<a href="http://www3.interscience.wiley.com/journalfinder.html"
<td width="46%"><a href="/xpl/periodicals.jsp" class="dropDownNav" accesskey="j">Journals & Magazines
<td>IEEE Xplore Demo</td>
| Alerts
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
Abstract
<td>View Session History</td>
<td>New Search</td>
<a href="http://web5s.silverplatter.com/webspirs/start.ws?customer=kaynak"
Türk Standartlarý
Web of Science
<a href='deneme.html#bg'>Butler Group </a>veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. <span class="tarih">(19.03.2007)</span>
<a href='deneme.html#ps'>Productscan</a> veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. <span class="tarih">(19.03.2007)</span>
I am supposed to match text like this
href="history.jsp"
That is I need to match any href in the above text.
Now according to Solutions, it seems like the answer for this is href=(['"]).+?\1
But that last backreference, if I don't use it and repeat the regex group(I hope parenthesis is called group, correct me if I am wrong), why am I getting different results? That is if I use this I am getting wrong results. href=(['"]).+?['"] or href=(['"]).+?(['"])

The backreference has to match the same thing that the capture group matched. So the first regexp will match
"abcd"
or
'abcd'
The second version doesn't link the two ends of the match, so it will match the following as well:
"abcd'
or
'abcd"
So the version with the back-reference only matches a string surrounded by the same types of quotes.
This difference is important if you have embedded quotes in a string, e.g.
some text "<div id='foo'>" more text
The version with the back-reference will match "<div id='foo'>", but the version without the back-reference will match "<div id='.

The regex snippet (['"]).+?\1 captures the opening quote with (...), and uses a back-reference to use it later on with \1. That means that 'xyzzy' or "plugh" will match but not 'twisty".
That's probably the correct form since, with (['"]).+?['"], it can open and close with either quote.
As an aside, there's little point capturing the groups in your latter expression, unless you're going to use them in the code somehow. If you capture both, you could check to ensure they're identical but that's probably best handled by the use of the back-reference version.
In other words, if you wanted to allow something like 'twisty", all you need is ['"].+?['"].

Related

Sublime text replace pattern with regex

Using Sublime Text 3 I am trying to find all instances of a <span> element where the class value is not enclosed in quotes – e.g. <span class=foo> – and I want to wrap the class value in quotes.
The following is not working as expected as a search + replace with the regex option activated:
Find what: <span class=[A-Za-z0-9]*>
Replace with: <span class="$1">
The result I am getting (which I don't want) is <span class="">
Highlighting shows that the search term is correctly matching what I want but the $1 part where I want to insert the previously captured pattern does not work. I have also tried \1 in the replace pattern.
What is wrong with my syntax?
The answer was supplied as comment. The pattern to be captured was not wrapped in brackets.
Tell it what you want to (capture): <span class=([A-Za-z0-9]*)>
Alex K.

regex substitute two patterns in one match

I'm trying to do a find/replace in notepad++ where the string is similar to
<span class="CharOverride-1">Q</span>
With a single replace command I'd like the result to be
<span class="somethingNew">somethingElse</span>
This matches the two things I want replaced but I don't know how to form the substitution
(?<=<span class="(CharOverride-1)">)(Q)(?=<\/span>)
If possible I'd like to avoid doing something like this
(<span class=")(CharOverride-1)(">)(Q)(<\/span>)
and
\1somethingNew\3somethingElse\5
You can simlpy use 3 captures groups:
Search:
(<span class=").*?(">).*?(</span>)
Replace:
\1somethingNew\2somethingElse\3
Don't forget to check the "regular expression" checkbox.
But, if I can give you a very personal advice: don't use Notepad++...
The regular expression (?<=<span class=")CharOverride-1">Q(?=<\/span>) uses lookahead and lookbehind to find the string CharOverride-1">Q, but only where it follows the string <span class=" and is followed by </span>. Use somethingNew">somethingElse as the replacement string.

Using regex in Find/Replace to replace multiple matches

I'm using Sublime Text, and I want to use Find/Replace to make HTML to Markdown. One problem I encountered is how to replace multiple matches?
The HTML is below:
<blockquote>
<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>
<p> text 4 </p>
</blockquote>
And I want to change it to
><p> text 1 </p>
><p> text 2 </p>
><p> text 3 </p>
><p> text 4 </p>
I use
<blockquote>\n(^.+$\n)+?.+</blockquote>
to capture the p tag within the blockquote. But how to replace it?
Thanks a lot.
I have tested this for your simple test case. The main problem is, it may or may not work for more complex input, where you may need to further customize the regex.
Find what:
(?:<blockquote>\s*+|(?<!\A)(?<!</blockquote>)\G)(.*)\s++(?:</blockquote>)?
This solution will clean the closing tag as it match the last line. It fixes the caveat in the first solution where the end tag </blockquote> is not removed.
Replace with:
\n> $1
Use regular expression mode and highlight matches to check what will be replaced.
It will strip all leading spaces, and leave only 1 space between > and the text.
The regex above is built based on my own answer to the question of solving this class of problem with regex alone: Collapse and Capture a Repeating Pattern in a Single Regex Expression.
My earlier solution is based on the second construct, while the current solution is based on the first construct. The initial solution is quoted here, in case you want to customize the regex to be more flexible with its end tag (e.g. free spacing):
(?:<blockquote>\s*+|(?!\A)\G\s++(?!</blockquote>))(.*)
You can do this in two steps.
1)<blockquote>((?:(?!<\/blockquote>).)*)<\/blockquote> replace by $1.
See demo.
http://regex101.com/r/dZ1vT6/35
2)^\s+ replace by <
See demo.
http://regex101.com/r/dZ1vT6/36

Regex to match content before string

I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).

Regular expressions: Find and replace url with c:url tag

I have a problem to build good regular expressions to find and replace. I need to replace all urls in many .jsf files. I want replace ulrs staring by XXX with <c:url value="URL_WITHOUT_XXX"/>. Examples below.
I stuck with find regular expression "XXX(.*)" and replace expression "<c:url value="\1"/>", but my find expression match to long string , for example "XXX/a" style="", but need that match only to first " (href end). Anybody helps ?
I have:
<a href="XXX/a" style="">
<a href="XXX/b" >
<a href="XXX/c" ...>
I want:
<a href="<c:url value="/a"/>" style="">
<a href="<c:url value="/b"/>" >
<a href="<c:url value="/c"/>" ...>
PS: Sorry for my poor english ;)
Edit:
I use Find/Replace in Eclipse (regular expressions on)
You should specify the language you're working with.
The following regex will match what you want:
<a href="XXX[^\"]*"
If you want to have some particular value, you can group the regex according to your needs. For example:
<a href="(XXX[^\"]*)"
will give you in the first group:
XXX/a
XXX/b
XXX/b
If you want to have only /a, /b, and /c, you can group it like that:
<a href="XXX([^\"]*)"
Edit:
I will explain what <a href="XXX[^\"]*" does:
It will match: <a href="XXX
Then it should match anything that except a " zero or many times: [^\"]*
Finally match the ", which is not really necessary
When you do: [^abc] you're telling it to match anything but not a, or b, or c.
So [^\"] is: Match anything except a ".
And the quantifier * means zero or more times, so a* will match either an empty string, or a, aa, aaa, ...
And the last thing: Groups
When you want to keep the value appart from the entire match, so you can do anything with it, you can use groups: (something).