Order-agnostic regex - is it possible? - regex

I'm trying to match the following video url:
<iframe width="420" height="315" src="//www.youtube.com/embed/F40ZBDAG8-o?rel=0" frameborder="0" allowfullscreen></iframe>
I have the following:
^<iframe
(\swidth="\d{1,3}")?
(\sheight="\d{1,3}")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\ssrc="//(www.youtube.com|player.vimeo.com)/[-a-z0-9+&##/%?=~_|!:,.;\(\)]+"
(\sframeborder="[^""<>]*")?
(\sallowfullscreen)?
\s?/?></iframe>$
This is working, but I can't rely on the fact that youtube will always provide embed links that follow this structure. If they move the width attribute to after src, my regex will fail.
Is there any way to do order-agnostic groupings, to address this?

You can make each of the search terms a lookahead - these don't consume the strings, so they can be in any order. Example:
<iframe (?=.*height="\d{1,3}")(?=.*width="\d{1,3}").*
will match both
<iframe width="123" height="321"
and
<iframe height="321" width="123"
demo on regex101.com
I am sure you can finish this yourself (adding all the terms you want to match).
Note - this "matches" - it does not "extract". But it will tell you that all these terms are present in the expression, in any order.
EDIT since I started writing this answer a number of comments appeared that change my understanding of your request. If you "just" want to extract the src= thing, you simply do
<iframe.*?src="([^"]+)"
and the match (the thing in brackets) will be whatever is between the first and the second double quote. Typically there are better tools than regex for parsing HTML - my personal preference is BeautifulSoup (Python).

Related

Ignore tags and javascript with regex

I'm trying to perform a regex replacement on the HTML below. I'm using an existing (I didn't write it and don't really understand it) regex pattern that ignores anything inside of an HTML tag, but I need it to also ignore anything between script tags. The pattern is (?<!<[^>]*)(diversity|and|inclusion). The problem is that the and in 'playerBrandingId' in the javascript is getting matched and ultimately replaced. In case it matters, I'm using C#. You can see what I get here.
<p>When it comes to building more diverse and inclusive workforces, the sports industry is already a leader, but it can do much more. One of the ways SBD/SBJ is focusing on diversity and inclusion is by talking to business leaders about what the industry can do better. In our first video in the “SBJ Diversity and Inclusion” series, we hear from execs working in leagues, technology, recruitment and academia.</p>
<div class="article-offset-block article-video article-offset-block--half">
<div class="u-vr2">
<div id='video-F17F523A70EB43ECAF54DF46144835B4'></div>
</div>
</div>
<script>
var playerParam = {
'pcode': 'poeXI63BtIsR_ugBoy3Z6X8KfiMo',
'playerBrandingId': 'video-F17F523A70EB43ECAF54DF46144835B4',
'autoplay': false,
'loop': false
};
OO.ready(function () { window.ppF17F523A70EB43ECAF54DF46144835B4 = OO.Player.create('video-F17F523A70EB43ECAF54DF46144835B4', 'w5cW9qZTE6qRRDqfBdi861XWJTXci9uE', playerParam); });
</script>
EDIT:
The pattern is generated by a user's query, so the pattern could include the word window or player which would be matched in the javascript when I change the pattern to include the \b like so: (?<!<[^>]*)\b(window|player|and)\b
Another example
Change your regex to (?<!<[^>]*)\b(diversity|and|inclusion)\b The \b adds a test for a word boundary. forcing each word inside the ( and ) to be whole words.
EDIT:
You are trying to parse the HTML to extract the text nodes then check them,
you should not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it or search for extracting text nodes from HTML with .NET and C#
The answer is that you cannot do what I'm trying to do with Regex according to this.

Getting whole iframe with regex

I am trying to scrape an iframe from a website, but cannot seem to scrape the whole iframe (not just attributes) (for post purpose I'll do a basic iframe)
<iframe src="http://google.com"></iframe>
The content on each iframe is prone to change, so need to regex the iframe tags some how, I have tried with the following buy can't get it to work:
<iframe[^>]*>(.*?)</iframe[^>]*>"
It might be because your iframe is spanning multiple lines. In that case, you should know that . doesn't match newline character, so you can replace it with (?:.|\n) or [^<] or use the dot-all/single line flag so that dot matches all characters. Also you might want to use this regex instead: <iframe[^>]*?(?:\/>|>[^<]*?<\/iframe>) which also matches <iframe />

Regex to match content before string

I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/

Writing Regex pattern for HTML tags

I'm very new to PHP writing and regular expressions. I need to write a Regex pattern that will allow me to "grab" the headlines in the following html tags:
<title>My news</title>
<h1>News</h1>
<h2 class=\"yiv1801001177first\">This is my first headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is another headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the third headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the last headline</h2> <p>This is a summary of a fascinating article.</p>
So I need a pattern to match all the <h2> tags. This is my first attempt at writing a pattern, and I'm seriously struggling...
/(<h+[2])>(.*?)\<\/h2>/ is what I've attempted. Help is much appreciated!
I'm not too familiar with PHP, but in cases like this it's usually easier to use XML parser (which will automatically detect <h2> as well as <h2 class="whatever"> rather than regex, which you'll have to add a bunch of special cases to. Javascript, for example has XML DOM exactly for this purpose, I'd be surprised if PHP didn't have something similar.
The easiest way to do it via regex is
#<h2\b[^>]*>(.*?)</h2>#is
This will match any h2 tag and capture its contents in backreference $1. I've used # as a regex delimiter to avoid escaping the / later on in the regex, and the is options to make the regex case-insensitive and to allow newlines within the tag's contents.
There are circumstances where this regex will fail, though, as pointed out correctly by others in this thread.
I have only checked in RegexBuddy, there following regex works:
<h2.*</h2>