I am trying to scrape an iframe from a website, but cannot seem to scrape the whole iframe (not just attributes) (for post purpose I'll do a basic iframe)
<iframe src="http://google.com"></iframe>
The content on each iframe is prone to change, so need to regex the iframe tags some how, I have tried with the following buy can't get it to work:
<iframe[^>]*>(.*?)</iframe[^>]*>"
It might be because your iframe is spanning multiple lines. In that case, you should know that . doesn't match newline character, so you can replace it with (?:.|\n) or [^<] or use the dot-all/single line flag so that dot matches all characters. Also you might want to use this regex instead: <iframe[^>]*?(?:\/>|>[^<]*?<\/iframe>) which also matches <iframe />
Related
I'm trying to perform a regex replacement on the HTML below. I'm using an existing (I didn't write it and don't really understand it) regex pattern that ignores anything inside of an HTML tag, but I need it to also ignore anything between script tags. The pattern is (?<!<[^>]*)(diversity|and|inclusion). The problem is that the and in 'playerBrandingId' in the javascript is getting matched and ultimately replaced. In case it matters, I'm using C#. You can see what I get here.
<p>When it comes to building more diverse and inclusive workforces, the sports industry is already a leader, but it can do much more. One of the ways SBD/SBJ is focusing on diversity and inclusion is by talking to business leaders about what the industry can do better. In our first video in the “SBJ Diversity and Inclusion” series, we hear from execs working in leagues, technology, recruitment and academia.</p>
<div class="article-offset-block article-video article-offset-block--half">
<div class="u-vr2">
<div id='video-F17F523A70EB43ECAF54DF46144835B4'></div>
</div>
</div>
<script>
var playerParam = {
'pcode': 'poeXI63BtIsR_ugBoy3Z6X8KfiMo',
'playerBrandingId': 'video-F17F523A70EB43ECAF54DF46144835B4',
'autoplay': false,
'loop': false
};
OO.ready(function () { window.ppF17F523A70EB43ECAF54DF46144835B4 = OO.Player.create('video-F17F523A70EB43ECAF54DF46144835B4', 'w5cW9qZTE6qRRDqfBdi861XWJTXci9uE', playerParam); });
</script>
EDIT:
The pattern is generated by a user's query, so the pattern could include the word window or player which would be matched in the javascript when I change the pattern to include the \b like so: (?<!<[^>]*)\b(window|player|and)\b
Another example
Change your regex to (?<!<[^>]*)\b(diversity|and|inclusion)\b The \b adds a test for a word boundary. forcing each word inside the ( and ) to be whole words.
EDIT:
You are trying to parse the HTML to extract the text nodes then check them,
you should not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it or search for extracting text nodes from HTML with .NET and C#
The answer is that you cannot do what I'm trying to do with Regex according to this.
I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as
<div class="itemBanner" style="float:left; padding:10px">
<div style="padding-right:5px; padding-bottom:5px">
<div class="itemBanner">
HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (<?php ?> <%php ?> <% %>). It will also replace sequence of new line characters (multiple) with only one. <b>Allow tags</b> feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.<p></p>You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.<p></p>
<b>Known issues:</b><br />
I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.
The closest I've gotten so far is:
(([^.<]){1,500})<
Which still misses on things like periods and other characters before and after the string.
Your regex will match anything that's neither "." nor "<" 1 to 500 times, then a "<".
Assuming you want to capture everything from the itemBanner div until the very next occurrence of a closing div, you can use these elements:
<div class="itemBanner"> - explicit match
() - parathentical wrap for referencing, e.g. match[1]
.*? - any length of characters, non-greedily (as few as possible)
<\/div> - explicit match, with escaped '/'
to form this Ruby regex:
item_banner_div_regex = /<div class="itemBanner">(.*?)<\/div>/
match = item_banner_div_regex.match(html)
inside_item_banner_div = match && match[1]
Note: The exact regex will depend on the implementation you're using.
I'm trying to match the following video url:
<iframe width="420" height="315" src="//www.youtube.com/embed/F40ZBDAG8-o?rel=0" frameborder="0" allowfullscreen></iframe>
I have the following:
^<iframe
(\swidth="\d{1,3}")?
(\sheight="\d{1,3}")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\ssrc="//(www.youtube.com|player.vimeo.com)/[-a-z0-9+&##/%?=~_|!:,.;\(\)]+"
(\sframeborder="[^""<>]*")?
(\sallowfullscreen)?
\s?/?></iframe>$
This is working, but I can't rely on the fact that youtube will always provide embed links that follow this structure. If they move the width attribute to after src, my regex will fail.
Is there any way to do order-agnostic groupings, to address this?
You can make each of the search terms a lookahead - these don't consume the strings, so they can be in any order. Example:
<iframe (?=.*height="\d{1,3}")(?=.*width="\d{1,3}").*
will match both
<iframe width="123" height="321"
and
<iframe height="321" width="123"
demo on regex101.com
I am sure you can finish this yourself (adding all the terms you want to match).
Note - this "matches" - it does not "extract". But it will tell you that all these terms are present in the expression, in any order.
EDIT since I started writing this answer a number of comments appeared that change my understanding of your request. If you "just" want to extract the src= thing, you simply do
<iframe.*?src="([^"]+)"
and the match (the thing in brackets) will be whatever is between the first and the second double quote. Typically there are better tools than regex for parsing HTML - my personal preference is BeautifulSoup (Python).
I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/
I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/