regex for all characters on yahoo pipes - regex

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?

You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

Ignore tags and javascript with regex

I'm trying to perform a regex replacement on the HTML below. I'm using an existing (I didn't write it and don't really understand it) regex pattern that ignores anything inside of an HTML tag, but I need it to also ignore anything between script tags. The pattern is (?<!<[^>]*)(diversity|and|inclusion). The problem is that the and in 'playerBrandingId' in the javascript is getting matched and ultimately replaced. In case it matters, I'm using C#. You can see what I get here.
<p>When it comes to building more diverse and inclusive workforces, the sports industry is already a leader, but it can do much more. One of the ways SBD/SBJ is focusing on diversity and inclusion is by talking to business leaders about what the industry can do better. In our first video in the “SBJ Diversity and Inclusion” series, we hear from execs working in leagues, technology, recruitment and academia.</p>
<div class="article-offset-block article-video article-offset-block--half">
<div class="u-vr2">
<div id='video-F17F523A70EB43ECAF54DF46144835B4'></div>
</div>
</div>
<script>
var playerParam = {
'pcode': 'poeXI63BtIsR_ugBoy3Z6X8KfiMo',
'playerBrandingId': 'video-F17F523A70EB43ECAF54DF46144835B4',
'autoplay': false,
'loop': false
};
OO.ready(function () { window.ppF17F523A70EB43ECAF54DF46144835B4 = OO.Player.create('video-F17F523A70EB43ECAF54DF46144835B4', 'w5cW9qZTE6qRRDqfBdi861XWJTXci9uE', playerParam); });
</script>
EDIT:
The pattern is generated by a user's query, so the pattern could include the word window or player which would be matched in the javascript when I change the pattern to include the \b like so: (?<!<[^>]*)\b(window|player|and)\b
Another example
Change your regex to (?<!<[^>]*)\b(diversity|and|inclusion)\b The \b adds a test for a word boundary. forcing each word inside the ( and ) to be whole words.
EDIT:
You are trying to parse the HTML to extract the text nodes then check them,
you should not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it or search for extracting text nodes from HTML with .NET and C#
The answer is that you cannot do what I'm trying to do with Regex according to this.

Regex To Delete <ahref> tags but leave url

im rubbish with regex if someone could help id be very appreciative.
its going to be a bit of a tough one i imagine - so my hats off too anyone that can solve it!
so say we have file that contains 2 html tags in the following formats:
abc1234
Some Text <P>
Some Text
abc1234
im trying to remove everything in those tags except the url (and leaving other text) so the output of the regex in this document would be
abc1234
http://google.com <P>
http://www.google.com
abc1234
Can any guru figure this one out? Id prefer one regex expression to handle both cases but two seperate ones would be fine too.
Thanks in advance/
ScottStevens, it is well known that trying to parse html with regex is difficult, in fact, there is quite a verbose post on this issue. However, if those are the only two formats the <a> ever takes, here is the approach to the problem:
Your first clue on how to approach this problem is that both tags start with <a href=", and you want to take that out, and for that, a simple remove on '<a href="' will do, no regex required.
Your next clue is that sometimes, your end tag sometimes has ">...</a> and sometimes has " rel=...</a> (what goes between rel= and doesn't matter from a regex point of view). Now notice that " rel="...</a> contains within it somewhere a ">...</a>. This means you can remove " rel="...</a> in two steps, remove " rel="... up to the ">, and then remove ">...</a>. Additionally, to make sure you remove between only one tag of <a...>...</a>, add the additional constraint that in the ... of ">...</a>, there cannot be any <a.
That and a regex cheat sheet can help you get started.
That said, you should really use an html parser. Robust and Mature HTML Parser for PHP
I'm a Rubyist, so my example is going to be in Ruby. I'd recommend using two regexes, just to keep things straight:
url_reg = /<a href="(.*?)"/ # Matches first string within <a href=""> tag
tag_reg = /(<a href=.*?a>)/ # Matches entire <a href>...</a> tag
You'll want to pull the URL with the first regex out and store it temporarily, then replace the entire contents of the tag (matched with the tag_reg) with the stored URL.
You might be able to combine it, but it doesn't seem like a good idea. You're fundamentally altering (by deleting) the original tag, and replacing it with something inside itself. Less chance of things going wrong if you separate those two steps as much as possible.
Example in Ruby
def replace_tag(input)
url_reg = /<a href="(.*?)"/ # Match URLS within an <a href> tag
tag_reg = /(<a href=.*?a>)/ # Match an entire <a href></a> tag
while (input =~ tag_reg) # While the input has matching <a href> tags
url = input.scan(url_reg).flatten[0] # Retrieve the first URL match
input = input.sub(tag_reg, url) # Replace first tag contents with URL
end
return input
end
File.open("test.html", "r") do |html_input| # Open original HTML file
File.open("output.html", "w") do |html_output| # Open an output file
while line = html_input.gets # Read each line
output = replace_tag(line) # Perform necessary substitutions
html_output.puts(output) # Write output lines to file
end
end
end
Even if you don't use Ruby, I hope the example makes sense. I tested this on your given input file, and it produces the expected output.

Regex match for contents of <li> element

I have the following content
<li>Title: [...]</li>
and I'm looking for regex that will match and replace this so that I can parse it as XML. I'm just looking to use a regex find and replace inside Sublime Text 2, so I want to match everything in the above example except for the [...] which is the content.
Why not extract the content and use it to build the xml rather than trying to mold the wrapper of the content into xml? (or am i mis understanding you?)
<li>Title: ([^<]*)<\/li>
is the regular expression to extract the content.
Its pretty self explanatory other than the [^<]* which means match any number of characters that is not a "<"
I don't know Sublime, but something like this should suffice to get you the contents of the li. It allows for there being optional extra attributes on the tag. Make sure and turn off case-sensitivity, incase of LI or Li etc. (lifted straight from http://www.regular-expressions.info/examples.html ):
<li\b[^>]*>(.*?)</li>
<li>\S*(.*)?</li>
That should match your string, with the content being capturing group 1.

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/