Extract specific matching text (regex?) from YQL query - regex

I would like to use YQL to extract textual Ids from particular paragraph divs on a webpage. The id is in the form "ApplicationRef:NNN". The basic query I use is along the lines of:
select content from html where url="..." and content matches "ApplicationRef*"
which returns the whole of the paragraph containing that text. However, I'd like to further process the result so that the query returns just the NNN part of each paragraph instead. Is this possible?
I realise that I can process the result from the YQL query further locally, but it would be neater if I can do all the processing in one go within the YQL query itself.
thanks,

Related

How to find a specific between table HTML tags RegEx

I want to find a RegEx that allows me to find a specific text between HTML table tags.
I have: This is a test text <tr><td>text inside table</td></tr> and I want the RegEx to return me just the second 'text' because it is inside the table.
I have tried <tr>(text)<\/tr> but returns nothing.
It needs to be done with RegEx it cannot be done with a HTML parser
Your <tr>(text)<\/tr> matches only <tr>text</tr>, but you have other text around.
So you need <tr>.*?(text).*?<\/tr> for that

Ignore or eliminate format html <tags> from the text using Regular Expressions in Jmeter

We have html response in which need to extract content/text from paragraph html tag and store to compare with xml text like below. In this text, there is tag in between of the content/text which should be ignored hence trying to achieve this using Regular Expression.
xml content:
<p>testing content<italic>text</italic>testing content</p>
html content:
<p>testing content<i>text</i>testing content</p>
For this used:
Reg Exp in Jmeter:
<p>(.*)</p)
This will fetch entire text and when tried to match with beanshell assertion, it fails since tag is showing as in html response.
If tried as:
<p>(.*)<i
Then also the same issue.
How to ignore/eliminate italic tag using Regular expression of Jmeter, or any other way to achieve the same in Jmeter.
You should not be using Regular Expressions in order to extract data from HTML/XML responses
JMeter provides XPath Extractor which is way more handy for extracting data from XML/HTML responses.
The relevant XPath query would be as simple as //p/text()
Using Beanshell is not recommended way of scripting, if you need advanced comparison logic consider JSR223 Assertion instead. If you just need to compare 2 variables normal Response Assertion will be more than enough.

SilverStripe 3: Search Results template showing unwanted code

I'm using "$Content.LimitWordCountXML" in my search results template as shown here, but its showing results like this "Attendance[file_link,id=214]" when the content is a link, how do I stop it so its just shows the link text and no the code / ID? Thanks
LimitWordCountXML is a function on the StringField class, not the HTMLText class. It acts as if the string variable it is working on is plain text, not HTML. Therefore it does not strip HTML.
We can use the HTMLText Summary function instead, which does strip HTML and accepts a word limit as the first parameter.

Extract the href value with apostrophe in Java

I am a new user to JSoup. I want to extract the href value from the html.
For example:
String html = "<p>An <a href='http://exa'mple.com'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href");
I am getting the output as "http://exa" , but I need the output as "http://exa'mple.com" (the raw text in href). link.outerHtml() is providing some different text.
I can't alter the HTML. HTML is the user's input.
Try this:
String html = "<p>An <a href='http://exa%27mple.com'><b>example</b></a> link.</p>";
I can't see how this will be possible, given that the jsoup parser will be expecting a ' to close the href argument and that's exactly what it gets. I think your only option is to pre-parse the string provided by the user, but even that will be tricky, as you'll have to come up with a rule to distinguish between "correct" and "incorrect" quote marks.

regex for all characters on yahoo pipes

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/