Extract the href value with apostrophe in Java - href

I am a new user to JSoup. I want to extract the href value from the html.
For example:
String html = "<p>An <a href='http://exa'mple.com'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href");
I am getting the output as "http://exa" , but I need the output as "http://exa'mple.com" (the raw text in href). link.outerHtml() is providing some different text.
I can't alter the HTML. HTML is the user's input.

Try this:
String html = "<p>An <a href='http://exa%27mple.com'><b>example</b></a> link.</p>";

I can't see how this will be possible, given that the jsoup parser will be expecting a ' to close the href argument and that's exactly what it gets. I think your only option is to pre-parse the string provided by the user, but even that will be tricky, as you'll have to come up with a rule to distinguish between "correct" and "incorrect" quote marks.

Related

Unable to accurately search a particular text in a html tag using Python

I have the below regex to identify text in a html tag that doesn't yields the result expected.
HTML Tag:
<td>Issue Amount</td>
<td>:</td>
<td>20,000,000.00</td>
Find = re.findall(?<=Issue Amount</td> <td>:</td> <td>) [0-9,]),soup_string)[0]
I need to get the numerical value 20,000,000.00 from this tag.
Any advise what am I doing wrong here. I did try couple of other ways but with no success.
Do not under any circumstances try to parse XML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.
However in your case you have mucked up your regex by looking for a space between your </td> and <td> tags. Whereas your data has carriage returns. You can use the \s meta-character to look for any white space character
Below is the regex piece that helped me get the desired output. Thanks all for your inputs.
(?<=Issue Amount[td\W]{21})([\d,.]+)

How can I use Regular Expression to replace b with respective ascii character?

I wrote a VB .Net application that asks the user for a URL, then the application will pull the HTML content of that URL and filters out most stuff except for anything between <td> </td> tags.
So if the HTML of that url is something like this
<html><body><table><tr><td>My content here</td></tr></table>
</body>
</html>
then the application will simply print out:
My content here
However, the problem is some URLs have populated these <td></td> with the ascii codes of the letters rather than letters themselves, so here is an example:
<html><body><table><tr><td>">bandit at</td></tr></table>
</body>
</html>
so my program will display:
'bandit'
but any browser will display the above as
bandit
I tried to use RegEx to replace those numbers to their respective characters (using 'Chr' function), but I failed.
Here is what I tried:
Me.TextBox3.Text = Regex.Replace(htmlDoc, "&#\d\d\d;", chr("$&"))
but that presents an error.
My question is: how can I replace all occurences of &#\d\d\d; with Chr(value of the \d\d\d that was matched earlier) ?
This one can be achieved easily....by using the HTMLDecode method.
http://social.msdn.microsoft.com/Forums/vstudio/en-US/5cd2251d-1359-49ce-b6a2-7ca492d560a5/converting-nbsp-when-using-serverurldecode?forum=csharpgeneral
string subject = HttpUtility.HtmlDecode(HttpUtility.UrlDecode(Request.QueryString["subject"]));
this is c#, but you can easily convert this to vb.net.
You can use HttpUtility.HtmlDecode to decode html into plain string.

How to get string of everything between these two em tag?

I want to get string between em tag , including other html also.
for example:
<em>UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown</em>
output should be as:
UNIVERSALPOSTAL UNION - International Bureau Circular<br />
By: K.J.S. McKeown
please help me.
Thanks
Use the regular expression function like this:
REMatch("(?s)<em>.*?</em>", html)
See also: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=regexp_01.html
The (?s) sets the mode to single line, so that the input text is interpreted as one line even if it contains line feeds. This is probably the default (I'm not sure) so it can be omitted. As Peter pointed out in a comment, this is not the default and therefore must be set.
The .*? matches all characters inbetween <em> and </em>. The questionmark after the multiplier makes it "non-greedy", so that as few as possible characters are matched. This is needed in case the input html contains something like <em>foo</em><em>bar</em> where otherwise only the outermost <em></em> tags are considered.
The returned array contains all matches found, i.e. all texts including html that was in <em> tags.
Note that this could fail for circumstances where </em> also occurs as attribute text and is incorrectly not html-encoded, for example: <em><a title="Help for </em> tag">click</a></em> or in other rare circumstances (e.g. javascript script tags etc.). A regex cannot replace a full HTML/XML parser and if you need 100% accurateness, you should consider using one: http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_t-z_23.html
If your input is exactly in the format given above, you don't even need regex - just strip the outer tags:
<cfsavecontent variable="Input">[text from above]</cfsavecontent>
<cfset Output = mid( Input, 4 , len(Input) - 9 />
If your input is more than this (i.e. a significant piece of HTML, or a full HTML document), regex is still not the ideal tool - instead, you should be using a HTML parser, such as JSoup:
<cfset jsoup = createObject('java','org.jsoup.Jsoup') />
<cfset Output = jsoup.parse(Input).select('em').html() />
(With CF8, this code requires placing the jsoup JAR file in CF's lib directory, or using a tool such as JavaLoader.)
If you are using jquery you can do this also pretty easily.
$("em").html();
Will return all html between the em tags.
See this fiddle
I had to remove any text that was to follow after a partiucular tag . Now the HTML content was getting generated dynamically from a database that cater to 5 different langauges. so I only had the div tag to help me. I am not sure why REMatch("(?s).*?", html) did not work for me. However Ben helped me here (http://www.bennadel.com/blog/769-Learning-ColdFusion-8-REMatch-For-Regular-Expression-Matching.htm). My code looks like tghis:
<cfset extContentArr = REMatch("(?i)<div class=""inlineBlock"" style=""margin-right:30px;"">.+?</div>",qry_getContent.colval) />
<cfif !ArrayIsEmpty(extContentArr)>
Loop the array and do whatever you need with the extract , I just deleted them.
</cfif>

Regex To Delete <ahref> tags but leave url

im rubbish with regex if someone could help id be very appreciative.
its going to be a bit of a tough one i imagine - so my hats off too anyone that can solve it!
so say we have file that contains 2 html tags in the following formats:
abc1234
Some Text <P>
Some Text
abc1234
im trying to remove everything in those tags except the url (and leaving other text) so the output of the regex in this document would be
abc1234
http://google.com <P>
http://www.google.com
abc1234
Can any guru figure this one out? Id prefer one regex expression to handle both cases but two seperate ones would be fine too.
Thanks in advance/
ScottStevens, it is well known that trying to parse html with regex is difficult, in fact, there is quite a verbose post on this issue. However, if those are the only two formats the <a> ever takes, here is the approach to the problem:
Your first clue on how to approach this problem is that both tags start with <a href=", and you want to take that out, and for that, a simple remove on '<a href="' will do, no regex required.
Your next clue is that sometimes, your end tag sometimes has ">...</a> and sometimes has " rel=...</a> (what goes between rel= and doesn't matter from a regex point of view). Now notice that " rel="...</a> contains within it somewhere a ">...</a>. This means you can remove " rel="...</a> in two steps, remove " rel="... up to the ">, and then remove ">...</a>. Additionally, to make sure you remove between only one tag of <a...>...</a>, add the additional constraint that in the ... of ">...</a>, there cannot be any <a.
That and a regex cheat sheet can help you get started.
That said, you should really use an html parser. Robust and Mature HTML Parser for PHP
I'm a Rubyist, so my example is going to be in Ruby. I'd recommend using two regexes, just to keep things straight:
url_reg = /<a href="(.*?)"/ # Matches first string within <a href=""> tag
tag_reg = /(<a href=.*?a>)/ # Matches entire <a href>...</a> tag
You'll want to pull the URL with the first regex out and store it temporarily, then replace the entire contents of the tag (matched with the tag_reg) with the stored URL.
You might be able to combine it, but it doesn't seem like a good idea. You're fundamentally altering (by deleting) the original tag, and replacing it with something inside itself. Less chance of things going wrong if you separate those two steps as much as possible.
Example in Ruby
def replace_tag(input)
url_reg = /<a href="(.*?)"/ # Match URLS within an <a href> tag
tag_reg = /(<a href=.*?a>)/ # Match an entire <a href></a> tag
while (input =~ tag_reg) # While the input has matching <a href> tags
url = input.scan(url_reg).flatten[0] # Retrieve the first URL match
input = input.sub(tag_reg, url) # Replace first tag contents with URL
end
return input
end
File.open("test.html", "r") do |html_input| # Open original HTML file
File.open("output.html", "w") do |html_output| # Open an output file
while line = html_input.gets # Read each line
output = replace_tag(line) # Perform necessary substitutions
html_output.puts(output) # Write output lines to file
end
end
end
Even if you don't use Ruby, I hope the example makes sense. I tested this on your given input file, and it produces the expected output.

regex for all characters on yahoo pipes

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/