I am getting this html string returned via a web service call.
I am having trouble displaying the html probably because of the odd format (notice how the opening brackets "< head >" show up as '< ; head > ;' instead)
This is my truncated html formatted response.
What I am trying to do is display this html page on a form. But I am even having trouble getting it to open when I write the string to a file.
Any help is greatly appreciated,
Thanks
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://tempuri.org/"><html><head><title>......
...html......</div></div></body></html></string>
< and > are XML and HTML character entities. For some reason (probably to return misformatted html) this web service(?) returns tag brackets <> replaced with < and >. If you assume that in returned <string></string> element < and > are used only as tag brackets you can just replace entities with proper bracket. If you can't assume that you need to parse string element text to obtain valid html.
Thanks for your input. This is what worked for me.
uses
HTTPApp
var
HtmlXmlText : String;
HtmlText := HTMLDecode(HtmlXmlText);
I used HTMLDecode to clean up the odd characters to the standard formatting.
If you are just writing code to deal with this string (and you don't have access to code that retrieved it) then the most correct way is to use an XML parser.
uses XmlIntf;
procedure blah;
var
doc: IXMLDocument;
HtmlText: String;
begin
doc := CreateXMLDocument;
doc.LoadFromFile(...);
HtmlText := doc.documentElement.InnerText;
// Your text is already decoded here
DoWhatever(HtmlText);
end;
Related
I have a string which I need to send in an xml node to a third party application. That string is then parsed through a html parser over there. The string can have html, but problem occurs with non html tags. For example
<cfset str = "This mail was <b>sent</b> by Jen Myke <jmyke#mail.com> on June 20th.<br/> Click on <a href='http://google.com'>this link</a> for more information.">
There can be non-utf characters too in the string, which also cause issues but I found a old blog post which can help remove non-utf.
<cfset str = reReplace(str, "[^\x20-\x7E]", "", "ALL")>
But I am unable to figure out how I can remove html look alikes.
Try wrapping the string with encodeForXML(). This should encode any non-ASCII character for use within an XML node.
<node>#encodeForXml(str)#</node>
If you need to pass data in an attribute, then
<node attr=#encodeForXmlAttribute(str)#"/>
Edit: You can try using getSafeHTML() before encoding the rest of the string. This will remove HTML tags from a string using an XML configuration file to set your AntiSamy settings. Check the docs for more info.
Try replacing
< to <
> to >
I have a String containing the following:
"<CV-ALL><CURRICULO-VITAE SISTEMA-ORIGEM-XML='DEGOIS_ONLINE' DATA-ATUALIZACAO='26032015' HORA-ATUALIZACAO='193918' VERSAO-DA-GRAMATICA='2.0'><DADOS-GERAIS NOME-COMPLETO='Gonçalves' NOME-EM-CITACOES-BIBLIOGRAFICAS='Pte, A.' NACIONALIDADE='P' CURRICULUM-CONCLUIDO='SIM' OUTRAS-INFORMACOES-RELEVANTES='' ID-DEGOIS='267296113190873275' ORCID='0000-0001-5944-3218'><ENDERECO FLAG-DE-PREFERENCIA='ENDERECO_INSTITUCIONAL'><ENDERECO-PROFISSIONAL CODIGO-INSTITUICAO-EMPRESA='1124000002312' NOME-INSTITUICAO-EMPRESA='Instituto Politécnico' CODIGO-ORGAO='43400884' NOME-ORGAO='Escola Superior' CODIGO-UNIDADE='11241886' NOME-UNIDADE='Centro de Investigação' PAIS='Portugal' UF='CE'/> .... "
Please, note that for the sake of simplicity I have omitted the the remaining content of this string and its content is in fact an XML document.
Anybody knows/has an XSL transformation script that takes as input a string such this and transform it to a document XML so that it is possible to navigate using XPath expressions?
Thanks!
Assuming you are passing the string to the XSLT stylesheet as a parameter, and processing a dummy XML document, you have two options:
Output the string with output escaping disabled, save the result
as a new document, then process the new document with another
XSLT stylesheet;
Use XSLT 3.0 (or a processor that supports parsing XML as an
extension) to parse the string as XML before processing it.
Of course, you could save yourself all this trouble by writing the string to a file using your parent process, then pointing your XSLT processor to the resulting file.
I am trying to convert an XML file to HTML. The XML file has a bunch of HTML tags of the form:
<item><text>Line 1<br/>Line 2<br/>Line 3</text></item>
Ultimately, the output that appears in Internet Explorer is:
<text>Line 1<br/>Line 2<br/>Line 3</text>
When I would like:
Line 1Line 2Line 3
Once I discovered disable-output-escaping, the text rendered properly in IE. Unfortunately, MarkLogic does not support this attribute.
I was able to eliminate the tags altogether using replace(), but I cannot replace the line break tags with an actual new line character.
Does anyone have any ideas on how to either:
1) Render the HTML properly in MarkLogic, or
2) Properly parse the HTML tags in XSLT.
Thanks!
Maybe you want this
let $foo := <item><text>Line 1<br/>Line 2<br/>Line 3</text></item>
return xdmp:unquote($foo/text())
AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}
I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.