Stripping images from an .rtf by passing through textbox with Coldfusion - coldfusion

In one part of my app users can copy a Word or rtf document and paste it into a textbox on a form, and on submitting the form any images and a lot of formatting are stripped out of the form field content.
I want to achieve the same result but by reading from the file direct rather than by a manual form submit i.e. strip out the hidden characters and image data and just leave text and linefeeds / carriage returns.
How can I achieve a similar thing?

If you just want to extract the text from Word documents, you could try POI. CF9 already includes a version that can handle most .doc or .docx files. (It does not handle .rtf files). For CF8, you will need to use the javaLoader to load a newer version. Reading Office documents with ColdFusion (2).

I found this blog post that might assist: http://www.leavethatthingalone.com/blog/index.cfm/2005/6/11/Using-ColdFusion-to-convert-RTF-to-XHTML
This process converts the rtf file to xml, and then you can use ColdFusion's xml tags to read the converted file.
Process
download majix library as directed
extract lib folder and save to ColdFusion server
add the extracted location to the ColdFusion class path and restart the server
follow the code sample in the blog
Note that this library automatically creates the xml file. If your input file is mydoc.rtf your xml output file is mydoc.xml
Sample output created by this process:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<?xml:stylesheet type="text/xsl" href="mydoc.xsl"?>
<!-- generated by Majix from c:\doc.rtf on Mon Jan 31 12:04:03 EST 2011 using template MyDoc -->
<!DOCTYPE mydoc PUBLIC "-//TetraSix//DTD mydoc v1.1//EN" "mydoc.dtd" [
<!NOTATION wmf PUBLIC "-/TetraSix/NOTATION Windows Metafile/EN" "wmf">
<!ENTITY g001 SYSTEM "images/doc_001.wmf" NDATA wmf>
]>
<mydoc>
<p>This is my rtf document</p>
<p></p>
<p><graphic url='images/doc_001.wmf'/></p>
<p></p>
<p></p>
</mydoc>
I created my own test bed using the linked library with ColdFusion 9 without any problems.
Note that I skipped the second rereplacenocase from the blog post as it resulted in a malfored xml document.
Once you have your xml file you can read it like so:
<cffile action="read" file="c:\doc.xml" variable="xmldoc">
<cfdump var="#xmlparse(xmldoc)#" />
Resulting in this xml object:

Related

Why do browsers omit XML tags?

(Better title, anyone?) Rendering some XML made with lxml.builder using a small Flask app in Python 3.6. The function makeXML in module mkX builds and returns the XML like so:
from lxml import etree as ET
...
def makeXML():
...
# myxml is type <class 'lxml.etree._Element'>
f = ET.tostring(myxml, method='xml', xml_declaration=True, encoding='utf-8', pretty_print=True)
return f
Where method=xml could be omitted, as it's the default. The Flask app does:
#app.route('/getXML')
def getXML():
xml = mkX.makeXML()
print(type(xml)) # xml is type <class 'bytes'>
return xml
When I go to [myurl]/getXML in Chrome or Firefox, I see this:
eggs bacon sausage spam
It omits the XML tags. Why does that happen? Hitting view source, I see this:
<?xml version='1.0' encoding='utf-8'?>
<someXML>
<reclist>
<dat>eggs</dat>
<dat>bacon</dat>
<dat>sausage</dat>
<dat>spam</dat>
</reclist>
</someXML>
With pretty_print=True it's nicely formatted. Without it:
<?xml version='1.0' encoding='utf-8'?>
<someXML><reclist><dat>eggs</dat><dat>bacon</dat><dat>sausage</dat><dat>spam</dat></reclist></someXML>
Looking at other webservices that return XML, the browser does not omit the XML tags, for example this one.
Does this mean that myxml isn't valid XML? If so, what's the difference & how should I fix it?
A browser renders HTML, not XML. Most browser try to show what's possible from a document. In your case they show you all text nodes but not the XML elements that have no meaning in HTML.
Check if the HTTP response includes a line saying
Content-Type: application/xml
Only if this is set can the browser decide to display the XML document.
As you can see when you open the source view, the XML is complete. Everything works as it is supposed to do.
For completeness' sake, in addition to Lutz Horn's answer, this is how to set Flask to return a specific mimetype:
...
from flask import Response
...
def getXML():
xml = mkX.makeXML()
return Response(xml, mimetype='application/xml')
Since the xml is records rather than text, 'application/xml' is preferable over 'text/xml', more info here.

Goutte with behat: xml string as textarea value is filled with html entities

I have a page that contains a form with some input elements, including a textarea. Those input fields are populated with some values. Think of this whole page as an edit page for some entity. Textarea contains an XML string that shows properly within normal browser (eg. firefox and chrome) and looks like following:
<front>
<!-- top row -->
<cell>
<page>8</page>
</cell>
</front>
But when i run the test case with goutte mink driver the page is loaded and the value of textarea is encoded with special characters, like so:
<front>
<!-- top row -->
<cell>
<page>8</page>
</cell>
</front>
And when i press submit button that mess is sent to the server and saved instead of initial correct xml. Note that i do not touch it at all. I can just load the page and press submit button and it's all screwed. This happens only with goutte, but not with, say, selenium2.
So the question is: how can i force goutte interpret those html entities automatically and send them as correct data, not that encoded mess?
there's no solution for that. It seems as normal behavior of Goutte/Guzzle. As a workaround we ended up with following solution: in case Goutte driver is used we check page contents for all <textarea> elements, and if any found then for each we get their contents and plainly reinsert as follows:
$elements = $this->pageContent->findAll('xpath', '//textarea');
foreach ($elements as $element) {
$element->setValue($element->getText());
}

Parsing HTML tags using XSLT/MarkLogic

I am trying to convert an XML file to HTML. The XML file has a bunch of HTML tags of the form:
<item><text>Line 1<br/>Line 2<br/>Line 3</text></item>
Ultimately, the output that appears in Internet Explorer is:
<text>Line 1<br/>Line 2<br/>Line 3</text>
When I would like:
Line 1Line 2Line 3
Once I discovered disable-output-escaping, the text rendered properly in IE. Unfortunately, MarkLogic does not support this attribute.
I was able to eliminate the tags altogether using replace(), but I cannot replace the line break tags with an actual new line character.
Does anyone have any ideas on how to either:
1) Render the HTML properly in MarkLogic, or
2) Properly parse the HTML tags in XSLT.
Thanks!
Maybe you want this
let $foo := <item><text>Line 1<br/>Line 2<br/>Line 3</text></item>
return xdmp:unquote($foo/text())

actionscript htmltext. removing a table tag from dynamic html text

AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}

print a page using xslt

How can print a page using xslt.
i need a link, or a button which when clicked invokes the print page printer dialog box.
I suspect you need to specify a bit more about what you are trying to do.
XSLT is simply a way to turn one block of text into another. The input is generally an xml buffer and the output is some text rendering of that buffer.
It is possible that you are trying to generate a script using XSLT and that you want that script to be able to open a print dialog when it is run by something e.g. you generate javascript, that then runs on a browser.
Can you describe in more detail what you want to achieve?
The following in an html page gives you a print link:
Print
XSLT is a language for transforming XML documents. That means you can add/modify content. Assuming your output is HTML, you can do this:
<xsl:template match="top">
<html>
<head>
</head>
<body>
<input name="print" type="button" value="Print"
onclick="javascript:window.print()">
<xsl:apply-templates />
</body>
</html>
</xsl:template>
But of course, where exactly the button has to go depends on your needs. I'd additionally, add a media=print specific CSS at the top so that the document comes out neat!