How to replace text in content control after, XML binding using docx4j - replace

I am using docx4j 2.8.1 with Content Controls in my .docx file. I can replace the CustomXML part by injecting my own XML and then calling BindingHandler.applyBindings after supplying the input XML. I can add a token in my XML such as ¶ then I would like to replace that token in the MainDocumentPart, but using that approach, when I iterate through the content in the MainDocumentPart with this (link) method none of my text from my XML is even in the collection extracted from the MainDocumentPart. I am thinking that even after binding the XML, it remains separate from the MainDocumentPart (??)

I haven't tried this with anything more than a little test doc yet. My token is the Pilcrow: ¶. Since it's a single character, it won't be split in separate runs. My code is:
private void injectXml (WordprocessingMLPackage wordMLPackage) throws JAXBException {
MainDocumentPart part = wordMLPackage.getMainDocumentPart();
String xml = XmlUtils.marshaltoString(part.getJaxbElement(), true);
xml = xml.replaceAll("¶", "</w:t><w:br/><w:t>");
Object obj = XmlUtils.unmarshalString(xml);
part.setJaxbElement((Document) obj);
}
The pilcrow character comes from the XML and is injected by applying the XML bindings to the content controls. The problem is that the content from the XML does not seem to be in the MainDocumentPart so the replace doesn't work.
(Using docx4j 2.8.1)

Related

Removing invalid URLs from MSHTML document before it is loaded

I use MSHTML (IHTMLDocument) to display offline HTML which may contain various links. These are loaded from email HTML.
Some of them have URLs starting with // or / for example:
<img src="//www.example.com/image.jpg">
<img src="/www.example.com/image.jpg">
This takes a lot of time to resolve and show the document because it cannot find the URL obviously as it doesn't start with http:// or https://
I tried injecting <base> tag into <head> and adding a local known folder (which is empty) and that stopped this problem. For example:
<base href="C:\myemptypath\">
However, if links begin with \\ (UNC path) the same problem and long loading time begin again. Like:
<img src="\\www.something.com\image.jpg">
I also tried placing WebBrowser control into "offline" mode and all other tricks I could think of and couldn't come up with anything short of RegEx and replacing all the links in the HTML which would be terribly slow solution (or parsing HTML myself which defeats the purpose of MSHTML).
Is there a way to:
Detect these invalid URLs before the document is loaded? - Note: I already did navigate through DOM e.g. WebBrowser1.Document.body.all collection, to get all possible links from all tags and modify them and that works, but it only happens after the document is already loaded so the long waiting time before loading gives up is still happening
Maybe trigger some event to avoid loading these invalid links and simply replace them with about:blank or empty "" text like some sort of "OnURLPreview" event which I could inspect and reject loading of URLs that are invalid? There is only OnDownloadBegin event which is not it.
Any examples in any language are welcome although I use Delphi and C++ (C++ Builder) as I only need the principle here in what direction to look at.
After a long time this is the solution I used:
Created an instance of CLSID_HTMLDocument to parse HTML:
DelphiInterface<IHTMLDocument2> diDoc;
OleCheck(CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&diDoc)));
Write IHTMLDocument2 to diDoc using Document->write
// Creates a new one-dimensional array
WideString HTML = "Example HTML here...";
SAFEARRAY *psaStrings = SafeArrayCreateVector(VT_VARIANT,0,1);
if (psaStrings)
{
VARIANT *param;
BSTR bstr = SysAllocString(HTML.c_bstr());
SafeArrayAccessData(psaStrings, (LPVOID*)&param);
param->vt = VT_BSTR;
param->bstrVal = bstr;
SafeArrayUnaccessData(psaStrings);
diDoc->write(psaStrings);
diDoc->close();
// SafeArrayDestroy calls SysFreeString for each BSTR
//SysFreeString(bstr); // SafeArrayDestroy should be enough
SafeArrayDestroy(psaStrings);
return S_OK;
}
Parse unwanted links in diDoc
DelphiInterface<IHTMLElementCollection> diCol;
if (SUCCEEDED(diDoc->get_all(&diCol)) && diCol)
{
// Parse IHTMLElementCollection here...
}
Extract parsed HTML into WideString and write into TWebBrowser
DelphiInterface<IHTMLElement> diBODY;
OleCheck(diDoc->get_body(&diBODY));
if (diBODY)
{
DelphiInterface<IHTMLElement> diHTML;
OleCheck(diBODY->get_parentElement(&diHTML));
if (diHTML)
{
WideString wsHTML;
OleCheck(diHTML->get_outerHTML(&wsHTML));
// And finally use the `Document->write` like above to write into your final TWebBrowser document here...
}
}

Why is libxml not storing html in my htmlDocPtr?

I am working on a piece of software that uses libxml to store xml on webpages in an xmlDocPtr. I need to expand this functionality to do the same for html.
The original code:
xmlDocPtr doc = xmlParseEntity(filename.c_str());
Where filename = 10.1.1.135/poll_data.xml and everything works just fine
Now, I have html filename = 10.1.1.165/index.htm and would like to store this as well. I have tried using htmlParseDoc with no success.
htmlDocPtr doc = htmlParseFile(filename.c_str(), "windows-1252");
The resulting doc object is not null but it does not contain the contents of the index.html
Netbeans spits out:
http://10.1.1.165/index.htm:1: HTML parser error : Document is empty
Any suggestions?

Aspose.PDF How Replace replace text on PDF page to all upper case

I am trying to replace text on a specific page to upper case using Aspose.PDF for .Net. If anyone can provide any help that would be great. Thank you.
My name is Tilal Ahmad and I am a developer evangelist at Aspose.
You may use the documentation link for searching and replacing text on a specific page of the PDF document. You should call the Accept method for specific page index as suggested at the bottom of the documentation. Furthermore, for replacing text with uppercase you can use ToUpper() method of String object as follows.
....
textFragment.Text = textFragment.Text.ToUpper();
....
Edit: A sample code to change text case on a specific PDF page
//open document
Document pdfDocument = new Document(myDir + "testAspose.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("");
//accept the absorber for all the pages
pdfDocument.Pages[2].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
textFragment.Text = textFragment.Text.ToUpper();
}
pdfDocument.Save(myDir+"replacetext_output.pdf");

Why can't I use xpath to parse nodes in the default namespace?

I'm an XML newbie and I have an XML document (which I can't edit because it comes from somewhere else) but it has a root node like this:
<Configuration xmlns="http://schemas.mycomp.com/product/settings" version="2.0.0">
I'm trying to parse this document with msxml and xpath and I've done it successfully if I remove the xmlns attribute. For some reason, with this xmlns attribute in place, the document won't parse. I've attempted to set the msxml parse to recognise the document using:
m_pXMLDoc->setProperty( _bstr_t(L"AllowDocumentFunction"), _variant_t(true));
m_pXMLDoc->setProperty( _bstr_t(L"AllowXsltScript"), _variant_t(true));
m_pXMLDoc->setProperty( _bstr_t(L"SelectionLanguage"), _variant_t(L"XPath"));
m_pXMLDoc->setProperty( _bstr_t(L"SelectionNamespaces"), _variant_t(L"xmlns='http://schemas.mycomp.com/product/settings'"));
m_pXMLDoc->preserveWhiteSpace = VARIANT_FALSE;
m_pXMLDoc->resolveExternals = VARIANT_TRUE;
m_pXMLDoc->validateOnParse = VARIANT_FALSE;
From reading around it looks like xpath only works on the "no name" namespace and this document sets the default namespace so that it's no longer "no name". Can I set the namespace that xpath uses using MSXML?
From Microsoft : This behaviour is by design ...
See http://support.microsoft.com/kb/288147
Use prefixes with the namespaces when you specify the SelectionNamespaces property

How to use regex in selenium locators

I'm using selenium RC and I would like, for example, to get all the links elements with attribute href that match:
http://[^/]*\d+com
I would like to use:
sel.get_attribute( '//a[regx:match(#href, "http://[^/]*\d+.com")]/#name' )
which would return a list of the name attribute of all the links that match the regex.
(or something like it)
thanks
The answer above is probably the right way to find ALL of the links that match a regex, but I thought it'd also be helpful to answer the other part of the question, how to use regex in Xpath locators. You need to use the regex matches() function, like this:
xpath=//div[matches(#id,'che.*boxes')]
(this, of course, would click the div with 'id=checkboxes', or 'id=cheANYTHINGHEREboxes')
Be aware, though, that the matches function is not supported by all native browser implementations of Xpath (most conspicuously, using this in FF3 will throw an error: invalid xpath[2]).
If you have trouble with your particular browser (as I did with FF3), try using Selenium's allowNativeXpath("false") to switch over to the JavaScript Xpath interpreter. It'll be slower, but it does seem to work with more Xpath functions, including 'matches' and 'ends-with'. :)
You can use the Selenium command getAllLinks to get an array of the ids of links on the page, which you could then loop through and check the href using the getAttribute, which takes the locator followed by an # and the attribute name. For example in Java this might be:
String[] allLinks = session().getAllLinks();
List<String> matchingLinks = new ArrayList<String>();
for (String linkId : allLinks) {
String linkHref = selenium.getAttribute("id=" + linkId + "#href");
if (linkHref.matches("http://[^/]*\\d+.com")) {
matchingLinks.add(link);
}
}
A possible solution is to use sel.get_eval() and write a JS script that returns a list of the links. something like the following answer:
selenium: Is it possible to use the regexp in selenium locators
Here's some alternate methods as well for Selenium RC. These aren't pure Selenium solutions, they allow interaction with your programming language data structures and Selenium.
You can also get get HTML page source, then regular expression the source to return a match set of links. Use regex grouping to separate out URLs, link text/ID, etc. and you can then pass them back to selenium to click on or navigate to.
Another method is get HTML page source or innerHTML (via DOM locators) of a parent/root element then convert the HTML to XML as DOM object in your programming language. You can then traverse the DOM with desired XPath (with regular expression or not), and obtain a nodeset of only the links of interest. From their parse out the link text/ID or URL and you can pass back to selenium to click on or navigate to.
Upon request, I'm providing examples below. It's mixed languages since the post didn't appear to be language specific anyways. I'm just using what I had available to hack together for examples. They aren't fully tested or tested at all, but I've worked with bits of the code before in other projects, so these are proof of concept code examples of how you'd implement the solutions I just mentioned.
//Example of element attribute processing by page source and regex (in PHP)
$pgSrc = $sel->getPageSource();
//simple hyperlink extraction via regex below, replace with better regex pattern as desired
preg_match_all("/<a.+href=\"(.+)\"/",$pgSrc,$matches,PREG_PATTERN_ORDER);
//$matches is a 2D array, $matches[0] is array of whole string matched, $matches[1] is array of what's in parenthesis
//you either get an array of all matched link URL values in parenthesis capture group or an empty array
$links = count($matches) >= 2 ? $matches[1] : array();
//now do as you wish, iterating over all link URLs
//NOTE: these are URLs only, not actual hyperlink elements
//Example of XML DOM parsing with Selenium RC (in Java)
String locator = "id=someElement";
String htmlSrcSubset = sel.getEval("this.browserbot.findElement(\""+locator+"\").innerHTML");
//using JSoup XML parser library for Java, see jsoup.org
Document doc = Jsoup.parse(htmlSrcSubset);
/* once you have this document object, can then manipulate & traverse
it as an XML/HTML node tree. I'm not going to go into details on this
as you'd need to know XML DOM traversal and XPath (not just for finding locators).
But this tutorial URL will give you some ideas:
http://jsoup.org/cookbook/extracting-data/dom-navigation
the example there seems to indicate first getting the element/node defined
by content tag within the "document" or source, then from there get all
hyperlink elements/nodes and then traverse that as a list/array, doing
whatever you want with an object oriented approach for each element in
the array. Each element is an XML node with properties. If you study it,
you'd find this approach gives you the power/access that WebDriver/Selenium 2
now gives you with WebElements but the example here is what you can do in
Selenium RC to get similar WebElement kind of capability
*/
Selenium's By.Id and By.CssSelector methods do not support Regex and By.XPath only does where XPath 2.0 is enabled. If you want to use Regex, you can do something like this:
void MyCallingMethod(IWebDriver driver)
{
//Search by ID:
string attrName = "id";
//Regex = 'a number that is 1-10 digits long'
string attrRegex= "[0-9]{1,10}";
SearchByAttribute(driver, attrName, attrRegex);
}
IEnumerable<IWebElement> SearchByAttribute(IWebDriver driver, string attrName, string attrRegex)
{
List<IWebElement> elements = new List<IWebElement>();
//Allows spaces around equal sign. Ex: id = 55
string searchString = attrName +"\\s*=\\s*\"" + attrRegex +"\"";
//Search page source
MatchCollection matches = Regex.Matches(driver.PageSource, searchString, RegexOptions.IgnoreCase);
//iterate over matches
foreach (Match match in matches)
{
//Get exact attribute value
Match innerMatch = Regex.Match(match.Value, attrRegex);
cssSelector = "[" + attrName + "=" + attrRegex + "]";
//Find element by exact attribute value
elements.Add(driver.FindElement(By.CssSelector(cssSelector)));
}
return elements;
}
Note: this code is untested. Also, you can optimize this method by figuring out a way to eliminate the second search.