I use MSHTML (IHTMLDocument) to display offline HTML which may contain various links. These are loaded from email HTML.
Some of them have URLs starting with // or / for example:
<img src="//www.example.com/image.jpg">
<img src="/www.example.com/image.jpg">
This takes a lot of time to resolve and show the document because it cannot find the URL obviously as it doesn't start with http:// or https://
I tried injecting <base> tag into <head> and adding a local known folder (which is empty) and that stopped this problem. For example:
<base href="C:\myemptypath\">
However, if links begin with \\ (UNC path) the same problem and long loading time begin again. Like:
<img src="\\www.something.com\image.jpg">
I also tried placing WebBrowser control into "offline" mode and all other tricks I could think of and couldn't come up with anything short of RegEx and replacing all the links in the HTML which would be terribly slow solution (or parsing HTML myself which defeats the purpose of MSHTML).
Is there a way to:
Detect these invalid URLs before the document is loaded? - Note: I already did navigate through DOM e.g. WebBrowser1.Document.body.all collection, to get all possible links from all tags and modify them and that works, but it only happens after the document is already loaded so the long waiting time before loading gives up is still happening
Maybe trigger some event to avoid loading these invalid links and simply replace them with about:blank or empty "" text like some sort of "OnURLPreview" event which I could inspect and reject loading of URLs that are invalid? There is only OnDownloadBegin event which is not it.
Any examples in any language are welcome although I use Delphi and C++ (C++ Builder) as I only need the principle here in what direction to look at.
After a long time this is the solution I used:
Created an instance of CLSID_HTMLDocument to parse HTML:
DelphiInterface<IHTMLDocument2> diDoc;
OleCheck(CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&diDoc)));
Write IHTMLDocument2 to diDoc using Document->write
// Creates a new one-dimensional array
WideString HTML = "Example HTML here...";
SAFEARRAY *psaStrings = SafeArrayCreateVector(VT_VARIANT,0,1);
if (psaStrings)
{
VARIANT *param;
BSTR bstr = SysAllocString(HTML.c_bstr());
SafeArrayAccessData(psaStrings, (LPVOID*)¶m);
param->vt = VT_BSTR;
param->bstrVal = bstr;
SafeArrayUnaccessData(psaStrings);
diDoc->write(psaStrings);
diDoc->close();
// SafeArrayDestroy calls SysFreeString for each BSTR
//SysFreeString(bstr); // SafeArrayDestroy should be enough
SafeArrayDestroy(psaStrings);
return S_OK;
}
Parse unwanted links in diDoc
DelphiInterface<IHTMLElementCollection> diCol;
if (SUCCEEDED(diDoc->get_all(&diCol)) && diCol)
{
// Parse IHTMLElementCollection here...
}
Extract parsed HTML into WideString and write into TWebBrowser
DelphiInterface<IHTMLElement> diBODY;
OleCheck(diDoc->get_body(&diBODY));
if (diBODY)
{
DelphiInterface<IHTMLElement> diHTML;
OleCheck(diBODY->get_parentElement(&diHTML));
if (diHTML)
{
WideString wsHTML;
OleCheck(diHTML->get_outerHTML(&wsHTML));
// And finally use the `Document->write` like above to write into your final TWebBrowser document here...
}
}
Related
I am loading an local disk drive _test.htm file through IPersistMoniker Load method. From what I believe, it is supposed to add the path to the relative URLs as base path. Problem is - it does not do so. Instead, it takes a very long time trying to resolve the path from Internet until it gives up (about 20-30 seconds). What I want is to give up instantly, as soon as the unsolvable path is detected (since it is a local disk file anyway).
This is an example HTML I am loading:
<html>
<head>
<script src="//test/test.js"></script>
<head>
<body>
<img src="image.jpg">
<img src="/image.jpg">
<img src="//image.jpg">
</body>
</html>
Simplified code (C++ Builder) with no error checking:
WideString URL = "file:///" + StringReplace(ExtractFilePath(Application->ExeName), "\\", "/", TReplaceFlags() << rfReplaceAll) + "_test.htm";
TCppWebBrowser* WB = CppWebBrowser1;
DelphiInterface<IMoniker> pMoniker;
OleCheck(CreateURLMonikerEx(NULL, URL.c_bstr(), &pMoniker, URL_MK_UNIFORM));
DelphiInterface<IHTMLDocument2> diDoc2 = WB->Document;
DelphiInterface<IPersistMoniker> pPrstMnkr;
OleCheck(diDoc2->QueryInterface(IID_IPersistMoniker, (LPVOID*)&pPrstMnkr));
DelphiInterface<IBindCtx> pBCtx;
OleCheck(CreateBindCtx(0, &pBCtx));
pPrstMnkr->Load(0, pMoniker, pBCtx, STGM_READWRITE);
Problem - image.jpg loads fine, but the paths //test/test.js and /image.jpg and //image.jpg take a very long time to resolve/load. From what I understand CreateURLMonikerEx is supposed to use file:///path/to/executable/ and prepend that automatically to these paths in which case they would fail instantly - file:///path/to/executable//test/test.js for example. That does not happen.
I additionally tried to move image.jpg to a subfolder and then create custom IMoniker interface with the GetDisplayName and BindToStorage implementation which loaded the image from a custom path. However it doesn't do the same for paths which start with // or /. Even though I output file:///path/to/executable/ in the GetDisplayName through the *ppszDisplayName parameter.
How can I avoid extended time loading such unusable links (discard them), or redirect them to local path as above?
I found a partial solution to use about:blank in the *ppszDisplayName but then it doesn't load images with the valid path image.jpg as then it loads them as about:image.jpg which again is invalid path.
Additionally - I've tried adding IDocHostUIHandler interface with the implementation of Invoke method (DISPID_AMBIENT_DLCONTROL) with the pVarResult->lVal = DLCTL_NO_SCRIPTS | DLCTL_NO_JAVA | DLCTL_NO_RUNACTIVEXCTLS | DLCTL_NO_DLACTIVEXCTLS | DLCTL_NO_FRAMEDOWNLOAD | DLCTL_FORCEOFFLINE; - it it blocks the download of images entirely, but still does check 20-30 seconds for the links starting with // or /.
Update - this doesn't work well!
The code below doesn't work well! The problem is - it loses <BODY>
tag attributes. BODY tag turns out entirely empty after loading. I
ended up loading the message using IHTMLDocument2.write method.
See: Assigning IHTMLDocument2 instance to a TWebBrowser instance
After spending lots of time and no guidance of any kind here, I believe that it is not possible to avoid this wait 20-30 sec when the links are invalid. I found another solution and if someone wants to supplement this solution, feel free to do so.
Instead what I had to do is to create an instance of CLSID_HTMLDocument (IHTMLDocument3 or IHTMLDocument2 interface) and then load the document into that container and parse the links prior to doing anything with them. This is described on:
https://learn.microsoft.com/en-us/previous-versions/aa703592(v=vs.85)
This also helped:
How to load html contents from stream and then how to create style sheet to display the html file in preview pane (like HTML preview handler)
After parsing the document URLs and fixing the invalid ones, it can be saved/displayed in the actual TWebBrowser.
Rough solution (C++ Builder):
try
{
DelphiInterface<IHTMLDocument2> diDoc2;
OleCheck(CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (void**)&diDoc2));
DelphiInterface<IPersistStreamInit> diPersist;
OleCheck(diDoc2->QueryInterface(IID_IPersistStreamInit, (void**)&diPersist));
OleCheck(diPersist->InitNew());
DelphiInterface<IMarkupServices> diMS;
OleCheck(diDoc2->QueryInterface(IID_IMarkupServices, (void**)&diMS));
DelphiInterface<IMarkupPointer> diMkStart;
DelphiInterface<IMarkupPointer> diMkFinish;
OleCheck(diMS->CreateMarkupPointer(&diMkStart));
OleCheck(diMS->CreateMarkupPointer(&diMkFinish));
// ...Load from file or memory stream into your WideString here...
DelphiInterface<IMarkupContainer> diMC;
OleCheck(diMS->ParseString(WideString(MsgHTMLSrc).c_bstr(), 0, &diMC, diMkStart, diMkFinish));
DelphiInterface<IHTMLDocument2> diDoc;
OleCheck(diMC->QueryInterface(IID_PPV_ARGS(&diDoc)));
DelphiInterface<IHTMLElementCollection> diCol;
OleCheck(diDoc->get_all(&diCol));
long ColLen = 0;
OleCheck(diCol->get_length(&ColLen));
for (int i = 0; i < ColLen; ++i)
{
DelphiInterface<IDispatch> diItem;
diCol->item(OleVariant(i), OleVariant(i), &diItem);
DelphiInterface<IHTMLElement> diElem;
OleCheck(diItem->QueryInterface(IID_IHTMLElement, (void**)&diElem));
WideString wTagName;
OleCheck(diElem->get_tagName(&wTagName));
if (StartsText("img", wTagName))
{
OleVariant vSrc;
OleCheck(diElem->getAttribute(OleVariant("src"), 4, vSrc));
// Make changes to vSrc here....
// And save it back to src
OleCheck(diElem->setAttribute(OleVariant("src"), vSrc, 0));
}
else if (StartsText("script", wTagName))
{
// More parsing here...
}
}
}
catch (EOleSysError& e)
{
// Process exception as needed
}
catch (Exception& e)
{
// Process exception as needed
}
After full parsing of all required elements (img/src, script/src, base/href etc.) save and load into TWebBrowser.
I only now have to see if the parsed HTML IHTMLDocument2 can be directly assigned to TWebBrowser without loading it again, but that is another question (See - Assigning IHTMLDocument2 instance to a TWebBrowser instance)
i want display PDF file into region , i tried that by call application process using below code but always same file open.( plsql dynamic content region)
DECLARE
V_URL VARCHAR2(2500);
BEGIN
V_URL :='f?p=&APP_ID.:1:&APP_SESSION.:APPLICATION_PROCESS=display_emp_blob:::FILE_ID:' ||:P6_ID;
Sys.htp.p('<p align="center">');
sys.htp.p('<iframe src="'||V_URL||'"width="99%" height="1000">');
sys.htp.p('</iframe>');
sys.htp.p('</p>');
END;
and the application process code in below
CREATE OR REPLACE PROCEDURE OPEN_FILE (P_ID NUMBER)
IS
vBlob blob;
vmimetype varchar2(50);
BEGIN
SELECT ORG_FILES.FILE_CONTENT,MIME_TYPE INTO vBlob, vmimetype
FROM ORG_FILES
WHERE ID =P_ID ;
sys.HTP.init;
owa_util.mime_header(vmimetype,false);
htp.p('Content-Length: ' || dbms_lob.getlength(vBlob));
owa_util.http_header_close;
wpg_docload.download_file(vBlob);
apex_application.stop_apex_engine;
exception
when no_data_found then
null;
END;
How i can open different PDF file into region based a value in ITEM (P6_ID) .
I think the problem you have is that the browser caches the file.
You can specify the time the browser caches with the "Cache-control" header option. Below, you have the code that I use (I have this code in the application process, not in the database):
sys.htp.init;
sys.owa_util.mime_header( 'application/pdf', FALSE );
sys.htp.p('Content-length: ' || sys.dbms_lob.getlength( v_blob));
sys.htp.p('Content-Disposition: inline; filename="'|| v_filename || '"' ); -- "attachment" for download, "inline" for display
sys.htp.p('Cache-Control: max-age=3600'); -- in seconds. Tell the browser to cache for one hour, adjust as necessary
sys.owa_util.http_header_close;
sys.wpg_docload.download_file( v_blob );
apex_application.stop_apex_engine;
You can also try some lazy load, which is the way I access my files (It may be that the way to access your file is also part of the problem). This way you make the page load without waiting for the user and then it loads and shows the file. I don't use the iframe tag but the embed tag. The way to do it is as follows:
Create a region with static content with this html
<div id="view_pdf"></div>
creates a dynamic action when the page loads, which executes javascript and add the following code
$('#view_pdf').html('');
var url = 'f?p=&APP_ID.:1:&APP_SESSION.:APPLICATION_PROCESS=display_emp_blob:::FILE_ID:' + apex.item('P6_ID').getValue();
var preview = document.createElement('embed');
preview.type = "application/pdf";
preview.width="100%";
preview.height="625px";
preview.src = url;
$("#view_pdf").append(preview);
You can modify the values depending on what you need. The embed tag uses the default way to view pdf files from browsers.
Also if what you want is to change the pdf without reloading the page you must use the previous javacript in a dynamic action when you change the value of the item.
I hope you find it useful.
My apologies, on the rare occasion I use this region type, I always think it can be refreshed.
https://spendolini.blogspot.com/2015/11/refreshing-plsql-regions-in-apex.html
The solution is to create a classic report that calls a PL/SQL function that returns your HTML.
SELECT package_name.function_name(p_item => :P1_ITEM) result FROM dual
I'm using normal urllib2 library to open a webpage.
But when i print out the html raw data, i'm not able to see what i get when i use the"inspect element" mode in chrome browser...
Is it the HTML code has been encoded by the webpage?
But if this is a case, why i still can see the actual html code when i use inspect element in browser?
Here is the code:
import urllib2
url = 'http://www.bursamalaysia.com/market/listed-companies/list-of-companies/plc-profile.html?stock_code=0140'
page = urllib2.urlopen(url).read()
print (page)
Allow me to elaborate more:
1) from the above webpage, i right click the highlighted text and view it by using the inspect element. then i can see the html code like this:-
<th scope="row">Buy</th>
But when i use the urllib2 to extract the same webpage and try to search for the keyword. I can't find any of the related things that i saw in inspect element.
All what i get is something like that:
function test(){var table = "00000000 77073096 EE0E612C 990951BA 076DC419 706AF48F E963A535 9E6495A3 0EDB8832 79DCB8A4 E0D5E91E 97D2D988 09B64C2B 7EB17CBD E7B82D07 90BF1D91 1DB71064 6AB020F2 F3B97148 84BE41DE 1ADAD47D 6DDDE4EB F4D4B551 83D385C7 136C9856 646BA8C0 FD62F97A 8A65C9EC 14015C4F 63066CD9 FA0F3D63 8D080DF5 3B6E20C8 4C69105E D56041E4 A2677172 3C03E4D1 4B04D447 D20D85FD A50AB56B 35B5A8FA 42B2986C DBBBC9D6 ACBCF940 32D86CE3 45DF5C75 DCD60DCF ABD13D59 26D930AC 51DE003A C8D75180 BFD06116 21B4F4B5 56B3C423 CFBA9599 B8BDA50F 2802B89E 5F058808 C60CD9B2 B10BE924 2F6F7C87 58684C11 C1611DAB B6662D3D 76DC4190 01DB7106 98D220BC EFD5102A 71B18589 06B6B51F 9FBFE4A5 E8B8D433 7807C9A2 0F00F934 9609A88E E10E9818 7F6A0DBB 086D3D2D 91646C97 E6635C01 6B6B51F4 1C6C6162 856530D8 F262004E 6C0695ED 1B01A57B 8208F4C1 F50FC457 65B0D9C6 12B7E950 8BBEB8EA FCB9887C 62DD1DDF 15DA2D49 8CD37CF3 FBD44C65 4DB26158 3AB551CE A3BC0074 D4BB30E2 4ADFA541 3DD895D7 A4D1C46D D3D6F4FB 4369E96A 346ED9FC AD678846 DA60B8D0 44042D73 33031DE5 AA0A4C5F DD0D7CC9 5005713C 270241AA BE0B1010 C90C2086 5768B525 206F85B3 B966D409 CE61E49F 5EDEF90E 29D9C998 B0D09822 C7D7A8B4 59B33D17 2EB40D81 B7BD5C3B C0BA6CAD EDB88320 9ABFB3B6 03B6E20C 74B1D29A EAD54739 9DD277AF 04DB2615 73DC1683 E3630B12 94643B84 0D6D6A3E 7A6A5AA8 E40ECF0B 9309FF9D 0A00AE27 7D079EB1 F00F9344 8708A3D2 1E01F268 6906C2FE F762575D 806567CB 196C3671 6E6B06E7 FED41B76 89D32BE0 10DA7A5A 67DD4ACC F9B9DF6F 8EBEEFF9 17B7BE43 60B08ED5 D6D6A3E8 A1D1937E 38D8C2C4 4FDFF252 D1BB67F1 A6BC5767 3FB506DD 48B2364B D80D2BDA AF0A1B4C 36034AF6 41047A60 DF60EFC3 A867DF55 316E8EEF 4669BE79 CB61B38C BC66831A 256FD2A0 5268E236 CC0C7795 BB0B4703 220216B9 5505262F C5BA3BBE B2BD0B28 2BB45A92 5CB36A04 C2D7FFA7 B5D0CF31 2CD99E8B 5BDEAE1D 9B64C2B0 EC63F226 756AA39C 026D930A 9C0906A9 EB0E363F 72076785 05005713 95BF4A82 E2B87A14 7BB12BAE 0CB61B38 92D28E9B E5D5BE0D 7CDCEFB7 0BDBDF21 86D3D2D4 F1D4E242 68DDB3F8 1FDA836E 81BE16CD F6B9265B 6FB077E1 18B74777 88085AE6 FF0F6A70 66063BCA 11010B5C 8F659EFF F862AE69 616BFFD3 166CCF45 A00AE278 D70DD2EE 4E048354 3903B3C2 A7672661 D06016F7 4969474D 3E6E77DB AED16A4A D9D65ADC 40DF0B66 37D83BF0 A9BCAE53 DEBB9EC5 47B2CF7F 30B5FFE9 BDBDF21C CABAC28A 53B39330 24B4A3A6 BAD03605 CDD70693 54DE5729 23D967BF B3667A2E C4614AB8 5D681B02 2A6F2B94 B40BBE37 C30C8EA1 5A05DF1B 2D02EF8D";
OK... let assume all the data is encoded. But why i still can see them during inspect element? Any idea? or method to decode them?
Thanks
I am using docx4j 2.8.1 with Content Controls in my .docx file. I can replace the CustomXML part by injecting my own XML and then calling BindingHandler.applyBindings after supplying the input XML. I can add a token in my XML such as ¶ then I would like to replace that token in the MainDocumentPart, but using that approach, when I iterate through the content in the MainDocumentPart with this (link) method none of my text from my XML is even in the collection extracted from the MainDocumentPart. I am thinking that even after binding the XML, it remains separate from the MainDocumentPart (??)
I haven't tried this with anything more than a little test doc yet. My token is the Pilcrow: ¶. Since it's a single character, it won't be split in separate runs. My code is:
private void injectXml (WordprocessingMLPackage wordMLPackage) throws JAXBException {
MainDocumentPart part = wordMLPackage.getMainDocumentPart();
String xml = XmlUtils.marshaltoString(part.getJaxbElement(), true);
xml = xml.replaceAll("¶", "</w:t><w:br/><w:t>");
Object obj = XmlUtils.unmarshalString(xml);
part.setJaxbElement((Document) obj);
}
The pilcrow character comes from the XML and is injected by applying the XML bindings to the content controls. The problem is that the content from the XML does not seem to be in the MainDocumentPart so the replace doesn't work.
(Using docx4j 2.8.1)
I'm using selenium RC and I would like, for example, to get all the links elements with attribute href that match:
http://[^/]*\d+com
I would like to use:
sel.get_attribute( '//a[regx:match(#href, "http://[^/]*\d+.com")]/#name' )
which would return a list of the name attribute of all the links that match the regex.
(or something like it)
thanks
The answer above is probably the right way to find ALL of the links that match a regex, but I thought it'd also be helpful to answer the other part of the question, how to use regex in Xpath locators. You need to use the regex matches() function, like this:
xpath=//div[matches(#id,'che.*boxes')]
(this, of course, would click the div with 'id=checkboxes', or 'id=cheANYTHINGHEREboxes')
Be aware, though, that the matches function is not supported by all native browser implementations of Xpath (most conspicuously, using this in FF3 will throw an error: invalid xpath[2]).
If you have trouble with your particular browser (as I did with FF3), try using Selenium's allowNativeXpath("false") to switch over to the JavaScript Xpath interpreter. It'll be slower, but it does seem to work with more Xpath functions, including 'matches' and 'ends-with'. :)
You can use the Selenium command getAllLinks to get an array of the ids of links on the page, which you could then loop through and check the href using the getAttribute, which takes the locator followed by an # and the attribute name. For example in Java this might be:
String[] allLinks = session().getAllLinks();
List<String> matchingLinks = new ArrayList<String>();
for (String linkId : allLinks) {
String linkHref = selenium.getAttribute("id=" + linkId + "#href");
if (linkHref.matches("http://[^/]*\\d+.com")) {
matchingLinks.add(link);
}
}
A possible solution is to use sel.get_eval() and write a JS script that returns a list of the links. something like the following answer:
selenium: Is it possible to use the regexp in selenium locators
Here's some alternate methods as well for Selenium RC. These aren't pure Selenium solutions, they allow interaction with your programming language data structures and Selenium.
You can also get get HTML page source, then regular expression the source to return a match set of links. Use regex grouping to separate out URLs, link text/ID, etc. and you can then pass them back to selenium to click on or navigate to.
Another method is get HTML page source or innerHTML (via DOM locators) of a parent/root element then convert the HTML to XML as DOM object in your programming language. You can then traverse the DOM with desired XPath (with regular expression or not), and obtain a nodeset of only the links of interest. From their parse out the link text/ID or URL and you can pass back to selenium to click on or navigate to.
Upon request, I'm providing examples below. It's mixed languages since the post didn't appear to be language specific anyways. I'm just using what I had available to hack together for examples. They aren't fully tested or tested at all, but I've worked with bits of the code before in other projects, so these are proof of concept code examples of how you'd implement the solutions I just mentioned.
//Example of element attribute processing by page source and regex (in PHP)
$pgSrc = $sel->getPageSource();
//simple hyperlink extraction via regex below, replace with better regex pattern as desired
preg_match_all("/<a.+href=\"(.+)\"/",$pgSrc,$matches,PREG_PATTERN_ORDER);
//$matches is a 2D array, $matches[0] is array of whole string matched, $matches[1] is array of what's in parenthesis
//you either get an array of all matched link URL values in parenthesis capture group or an empty array
$links = count($matches) >= 2 ? $matches[1] : array();
//now do as you wish, iterating over all link URLs
//NOTE: these are URLs only, not actual hyperlink elements
//Example of XML DOM parsing with Selenium RC (in Java)
String locator = "id=someElement";
String htmlSrcSubset = sel.getEval("this.browserbot.findElement(\""+locator+"\").innerHTML");
//using JSoup XML parser library for Java, see jsoup.org
Document doc = Jsoup.parse(htmlSrcSubset);
/* once you have this document object, can then manipulate & traverse
it as an XML/HTML node tree. I'm not going to go into details on this
as you'd need to know XML DOM traversal and XPath (not just for finding locators).
But this tutorial URL will give you some ideas:
http://jsoup.org/cookbook/extracting-data/dom-navigation
the example there seems to indicate first getting the element/node defined
by content tag within the "document" or source, then from there get all
hyperlink elements/nodes and then traverse that as a list/array, doing
whatever you want with an object oriented approach for each element in
the array. Each element is an XML node with properties. If you study it,
you'd find this approach gives you the power/access that WebDriver/Selenium 2
now gives you with WebElements but the example here is what you can do in
Selenium RC to get similar WebElement kind of capability
*/
Selenium's By.Id and By.CssSelector methods do not support Regex and By.XPath only does where XPath 2.0 is enabled. If you want to use Regex, you can do something like this:
void MyCallingMethod(IWebDriver driver)
{
//Search by ID:
string attrName = "id";
//Regex = 'a number that is 1-10 digits long'
string attrRegex= "[0-9]{1,10}";
SearchByAttribute(driver, attrName, attrRegex);
}
IEnumerable<IWebElement> SearchByAttribute(IWebDriver driver, string attrName, string attrRegex)
{
List<IWebElement> elements = new List<IWebElement>();
//Allows spaces around equal sign. Ex: id = 55
string searchString = attrName +"\\s*=\\s*\"" + attrRegex +"\"";
//Search page source
MatchCollection matches = Regex.Matches(driver.PageSource, searchString, RegexOptions.IgnoreCase);
//iterate over matches
foreach (Match match in matches)
{
//Get exact attribute value
Match innerMatch = Regex.Match(match.Value, attrRegex);
cssSelector = "[" + attrName + "=" + attrRegex + "]";
//Find element by exact attribute value
elements.Add(driver.FindElement(By.CssSelector(cssSelector)));
}
return elements;
}
Note: this code is untested. Also, you can optimize this method by figuring out a way to eliminate the second search.