Obtaining visible text on a page from an IHTMLDocument2* - c++

I am trying to obtain the text content of a Internet Explorer web browser window.
I am following these steps:
obtain a pointer to IHTMLDocument2
from the IHTMLDocument2 i obtain the body as an IHTMLElement
3. On the body i call get_innerText
Edit
I obtain all the children of the body and try to do a recursive call on all the IHTMLElements
if i get any element which is not visible or if i get an element whose tag is script, i ignore that element and all its children.
My problem is
that along with the text which is visible on the page i also get content having for which style="display: none"
For google.com, i also get javascript along with the text.
I have tried a recursive approach, but i am clueless as to how to deal with scenarios like this,
<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>
In this scenario i wont be able to get "Hello World 1"
Can anyone please help me out with the best way to obtain the text from an IHTMLDocument2*.
I am using C++ Win32, no MFC, ATL.
Thanks,
Ashish.

If you iterate backwards on the document.body.all elements, you will always walk on the elements inside out. So you don't need to walk recursive yourself. the DOM will do that for you. e.g. (Code is in Delphi):
procedure Test();
var
document, el: OleVariant;
i: Integer;
begin
document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
document.open;
document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
document.close;
for i := document.body.all.length - 1 downto 0 do // iterate backwards
begin
el := document.body.all.item(i);
// filter the elements
if (el.style.display = 'none') then
begin
el.removeNode(true);
end;
end;
ShowMessage(document.body.innerText);
end;
A Side Comment:
As for your scenario with the recursive approach:
<div>Hello World 1<div style="display: none">Hello world 2</div></div>
If e.g. our element is the first DIV, el.getAdjacentText('afterBegin') will return "Hello World 1". So we can probably iterate forward on the elements and collect the getAdjacentText('afterBegin'), but this is a bit more difficult because we need to test the parents of each element for el.currentStyle.display.

Related

Finding xpath siblings after declaring a variable with find_elements

I start by trying to find all menu items on a site by selecting them with a .find_elements_by_xpath. This works fine
(The buttons are either text or an image).
Then I want to loop through each of these elements and return either the text between the tags or the src of the image of a span tag which preceeds the tag inbetween which there is text.
Returning the text works fine but I am unable to return the src. I am having trouble building an xpath which roots from the current iteration of the loop. What I am left with is either an 'unable to locate' or I return the first menu image over and over again.
Here is the code I currently have running (note I am unable to give out the URL to the site):
browser = webdriver.Chrome(...)
menu = browser.find_elements_by_xpath('//td[#onmouseover]')
for menu_part in menu:
try:
if len(menu_part.text) < 2:
menu_button = menu_part.find_element_by_xpath(
'/span[#class="ThemeOfficeMainFolderText"]/preceding-sibling::span/img').get_attribute('src')
else:
menu_button = menu_part.text
print menu_button
except Exception as e:
print e
pass
I am unsure if the syntax is completely correct/ if I can use the currently iterated element as the 'root' of my find_element function (menu_part.find_element_by_xpath)
Also, there is no way to further specify the tags with attributes because all menu items have identical attributes.
Lastly, the following code returns the first image in the menu.
menu_button = browser.find_element_by_xpath(
'//span[#class="ThemeOfficeMainFolderText"]/preceding-sibling::span/img').get_attribute('src')
Therfore, I am relatively confident the code following "span[#class... " works fine, the issue is the preceding code.
I am hopeful that there is a simple solution and that I made a mistake while writing the xpath, but I am completely out of ideas at the moment...
EDIT:
here is the basic html structure I am dealing with
<td class="ThemeOfficeMainItem" onmouseover="ItemMouseOverOpenSub ()">
<span class="ThemeOfficeMainFolderLeft">
<img src="img1.png"></span>
<span class="ThemeOfficeMainFolderText">TEXT</span>
<span class="ThemeOfficeMainFolderRight"> </span>
</td>
<td class="ThemeOfficeMainItem" onmouseover="ItemMouseOverOpenSub ()">
<span class="ThemeOfficeMainFolderLeft">
<img src="img2.png"></span>
<span class="ThemeOfficeMainFolderText"></span>
<span class="ThemeOfficeMainFolderRight"> </span>
</td>
If you want to search for span starting from previously defined parent element menu_part, then you should use
./span[#class="ThemeOfficeMainFolderText"]/preceding-sibling::span/img
Note the dot at the beginning of XPath that points to current (menu_part) element
Update
As for the logic of your code, try below:
browser = webdriver.Chrome()
browser.get(URL)
menu = browser.find_elements_by_xpath('//td[#onmouseover]')
for menu_part in menu:
text_span = menu_part.find_element_by_xpath('./span[#class="ThemeOfficeMainFolderText"]')
if not text_span.text:
menu_button = menu_part.find_element_by_xpath('./span[#class="ThemeOfficeMainFolderText"]/preceding-sibling::span/img').get_attribute('src')
else:
menu_button = text_span.text
print menu_button

JasperReports list + each record on new page

I have the following report in JasperReports jrxml, Detail section.
It is a list I get from Java containing 2 objects, both are being outputed on the first page, so each time test variable is called.
<detail>
<band height="200" splitType="Stretch">
<componentElement>
<reportElement key="table" style="table" x="0" y="49" width="500" height="140"/>
<jr:list xmlns:jr="http://jasperreports.sourceforge.net/jasperreports/components" xsi:schemaLocation="http://jasperreports.sourceforge.net/jasperreports/components http://jasperreports.sourceforge.net/xsd/components.xsd">
<datasetRun subDataset="Data Set">
<datasetParameter name="REPORT_DATA_SOURCE">
<datasetParameterExpression><![CDATA[$P{REPORT_DATA_SOURCE}]]></datasetParameterExpression>
</datasetParameter>
</datasetRun>
<jr:listContents height="50" >
<textField isBlankWhenNull="true">
<reportElement x="0" y="0" width="200" height="20" isRemoveLineWhenBlank="true"/>
<textElement textAlignment="Left" verticalAlignment="Middle" >
<font size="10" fontName="DejaVu Serif" isBold='true'/>
</textElement>
<textFieldExpression><![CDATA[$F{test}]]></textFieldExpression>
</textField>
</jr:listContents>
</jr:list>
</componentElement>
</band>
</detail>
So i have a list of 2 objects, beans. Everything works but this test variable gets shown twice on each page (both objects gets called on first page) instead of one object per page. I would like to put a break after first test is printed, so the next test in the list is printed on the next page.
Can anyone point me in the right direction?
I have the answer and i will post it here if someone else has the same problem.
Jasper report is quite buggy. First, when you are printing out collection of the elements (objects), you send collection to the jasper engine, which for some unknown reason doesnt recognize the first element in the collection. You solve this by adding one dummy object on the index 0 of your collection.
After searching i ve found out that jasper API has the function getPages(). This returns number of pages that there will be printed out, in a list. Each index of list is one page. You can call this function from this jasperPrint when u fill report.
JasperReport jasperReport = null;
JasperPrint jasperPrint = null;
where ist is the input stream of your jrxml, parameters is a hashmap, and last is your list of your jRBeancollData source.
beanColDataSource = new JRBeanCollectionDataSource("YOURLIST");
jasperReport = JasperCompileManager.compileReport(ist);
jasperPrint = JasperFillManager.fillReport(jasperReport, parameters, beanColDataSource);
List<JRPrintPage> pages = japserPrint.getPages();
After you have this jasperPrint, you can call this function i wrote.
/**
* This removes blank page if the page size is bigger then the number of
* "pages" in array if for some reason we get last page as empty in pdf - we
* also put arraySize - 1 since we ve put one dummy(empty) in the array
* collection on index 0 since for some buggy unknown reason jasper always
* outputs collections from index 1 forward instead of 0.
*
* #author Uros
* #param pages
* , arraySize
*/
private void removeBlankPage(List<JRPrintPage> pages, int arraySize)
{
int numOfPages = pages.size();
if (numOfPages > arraySize - 1)
{
pages.remove(numOfPages - 1);
}
}
The function in this case removes only the last page, you can modify it so it removes any other empty page if there is one, and arraySize is smaller by one since you ve put in 1 dummy.
Since you print each object on each page, if there are more pages then objects, it is obvious there is one empty page so you remove it. You need to make sure that data wont stretch over to the next page ofcourse, but i print pages that will always look the same.
Hope it helps..
Yes put break page palette after printing your first object. So that it will print both
object in different pages.
I design my reports in iReport, which allows me to drag and drop a page break into the detail band, but maybe this can help you. The resulting XML looks like:
<break>
<reportElement uuid="*uuid here*" x="0" y="28" width="100" height="1"/>
</break>
</band>
</detail>
I inserted mine at the bottom of the detail band, so you should try inserting your <break> between your two objects.
Use a SubReport. Feed your list as a JRBeanCollectionDataSource to a SubReport. Place that subreport in the detail band along with a page break just below the subreport.
<break>
<reportElement uuid="*uuid here*" x="0" y="28" width="100" height="1"/>
</break>

Print button in Windows 8 Store app HTML + JS for Split Template app

I have an app made in WinJS for Windows 8 App store, and I use the Split Page template. So on the split page I have two colums the one named List column and the one named Item detail column. It is for recipes, and in the list are the pictures and small details, and in the right in the item detail block is the recipe with details and pictures. I want a Print button that will print only the right column with details, not the entire windows with list and etc.
Can someone give me an example better than the one from msdn?
As taken from codeshow.codeplex.com to print a fragment:
function printFrag(printEvent) {
var printTask = printEvent.request.createPrintTask("codeSHOW Print Frag", function (args) {
var frag = document.createDocumentFragment();
frag.appendChild(q(".print #printFromApp").cloneNode(true));
args.setSource(MSApp.getHtmlPrintDocumentSource(frag));
// Register the handler for print task completion event
printTask.oncompleted = printTaskCompleted;
});
}
Which is in turn cloning this html node and then printing it
<div id="printFromApp">
<h2>Invoke print</h2>
<p>Wow, printing is fun.</p>
<button id="invokePrint">Print</button>
</div>
All the print samples can be mostly cut and pasted from there, check it out. It's also available in the Windows Store and is an essential HTML/JS dev tool for Windows Store apps.

How can I get the raphael element of a DOM object that is created using raphael?

With Raphael I can get the a reference to the DOM object of an element using the following code:
element.node
How can I get the element that is linked to the DOM object? In other words the inverse of the function above (e.g. DOMobject.element).
A node created with RaphaelJS has a raphaelid property or something very close you can log in Chrome DevTools or similar.
Since you know this specific ID and you have a reference to the Raphael Paper instance as the paper variable for instance, you get the element with:
paper.getById(node.raphaelid)
Actually, this is quite undocumented. Only the getById method is documented in the RaphaelJS documentation (Paper.getById section)
Update for the comment about not being able to get raphaelid on the DOM element
Please have a look at this jsfiddle about getting raphaelid.
HTML
<div id="c"></div>
<div><code>rect.node.raphaelid</code> : <span id="i"></span></div>
<div><code>rect2.node.raphaelid</code> : <span id="i2"></span></div>
JS
var paper = Raphael(c,400,400);
var rect = paper.rect(100,100,200,200);
var rect2 = paper.rect(150,150,200,200);
i.textContent = rect.node.raphaelid;
i2.textContent = rect2.node.raphaelid;
Text result
rect.node.raphaelid : 0
rect2.node.raphaelid : 1
All this with version 2.1.0 of RaphaelJS

How To I Replace New Elements Added To A Page With Jquery

Here is the scenario... I have a a checkbox next to each field that I am replacing on page load with jquery with "Delete" text that enables me to delete the field via jquery, which is working fine. Like so...
$(".profile-status-box").each(function(){ $(this).replaceWith('<span class="delete">' + 'Delete' + '</span>') });
The problem comes in however is that after page load, I am also giving the user the option to dynamically add new fields. The new added fields though have the checkbox and not the delete link because they are not being replaced by jquery since they are being added after the initial page load.
Is't possible to replace the content of new elements added to the page without doing a page refresh? If not, I can always have two templates with different markup depending one for js and one for non js, but I was trying to avoind taht.
Thanks in advance.
You can use the .livequery() plugin, like this:
$(".profile-status-box").livequery(function(){
$(this).replaceWith('<span class="delete">Delete</span>')
});
The anonymous function is run against every element found, and each new element matching the selector as they're added.
Have a look at this kool demo. It removes and adds elements like a charm.
http://www.dustindiaz.com/basement/addRemoveChild.html
Here's how:
First of all, the (x)html is real simple.
xHTML Snippet
<input type="hidden" value="0" id="theValue" />
<p>Add Some Elements</p>
<div id="myDiv"> </div>
The hidden input element simply gives you a chance to dynamically call a number you could start with. This, for instance could be set with PHP or ASP. The onclick event handler is used to call the function. Lastly, the div element is set and ready to receive some children appended unto itself (gosh that sounds wierd).
Mkay, so far so easy. Now the JS functions.
addElement JavaScript Function
function addElement() {
var ni = document.getElementById('myDiv');
var numi = document.getElementById('theValue');
var num = (document.getElementById('theValue').value -1)+ 2;
numi.value = num;
var newdiv = document.createElement('div');
var divIdName = 'my'+num+'Div';
newdiv.setAttribute('id',divIdName);
newdiv.innerHTML = 'Element Number '+num+' has been added! <a href=\'#\' onclick=\'removeElement('+divIdName+')\'>Remove the div "'+divIdName+'"</a>';
ni.appendChild(newdiv);
}
And if you want to,
removeElement JavaScript Function
function removeElement(divNum) {
var d = document.getElementById('myDiv');
var olddiv = document.getElementById(divNum);
d.removeChild(olddiv);
}
and thats that. bobs your uncle.
This is taken from this article/tutorial: http://www.dustindiaz.com/add-and-remove-html-elements-dynamically-with-javascript/
I've just learnt this myself. thank you for the question
Hope that helps.
PK