Retrieving Tags from web pages - c++

I am working on a (Browser Helper Object) plugin for Internet Explorer using C++. I intend to use IHTMLDocument2 interface to fetch all tags from a Web page, identify phone numbers and add a hyperlink next to it.
Could someone provide me with a basic sample code for retrieving all tags (Using IHTMLDocument2)from the web page or guide me to some webpage ?
UPDATE
I have an alternative approach of adding a javascript to the page using the bho. the javascript could be used to add hyperlinks besides the phone numbers. Is this a better approach ?
Thanks

Related

How to launch IE from C++ code and be able to get html data after web page is changed

Here is my problem:
I am about to implement a method (C++),with an URL argument as parameter, what this method do is to launch the default browser of windows,and visit the url; this URL leads to an page, user of this program have to fill in some info in this page,and submit, then it jump to result page, and my method need to read and analyze this page data.
I know how to launch an browser like IE , but how to:
read page data to my program?
how my program know the page in browser is updated?
Maybe I should just write a web browser inside my program?
Looks like you want to do the IE browser automation. In which you want to launch the IE and try to fetch data from the web page.
With only C++ you can open the IE browser using shell and open the URL but you will not be able to fetch the data to your application.
I suggest you to check the documentation for Selenium Web driver.
I check and find that currently no any framework available for C++.
If you are available to use C# or JavaScript than it can help to solve your issue.
Reference:
Programming Languages & Frameworks
If you are available to use VBA than you can also refer link below for IE automation using VBA.
(1) Automate Internet Explorer (IE) Using VBA
(2) IE (Internet Explorer) Automation using Excel VBA
(3) VBA Internet Explorer Object

Openeing a youtube search via a link in an embedded IWebBrowser2 control fails

I have a simple IWebBrowser2 browser in my application like this one sample.
We use this browser control for a research in our application to search for address information. The user may click on a button to perform a selective search for given keywords in the address and the result is shown in this browser control.
For example we execute a YouTube search for
https://www.youtube.com/results?search_query=test+video
I can copy the link into a browser (Chrome, IE, Edge) and the search is executed.
But from within the embedded Control the search shows the following text:
Google Sorry...
We're sorry...
... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.
See Google Help for more information.
The help links are not useful, and the problem arises only to searches on Youtube from within the IWebBrowser2. No Captcha is shown. We use AV and Firewall software... so something from YouTube/Google don't like browsing from an IWebBrowser2.
Hint: If you want to use the sample code form CodeProject you should thet ES_AUTOHSCROLL for the URL edit control. Otherwise you will not be able to enter a long search URL.
Set the agent field of the http query header to emulate a known browser.

Extracting out the important information from a web page when provided with only the URL

What I'm referring to is what apps like Facebook and Twitter do when someone posts a link. They are able to convert that link into a title, an important image and (sometimes) a short summary.
What I'm asking is: is there some trick to this using tags, rss or metadata or do you have to sign up for a web service which does this for you or write the code yourself, downloading the HTML and parsing it to extract out a guess to the components you want?
http://ogp.me - They all use the open graph protocol or others. The answers are in the meta tags.

Does Facebook support Hash Bang #! Ajax Crawlable Urls?

Does Facebook support Google's ajax crawling specification and, if so, what do you need to do to implement it?
I am trying to get the Facebook "Like" button to work with AJAX crawlable urls as defined here: code.google.com/web/ajaxcrawling/docs/specification.html
I have this url which I can go to directly and it loads. Note the "#!" in the url:
http://www.idkshouldi.com/?#!idkDetails_idkKey=agppZGtzaG91bGRpcmMLEiljb21faWRrc2hvdWxkaV93ZWJfc2VydmVyX2dhZV9vYmpfSWRrVXNlciIDamltDAsSKWNvbV9pZGtzaG91bGRpX3dlYl9zZXJ2ZXJfZ2FlX29ial9JZGtJdGVtGN6kBgw
When I "Like" this page it should crawl this "escaped fragment" url:
http://www.idkshouldi.com/?_escaped_fragment_=idkDetails_idkKey=agppZGtzaG91bGRpcmMLEiljb21faWRrc2hvdWxkaV93ZWJfc2VydmVyX2dhZV9vYmpfSWRrVXNlciIDamltDAsSKWNvbV9pZGtzaG91bGRpX3dlYl9zZXJ2ZXJfZ2FlX29ial9JZGtJdGVtGN6kBgw
Why won't it crawl this page? The Facebook linter is not properly crawling my page. If one uses the Facebook linter tool here: developers.facebook.com/tools/debug
It won't properly crawl an AJAX enabled URL with the "#!" in it. This is Google's specification. What Facebook's lint crawler needs to do is to replace the "#!" with "_escaped_fragment_". It doesn't appear to do that with my AJAX enabled links.
This is also a big problem for me, but unfortunately it appears Facebook does not support this Google URL notation. Facebook's crawler/parser does not translate from hash bang (#!) to an _escaped_fragment_ format URL.
Like you I have tested my page on Facebook's URL linter and it only picks up static Open Graph tags within the dynamic original page, rather than the page-specific Open Graph tags in the _escaped_fragment_ server-side variant of my page. Unfortunately, this means that Facebook sees my Open Graph tags as site-specific, rather than page specific.
It is rather an irony that this appears to be unsupported as Facebook uses this approach itself to allow Google's crawlers to pick up Facebook pages.
One potential workaround, that may help you a little bit, is:
1) Use your _escaped_fragment_ page version in Facebook links
2) Add an automatic redirect to your _escaped_fragment_ variant to the proper version.
This should mean that Facebook will pick up the proper meta tags, and the user will click the link and end up on the correct page. The downside of this approach is that the user has to know the rather ugly _escaped_fragment_ URL. In other words, it will probably only be you that knows it, unless you add some sort of 'generate shareable link' button to your page.
It is surely only a matter of time before Facebook adds support for this as single-page hash bang sites are only going to become more prevalent.

Is there a Web Service API to the Google Product Search?

I want to call the Google product search and get back a parse-able XML file rather than having to scrape the HTML. I'm not looking for a SOAP based service, but a service that returns XML based on a URL passed in.
Correction--this did NOT work:
The Google Base API lists only a subset of Google product sellers (apparently only those who are active users of the Google Base product.)
http://code.google.com/apis/base/docs/2.0/attrs-queries.html
I eventually ended up using a screen scraping solution and then found that the data was too inconsistant to use for my purposes at all. :-(
http://answers.oreilly.com/topic/2165-how-to-search-google-and-bing-in-c/
use that refer link ,hopefuly it'll very usefull with you all guys