Get a particular text from website

Get a particular text from website - c++

I'm looking for a way if you know the location where to read the text for example say, under a particular category, how would you connect to a website and search & read the text from it?
what steps do i need to follow to learn about that?

you could use libcurl/cURL for your HTML retrival

You're probably looking for a web crawler.
Here's an example of a simple crawler written in C++.
Moreover, you might want to have a look to wget, a software to retrieve files via HTTP, HTTPS and FTP.

if you are looking at a specific web-page, you could try retrieving the page and parsing it to get to the exact location you want. e.g. specific div, etc.
since you are using c++, you could try reading up on using libcurl to retrieve the information you need from the URL.

You can download an html file with WinHTTP(working example) and then search the file. There's some find algos in the std::string class for searching if your needs are relatively basic.

Related

Is there a way to write to a file in online CMS from a local C++ program?

I created a very customized leaflet map on a Bitrix website (they forced me to, not my choice). Now other coworkers who are basically "afraid" of code need to be able to add markers to that. I already created a C++ program where they can simply enter all the details they want (what category, whats the popupcontent etc.) and it spits out the geoJSON code for the marker for them to copy and paste into the website.
To make it even more easy for them I am wondering if there is a way to basically have my program connect to the internet, go to the backend of my website and, after asking for login, adds the code to the respective .js file that contains only the marker code.
I have been googling the problem but unfortunately couldnt find any other related posts.

Okay I finally found the I guess easiest way, I will force my colleagues to install python and write a little thingy to concatenate the code and upload it using Selenium. Thanks for your help guys!

Converting HTML file to PDF using Win32/MFC

As part of my application, my client has requested that I include an automated e-mailing system. As part of this system, I generate HTML code and use automation to send it via. Outlook.
However, they also require a PDF copy of the HTML document to be sent as an attachment. My initial attempts involved using libHaru, which proved difficult to use efficiently, as I was required to create the PDF document from scratch, which required computation of the position of each of the lines in a table, and positioning of all the text, etc.
I was wondering if there would be a way to programmatically convert HTML code (or an HTML file if need be) into a PDF document either by using Win32/MFC itself or an external library.
Thanks in advance!
EDIT: Just to clarify, I am looking for solutions which minimize external dependencies.

You should evaluate this utility wkhtmltopdf:
http://code.google.com/p/wkhtmltopdf/
You can call it from the command line without the need to run a setup.
I use it generating my output documents as html then cal a ShellExecute(...) to convert it to PDF. It's great!
Inside uses webkit + qt. So compability with modern HTML is OK.
Hope it helps.

I'd take a look at PDF Creator, which can be used as a COM object (that acts pretty much like a printer). I haven't used it to print HTML, so I'm not sure, but my guess is that you'll probably end up having to instantiate a web browser control to render the HTML, and then feed it from there to the PDF control.

Some possible answers are in this thread:
C++ Library to Convert HTML to PDF?
Not sure if they will satisfy your particular requirements, but these might at least get you started.
Edit:
Some other possible options here.

Not MFC but you can try QtWebKit. It can render and export HTML to PDF, PNG, JPEG

Extracting key words from HTML to C++ under linux

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.

I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.

To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.

You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

How to retrieve a list of files of given repository (MirrorBrain)?

The repository I am asking is for Linux, but my problem is related to client -- i.e. with retrieving those data, and client can be Linux, Windows, Mac OS X, etc. So I opted against asking this question on Unix&Linux site, if admins feel it should be U&L question please move it to the other site.
Consider such repository as http://download.opensuse.org/repositories/LCD/openSUSE_11.4/x86_64/ -- you can fetch the html for it, parse it, and get the list of files. However I hardly believe it is correct way -- since the html is created by website engine (MirrorBrain in this case), there should be some web service API to get this list directly.
I googled, but didn't find anything relevant.
So -- how to get the list of the file directly, no parsing, just call, and getting the collection of file names.

MirrorBrain doesn't have an API call to retrieve a list of files. (It only has API calls to retrieve a list of mirrors for a single file, by appending .mirrorlist or .meta4 to a file's URL.) It would be a worthwhile idea to add such an api call (patches welcome!).
So there's only the standard HTTP server directory index to read a file list from. The format varies from server to server, and even Apache has different variants. With Apache, a little trick that can help is to append ?F=0 to the directory URL if you want to get only the filenames (it will simplify the index), or to append ?F=1 to switch to the fancier variant which includes more details.
Hope this helps.

Everything inside < > lost, not seen in html?

I have many source/text file, say file.cpp or file.txt . Now, I want to see all my code/text in browser, so that it will be easy for me to navigate many files.
My main motive for doing all this is, I am learning C++ myself, so whenever I learn something new, I create some sample code and then compile and run it. Also, along these codes, there are comments/tips for me to be aware of. And then I create links for each file for easy navigation purpose. Since, there are many such files, I thought it would be easy to navigate it if I use this html method. I am not sure if it is OK or good approach, I would like to have some feedback.
What I did was save file.cpp/file.txt into file.html and then use pre and code html tag for formatting. And, also some more necessare html tags for viewing html files.
But when I use it, everything inside < > is lost
eg. #include <iostream> is just seen as #include, and <iostream> is lost.
Is there any way to see it, is there any tag or method that I can use ?
I can use regular HTML escape code < and > for this, to see < > but since I have many include files and changing it for all of them is bit time-consuming, so I want to know if there is any other idea ??
So is there any other solution than s/</< and s/>/>
I would also like to know if there any other ideas/tips than just converting cpp file into html.
What I want to have is,
in my main page something like this,
tip1 Do this
tip2 Do that
When I click tip1, it will open tip1.html which has my codes for that tip. And also there is back link in tip1.html, which will take me back to main page on clicking it. Everything is OK just that everything inside < > is lost,not seen.
Thanks.

You might want to take a look at online tools such as CodeHtmler, which allows you to copy into the browser, select the appropriate language, and it'll convert to HTML for you, together with keyword colourisation etc.

Or, do like many other people and put your documentation in Doxygen format (/** */) with code samples in #verbatim/#endverbatim tags. Doxygen is good stuff.

A few ideas:
If you serve the files as mimetype text/plain, the browser should display the text for you.
You could also possibly configure your browser to assume .cpp is text/plain.
Instead of opening the files directly in the browser, you could serve them with a web server than can change the characters for you.

You could also use SyntaxHighlighter to display the code on the client side using JavaScript.

It is pretty much essential that somewhere along the line you use a program to prevent the characters '<>&' from being (mis-)interpreted by your browser (and expand significant repeated blanks into '` '). You have a couple of options for when/how to do that. You could use static HTML, simply converting each file once before putting it into the web server document hierarchy. This has the least conversion overhead if the files are looked at more often than they are modified. Alternatively, you can configure your web server to server the pages via a filter program (CGI, or something more sophisticated) and serve the output of that in lieu of the file. The advantage is that files are only converted when needed; the disadvantage is that the files are converted each time they are needed. You could get fancy and consider a caching solution - convert the file on first demand but retain the converted file for future use. The main downside there is that the web server needs to be able to write to where the converted file is cached - not necessarily a good idea for security reasons. (A minimalist approach to security requires the document hierarchy to be owned by and only writable by one user, say webmaster, and the web server runs as another user, say webserver. Now the web server cannot do any damage because it cannot write anywhere in the document hierarchy. Simple; effective; restrictive.)
The program can be a simple Perl script or a simple C program (the C source for webcode 1.3 is available here).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js