Read HTML source to string - c++

I hope you don't frown on me too much, but this should be answerable by someone fairly easily. I want to read a file on a website into a string, so I can extract information from it.
I just want a simple way to get the HTML source read into a string. After looking around for hours I see all these libraries and curl and stuff. All I need is the raw HTML data. I don't even need a definite answer. Just something that will help me refine my search.
Just to be clear I want the raw code in a string I can manipulate, don't need any parsing etc.

You need an HTTP Client library, one of many is libcurl. You would then issue a GET request to a URL and read the response back how ever your chosen library provides it.
Here is an example to get you started, it is C so I am sure you can work it out.
#include <stdio.h>
#include <curl/curl.h>
int main(void)
{
CURL *curl;
CURLcode res;
curl = curl_easy_init();
if(curl) {
curl_easy_setopt(curl, CURLOPT_URL, "http://example.com");
res = curl_easy_perform(curl);
/* always cleanup */
curl_easy_cleanup(curl);
}
return 0;
}
But you tagged this C++ so if you want a C++ wrapper for libcurl then use curlpp
#include <curlpp/curlpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>
using namespace curlpp::options;
int main(int, char **)
{
try
{
// That's all that is needed to do cleanup of used resources
curlpp::Cleanup myCleanup;
// Our request to be sent.
curlpp::Easy myRequest;
// Set the URL.
myRequest.setOpt<Url>("http://example.com");
// Send request and get a result.
// By default the result goes to standard output.
myRequest.perform();
}
catch(curlpp::RuntimeError & e)
{
std::cout << e.what() << std::endl;
}
catch(curlpp::LogicError & e)
{
std::cout << e.what() << std::endl;
}
return 0;
}

HTTP is built on top of TCP. If you know socket programming, you can write a simple networking application that opens a socket to the desired server and issues an HTTP GET command. Whatever the server responds with, you'll have to remove the HTTP headers that precede the actual document you want.
If that sounds complicated, then just stick with libcurl.

if it is a hack - then just grab the source from show source, and save as txt. then you can open it with a normal file io stream.
all thos pesky libraries are a hint that it is a common and non-trivial excercise to do it right... :)

If all you want to do is grab the entire HTML code without any kind of parsing and extern libraries, my sugestion would be copying the code with a IO stream into a string.
It is the simplest way that I have in mind but be aware that it isn't the most efficient way to do it.

Related

Get the HTML of a site [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm trying to get into a string (or a char[]) the html of a page...( and such)
I know how to use basic sockets, and connect as a client/server...
I've wrote a client in the past, that gets an ip & port, and connects to it, and send images and such using sockets betwen the client & the server...
I've searched the internet a bit, and found I can connect to the website, and send a GET request, to get the HTTP content of a page and store it in a variable, though I have a few problems :
1) I'm trying to get the HTML of a page that isnt the main page of a site, like, not stackoverflow.com, but stackoverflow.com/help and such (not the "official page of the site", but something inside that site)
2) I'm not sure how to either send or store the data I got from the GET request...
I saw there are outside libraries I could use, but I rather use sockets only...
By the way - I'm using Windows 7, and I aim that it'll work on Windows only(so it's fine if it wont work for Linux)
Thanks for you'r help! :)
To access a resource on some host you just specify the path to the resource in the first line of the request, just after the 'GET'. E.g. check http://www.jmarshall.com/easy/http/#http1.1
GET /path/file.html HTTP/1.1
Host: www.host1.com:80
[blank line here]
I'd also recomend using some portable library like Boost.ASIO instead of sockets. But I'd strongly recomend you to use some existing, portable library implementing HTTP protocol. Of course only if it is not a matter of learning how to implement it.
Even if you want to implement it by yourself it'd be worth knowing the existing solutions. For instance this is how you can get a webpage using cpp-netlib (http://cpp-netlib.org/0.10.1/index.html):
using namespace boost::network;
using namespace boost::network::http;
client::request request_("http://127.0.0.1:8000/");
request_ << header("Connection", "close");
client client_;
client::response response_ = client_.get(request_);
std::string body_ = body(response_);
This is how you can do it using cURL library (http://curl.haxx.se/libcurl/c/simple.html):
#include <stdio.h>
#include <curl/curl.h>
int main(void)
{
CURL *curl;
CURLcode res;
curl = curl_easy_init();
if(curl) {
curl_easy_setopt(curl, CURLOPT_URL, "http://example.com");
/* example.com is redirected, so we tell libcurl to follow redirection */
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
/* Perform the request, res will get the return code */
res = curl_easy_perform(curl);
/* Check for errors */
if(res != CURLE_OK)
fprintf(stderr, "curl_easy_perform() failed: %s\n",
curl_easy_strerror(res));
/* always cleanup */
curl_easy_cleanup(curl);
}
return 0;
}
Both libraries are portable but if you'd like to use some Windows-specific API you might check WinINet (http://msdn.microsoft.com/en-us/library/windows/desktop/aa383630%28v=vs.85%29.aspx) but it's less pleasant to use.

How-to: Send text from webpage and send to external application

i'm writing a program for my algorithm class that is supposed to be able to traverse a webpage, find a random address, and then using a browser extension(Firefox/Chrome), it should do a Google Maps search for that address. I literally just thought that maybe trying to use the extension to capture text and put it into a text file and then make my program read that text file would be a good idea, but i have no clue as to how that would be implemented.
My code so far (Don't worry, after a Window UI, it will get longer. This is just a test console app):
#include <iostream>
#include <cstdlib>
#include <stdlib.h>
#include <windows.h>
using namespace std;
int main ()
{
string address;
cout << "Please input address: ";
//cin >> address;
getline(cin, address);
//word_list = getRecursiveURLs(url, DEPTH)
//return cleaner(word_list)
//string address = "Houston, Tx ";
std::string str = "http://mapof.it/" + address;
//cout << mapSearch;
const char * c = str.c_str();
ShellExecute(NULL, "open", c, NULL, NULL, SW_SHOWNORMAL);
}
Right now, my code takes in an address and adds it to the end of a "Mapof.it" url that basically initiates a GMaps search.
It look like user is interact with your C++ program. It doesn't need to communicate with browser progress.
You can send http request from C++ program, fetch the reponse text, then parse it.
First, you try to find whether the website provide a api url which return json/xml format, because json/xml is easier to parse. For example, Google Map does provide api.
If not, try to use regular expression to parse html, or find some DOM handle library to parse it with DOM.
If your result text can't not extract from raw, it create by JavaScript dynamically, you can find some "headless browser" library to help you.
If you need a full feature browser, use QT, it provide QtWebkit widget.

Sending and receiving strings over http via curl

I have a situation where my program on a server (windows machine) outputs some strings. I need to send those strings from the server to the client via HTTP using curl. Once sent I am to receive the data on the client side as string, decode it and perform subsequent actions.
I already achieved this functionality using C Sockets using berkely API as I had familiarity with that. But for some reason I am not allowed to use a program of my own.
I poked around and seems CURL can be my solution. However I am very new to curl and cant seem to figure out how to achieve this functionality. On the Client side I found this to be useful may be:
#include <stdio.h>
#include <curl/curl.h>
int main(void)
{
CURL *curl;
CURLcode res;
curl = curl_easy_init();
if(curl) {
curl_easy_setopt(curl, CURLOPT_URL, "http://example.com");
/* Perform the request, res will get the return code */
res = curl_easy_perform(curl);
/* Check for errors */
if(res != CURLE_OK)
fprintf(stderr, "curl_easy_perform() failed: %s\n",
curl_easy_strerror(res));
/* always cleanup */
curl_easy_cleanup(curl);
}
return 0;
}
I understand that you have to use the write back functions to receive data ?
Also on the client side I need to develop a program using curl that whenever the server sends over a string, it should receive it and decode it. Any pointers to tutorials related to the specific problems will be highly appreciated. Or if someone has already tried this I'll highly appreciate any help here.
Thanks.
Take a look at this example code from their site. It details how to get your response data written to a region of memory rather than a file:
http://curl.haxx.se/libcurl/c/getinmemory.html
also take a look at the generic tutorial on the curl website:
http://curl.haxx.se/libcurl/c/libcurl-tutorial.html
one final thing to consider, if using C++ you need to make sure your callbacks are not non static member functions (see here libcurl - unable to download a file)
This should get you started at least.

Downloading a file from URL to disk in C++

I have a simple question. Is it possible to write simple code to download a file from the internet (from URL to disk) without using C++ (for mac osx) libraries like curl?
I have seen some examples but all of these use the Curl library.
i use this code on my xcode projet..but i have some compilation (linking) errors
#define CURL_STATICLIB
#include <stdio.h>
#include <curl/curl.h>
#include <curl/types.h>
#include <curl/easy.h>
#include <string>
size_t write_data(void *ptr, size_t size, size_t nmemb, FILE *stream) {
size_t written;
written = fwrite(ptr, size, nmemb, stream);
return written;
}
int main(void) {
CURL *curl;
FILE *fp;
CURLcode res;
char *url = "http://localhost/aaa.txt";
char outfilename[FILENAME_MAX] = "bbb.txt";
curl = curl_easy_init();
if (curl) {
fp = fopen(outfilename,"wb");
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);
res = curl_easy_perform(curl);
curl_easy_cleanup(curl);
fclose(fp);
}
return 0;
}
how can i link the curl library to my xcode project?
You can launch a console command, it is very simple :D
system("curl -o ...")
or
system("wget ...")
"Downloading a file from URL" means basically doing an GET request to some remote HTTP server. So you need to have your application know how to do that HTTP request.
But HTTP is now a quite complex protocol. Its specification alone is long and complex (more than a hundred pages). libcurl is a good library implementing it.
Why do you want to avoid using a good free library implementing a complex protocol? Of course, you could implement the complex HTTP protocol by yourself (probably that needs years of work), or make a minimal program which don't implement all the details of HTTP protocol but might work (but won't work with weird HTTP servers).
You have to learn bits of "socket programming" and implement a very basic HTTP protocol; the minimalist thing is to send string like "GET /this/path/to/file.png HTTP/1.0\r\n" to the site; then, likely it will answer with an HTTP header you have to parse to know at least the length of the binary data following (if the request succeeded, otherwise you have to handle HTTP errors, or a unexpected contet-type like a html page).
This guide should give you the basic to start with; about HTTP, it depends on your need, sometimes sending a "raw" GET could suffice, sometimes not.
EDIT
Changed to pretend that the request comes from a HTTP/1 compliant client, since HTTP/1.1 wants the Host header to be sent, as commenter has rightly pointed.
EDIT2
The OP changed the question, which became something about how to link with a library in Xcode. There's already a similar question on SO.

any good and simple RPC library for inter-process calls? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I need to send a (probably one) simple one-way command from client processes to server process with arguments of builtin C++ types (so serialization is pretty simple). C++, Windows XP+.
I'm looking for a library that doesn't require complicated configuration, provides simple interface, doesn't require hours to days of learning and doesn't have commercial usage restrictions. Simple solution for simple problem.
Boost.Interprocess is too low-level for this simple task because doesn't provide RPC interface. Sockets are probably an overkill too because I don't need to communicate between machines. The same about DCOM, CORBA et al. Named pipes? Never used them, any good library over WinAPI? OpenMPI?
I don't think sockets are really overkill. The alternatives all have their own problems and sockets are far better supported than named pipes, shared memory, etc., because almost everyone is using them. The speed of sockets on local system is probably not an issue.
There's Apache Thrift:
http://incubator.apache.org/thrift/
There are a few RPC implementations wrapped around Google's protobuf library as the marshaling mechanism:
https://github.com/google/protobuf/blob/master/docs/third_party.md#rpc-implementations
There's XML-RPC:
http://xmlrpc-c.sourceforge.net/
If your messages are really simple, I might consider using UDP packets, then there are no connections to manage.
You might like ZeroMQ for something like this. Perhaps not as much a complete RPC, as a raw byte messaging framework you could use to make an RPC. It's simple, lightweight and with an impressive performance. You can easilly implement an RPC on top of it. Here's an example server straight from the manual:
//
// Hello World server in C++
// Binds REP socket to tcp://*:5555
// Expects "Hello" from client, replies with "World"
//
#include <zmq.hpp>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main () {
// Prepare our context and socket
zmq::context_t context (1);
zmq::socket_t socket (context, ZMQ_REP);
socket.bind ("tcp://*:5555");
while (true) {
zmq::message_t request;
// Wait for next request from client
socket.recv (&request);
printf ("Received Hello");
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (5);
memcpy ((void *) reply.data (), "World", 5);
socket.send (reply);
}
return 0;
}
This example uses tcp://*.5555, but uses more efficient IPC techniques if you use:
socket.bind("ipc://route.to.ipc");
or even faster inter thread protocol:
socket.bind("inproc://path.for.client.to.connect");
If you only need to support Windows I'd use the Windows built-in RPC, I've written two introductory articles about this:
http://www.codeproject.com/KB/IP/rpcintro1.aspx
http://www.codeproject.com/KB/IP/rpcintro2.aspx
You could use the ncalrpc protocol if you only need local inter-process communication.
Boost.MPI. Simple, fast, scalable.
#include <boost/mpi/environment.hpp>
#include <boost/mpi/communicator.hpp>
#include <iostream>
#include <sstream>
namespace mpi = boost::mpi;
int main(int argc, char* argv[])
{
mpi::environment env(argc, argv);
mpi::communicator world;
std::stringstream ss;
ss << "Hello, I am process " << world.rank() << " of " << world.size() << ".";
world.send(1, 0, ss.str());
}
If you are working on windows only, and really need a C++ interface, use COM/DCOM. It is based on RPC (in turn based on DCE RPC).
It is extremely simple to use -- provided you take the time to learn the basics.
ATL: http://msdn.microsoft.com/en-us/library/3ax346b7(VS.71).aspx
Interface Definition Language: http://msdn.microsoft.com/en-us/library/aa367091(VS.85).aspx
You probably don't even need a library. Windows has an IPC mechanism built deeply into its core APIs (windows.h). You can basically post a windows message into the message-queue of a different processes main window. Windows even defines a standard message to do just that: WM_COPYDATA.
MSDN docu on WM_COPYDATA
MSDN demo code
More demo code the following StackOverflow response
The sending process basically does:
FindWindow
SendMessage
The receiving process (window):
On Vista and later has to modify its message filter using ChangeWindowsMessageEx
Override its WindowProc
In order to handle the incoming WM_COPYDATA
I know that we are far away from easy to use. But of course you can stick to CORBA. E.g. ACE/TAO
I'm told RPC with Raknet is nice and simple.
Also, you might look at msgpack-rpc
Update
While Thrift/Protobuf are more flexible, I think, but there are require to write some code in specific format. For example, Protobuf needs some .proto file, which can be compile with specific compiler from package, that genegate some classes. In some cases it might be more difficult that other parts of code.
msgpack-rpc is much simpler. It doesn't require write some extra code. Here is example:
#include <iostream>
#include <msgpack/rpc/server.h>
#include <msgpack/rpc/client.h>
class Server: public msgpack::rpc::dispatcher {
public:
typedef msgpack::rpc::request request_;
Server() {};
virtual ~Server() {};
void dispatch(request_ req)
try {
std::string method;
req.method().convert(&method);
if (method == "id") {
id(req);
} else if (method == "name") {
name(req);
} else if (method == "err") {
msgpack::type::tuple<> params;
req.params().convert(&params);
err(req);
} else {
req.error(msgpack::rpc::NO_METHOD_ERROR);
}
}
catch (msgpack::type_error& e) {
req.error(msgpack::rpc::ARGUMENT_ERROR);
return;
}
catch (std::exception& e) {
req.error(std::string(e.what()));
return;
}
void id(request_ req) {
req.result(1);
}
void name(request_ req) {
req.result(std::string("name"));
}
void err(request_ req) {
req.error(std::string("always fail"));
}
};
int main() {
// { run RPC server
msgpack::rpc::server server;
std::auto_ptr<msgpack::rpc::dispatcher> dispatcher(new Server);
server.serve(dispatcher.get());
server.listen("0.0.0.0", 18811);
server.start(1);
// }
msgpack::rpc::client c("127.0.0.1", 18811);
int64_t id = c.call("id").get<int64_t>();
std::string name = c.call("name").get<std::string>();
std::cout << "ID: " << id << std::endl;
std::cout << "name: " << name << std::endl;
return 0;
}
Output
ID: 1
name: name
More complicated examples you can find here https://github.com/msgpack/msgpack-rpc/tree/master/cpp/test
I'm using XmlRpc C++ for Windows found here
Really easy to use :) But the only side effect that this is only a client!
There's also Microsoft Messaging Queueing, which is fairly straightforward to use when all processes are on the local machine.
The simplest solution for interprocess-communication is to use the filesystem. Requests and responses can be written as temp files. You can work out a naming convention for request and response files.
This will not give you the best performance, but maybe it will be good enough.