C++ web crawler [closed]

C++ web crawler [closed] - c++

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am experimenting and attempting to make a minimal web crawler. I understand the whole process at a very high level. So getting into the next layer of details, how does a program 'connect' to different websites to extract the HTML?
Am I using Sockets to connect to servers and sending http requests? Am I giving commands to the terminal to run telnet or ssh?
Also, is C++ a good language of choice for a web crawler?
Thanks!

Also, is C++ a good language of choice for a web crawler?
Depends. How good are you at C++.
C++ is a good language to write an advanced high speed crawler in, because of its speed (and you need that to processes the HTML pages). But it is not the easiest language to write a crawler with so probably not a good choice if you are experimenting.
Based on your question you don't have the experience to write an advanced crawler so are probably looking to build a simple serial crawler. For this speed is not a priority as the bottleneck is the download of the page across the web (not the processing of the page). So I would pick another language (maybe python).

Short answer, no. I prefer coding in C++ but this instance calls for a Java application. The API has many html parsers plus built in socket protocols. This project will be a pain in C++. I coded one once in java and it was somewhat blissful.
BTW, there are many web crawlers out there but I assume you have custom needs :-)

If you plan to stick with C++, then you should consider using the libcurl library, instead of implementing the HTTP protocol from scratch using sockets. There are C++ bindings available for that library.
From curl's webpage:
libcurl is a free and easy-to-use client-side URL transfer library,
supporting DICT, FILE, FTP, FTPS, Gopher, HTTP, HTTPS, IMAP, IMAPS,
LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, Telnet
and TFTP. libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP
uploading, HTTP form based upload, proxies, cookies, user+password
authentication (Basic, Digest, NTLM, Negotiate, Kerberos), file
transfer resume, http proxy tunneling and more!
libcurl is highly portable, it builds and works identically on
numerous platforms, including Solaris, NetBSD, FreeBSD, OpenBSD,
Darwin, HPUX, IRIX, AIX, Tru64, Linux, UnixWare, HURD, Windows, Amiga,
OS/2, BeOs, Mac OS X, Ultrix, QNX, OpenVMS, RISC OS, Novell NetWare,
DOS and more...
libcurl is free, thread-safe, IPv6 compatible, feature rich, well
supported, fast, thoroughly documented and is already used by many
known, big and successful companies and numerous applications.

Related

How to use gssapi kerberos in c / c++ client server cross-platform programs? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I had to "sporadically" work with Heimdal / MIT Gssapi for kerberos authentication over past couple of years. I had to build an application that was to be used as a web-service running on a Linux box, and serve client applications like browsers, running on Windows and/or Linux Desktops and Workstations. Surely not the easiest of beasts to tame. Eventually when summarizing my work, I could record that the difficulties emanated due to challenges in multiple dimensions. Getting started with gssapi programming is truly a challenge just because of poor documentation, and practically non-existant tutorials. Googling mostly results in either some theoretical discussion on what's kerberos, or leads to content written with presumption that you already know everything besides some particular semantic issue.
Some really good hacks around here contributed to help me, I therefore suppose it would be a good idea to summarize the stuff, from a developer's perspective, and share it here as some sort of a wiki, to give something back to this fantastic place, and fellow programmers.
Haven't really done a wiki like this before, and I am surely no authority on GSSAPI nor Kerberos, so please be kind, but more than that please contribute and correct my mistakes. Site Editors, I am counting on you to do your magic ;)
Getting your project completed successfully will require 3 specific things to be done correctly:
Setup of your test environment
Setup of your libraries
Your code
As I said already, such projects are beasts, just because all the three haven't been put together on the same page anywhere.
Ok So let's begin at the beginning.
Unavoidable theory for a newbie
GSSAPI helps a client application to provide credentials for a server to authoritatively identify the user. Extremely useful because the server applications can modulate their served responses if they wish to, as per the user. Very naturally therefore both - the client and the server applications must be kerberized, or as some would state kerberos-aware.
The kerberos based authentication, requires both the client and server applications, to be members of a Kerberos Realm. KDC (Kerberos Domain Controller) is the designated authority that rules the realm. Microsoft's AD servers are one of the most popularly experienced examples of a KDC, though you can of course be using a *NIX based KDC. But surely without a KDC there can be no Kerberos business at all. Desktops, Servers & workstations joined into the domain identify each other as long as all of them remain joined into the domain.
For your initial experiments, setup the client & server applications in the same realm.
Though Kerberos Authentication can surely be also used across realms by creating trusts between KDCs of these realms, or even merging keytabs from different KDCs that do not trust each other. Your code will not really need any change to accommodate such different and complex-sounding scenarios.
Kerberos Authentication basically works via "tickets (or tokens)". When a member joins the realm, the KDC "grants tokens" to each of them. These tokens are unique; time and FQDN are essential factors for these tickets.
Before you even think of the very first line of your code make sure you have got these two right:
Pitfall #1 When you setup your development and test environment, make sure everything is tested and addressed as FQDN. For example if you want to check connectivity, ping using FQDN, not IP. Needless to say therefore, they must necessarily have the same DNS service configuration.
Pitfall #2 Make sure all the host systems - that are running your KDC, client software, server software have the same time server. Time synchronization is something that one forgets, and realizes to be amiss after a lot of hair-splitting, and head-banging!
Both, the client and server applications NEED kerberos keytabs. So if your application is going to run in a *NIX host, and be a part of a Microsoft Domain, you have to get a kerberos keytab generated, before we start to look at the remaining preparatory steps for gss programming.
Step-by-Step Guide to Kerberos 5 (krb5 1.0) Interoperability at is an absolute must-read.
GSS-API Programming Guide is an excellent bookmark.
Depending upon your *NIX distribution you can install the headers & libraries for building your code. My suggestion however is to download the source and build it yourself. Yes, you might not get it right at one go, but it surely is worth the trouble.
Pitfall #3 Make sure that your application is running in an Kerberos aware environment.
I really learnt this the hard way, but maybe because I am not so smart. In my earliest stages of gssapi programming struggle, I had discovered that kerberos keytabs were absolutely necessary for making my application kerberos-aware. But I simply couldn't find anything about how to load these keytabs in my application. You know why?!! Because no such api exists!!!
Because: The application is to be run in an environment which is aware of the keytabs.
Ok, let me make this simple: Your application that is supposed to do the GSSAPI / Kerberos things has to run after you have set environment variable KRB5_KTNAME to the path where you have stored the keytabs. So either you do something like:
export KRB5_KTNAME=<path/to/your/keytab>
or make use of setenv to set KRB5_KTNAME in your application sufficiently before the very first line of your code that uses gssapi is run.
We are now ready to do the necessary things in the application's code.
I understand there are quite a few other aspects that must be reviewed by the application developer, to write and test an application. I know of a few environment variables, that can be important.
Can anybody please shed some more light upon that?

Building high-performance HTTP compliant upload server with Boost Asio or extending Nginx? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'd like to hear your guys opinion on architecture decision.
I'm building high-performance upload server. The purpose of server(s), in simple terms, is to receive huge amount of files from clients and save them on NAS.
Requirements are:
Files will be uploaded via HTTP POSTs in "multipart/form-data" encoding.
Each file will be at least few megabytes long.
Clients are browsers on computers and mobile clients.
Because at least 50% of clients will be mobile clients I expect these TCP/IP sessions to be pretty long.
It should not be affected by Slowloris attacks.
Target platform is Linux CentOS 6.2.
As server receives files it saves some info to database, performs authentication and etc.
My current approach is to implement everything (http parsing and etc) myself using Boost Asio. I like it because I have greater control on what happens with data. I'm thinking about writing it to file directly from socket using memory mapped files.
Would you say it might be easier to implement it as extension to Nginx ? Is Nginx a viable option given the requirements listed above?

Socket programing vs. web service? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to create a Mobile messaging service, but I don't know which is better to use a socket programming or a web service ?.
What are the concerns that I need to take in consideration when creating such service ? such as cost of connection .. etc.
If you need more details just tell me before voting the question down or to be closed !

Webservices are generally speaking "easier" to do, thanks to the tremendous interest in them and the support for them in developer tools and through libraries and frameworks.
However, especially if your payload is small (think messages the size of a typical SMS or tweet), the overhead you create with webservices is prohibitive: bytes sent over a wireless network like GPRS or UMTS are still very expensive, compared to bytes carried over cable or ADSL. And web services carry several layers of invisible info around that the end customer will also have to pay.
So, if your use case is based on short messages, I'd at least advise to do some bandwidth simulation calculations, and base your decision on bandwidth savings vs increased complexity of your app.
While looking at sockets, also have a look at UDP: if you can live with the fact that basically you throw someone a packet, and without designing some ack mechanism into your protocol you'll never be sure the message arrived, it's very efficient because there is no traffic to create and maintain a connection, and even pretty long messages can very well be transported inside 1 UDP packet.
EDIT based on comment:
stream socket: not sure how you define streams, but streams and messages are two very distinct concepts for me, a stream is a typically longer sequence of data being sent, whereas a message is an entity that's sent, and optionally acknowledged or answered by the receiver.
bandwidth simulation: the easiest way to understand what I'm talking about is to get Wireshark and add up everything that gets transported across the net to do a simple request of a very short string - you'll see several layers of administrative info (ie invisible, just there to make the different protocol layers work) that are all traffic paid for by the end user. Then, write a small mock service using UDP to transport the same message, or use a tool like netcat, good tutorial here, and add up the bytes that get transported. You'll see pretty huge differences in the number of bytes that are exchanged.
EDIT2, something I forgot to mention: mobile networks used to be open, transparent networks with devices identified by public IP addresses. There's a rapid evolution towards NATed mobile networks which has its impact on how devices inside and outside this "walled garden" can communicate (NAT traversal). You'll need to take this into account when designing your communication channel.
As for the use of streams for a chat application: it offers some conceptual advantages, but you can very well layer a chat app on top of UDP, look here or here

Django / Comet (Push): Least of all evils?

I have read all the questions and answers I can find regarding Django and HTTP Push. Yet, none offer a clear, concise, beginning-to-end solution about how to accomplish a basic "hello world" of so-called "comet" functionality.
First question (1): To what extent is the problem that HTTP simply isn't (at least so far) made for this? Are all the potential solutions essentially hacks?
2) What's the best currently available solution?
Orbited?
Some other Twisted-based solution?
Tornado?
node.JS?
XMPP w/ BOSH?
Some other solution?
3) How does nginx push module play into this discussion?
4) Which of these solutions require replacement of the typical mod_wsgi / nginx (or apache) deployment model? Why do they require this? Is this a favorable transition in any case?
5) How significant are the advantages of using a solution that is already in Python?
Alex Gaynor's presentation from PyCon 2010, which I just watched on blip.tv, is amazing and informative, but not terrifically specific on the current state of HTTP Push in Django. One thing that he said that gave me some confidence was this: Orbited does a good job of abstracting and simulating the concept of network sockets. Thus, when WebSockets actually land, we'll be in a good place for a transition.
6) How does HTML5 Websockets differ from current solutions? Is Gaynor's assessment of the ease of transition from Orbited accurate?

I'd take a look at evserver (http://code.google.com/p/evserver/) if all you need is comet.
It "supports [the] little known Asynchronous WSGI extension" and is build around libevent. Works like a charm and supports django. The actual handler code is a bit ugly, but it scales well as it really is async io.
I have used evserver and I'm currently moving to cyclone (tornado on twisted) because I need a little more than evserver offsers. I need true bidirectional io (think socket.io (http://socket.io/)) and while evserver could support it I thought it was easier to reimplement tornado's socket.io in cyclone (I opted for cyclone instead of tornado as cyclone is build on twisted, thus allowing for more transports that aren't implemented in twisted (i.c. zeromq)) Socket.io supports websockets, comet style polling, and, much more interseting, flash based websockets. I think that in most practical situations websockets + flash based websockets are enough to support 99% (according to adobe flash penetration is about 99% (http://www.adobe.com/products/player_census/flashplayer/version_penetration.html)) of a websites visitors (only people not using flash need to fallback to one of socket.io its (less perfomant and resource hogging) backup transports)
Be aware though websockets are not an http transport thus putting them behind http based proxies (e.g haproxy in http mode) breaks the connection. Better serve them on an alternate ip or port so you can proxy in tcp mode (e.g haproxy in tcp mode).
To answer your questions:
(1) If you don't need a bidirectional transport longpolling based solutions are good enough (all they do is keep a connection open). Things do get iffy when you need your connection to be statefull or you need to be able to both send and receive data. In the latter case socket.io helps. However websockets are made for this scenario and with the support of flash its available to most of a websites vistors (via socket.io or standalone, however socket.io has the added benefit of backup transports for those people not wanting to install flash)
(2) if all you need is push, evserver is your best bet. It uses the the same javascripts on the client side as orbited. Else look at socket.io (this also needs a supporting server, the only python one available is tornado.)
(3) It's just one other server implementation. If i read it correctly it's push only. pushing data to a client is done by making http equest from your app to the nginx server. (nginx then takes care they reach the client). If you're inteersted in this, look at mongrel2 (http://mongrel2.org/home) it not only has handlers for longpolling but also for websockets.(instead of making http request to mongrel, this time you use zeromq handlers to get data to your mongrel server) (Do take note of the developer's lack of enthusiasm for websockets and flash based websockets. Especially taking into account that the websocket protocol tends to evolve you might, at some point, need to recode mongrel2's websocket support yourself keep having support for websockets)
(4) All solutions except evserver replace wsgi with something else. Though most servers also have some wsgi support ontop of this "something else". No matter what solution you choose be careful that one cpu intensive or otherwise io blocking request doesn't block the server. (you either need multiple instances or threads).
(5) Not very significant. All solutions depend on some custom handlers to push (and, if applicable, receive) data to the client. All solutions i mentioned allow these handlers to be written in python. If you want to use a completely different framework (node.js) then you have to weigh the ease of node.js (it's assumed to be easy, but it's also rather experimental, and i found very few libraries to be actually stable) against the convenience of using your existing code base and the available libraries (e.g. if your app needs a blog ther are plenty django blogs you could plug in, but none for node.js) Also don't stare yourself blind on performance stats. unless you plan to push dumb predefined data (what all benchmarks do) to the client you'll find that the actual processing of data adds much more overhead than even the worst async io implementation. (But you still want to use an async io based server if you plan to have many simultaneous clients, threading simply isn't meant to keep thousands of connections alive)
(6) websockets offer bidirectional communication, long polling/comet only pushes data but does not accept writes. (Socket.io simulates this bidirectional support by using two http requests, one to longpoll, one to send data. It tracks their interdependance by a (session) id that's part of both requests query string). flash based websockets are similar to real websockets (the difference is that their implementation is in the swf, not your browser). Also the websockets protocol does not follow the http protocol; longpolling/comet stuff does (technically the websocket client sends an upgrade request to websocket server, the upgraded protocol isn't http anymore)

There is support for WebSockets with django-websocket, but unfortunately there are major issues with it for getting it working; here's a quote from that page:
Disclaimer (what you should know when using django-websocket)
BIG FAT DISCLAIMER - right at the moment its technically NOT possible in any way to use a websocket with WSGI. This is a known issue but cannot be worked around in a clean way due to some design decision that were made while the WSGI stadard was written. At this time things like Websockets etc. didn't exist and were not predictable.
...
But not only WSGI is the limiting factor. Django itself was designed around a simple request to response scenario without Websockets in mind. This also means that providing a standard conform websocket implemention is not possible right now for django. However it works somehow in a not-so pretty way. So be aware that tcp sockets might get tortured while using django-websocket.
So at the moment, WSGI: no go; Django: hardly any go, even with django-websockets; see also a comment in the author's original announcement:
I can't say this looks like a good idea. You're doing long-lived connections in a way that is going to require threading. django-websocket requires threading setup, and won't work if you've got processes (because you'd just have too many processes) but threads won't scale for a lot of connections at the same time, either, so its just a false safety. You need an asynchronous platform for long-lived things, and I do this by doing my app in Django and my comet and websocket in Node.js
Personally if trying to use WebSockets (which I expect to be next year), I would try the combination of Twisted and Cyclone first. They're designed to cope with WebSockets, and scale well. If you write your code properly to remove unnecessary dependencies on Django, you should be able to use much of your code in a Twisted-based system. This is a very distinct advantage over using Node.js or Comet or any system in another language. You could also make a simple push
Finally, you could also just decide it's too hard and use an external service to provide the push support. That then becomes a matter of sending a simple JSON request to their servers instead of worrying about how to make the connection and how concurrency will work and things like that. Of course, you'll need to pay for it (though currently it may be free while in Beta), but you don't need to worry about implementation details; you won't have the full power of WebSockets that way though - just push support.

I can't believe it's been over six years since I asked this question.
Async with Django (and the associated network traffic, eg websockets) has been an itch for many of us in the community. I have taken these past few years, to among other things, scratch this itch.
hendrix
hendrix is a WSGI/ASGI conatiner that runs on Twisted. It has been a project mainly driven by 5 enthusiasts, with help and funding from some visionary organizations. It is in production today at dozens, but not hundreds, of companies.
I'll leave it to you to read the documentation to see why it's the best solution to this problem, but a few quick highlights:
it's based on Twisted, requires no knowledge or use of Twisted internals, but leaves them all available
It "just works" in the sense that you don't need any special server or process configuration to do async and socket traffic from within your Django (or Pyramid, or Flask) app
It is very likely to be forward-compatible with ASGI, the Django Channels standard, and is in some meaningful ways the first ASGI container
It ships with simple APIs that maintain the flow of your view logic and are easy to unit test.
Please see this talk that I gave at Django-NYC (at the Buzzfeed offices) for more information about why I think this is the best answer to this question.

Re question #2, I recently was given a tour of the internals of a Django app that uses Comet heavily, and Orbited was the solution they chose.

Tool to monitor HTTP, TCP, etc. Web Service traffic [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What's the best tool that you use to monitor Web Service, SOAP, WCF, etc. traffic that's coming and going on the wire? I have seen some tools that made with Java but they seem to be a little crappy. What I want is a tool that sits in the middle as a proxy and does port redirection (which should have configurable listen/redirect ports). Are there any tools work on Windows to do this?

For Windows HTTP, you can't beat Fiddler. You can use it as a reverse proxy for port-forwarding on a web server. It doesn't necessarily need IE, either. It can use other clients.

Wireshark does not do port redirection, but sniffs and interprets a lot of protocols.

You might find Microsoft Network Monitor helpful if you're on Windows.

Wireshark (or Tshark) is probably the defacto standard traffic inspection tool. It is unobtrusive and works without fiddling with port redirecting and proxying. It is very generic, though, as does not (AFAIK) provide any tooling specifically to monitor web service traffic - it's all tcp/ip and http.
You have probably already looked at tcpmon but I don't know of any other tool that does the sit-in-between thing.

I tried Fiddler with its reverse proxy ability which is mentioned by #marxidad and it seems to be working fine, since Fiddler is a familiar UI for me and has the ability to show request/responses in various formats (i.e. Raw, XML, Hex), I accept it as an answer to this question. One thing though. I use WCF and I got the following exception with reverse proxy thing:
The message with To 'http://localhost:8000/path/to/service' cannot be processed at the receiver, due to an AddressFilter mismatch at the EndpointDispatcher. Check that the sender and receiver's EndpointAddresses agree
I have figured out (thanks Google, erm.. I mean Live Search :p) that this is because my endpoint addresses on server and client differs by port number. If you get the same exception consult to the following MSDN forum message:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2302537&SiteID=1
which recommends to use clientVia Endpoint Behavior explained in following MSDN article:
http://msdn.microsoft.com/en-us/magazine/cc163412.aspx

I've been using Charles for the last couple of years. Very pleased with it.

I second Wireshark. It is very powerful and versatile.
And since this tool will work not only on Windows but also on Linux or Mac OSX, investing your time to learn it (quite easy actually) makes sense. Whatever the platform or the language you use, it makes sense.
Regards,
Richard
Just Programmer
http://sili.co.nz/blog

I find WebScarab very powerful

Check out Paros Proxy.

JMeter's built-in proxy may be used to record all HTTP request/response information.
Firefox "Live HTTP headers" plugin may be used to see what is happening on the browser side when sending/receiving request.
Firefox "Tamper data" plugin may be useful when you need to intercept and modify request.

I use LogParser to generate graphs and look for elements in IIS logs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js