Very Simple C++ Web Crawler/Spider? - c++

I am trying to do a very simple web crawler/spider app in C++. I have been searching using Google for a simple one to understand the concept. I found this:
spider_simpleCrawler
However, it is complicated to understand for me, since I started learning C++ about 1 month ago.
This is, for example, what I'm trying to do:
Enter the URL: www.example.com (I will use bash->wget, to get the contents/source code),
Look for, maybe "a href" link, and then store in some data file.
Is there a simpler tutorial or guide on the Internet?

All right, I'll try to point you in the right direction. Conceptually, a webcrawler is pretty simple. It revolves around a FIFO queue data structure which stores pending URLs. C++ has a built-in queue structure in the standard libary, std::queue, which you can use to store URLs as strings.
The basic algorithm is pretty straightforward:
Begin with a base URL that you
select, and place it on the top of
your queue
Pop the URL at the top of the queue
and download it
Parse the downloaded HTML file and extract all links
Insert each extracted link into the queue
Goto step 2, or stop once you reach some specified limit
Now, I said that a webcrawler is conceptually simple, but implementing it is not so simple. As you can see from the above algorithm, you'll need: an HTTP networking library to allow you to download URLs, and a good HTML parser that will let you extract links. You mentioned you could use wget to download pages. That simplifies things somewhat, but you still need to actually parse the downloaded HTML docs. Parsing HTML correctly is a non-trivial task. A simple string search for <a href= will only work sometimes. However, if this is just a toy program that you're using to familiarize yourself with C++, a simple string search may suffice for your purposes. Otherwise, you need to use a serious HTML parsing library.
There are also other considerations you need to take into account when writing a webcrawler, such as politeness. People will be pissed and possibly ban your IP if you attempt to download too many pages, too quickly, from the same host. So you may need to implement some sort of policy where your webcrawler waits for a short period before downloading each site. You also need some mechanism to avoid downloading the same URL again, obey the robots exclusion protocol, avoid crawler traps, etc... All these details add up to make actually implementing a robust webcrawler not such a simple thing.
That said, I agree with larsmans in the comments. A webcrawler isn't the greatest way to learn C++. Also, C++ isn't the greatest language to write a webcrawler in. The raw-performance and low-level access you get in C++ is useless when writing a program like a webcrawler, which spends most of its time waiting for URLs to resolve and download. A higher-level scripting language like Python or something is better suited for this task, in my opinion.

Check this Web crawler and indexer written in C++ at: Mitza web crawler
The code can be used as reference. Is clean and provides good start for a
webcrawler codding. Sequence diagrams can be found at the above link pages.

A web-crawler has the following components in it:
Downloading an HTML file
Extracting links from it
Pushing all the links into a queue
{web indexing and ranking if necessary}
Repeating this with the front element of the queue
This one has it all Web-Crawler.
It would be very helpful for beginners to learn about complete understanding of a web-crawler, concepts of multithreading and web-ranking.

Related

GET/POST using boos-asio for a REST API

I am new to network programming and got started with boost, REST etc. I wanted to know if I could use REST API's with boost-asio such as using Google Maps' Distance Matrix in my program. But I couldn't find a proper documentation for boost.
I don't expect you to give me complete working code rather I need idea or some sort of guidance as to what to do, where to find things etc.Also this program will be in C++ purely (I don't know if it can even be done in C++, given this answer https://stackoverflow.com/a/28736632/4846740). Thanks
Note: This post was not very helpful Integrating Google maps with C++ Program
You'd want to
use a REST library (e.g. cpprestsdk or some other frameworks like autobahn-cpp?), or
= at least write the REST requests on top of a HTTP library, such as Boost Beast
The library examples show you everything you need to send requests and receive responses. If you want, you can use additional libraries like https://github.com/0xdead4ead/BeastHttp to make it even more high-level/instant.
I would 100% recommend that you do not do this in C++. While I'm not a huge fan of Python, it's undoubtedly the hammer that is made for this nail. Check out BeautifulSoup, Mechanize, and Scrapy (+XPath), for really convenient ways of obtaining/parsing HTML, filling out web forms, and gaining responses. Typically, unless you're doing realtime target tracking, you do not need the latency gained from running everything in C/C++. You can get away with quarter-second, or even half-second updates.
I'm not sure what you're trying to do, but I would say save yourself the headache, and just work with Python.

Ember Way to Add Rss Feed without third party widget, Front-end only

I am using Ember 3.0 at the moment. Wrote my first lines of code in ANY language about 1 year ago (I switched careers from something totally unrelated to development), but I quickly took to ember. So, not a ton of experience, but not none. I am writing a multi-tenant site which will include about 20 different sites, all with one Ember frontend and a RubyOnRails backend. I am about 75% done with the front end, now just loading content into it. I haven’t started on the backend yet, one, because I don’t have MUCH experience with backend stuff, and two, because I haven’t needed it yet. My sites will be informational to begin with and I’ll build it up from there.
So. I am trying to implement a news feed on my site. I need it to pull in multiple rss feeds, perhaps dozens, filtered by keyword, and display them on my site. I’ve been scouring the web for days just trying to figure out where to get started. I was thinking of writing a service that parses the incoming xml, I tried using a third party widget (which I DON’T really want to do. Everything on my site so far has been built from scratch and I’d like to keep it that way), but in using these third party systems I get some random cross domain errors and node-child errors which only SOMETIMES pop up. Anyway, I’d like to write this myself, if possible, since I’m trying to learn (and my brain is wired to do the code myself - the only way it sticks with me).
Ultimately, every google result I read says RSS feeds are easy to implement. I don’t know where I’m going wrong, but I’m simply looking for:
1: An “Ember-way” starting point. 2: Is this possible without a backend? 3: Do I have to use a third party widget/aggregator? 4: Whatever else you think might help on the subject.
Any help would be appreciated. Here in New Hampshire, there are basically no resources, no meetings, nothing. Thanks for any help.
Based on the results I get back when searching on this topic, it looks like you’ll get a few snags if you try to do this in the browser:
CORS header issues (sounds like you’ve already hit this)
The joy of working with XML in JavaScript (that just might be sarcasm 😉, it’s actually unlikely to be fun)
If your goal is to do this as a learning exercise, then doing it Javascript/Ember will definitely help you learn lots of new things. You might start with this article as a jumping off point: https://www.raymondcamden.com/2015/12/08/parsing-rss-feeds-in-javascript-options/
However, if you want to have this be maintainable for the long run and want things to go quickly and smoothly, I would highly recommend moving the RSS parsing system into your backend and feeding simple data out to Ember. There are enough gotchas and complexities to RSS feeds over time that using a battle-tested library is going to be your best way to stay sane. And loading that type of library up in Ember (while quite doable) will end up increasing your application size. You will avoid all those snags (and more I’m probably not thinking of) if you move your parsing back to the server ...

How to start working on QuickFix library

I have given a project to develop Algorithmic trading system using c++ and quickFix library, I search on google about quickFix library but didn't find any useful information.
Can anybody give me some information , from where should I start?
You provide very little detail in your question, so I can only guess at a helpful approach. I have done what you are starting, in Python, and can give you some orientation. All the links Karl mentioned are crucial, (you should pay special attention to the quickfix documentation on the config file) to which I would add FIXIMATE.
To do something like this in QF you need to answer a number of questions.
Logon. Figure out how to logon. Try to get a data dictionary from your counterparty. You don't want to be forced to modify your DD too much.
Interface. How will you tell QF to logon, logoff, exit terrible positions, and so forth? I use a command line tool (cmd2) that gives me this ability. Other people code GUI windows.
Message Cracking. Some versions of QF come with a cracker but if you don't have it in C++ you will have to write your own so you can parse the incoming messages.
Data Management. How will you save incoming market data, both in RAM and to disk for analysis later? How will you represent and monitor your positions, your working orders, your audit trail? Familiarize yourself with the ScreenLogFactory and FileLogFactory in QF.
Auxiliary Functions. You will need a lot of functions you will write yourself to help at all stages. Save them all in one place and organize them into categories for easy access.
Monitoring. How will you know if something goes wrong (or right) when you are not in front of your computer monitoring the algo? I launch a completely separate process which consumes messages via a queue and sends me texts and emails.
Risk. You don't want your machine sending 1000 orders to market in the blink of an eye. You need to code some checks that will veto bad orders as a final stage before they go out. Also some code that will tell you if you are in a position when you are supposed to be flat. This part is very important.
Strategy. You will need the ability to quickly ingest data, analyze it, and generate signals. For flexibility you should not design your strategy into your system, but you should design a strategy object which can support any strategy you come up with. Then you deploy those objects within your system.
Order handling. Your algo needs to know when and how to enter orders, cancel them, move stops, etc. It will need to deal with partial fills, and be able to support multiple order types.
This is just the beginning, off the top of my head. It is a long road to do all by yourself with no help. Very interesting though, and rewarding.
You can find the QuickFIX downloads on the quickfixengine.org website here: http://www.quickfixengine.org/. From there, you can download either the source code or download pre-built packages for Visual Studio 2010, 2012 and 2013.
Documentation for QuickFIX can be found in their documentation area here: http://www.quickfixengine.org/quickfix/doc/html/. The documentation includes compilation/installation instructions and a "Getting Started" section which discusses setting up a project and writing your first QuickFIX application.
If you wish to know more about the FIX protocol, you are advised to look at the FIX website here: http://www.fixtradingcommunity.org. There are specifications on that website that will give you information on the types of messages supported by FIX and how they should be used.

Best way of logging a user in C++

I am trying to get into C++ programming so apologise if this is a bit of a stupid question.
I am attempting to create a program in C++ that is linked to a website via the database, that's all sorted. In this program, the user must log into it to be able to use its features, I've also managed to do this fine. My question is, what is the best way of storing that users session so I can refer to their username, display that users settings from the database e.c.t?
I am unsure, but I don't think c++ has session options like in php so I cannot do it that way. I did some googling before I posted this, spent all night trying to find a solution, I found nothing.
My knowledge if c++ is slim and this may sound like a more complicated or unnecessary route to take, but it was thinking of perhaps when the user logs in, to create a txt file storing that users username and then calling on it when I need to refer to that users username for queries and such, then when the user logs out or closes the program it deletes the file. Is that stupid? Forgive me if it is.
Is there better way to go about this?
Thanks for your time!
EDIT
I read your comments, if it needs to be a stand-alone application, like some sort of client, you could take a look at the C++ libraries I mentioned, but I'd use any higher level language (Java or C# have good documentation and there are many tutorials for creating GUIs, if that's what your're looking for. I think even Python would make a good candidate).
If you really must use C++, your best bet would be to use an existing library to implement your web solution. POCO includes an HTTP server framework, and a library for sockets and other forms of low-level network programming. Boost ASIO can also serve your purposes. But this is hardly something I'd recommend to start learning programming, or C++ for that matter.
If you want to learn about web programming, then you should definitely take a look at other languages. PHP or ASP.NET come to mind. AS you learn, you'll most likely also end up writing some form of Javascript. You can find a lot of info out there, just Google for tutorials. A site to get started is w3Schools, but any site with tutorials will do. Good luck!

Web Design - Templates vs Include

I am currently developing a website. I would like to separate content and presentation. I am currently using a Dreamweaver Template to achieve this. However, I find that Dreamweaver's edit regions are very limiting in the design view. I have found that the same goal can be achieved by including the header and footer of my website.
What are the pros and cons of using includes rather than using templates?
First, if I were to rephrase your question, it's more like asking "Should I by a wire frame of a kite or by the glue to stick together what I'm making?" And then, you ask about the pros and cons of buying the wireframe against buying the glue. There are far too many variables as you can see...
And back on your your question... At some point your template will use include files. And for a start, it's worth knowing what you're thinking... Let's look at some basics.
Web design - usually refers to making websites that aren't really interactive. They don't have server-side elements. So most of the site has 'static' contents. If this were the case, you're better off with DreamWeaver, particularly if you're not into html/css editing.
Web development/programming - starts off with something as elementary as mailing a form, to highly interactive sites like FaceBook. Here you'll need to use some server-side language, usually like PHP, ASP or JSP. The choices are many but you've got to choose your own platform or combination of them.
Now to the second option (above). If for example, you were building a site using PHP, one of the nice things you'll do is to include your header, footer and side panels that need to be repeated across all pages. This way, you'll eliminate the need to re-write those sections. But if you were using a program like DreamWeaver, it does this duplication for you. Yes, it physically copy-pastes those sections into every file that needs it. Of course the end result may not be any different. But as a developer, you will be tied down to the DreamWeaver platform or for that matter, any other specific platform.
On the other hand, if you get used to working with an editor like NotePad++ or GEdit, you may switch between editors at any time. But you have the task of hand-coding everything from scratch. But then again, since you would use include files to bring in your headers and stuff, you save development time as well.
I don't know how much of html/css or php you know, but here's one of my demos to show you how to hand-code a site. This ain't complete but you should get an idea.
Link to the video introduction
Link to the video on youtube