I'm designing a web crawler with C++,but there is a web page asking me "Do you at least 18 years of age?" when I first fetch the web page by using URLDownloadToFileW,and of course I must click YES.
In javascript,I can use document.getElementsByTagName('button')[0].click(); to simulate a button click,so is there any other way to solve such problem with C++?
That is not really easy to do, but if you want to do it, you need several requests.
What the click (i.e. document.getElementsByTagName('button')[0].click(); in JavaScript) does is to trigger an associated click event. Your first step should be to find the event handler code and take a look into it. The event may for example send another (AJAX) request to the website. If that is the case, you have to perform the request in C++ in your crawler, too. Many sites also use cookies to store the user's answer to such questions (or at least the fact that the user selected "I'm at least 18 years of age"). So your crawler has to accept such cookies, too, and store them between requests.
I am aware of the fact that this answer is rather general, but it is difficult to give a more specific answer without knowing the exact website you are crawling.
Alternative approach: Instead of writing a crawler that downloads the website content directly, you might utilize frameworks like Selenium. Selenium allows to automate a browser and is intended to be used for testing, but one could also use it to crawl a website. The advantage is that you can also perfom things like clicks easier in the browser, given you know the ID or the XPath of the element you want to click. This might be easier to do than a "classical" crawler.
However, you should be aware that many websites have some kind of protection against flooding them with requests in place. That is, if you intent to do a lot of request to the same server in a short amount of time, you might get blocked from the server. So try to limit the requests to the absolute minimum.
Related
I am sorry in advance; I am just learning Web development and my knowledge of it is quite limited.
I will describe my problem first.
I have relatively large amount of data (1.8-2 GB), which should be hidden from a public web access. However, a user should be able to request via url call a specific small subset of data and see it on his / her webpage.
Ideally, I would like to write a program on a web server. Let's call it ./oracle, which stores the large amount of data in primary memory.
Each web user should be able to make a specific string calls to oracle and see oracle'sresponse on a web page as html elements.
There should only one instance of oracle, and web users should make asynchronous calls to it.
Can I accomplish the above task with FastCGI or any other protocols?
If yes could you please explain which tools / protocols should I use / learn?
I would recommend setting up an Apache server because it's very common and you'll be able to find a lot of answers to any specific questions here on StackOverflow already.
You could also look into things like http://Swagger.io which can help you generate your API.
Unfortunately, everything past this really depends on what you use to set up your server. Big picture though:
You'll need to open up a port to listen to incoming requests
You'll need to have requests include the parameters they want to send to oracle
You could accomplish this the URI, like localhost/oracle-request?PARAMETER="foo"
You could alternatively use JSON in the body of the http request
Again, this largely depends on how you set up step 1
You'll need to route those requests to the oracle
This implementation depends entirely on step 1
You'll need to capture the output from the oracle and return it to the user
Once you decide on how you want to set up your server, feel free to edit your question and we may be able to provide more specific help.
I need to write an API which would provide access to data being served as HTML documents from a web server. I need for my users to be able to perform queries over the data.
Say on a web site there is a page which lists items and their owners. Then there is additional set of profile pages for owners which for each owner provide information about their reputation. An example query I may need to answer is "Give me ID's and owners of all items submitted in 2013 whose owners have reputation of at least 10".
Given a query to answer, I need to be able to screen scrape only the parts of the web site I need for answering the query at hand. And ideally cache the obtained information for future use with new queries.
I have no problem writing the screen scraping part, but I am struggling with designing the storage/query/cache part. Is there something about Clojure/Datomic that makes it an especially suitable technology choice for this kind of processing of data? I have been pointed in this direction before.
It seems a nice challenge but not sure about a few things: a) would you like to expose to your users a Datalog query box and so make them learn datalog-like syntax? b) what exact kind of results do you wish to cache, raw DB responses, html fomatted text, json ?
Anyway I suggest you to install and play a little bit with the Datomic console to get a grasp if you didn't before as it seems to me the more close idea to what you want to achieve atm https://www.youtube.com/watch?v=jyuBnl0XQ6s http://blog.datomic.com/2013/10/datomic-console.html
For the API I suggest you to use http://clojure-liberator.github.io/liberator/ as it provides sane defaults to implement REST services and let you focus on your app behaviour
To start off, I am extremely sorry if my question is not clear but I have very little knowledge about web services in general and the vast nature of varying available information has driven me crazy over the past few weeks. So please do bear with me.
Summary: I want to create a live score update app for android. (I haven't added android as a tag because I do know how to retrieve data from say twitter's JSON api.) However, like the twitter JSON api, I want to be able to add(POST maybe?) data to the Apache 7.0 service that I have running. I then want the app to be able to be able to retrieve this data that I have posted.
I had asked a more generic question earlier and I was told that I should look up some api's. I did that but I have still not been unable to make a break through.
So my questions is:
Is setting up an API on my local web service the correct way to do this?
If so, how can I setup an API that will return JSON objects to the Android app. Also, I would need to be able to constantly update this API with new data.
Additionally, would I also need to setup a database for all this?
Any links to well explained matter would be appreciated too.
Note: I would like to carry this out using a RESTful Web Service through Jersey and use JSON Objects during retrieval.
Again, I am sorry about my terrible knowledge with web services in general despite trying my best to research a lot. The best I could do was get my RESTful Web to respond to a GET with some pre-defined text that I had set in Eclipse.
Thanks.
If I understand you correctly, what you try to do is something like this:
There will be a match or multiple matches of some sort. Whenever a team/player scores someone (i.e. you) will use the app to update the score. People who previously subscribed to the match, will be notified and see the updated score.
Even though I'm not familiar with backends based on Java, the implementation should be fairly similar to other programming languages.
First of all a few words to REST in general. REST is generally needed, when you need to share information between multiple devices and or users. This seems to be the case here. To implement the REST you are going to need an API of some sorts. Within the web APIs are implemented by webservers answering to certain predefined HTTP Requests.
Thus setting up an API on a web server is the correct way.
Next a few words on databases. A database is generally needed, if you want to store information persistently. This might, or might not be what you are planning to do. If there are just going to be a few matches at the same time and you don't care about persistence of the data, you can use Java to store a collection of match objects in memory. I'm just saying it is possible, not that it is a good idea. Once your server crashes or you run out of memory due to w/e reason, data is going to be lost. (Of course within the actual implementation you want to cache data for current matches in some way and keeping objects in memory is way to do so).
I'd recommend to use a database.
Within the database, you can then store and access information about the matches like the score, which users subscribed, who played, etc.
JSON is just a way to represent the data/objects that will be shared between the server and the client. You can use JSON to encode request and response data/bodies.
The user has to be informed about the updated score. There are two basic ways to do so. Push or Pull. With pull, the client will check for updated scores after fixed intervals or actions. With push, the server will notify the client about changed scores which will cause him to update the information. Since you are planning on doing a live application and using Java anyways, push seems to be the better way to go.
Last but not least let's have a look at a possible implementation using
Webserver (API endpoints + database)
Administrator (keeps score updated)
User (receives updates)
We assume that the server will respond to HTTP Requests (POST#/api/my-endpoint) with JSON-Objects.
Possible flow
1)
First the administrator creates a match
REQUEST
POST # /api/matches
body: team1=someteam&team2=someotherteam
The server now will create a match object and store it in the database. The response will contain information about the object and whether the action was successful.
2)
The user asks for a list of matches
REQUEST
GET # /api/matches/curret
The response will be a JSON object containing a list of current matches.
RESPONSE
{
matches: [
{id: 1, teams:...}, ...
]
}
3)
(If push)
A user subscribes to a match
REQUEST
GET # /api/SOME_MATCH_ID/observe
The user will now be added as an observer for the match. Again, the response contains information about whether the action was successful or not.
4)
The administrator updates a score
REQUEST
UPDATE # /api/SOME_MATCH_ID
body: team1scored...
The score now gets update on the server (in memory/database) and the user will be notified about the updated score.
5)
The user gets the updated score
REQUEST
GET # /api/SOME_MATCH_ID
RESPONSE
... (Updated score in some way)
I need to design and implement a Java web application that can be used by multiple users at the same time. The data that is handled by this application is going to be huge and may take about 5 minutes for a page to display the results(database records).
I had designed this application using HTML, Servlets and JSP. But when two users would try to get the records, only one user was able to view the results while the other faced an error.
I always thought a web application would take care of handling multiple users but this is not the case.
Any insights on this would be highly appreciated.
Thanks.
I always thought a web application would take care of handling multiple users but this is not the case.
They do if they're written correctly. Obviously yours is not. That's all we can tell you unless you give more information, most importantly details of the error shown to the second user.
One possibility is that everything is OK on the web layer but your DB access for the first user causes an exclusive lock so that the second user cannot access the data at the same time. This could be fixed by using non-exclusive read locks. How to do that depends mainly on what DB you're using.
Getting concurrency right requires you to choose the correct tools and use them correctly. It doesn't just happen magically because it's a web app.
What are are using to develop this web-application? If you are developing it in your own way from the start I must say you are trying to re-invent the same wheel which has been already created and enhanced by very solid frameworks.
I suggest you analyze your requirements thoroughly and study some available frameworks. Let them handle the things like multi threading and other aspects in the best possible manner.
Handling multiple request at a time is a container work and as an application developer we have to concentrate how we are handling and processing those requret being forwarded by the container.
I must suggest you to get some insight how web-application work and how request -response cycle happens
I want to trace user's actions in my web site by logging their requests to database as plain text in Django.
I consider to write a custom decorator and place it to every view that I want to trace.
However, I have some troubles in my design.
First of all, is such logging mecahinsm reasonable or because of my log table will be enlarging rapidly it causes some preformance problems ?
Secondly, how should be my log table's design ?
I want to keep keywords if the user call search view or keep the item's id if the user call details of item view.
Besides, IP addresses of user's should be kept but how can I seperate users if they connect via single IP address as in many companies.
I am glad to explain in detail if you think my question is unclear.
Thanks
I wouldn't do that. If this is a production service then you've got a proper web server running in front of it, right? Apache, or nginx or something. That can do logging, and can do it well, and can write to a form that won't bloat your database, and there's a wealth of analytical tools for log analysis.
You are going to have to duplicate a lot of that functionality in your decorator, such as when you want to switch it on or off, or change the log level. The only thing you'll get by doing it all in django is the possibility of ultra-fine control, such as only logging views of blog posts with id numbers greater than X or something. But generally you'd not want that level of detail, and you'd log everything and do any stripping at the analysis phase. You've not given any reason currently why you need to do it from Django.
If you really want it in a RDBMS, reading an apache log file into Postgres or MySQL or one of those expensive ones is fairly trivial.
One thing you should keep in mind is that SQL databases don't offer you a very good writing performance (in comparison with reading), so if you are experiencing heavy loads you should probably look for a better in-memory solution (eg. some key-value-store like redis).
But keep in mind, that, especially if you would use a non-sql solution you should be aware what you want to do with the collected data (just display something like a 'log' or do some more in-deep searching/querying on the data).
If you want to identify different users from the same IP address you should probably look for a cookie-based solution (if you are using django's session framework the session's are per default identified through a cookie - so you could just simply use sessions). Another solution could be doing the logging 'asynchronously' via javascript after the page has loaded in the browser (which could give you more possibilities in identifying the user and avoid additional load when generating the page).