Script to crawl through different pages and acquire data

Script to crawl through different pages and acquire data - python-2.7

I am planning to do a network analysis of bmtc bus connectivity network... So i need to acquire data regarding bus routes. The best website as far as i know is
http://www.narasimhadatta.info/bmtc_query.html
Under the "search by route " option the whole list of routes is given and one can select any one of them and on clicking "submit" it displays the detailed route . Previously when I acquired data online I used to encash upon the fact that each item (in this case route number) lead to distinct URL, and I used to acquire data from the source page using Python. But here irrespective of the bus route the final page always has the URL
http://www.narasimhadatta.info/cgi-bin/find.cgi
and it's source page doesn't contain the route details!
I am only comfortable with Python and Matlab. I couldn't figure out any means to acquire data from that website. If you can see something, technically one should be able to download the data (at least thats what I believe). So can you please help me out with a code which crawls through each bus route number automatically and downloads the route details?

I looked at the url you mentioned. if you have a list of route numbers, you can use the following url sturcture to extract data.
http://www.narasimhadatta.info/cgi-bin/find.cgi?route=270S
or
http://www.narasimhadatta.info/cgi-bin/find.cgi?route=[route number from you list]

Related

Access list data as a group

We have a company program designed to help us get control over data. It has feature to group all the application of one Client. If I want to take a look at them I click on the Client and I see a list of all applications made for him. Take a look at the picture below:
I was wondering if Microsoft Access can do the same? If yes where should I start looking?
I did some internet search and no solution found.

That is built in, and it is called Subdatasheet. You have relationships properly set between Clients and Order, for instance, when you open the Clients table you will see such small "+" allowing to view the Orders of the current client. You may have to set the Subdatasheet Name property of table Clients to "Orders" in this case.
If you want to work with forms, you can build a continuous from for Clients, then one for Orders, then insert the Orders subform in the Footer of the Clients form. Access might tell you you can't do this, just ignore, it works.

In Access that would simply be a continuous form with a filter. Typically opened from a list of clients, setting a filter for the applications of the selected client.
Unless I'm misunderstanding the question.

Django transaction management among different http requests

I have been working on a DJANGO back-end which main use case would be the capability to store a given set of pictures with its related tags.
The current design foresees dedicated REST-ful APIs for creating a new set, adding a picture to a given set and associating tags to a given set : this results into distinct client calls.
For instance :
BEGIN the "create new set" transaction
create a new set and receive the set ID
upload the first picture of the set
upload the second picture of the set (And so on depending on the total number of pictures...)
Add the tags related to this newly added set
END the transaction
How can I commit/rollback such a transaction knowing that it is split among different HTTP requests ?
Am I having a design issue here ? Shall I favor a single cumulative HTTP request approach ?
Please take into account that such a back-end is to be used with mobile devices which might suffer from temporary signal loss.
Any advice is welcome.
UPDATE:
Would it be convenient to use model versioning packages such as django-revisions to solve the issue ?

How can I have 5 visits with with zero unique visitors?

According to Google Analytics, I had 5 visits from zero unique visitors. Is that a bug or did I perhaps implement something wrongly? Or hasn't the data processing finished yet (I created this view 2 days ago)?
The view is based on an include custom filter that's supposed to include only traffic from any of three ip addresses. The regex I used for this is
62\.58\.32\.193|77\.172\.143\.12$|213\.125\.166\.98

My best guess would be the way Google defines unique Visitors. Sometimes I have been visiting my own website periodically and I ended up showing up as a unique visitor (My site isn't so popular so it's easy for me to track that). I would either have to say that it has to do with the nature of visits or the actual way of unique visitors. According to google this is how the find unique visitors
The other Unique Visitors metric calculation (Calculation #2) is based
on the __utma cookie. Calculation #2 is used when segmenting the
Audience Overview report or when viewing Unique Visitors over any
dimension other than date. As such, Calculation #2 is used in custom
reports to allow for the calculation of Unique Visitors over any
dimension, such as browser, city, or traffic source.
source: https://support.google.com/analytics/answer/2992042?hl=en

Occasionally, there are problems with Google Analytics reporting. Check the product forums. For example, here is an issue that happened on Nov 11, 2013:
http://productforums.google.com/forum/#!topic/analytics/fsurDK8AOcY
This issue can also crop up when you are using the page dimension. Unique visitors are only assigned to the first page in a visit as described here. But, it doesn't seem like that is the case for you.
Finally, its possible, analogous to the page dimension situation, that unique visitors are only assigned to the first IP address that a visitor came from. If that is true, then if the people who came to your site had previously come from a different IP address, then they wouldn't show up as unique visitors in your filter.

Ember choking upon encountering large data sets

Looking for a solution to an issue caused by large data sets forcing Ember to lock up the browser while it tries to process the data.
For pagination, I'm using tchak's handy pagination mixin to paginate approximately 13,000+ objects being loaded from a backend API.
The Ember Data objects contain an ID, one text attribute and several number attributes.
The problem is it takes close to a minute before the browser finishes processing the data, rendering the browser unusable in the meantime. Firefox even goes as far as to issue a warning that a script is using up all browser resources and suggests that script be terminated.
I've written my own pagination mixin that requests objects by range, i.e. items 10-25, and it works generally well except for one serious limitation: sorting. To sort the data, I need to make additional requests to the backend and reload the objects even if some of them have already been loaded.
I would love to be able to load all of the content upfront to simplify the process of sorting without doing additional requests to the backend API. I'm looking for guidance on how to tackle this issue but I'm open to an entirely alternative approach.
If nothing else, is it possible to reduce the resource footprint Ember places on the browser as it tries to load all 13k objects into the ArrayController?
I'm using Ember 1.0.0-pre2 with the latest Ember Data (currently at Revision 10).
On the backend is Rails 3.2.8.
Update I sidestepped the issue by loading data into an ArrayController property other than content. This brought the load times down from over a minute to only a few seconds. I then slice the requested number of items and load those into content. This works well for any number of items, at the cost of not being able to easily sort the data.

I suggest you take a look at Ember Table. The demo shows a table with 500 000 records and works very fast. Digging around the source code might help.

Can't you query a view from your db that handles the sorting? Pass in the sort conditions in the query string ?sortBy=name&sortAsc=true

SOA/Web Service Pagination

In SOA we should not be building or holding state (or designing dependencies) between client and server. This is understood. But what patterns can be followed in the case that a client wants to consume a real-time service that may return an open ended number of 'rows'?
Web applications, similar to SOA but allowing for state (sessions) have solved this with pagination. Pagination requires (in most cases, especially with SQL) that the server holds the data and that the client request the data in chunks.
If we where to consider pagination-like scenarios for web services, what patterns would these follow that would still allow the tenets of SOA to be adhered (or as close as possible).
Some rules for the thinkers:
1) Backed by a SQL database (therefore there is no concept of a row number in a select set)
2) It is important to not skip a row or duplicate a row in a set during pagination
3) Data may be inserted and deleted at any time into the database by other clients
4) There is no need to consider the dataset a live (update-able) dataset
Personally, I think that 1 and 2 above already spell our the solution by constraining the solution space with the requirements.
My proposed solution would have the data (as much as is selected) be stored in a read-only store/cache where it can be assigned a row number within the result set and allow pagination to occur on this data snapshot. I have would have infrastructure to store snapshots (servers, external caches, memcached or ehcache - this must scale quite large). The result of such a query would be a snapshot ID and clients could retrieve the data from the snapshot using a snapshot API (web services) and the snapshot ID. Results would be processed in a read-only, forward only manner for x records at a time where x was something reasonable.
Competing thoughts and ideas, criticisms or accolades would be greatly appreciated.

Paginated results in a Web Service is actually quite easy to achieve.
All you have to do is add two parameters to the web service call: Page Size, Page Number.
Page Size is the number of results to include in a page. Page Number is the number of the page of results you are looking for.
Your web service then goes back to the database (or cache), retreives the results, figures out which results fit on the requested page, and return only those results.
The client then has to make a single request per page of results they want from the service.

What you propose with memcached will also work with a caching table. The first service call would (1) INSERT results INTO the caching table with a snapshot ID (2) return the first page from the caching table and the snapshot ID. Subsequent calls would return pages based on page size and page number by querying the caching table using the snapshot ID.
I should think this could also be optimized by using an in-memory caching table, but that depends on whether your database supports INSERT-INTO from a disk table to an in-memory table. That might get complicated in a clustered environment though.
Such a cache is stateful by its very nature if you are retaining a client-specific copy between requests, whether storage is in a session object, database table or memcached data store. Given the requirements though, you have no choice but to cache results in some form or another, except you risk the chance of returning deleted or no-longer-relevant records as legitimate results.

SOA is not meant for such low level functionality.
SOA is meant to glue together business areas, not frontends to backends. Not because your application talks to the back end using webservices you have a "SOA" application. This is non sense since SOA is meaningless in the context of 1 isolated system.
From that point of view, it is then clear that, in SOA, the caller should not have known about the SQL table you are paginating, that’s an implementation detail that SOA should hide. In the other hand the server should not know about the client's state, because it should be agnostic to the details of the clients, to be really open.
So, just understand that pagination is not SOA. Do as you wish, just understand that the webservice you are using to paginate is an internal artifact of your application, not to be used for external clients in a SOA bus. Also remember that it can not be transaction consistent with out state in the server. Probably the problem is that you have only one service layer for the application's UI and the SOA bus, you need to separate them.
Using this webservice in a SOA bus would be bad. I can not be consistent as the user paginates and as other applications hang to it they become tied to the specific SQL.
... then you might as well have granted direct SQL access to the table for all that matters.
SOA is for business messages between systems, not to glue an application's frontend to the backend.

Same problem, resolved using the Navision approach.
$ws->getList($first_record_id, $limit)
This return a page of $limit element that start from the the passed id
select * from collection where collection.id > $first_record_id ASC limit $limit
ordered by id ASC
Navision use Key (each element has a key) but in MySQL an autoincrement id is better.
In this case pagination is intended for handle large result sets and not for a frontend pagination...

I am not sure if SOA is of concern here. The problem you have seems to be with paginating your API's. I will point you to how twitter handles their pagination dev.twitter.com/rest/public/timelines

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js