Huge discrepancy between access.log, Google Analytics and Mapbox Statistics - django

I have a website made in django and served using gunicorn in which a single Mapbox map is loaded on the home page using mapbox-gl.js. Users can then navigate the map and change styles at will. The map is initialized and loaded only once and only in the home page. The service is billed on a "map load" basis. The Mapbox pricing page says
A map load occurs whenever a Map object is initialized, offering users unlimited interactivity with your web map.
I would have expected to see a count, if not exactly identical, at least comparable between the data recorded by Mapbox billing, the accesses to the home page recorded by Google Analytics and the hits on the home page recorded on the server access.log.
Instead, the Mapbox count is in average about 25 times higher than Analytics and the access.log, which have similar numbers.
As an example, here are the numbers for yesterday:
Analytics: home page was loaded 890 times
access.log: 1261 requests for the home page
Mapbox: 23331 map loads
I am using URL restriction from the Mapbox control panel, but I guess the enforcement is not that strict, since they strongly suggest to also rotate the token periodically (which I am already doing on a daily basis). Since I started rotating the token I noticed a slight lowering on the map loads (from an average of 28k to an average of 24k) and no noticeable changes in the access log and analytics reports.
The map implementation in Javascript is the following:
mapboxgl.accessToken = MY_TOKEN
var map = new mapboxgl.Map({
container: 'map',
style: 'mapbox://styles/myaccount/mystyle',
center: [12.381384,42.059164],
zoom: 5,
});
As I mentioned, this script is contained in the home page and is executed only once when the page is loaded. Do you have any suggestion on how to keep the maploads low? I have no problem in paying for what I'm using, but I feel there's either something wrong in the way the map loads are calculated by Mapbox, something wrong in my implementation or some sort of bot actively stealing the token.

When you change styles, that triggers a reload on the map, which will be recorded as a separate map load.
What are you trying to achieve with changing styles? Is there a small set of styles that you want to toggle between? How do they differ? There may be a way to implement that does not require fully changing the style in the map.

Related

How do I scrape data from an ArcGIS Online map?

I want to scrape the data from an ArcGIS map. The following map has a popup when we click the red features. How do I access that data programmatically?
Link : https://cslt.maps.arcgis.com/apps/MapSeries/index.html?appid=2c9f3e737cbf4f6faf2eb956fa26cdc5
Note: Please respect the access and use constraints of any ArcGIS Online item you access. When in doubt, don't save a copy of someone else's data.
The ArcGIS Online REST interface makes it relatively simple to get the data behind ArcGIS Online items. You need to use an environment that can make HTTP requests and parse JSON text. Most current programming languages either have these capabilities built in or have libraries available with these capabilities.
Here's a general workflow that your code could follow.
Use the app ID and the item data endpoint to see the app's JSON text:
https://www.arcgis.com/sharing/rest/content/items/2c9f3e737cbf4f6faf2eb956fa26cdc5/data
Search that text for webmap and see that the app uses the following web maps:
d2b4a98c39fd4587b99ac0878c420125
7b1af1752c3a430184fbf7a530b5ec65
c6e9d07e4c2749e4bfe23999778a3153
Look at the item data endpoint for any of those web maps:
https://www.arcgis.com/sharing/rest/content/items/d2b4a98c39fd4587b99ac0878c420125/data
The list of operationalLayers specifies the feature layer URLs from which you could harvest data. For example:
https://services2.arcgis.com/gWRYLIS16mKUskSO/arcgis/rest/services/VHR_Areas/FeatureServer/0
Then just run a query with a where of 0=0 (or whatever you want) and an outFields of *:
https://services2.arcgis.com/gWRYLIS16mKUskSO/arcgis/rest/services/VHR_Areas/FeatureServer/0/query?where=0%3D0&outFields=%2A&f=json
Use f=html instead if you want to see a human-readable request form and results.
Note that feature services have a limit of how many features you can get per request, so you will probably want to filter by geometry or attribute values. Read the documentation to learn everything you can do with feature service queries.

Is there a way to force Sitecore to sync MongoDB data with it's SQL database?

I am setting up Sitecore xDB and am trying to test exactly what info gets through the system for authenticated and non-authenticated users. I would like to be able to make a change and see the results quickly in Sitecore. I found the setting to lower session lifetime to 1 minute rather than 20. I have not found a way to just force Sitecore to sync with Mongo on demand or at least within 1-5 minutes rather than, what also appears to be about 20 minutes at the moment. Does it exist or is "rebuilding" the database explained here the only existing process?
See this blog post by Martina Welander for this and more good info about xDB sessions: https://mhwelander.net/2016/08/24/whats-in-a-session-what-exactly-happens-during-a-session-and-how-does-the-xdb-know-who-you-are/
You just need a utility page that calls System.Web.HttpContext.Current.Session.Abandon(). You may also want to redirect the user to a page that doesn't exist.
Update to address comment
My understanding is that once an xDB session has expired, processing should take place quickly. In the Sitecore.Analytics.Processing.Services.config file, the BackgroundService agent is set to run on an interval of 15 seconds by default.
You may just be seeing cached reporting data. Try clearing the cache using the /sitecore/admin/cache.aspx page. You could also decrease the defaultCacheExpiration setting for the reporting cacheProvider in the Sitecore.Analytics.Reporting.config file. The default is 10 minutes.

Script to crawl through different pages and acquire data

I am planning to do a network analysis of bmtc bus connectivity network... So i need to acquire data regarding bus routes. The best website as far as i know is
http://www.narasimhadatta.info/bmtc_query.html
Under the "search by route " option the whole list of routes is given and one can select any one of them and on clicking "submit" it displays the detailed route . Previously when I acquired data online I used to encash upon the fact that each item (in this case route number) lead to distinct URL, and I used to acquire data from the source page using Python. But here irrespective of the bus route the final page always has the URL
http://www.narasimhadatta.info/cgi-bin/find.cgi
and it's source page doesn't contain the route details!
I am only comfortable with Python and Matlab. I couldn't figure out any means to acquire data from that website. If you can see something, technically one should be able to download the data (at least thats what I believe). So can you please help me out with a code which crawls through each bus route number automatically and downloads the route details?
I looked at the url you mentioned. if you have a list of route numbers, you can use the following url sturcture to extract data.
http://www.narasimhadatta.info/cgi-bin/find.cgi?route=270S
or
http://www.narasimhadatta.info/cgi-bin/find.cgi?route=[route number from you list]

Ember choking upon encountering large data sets

Looking for a solution to an issue caused by large data sets forcing Ember to lock up the browser while it tries to process the data.
For pagination, I'm using tchak's handy pagination mixin to paginate approximately 13,000+ objects being loaded from a backend API.
The Ember Data objects contain an ID, one text attribute and several number attributes.
The problem is it takes close to a minute before the browser finishes processing the data, rendering the browser unusable in the meantime. Firefox even goes as far as to issue a warning that a script is using up all browser resources and suggests that script be terminated.
I've written my own pagination mixin that requests objects by range, i.e. items 10-25, and it works generally well except for one serious limitation: sorting. To sort the data, I need to make additional requests to the backend and reload the objects even if some of them have already been loaded.
I would love to be able to load all of the content upfront to simplify the process of sorting without doing additional requests to the backend API. I'm looking for guidance on how to tackle this issue but I'm open to an entirely alternative approach.
If nothing else, is it possible to reduce the resource footprint Ember places on the browser as it tries to load all 13k objects into the ArrayController?
I'm using Ember 1.0.0-pre2 with the latest Ember Data (currently at Revision 10).
On the backend is Rails 3.2.8.
Update I sidestepped the issue by loading data into an ArrayController property other than content. This brought the load times down from over a minute to only a few seconds. I then slice the requested number of items and load those into content. This works well for any number of items, at the cost of not being able to easily sort the data.
I suggest you take a look at Ember Table. The demo shows a table with 500 000 records and works very fast. Digging around the source code might help.
Can't you query a view from your db that handles the sorting? Pass in the sort conditions in the query string ?sortBy=name&sortAsc=true

accurate page view count in Django

What is a good approach to keeping accurate counts of how many times a page has been viewed?
I'm using Django. Specifically, I don't want refreshing the page to up the count.
As far as I'm aware, no browsers out there at the moment send any kind of message/header to the server saying whether the request was from a refresh or not.
The only way I can see to not count a user refreshing the page is to track the IPs and times that a user views a page, and then if the user last viewed the page less than 30 minutes ago, say, you would dismiss it as a refresh and not increment the page view count.
IMO most page refreshes should be counted as a page view anyway, as the only reason I have for refreshing is to see new data that might have been added, or the occasional accidental refresh/reloading after a browser crash (which the above method would dismiss).
You could give each user cookie, that expires at the end of the day, containing a unique number. If he reloads a page you can check wether she has been counted already that day.
You could create a table with unique visitors of the pages, e.g. VisitorIP + X-Forwarded-For content, User-Agent string along with a PageID of some sorts. If the data itself is irrelevant, you can create a md5/sha1 hash from these values (besides the PageID of course). Be warned however that this table will grow really fast.
I'd advise against setting cookies for that purpose. They have a limited size and with many visited pages by the user, you could reach that limit and make the solution unreliable. Also it makes it harder to cache such page on client-side (see Cacheability), since it becomes interactive content.
You can write a django middleware and catch request.url, then setup a table with url / accesses columns. Beware of transactions for concurrent update.
If you have load problems, you can use memcached with incr or add function and periodicaly update the database table to avoid transaction locks.