Analytics django app with noSQL db and GA - django

I've started a django project that will include an analytics app. I want that app to use either couchDB or mongoDB for storing data.
The initial idea was (since the client already is using Google Analytics) to once a day/week/month grab data from GA, and store store it locally as values in database. Which would ultimately build a database of entries - one entry per user per month - with summed values like
{"date":"11.2011""clicks": 21, "pageviews": 40, "n": n},
for premium users there could be one entry per user per week or even day.
The question would be:
grab analytics from GA, do a sum entries for clicks, visits etc.
or
store clicks and whatever values locally and once a month do sums for display ?

Lukasz, unless Google Analytics has really relaxed their privacy levels, you're not going to be able to access user-level records (but check out the answer here: Django saving the whole request for statistics, whats available?)

Right, old question but I've just finished the project so I'll just write what I did.
Since I didn't need concurrency and wanted more speed approach, I found that mongodb is better for that.
The final document schema that I've used is
{'date': '11.2009', 'pageviews': 40, 'clicks': 13, 'otherdata': 'that i can use as filters'}
The scope of my local analytics is monthly, so I create one entry in mongdb per user per month, and update it each day. As said just now, I update data daily, and store only summaries and averages of those.
What else. Re: Jamie's answer... The system is using GA events, so I've got access to all data that i need.
Hope someone may find it interesting.
cheers and thanks for ideas !

Related

The correct way to remove or update Item

I am building recommendation system for classified ads website , ads are added and deleted daily.
What I thought of is to use PutItems to add new ads and make field called status = 0 , if user deleted the ad , I will use the same PutItem API with the same ITEM_ID to update the stored Item, and use filter to select only ads with status = 0 when generation recommendation.
Is that correct ? will the PutItems API update the existing ad ? and is there anyway to delete the Item ?
Currently there is no way to remove items that were already added to Datasets.
Your workaround looks good, however from my experience with working with Personalize, the filter might decrease your recommendations quality.
To understand why, this is the more or less algorithm, that Personalize uses for filtering recommendations:
Get recommended items for user
Filter recommendations using filter expression
Return first N recommended items left after filtering
Because the filtering is done after getting recommendations, it means, that Personalize will simply fill recommendations list with items, that were somewhere down on the recommended list.
And there is a problem with that approach - items lower on the list, have lower "Score" value, which indicates accuracy of recommendations. That's why you will end up with in general worse recommendations, but it will depend how many ads that have status = 0 were recommended, before filtering out them.
To check your recommendations scores, simply get recommendations in Personalize web UI. It will return list of recs with scores.
Better approach
If your ads are updated daily, then you can definitely workaround it by following those steps:
Create a Lambda function, that is triggered every 24 hours
Lambda will fetch all of the ads and put them into S3 bucket as CSV file. It should exclude ads that are no longer available (status = 0)
Call CreateDatasetImportJob API using any AWS SDK of your choice and provide the data which is stored on S3 bucket
Personalize will start import job. When it finishes, all of the items are replaced with the newest dump
However it has some downsides.
If you are not using the User-Personalization (aws-user-personalization) Recipe, then after each import of Items, you need to update your Solution by creating new Solution Version. Otherwise it won't include changes made by items dataset import job.
Creating a new Solution Version is quite slow and expensive, that's why I would recommend to use User-Personalization Recipe, if you want to use this approach and since HRNN Recipes are marked as legacy, it's a good idea to migrate anyways.
If you are using User-Personalization Recipe, then according to AWS documentation:
Amazon Personalize automatically updates your latest solution version every two hours to include new data. Your campaign automatically uses the updated solution version. For more information see Automatic Updates.
So pretty much all of the work is done on Personalize side and you don't have to worry about Solution retraining after each Items import job.
And the last problem...
Since for User-Personalization Recipe documentation claims, that your solution will be updated within two hours, then you might end up with recommending items, that are not available, for some short period of time. If you are updating items daily, it might be a significant problem.
To fix that case, I would recommend simply using Filter approach, that you mentioned. Thanks to this, you have benefits of both approaches
and your recommendations are always valid.

DynamoDB - Reducing number of queries

After my users log in the app makes too many requests to DynamoDB and I am thinking about different ways to reduce the number of calls.
The app allows user to trigger certain alerts that get sent to other users. For instance: "Shipment received, come to the deck", "Shipment completed", etc.
These are the calls made:
Get company's software license expiration date.
Get the computer's location in the building (i.e. "Office A").
Get the kinds of alerts that can be triggered (i.e. "Shipment received, come to the deck", "Shipment completed", etc).
Get information about the user (i.e. company teams the user belongs to, and admin level the user has (which can be 0, 1, 2, or 3).
Potential solutions I have though about:
Put the company's license expiration date as an attribute of each computer (This would reduce the number of queries by 1). However, if I need to update the company's license expiration date, then I need to update it for EVERY SINGLE computer I have in the system, which sounds impractical to me since I may have 200, 300 or perhaps even more computers in the database.
Add the company's license expiration date as an attribute of the alerts (This would reduce the number of queries by 1); which seems more reasonable because there are only about 15 different kinds of alerts, so if I need to change the license expiration date later on, it is not too bad.
Cache information on the user's device; however, I can't seem to find a good strategy to keep the information stored locally as updated as possible.
I still think these 3 options do not sound too good, so I am hoping someone can point me in the right direction. Is there a good way to reduce the number of calls? I am retrieving information about 4 different entities (license, computer, alert, user), should I leave those 4 calls after users log in?
here are few things that can be done wrt each component.
Get information about the user
keep it in session store and whenever details changes update the store. session stores are usually implemented using cache like redis.
Computer location
Keep it in a distributed cache like redis. lazily initialise it. and whenever new write happens to computer location (rare IMO) remove the entry from redis using dynamodb streams and aws lambda.
Kind of alerts
Same as Computer location
License expiration date
If possible don't allow license expiry date (issue a new one for these cases, so that traceability is maintained.) and cache licence expiry forever. OR same as Computer location.

Cannot get data (100k+ rows) for a dashboard

Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.
Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/

Best way to count number of records from NDB library

I am developing an application in which I am explicitly using memcache with Google Appengine's NDB library. I want something like this.
1) Get 100 records from datastore and put them in memcache.
2) Now whenever user wants these records I would get these records from memcache instead of datastore.
3) I would invalidate the memcache if there is a new record in datastore and then populate the memcache with 101 records.
I am thinking of an approach like I compare the number of records in memcache and datastore and if there is a difference, I would update the memcache.
But if we see documentation of NDB, we can only get count by retrieving all the records, and this is not required as datastore query is not being avoided in this way.
Any help anyone? Or any different approach I could go with?
Thanks in advance.
Rather than relying on counts, you could give each record a creation timestamp, and keep the most recent timestamp in memcache. Then to see if there are new records you just need to check if there are any timestamps newer than that, which assuming you have an index on that field is a very quick query.

Powerbi rest api AddRows

I am working on a realtime dashboard, i'd like to use the powerbi Rest Api.
My question how does the updating of rows work. I have 1300 records to load once and then update 2 columns for each row every 20 seconds.
The only rest call I see is to addrows, but it's not clear how it handles update of rows if it does
You have two patterns you can choose from:
You can send data in batches: upload 1300 rows, then call DELETE on the rows, then call upload with the next payload of rows.
Here's the DELETE method you need to all. We're adopting REST standards for our APIs so the 'methods' are the REST verbs :). https://msdn.microsoft.com/en-us/library/mt238041.aspx
Alternately you can incrementally update the data: You'd add a 'timestamp' column to your data set. Then in your query (like in Q&A) you'd ask for "show data for the last 20 seconds". If you do this, set the FIFO retention policy when you create the data set so you don't run out of space.
In either case, double check the number of rows you're pushing fit within the limits we spell out. https://msdn.microsoft.com/en-US/library/dn950053.aspx
HTH,
-Lukasz
i was searching something in powerbi docs that could help me in creating a report with rest APIs. couldn't find it exact though. however made a work-around.
firsly, I created a push dataset schema in powerbi with help of post push dataset rest api.
https://learn.microsoft.com/en-us/rest/api/power-bi/push-datasets/datasets-post-dataset-in-group
then I pushed rows/record/data into my dataset with this post rows in push dataset.
https://learn.microsoft.com/en-us/rest/api/power-bi/push-datasets/datasets-post-rows-in-group
then I went to powerbi service, and created a visual report manually there.
after this I embedded that report in my react app.
finally my report was live.
now if wanted to update my report in real time, I called delete push dataset rows api to delete the existing rows/records from my dataset.
https://learn.microsoft.com/en-us/rest/api/power-bi/push-datasets/datasets-delete-rows-in-group
and then called the post push dataset rows api again with new updated data. (repeated step 2)
and then finally I refreshed my website page, and now I see the updated visual report in my website.
it took me too much time. so I can feel if you are struggling w/ powerbi rest api. it's not straightforward. so feel free to ask anything down below. will happy to help.