How to efficiently implement a page view counter in DynamoDB? - amazon-web-services

I am essentially trying to build a website where members can post blog entries and i want to record unique and overall page views for the different posts in absolute terms as well as over different time-frames e.g., last 24h, last week etc.
My initial approach was to use the date as primary key and the blogPostId as secondary key, i could then add all the posts visited during a given day. If i then include the userIds as an attribute i should then be able to a)get unique page views and b)overall page views (which might include duplicate visits by a specific user) for a given day. Finally, i would then pull the primary key for let's say the last 7 days and extract the most popular post.
As far as i can tell this should work fine as long as there aren't too many entries, however, i'm sceptical if this will scale. More specifically, if the number of blog posts increases a lot for a given interval, or if i want to find the all-time most viewed post i'd essentially have to read the whole table.
Has anyone an idea how i could implement this more efficiently?

DynamoDB will almost certainly work for you, and if you need an excuse to use it, by all means give it a try. If you get a ton or traffic it might end up being expensive.
Personally, I would consider using redis for what you are asking to do, and here is a pretty good/detailed question/answer on how you might implement it:
Scalable way of logging page request data from a PHP application?

DynamoDB can be used to iterate and create this feature quickly.
Nonetheless, this is a feature for Amazon Kinesis Data Streams, which will let you ingest data and then manipulate it to your needs.
Know that Kinesis can become expensive if you try to be as frugal as possible.
But, if you start receiving a lot of traffic, Kinesis will work as a Queue and let you manipulate the data before ingesting it to DynamoDB (Or another Data Store) (It will be cheaper than sending all those write requests).
Another limitation you'd like to check out is that DynamoDB will only return up to 1MB per Query.
Amazon recommends you use Redshift to handle all those operations as it is more suited to perform aggregation and calculation across Data warehouses.

Related

DynamoDB Indexing Assistance and Getting My Data Out

I preface all of this to say I’m still actively learning DynamoDB, and I think an answer to my question will help me understand a few things.
I have an analytics microservice that I’m pushing custom (internal) analytics events into a DynamoDB table. Columns in our Dynamo rows/items include data like:
User ID
IP Address
Event Action
Timestamp
Split Test ID
Split Test Value
One of the main questions we want to pull from this db is:
"How many users saw split test x with values y?"
I’m struggling to understand how I should index my database to account for this kind of requests? I set up a “Keys Only” index targeting Split Test ID, and the query to gather these are fairly efficient, but it only pulls UserID and Split Test ID. Ideally I want an efficient query that returns multiple other associated values as well…
How do I achieve this? Do I need to be doing something much differently? Additionally, if any of my understanding of Dynamo, based on my explanations, sounds completely lacking in some regard, please point me in the right direction!
You're thinking of DynamoDB as a schema-less database, which it obviously is. However, that does not mean that a schema is not important. Schemas in NoSQL databases are usually more important than they are in SQL databases and they are usually less straightforward.
The most important thing to determine how you will store your data is how you will access it. You will have to take into account all the ways that you will want to access your data and ensure it is possible by creating the necessary data columns and necessary indexes. In this case, if you want to know how many times two values are combined in a certain way, you could easily add a column that has these combined values (e.g., splitId#splitValue ) and use that in your indexes.
If you want to know more about advanced patterns and such, I advise you to watch this pretty famous re:invent talk by Rick Houlihan or to read the DynamoDB book.
As a last note, I want to add that switching to a SQL server usually is not the solution. Picking NoSQL over SQL is usually based on non-functional requirements. There is a reason NoSQL databases are used in applications that require very low-latency retrieval of data in huge datasets, but as with everything, trade-offs are the name of the game.

Cannot get data (100k+ rows) for a dashboard

Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.
Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?
For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.
I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html

Modeling data in NoSQL DynamoDB

I'm trying to figure out how to model the following data in AWS DynamoDB table.
I have a lot of IOT devices, each sends telemetry data every few seconds.
Attributes
device_id
timestamp
malware_name
company_name
action_performed (two possible values)
Queries
Show all incidents that happened in the last week.
Show all incidents for a specific device_id.
Show all incidents with action "unable_to_remove".
show all incidents related to specific malware.
Show all incidents related to specific company.
Thoughts
I understand that I can add GSI's for each attribute, but I would like to use GSI's only if there is no other choice as it costs me more money.
What would be the main primary-key (partition-key:sort-key) ?
Please share you thoughts, I care about them more than I care about the perfect answer as I'm trying to learn how to think and what to consider instead of having an answer for a specific question.
Thanks a lot !
If you absolutely need the querability patterns mentioned, you have no way out but create GSIs for each. That too has its set of caveats:
For query #1, your GSI would be incident_date (or whatever) as partition-key and device_id as sort-key. This might lead to hot partitioning in DynamoDB, based on your access patterns.
There is a limit of 5 GSIs per table, that you'll use up right away. What'll you do if you need to support another kind of query in future?
While evaluating pros and cons of using NoSQL for a given situation, one needs to consider both read and write access patterns. So, the question you should ask is, why DynamoDB?
For e.g., do you really need realtime queries? If not, you can use DynamoDB as the main database and periodically sync data (using AWS Lambda or Kinesis Firehose) to EMR or Redshift for later batch processing.
Edit: Proposed primary key:
device_id as partition-key and incident_date as sort-key, if you know that no 2 or more incidents, for a given device_id, can come at exact same time.
If above doesn't work, then incident_id as partition-key and incident_date as sort-key.

Storing Chat Log on AWS DynamoDB?

I am thinking of building a chat app with AWS DynamoDB. The app will support 1:1 and group chats.
I want to create one table for each one of the chats, where there is a record for each sent chat text line. Is DynamoDB suitable for this kind of job?
I am also thinking of merging both tables. But is this a good idea, if there are – let's assume – 100k or 1000k users?
I think you may run into problems with the read capacity on your table. The write capacity should be ok, as there are not so many messages coming in per second (e.g. 10 or so), but you'll need to constantly read from it for all users, so that'll be expensive.
If you want to use DynamoDB just as storage and distribute the chat messages like in any normal chat over the network, then it may make sense, depending on your use cases. You could, assuming you have a hash key UserId and Timestamp, query all messages from a specific user during a specific period as a result. If you want, however, search within the chat text (a much more useful feature, probably), then DynamoDB won't work per se. It's not like SQL, where you could do a LIKE '%abc%' query (which isn't a good idea in SQL either).
Probably you're better off using S3 as data storage and ElasticSearch as search instrument. If you require the aforementioned use case "get all messages from user X in timespan S" (as a simple example) you could additionally use DynamoDB to store metadata, such as UserId, Timestamp, PositionInFile or something like that.