Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.
Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/
Related
I preface all of this to say I’m still actively learning DynamoDB, and I think an answer to my question will help me understand a few things.
I have an analytics microservice that I’m pushing custom (internal) analytics events into a DynamoDB table. Columns in our Dynamo rows/items include data like:
User ID
IP Address
Event Action
Timestamp
Split Test ID
Split Test Value
One of the main questions we want to pull from this db is:
"How many users saw split test x with values y?"
I’m struggling to understand how I should index my database to account for this kind of requests? I set up a “Keys Only” index targeting Split Test ID, and the query to gather these are fairly efficient, but it only pulls UserID and Split Test ID. Ideally I want an efficient query that returns multiple other associated values as well…
How do I achieve this? Do I need to be doing something much differently? Additionally, if any of my understanding of Dynamo, based on my explanations, sounds completely lacking in some regard, please point me in the right direction!
You're thinking of DynamoDB as a schema-less database, which it obviously is. However, that does not mean that a schema is not important. Schemas in NoSQL databases are usually more important than they are in SQL databases and they are usually less straightforward.
The most important thing to determine how you will store your data is how you will access it. You will have to take into account all the ways that you will want to access your data and ensure it is possible by creating the necessary data columns and necessary indexes. In this case, if you want to know how many times two values are combined in a certain way, you could easily add a column that has these combined values (e.g., splitId#splitValue ) and use that in your indexes.
If you want to know more about advanced patterns and such, I advise you to watch this pretty famous re:invent talk by Rick Houlihan or to read the DynamoDB book.
As a last note, I want to add that switching to a SQL server usually is not the solution. Picking NoSQL over SQL is usually based on non-functional requirements. There is a reason NoSQL databases are used in applications that require very low-latency retrieval of data in huge datasets, but as with everything, trade-offs are the name of the game.
Hi Im trying to query some table in DynamoDB. However from what I read I can only do it using some code or form the CLI. Is there a way to do complex queries from the GUI? I tried playing with it but can't seem to figure out how to do a simple COUNT(*). Please help.
Go to DynamoDB Console;
Select the table that you want to count
Go to "overview" page/tab
In table properties, click on Manage Live Count
Click Start Scan
This will give you the count of items of the table at that moment. Just be warned that this count is eventually consistent; what means that if someone is performing changes in the table at that exact moment your end result will not be exact (but probably very close to reality).
Digressing a little bit (only in case you're new to DynamoDB):
DynamoDB is a NoSQL database. It doesn't support the same commands that are common in SQL databases. Mainly because it doesn't support the same consistency model provided by SQL databases.
In SQL databases, when you send a count(*) query your RDMS make some very educated guesses and take some short paths to discover the number of lines in the table. It does that because reading your entire table to give you this answer would take too much time.
DynamoDB doesn't have means to make these educated guesses. When you want to know how many items one table have the only option it has is to read all of them counting one by one. That is the exact task that the command mentioned in the beginning of this answer does. It scans the entire table counting all the items one by one.
Because of that, when you perform this task it will bill you the entire table read (DynamoDB bills you per reads and writes). And maybe after you started the scan someone put another item in the the table while you are still counting. In that case it will not restart the count because by design DynamoDB is eventually consistent.
I have mobile app log data being posted daily (eventually it will be a data stream). I am looking at different solutions for processing this log data and providing analytics. I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app.
However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user.
I have two questions regarding this approach
Is there a limit to have large this lookup file can be? Mine would be < 500K records so I'd imagine it would be fine?
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - i.e. each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana
The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack)
Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line.
Thanks!
logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff.
As you found, you can also use the translate filter. The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (e.g. cron) without restarting logstash. Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated.
I am essentially trying to build a website where members can post blog entries and i want to record unique and overall page views for the different posts in absolute terms as well as over different time-frames e.g., last 24h, last week etc.
My initial approach was to use the date as primary key and the blogPostId as secondary key, i could then add all the posts visited during a given day. If i then include the userIds as an attribute i should then be able to a)get unique page views and b)overall page views (which might include duplicate visits by a specific user) for a given day. Finally, i would then pull the primary key for let's say the last 7 days and extract the most popular post.
As far as i can tell this should work fine as long as there aren't too many entries, however, i'm sceptical if this will scale. More specifically, if the number of blog posts increases a lot for a given interval, or if i want to find the all-time most viewed post i'd essentially have to read the whole table.
Has anyone an idea how i could implement this more efficiently?
DynamoDB will almost certainly work for you, and if you need an excuse to use it, by all means give it a try. If you get a ton or traffic it might end up being expensive.
Personally, I would consider using redis for what you are asking to do, and here is a pretty good/detailed question/answer on how you might implement it:
Scalable way of logging page request data from a PHP application?
DynamoDB can be used to iterate and create this feature quickly.
Nonetheless, this is a feature for Amazon Kinesis Data Streams, which will let you ingest data and then manipulate it to your needs.
Know that Kinesis can become expensive if you try to be as frugal as possible.
But, if you start receiving a lot of traffic, Kinesis will work as a Queue and let you manipulate the data before ingesting it to DynamoDB (Or another Data Store) (It will be cheaper than sending all those write requests).
Another limitation you'd like to check out is that DynamoDB will only return up to 1MB per Query.
Amazon recommends you use Redshift to handle all those operations as it is more suited to perform aggregation and calculation across Data warehouses.
I am thinking of building a chat app with AWS DynamoDB. The app will support 1:1 and group chats.
I want to create one table for each one of the chats, where there is a record for each sent chat text line. Is DynamoDB suitable for this kind of job?
I am also thinking of merging both tables. But is this a good idea, if there are – let's assume – 100k or 1000k users?
I think you may run into problems with the read capacity on your table. The write capacity should be ok, as there are not so many messages coming in per second (e.g. 10 or so), but you'll need to constantly read from it for all users, so that'll be expensive.
If you want to use DynamoDB just as storage and distribute the chat messages like in any normal chat over the network, then it may make sense, depending on your use cases. You could, assuming you have a hash key UserId and Timestamp, query all messages from a specific user during a specific period as a result. If you want, however, search within the chat text (a much more useful feature, probably), then DynamoDB won't work per se. It's not like SQL, where you could do a LIKE '%abc%' query (which isn't a good idea in SQL either).
Probably you're better off using S3 as data storage and ElasticSearch as search instrument. If you require the aforementioned use case "get all messages from user X in timespan S" (as a simple example) you could additionally use DynamoDB to store metadata, such as UserId, Timestamp, PositionInFile or something like that.