Moving from RDBMS and I am not sure how best to design for below scenario
I have a table with around 200,000 questions with question id as partition key.
Users view questions and i do not wish to show viewed question again to the user. So which one is better option?
Have a table with question id as partition key and a set of user IDs as attribute
Have a table with User id as partition key and a set of question IDs they have viewed as attribute
Have a table with question id as partition key and user id as sort key. Once user has viewed question, add row to this table
1 and 2 might have a problem with the 400 kb size limit for item. The third seems better option though i would end up with 100 million items as there will be one row per user per question viewed. but i assume this is not a problem for dynamo?
Another problem is how to get 10 random questions not viewed by the user. Do i generate 10 random numbers between 1 and 200,000 (the number of questions) and then check if not in table mentioned in point 3 above?
I definitely would not go with option 1 or 2 for the reason you mentioned: you would already be limiting your scalability by the 400kb limit. With a UUID of 128 bits, you would be limited to about 250 users per question.
Option 3 is the way to go with DynamoDB, but what you need to consider is what is the partition key and what is the range key. You could have user_id as the partition key and question_id as the range key. The answer to that decision depends on how your data is going to be accessed. DynamoDB divides the total table throughput by each partition key: each one of your n partition keys gets 1/nth of the table throughput. For example, if you have a subset of partition keys that are accessed more than the others, then you won't be efficiently utilizing your table throughput because those partition keys that actually use up less than 1/nth of the throughput are still provisioned for 1/nth of the throughput. The general idea is that you want to have the each of your partition keys utilized equally. I think that you have it correct, I'm assuming that each question is given randomly and is no more popular than another, while some users might be more active than others.
The other part of your question is a little bit more difficult to answer / determine. You could do it your way where you have tables that contain question and user pairs for the questions that those users have read or you could have tables that contain the pairs for the questions those users haven't read. The tradeoff here is between initial write cost and subsequent read cost, and the answer depends on the amount of questions that you have compared to the consumption rate.
When you have a large amount of questions compared to the rate that users will progress through them, the chances of randomly selecting an already chosen one are small, so you're going to want to store have-read question-user pairs. With this setup you don't pay a lot to initialize a user (you don't have to write a question-user pair for each question) and you won't have a lot of miss-read costs (i.e. where you select a question-user pair and it turns out they already read it, this still consumes read-write units).
If you have a small amount of questions compared to the rate that users consume them, then you're going to want to store the haven't-read question-user pairs. You pay something to initialize each user (writing in one question-user pair for each question), but then you don't have any accidental miss-reads. If you stored them as have-read pairs when their is a small amount of questions, then you will encounter a lot of miss-reads as the percentage of read questions approaches 100% (to the point where you would have been better off just setting them up as haven't-read pairs).
I hope this helps with your design considerations. Drop a comment if you need clarification!
Related
I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)
Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.
I am considering taking advantage of sparse indexes as described in the AWS guidelines. In the example described --
... in the GameScores table, certain players might have earned a particular achievement for a game - such as "Champ" - but most players have not. Rather than scanning the entire GameScores table for Champs, you could create a global secondary index with a partition key of Champ and a sort key of UserId.
My question is: what happens when the number of champs becomes very large? I suppose that the "Champ" partition will become very large and you would start to experience uneven load distribution. In order to get uniform load distribution, would I need to randomize the "Champ" value by (effectively) sharding over n shards, e.g. Champ.0, Champ.1 ... Champ.99?
Alternatively, is there a different access pattern that can be used when fetching entities with a specific attribute that may grow large over time?
this is exactly the solution you need (Champ.0, Champ.1 ... Champ.N)
N should be [expected partitions for this index + some growth gap] (if you expect for high load, or many 'champs' then you can choose N=200) (for a good hash distribution over partitions). i recommend that N will be modulo on userId. (this can help you to do some manipulations by userId.)
we also use this solution if your hash key is Boolean (in dynamodb you can represent boolean as string), so in this case the hash will be "true.0", "true.1" .... "true.N" and the same for "false".
I have db.r3.2xlarge with 4000 PIOPS. I'm inserting like 1 billion rows from EC2 instances. There are like 40GB free RAM right now.
Currently, out of 4000 PIOPS, READ PIOPS is taking 3000 and I'm only getting 1000 WRITE PIOPS. So, it's been a low writing.
How do i check which is taking READ PIOPS? And how to speed thing up?
Thank you.
Edit:
insert ignore into dna (hash, time, song_id) values (b%s, b%s, %s)
I'm using self.cursor.executemany(query, rows) from python
hash + time + song_id is a composite primary key.
I'm using AWS RDS InnoDB.
I have 4000 PIOPS. However, it is now stuck at 2000 total. I have 60MB/s WRITE THROUGHPUT.
If the hash is your primary key or is indexed, you're not inserting in primary my and/or index order.
Also, you're using INSERT IGNORE, which suggests you are trying to avoid the inevitable duplicate key error because there's duplicate data among what you're inserting.
For both of these reasons, InnoDB has to do a lot of readying to load the appropriate pages from the tablespaces on disk into memory to find the spot(s) in the primary and/or any secondary indexes where the next row needs to go, which may turn out to be wasted effort if the row is a duplicate, and may turn out to require a page split so that space is available to randomly insert the next hash into its proper place.
If hash is the primary key, it would probably be to your advantage to drop all other indexes while inserting, then add them at the end, where they can be built more efficiently.
Pre-sorting the inserts by hash should help, some, if the batches are large enough and hash is indeed the primary key.
I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.