Show top 10 scores for a game using DynamoDB

Show top 10 scores for a game using DynamoDB - amazon-web-services

I created a table "scores" in DynamoDB to store the scores of a game.
The table has the following attributes:
uuid
username
score
datetime
Now I want to query the "top 10" scores for the game (leaderboard).
In DynamoDB, it is not possible to create an index without a partition key.
So, how do I perform this query in a scalable manner?

No. You will always need the partition key. DynamoDB is able to provide the required speed at scale because it stores records with the same partition key physically close to each other instead of on 100 different disks (or SSDs, or even servers).
If you have a mature querying use case (which is what DynamoDB was designed for) e.g. "Top 10 monthly scores" then you can hash datetime into a derived month attribute like 12/01/2017, 01/01/2018, and so on and then use this as your partition key so all the scores generated in the same month will get "bucketized" into the same partition for the best lookup performance. You can then keep score as the sort key.
You can of course have other tables (or LSIs) for weekly and daily scores. Or you can even choose the most granular bucket and sum up the results yourself in your application code. You'll prob need to pull a lot more then 10 records to get close enough to 100% accuracy on the aggregate level... I don't know. Test it. I wouldn't rely on it if I were dealing with money/commissions but for game scores who cares :)
Note: If you decide to use this route then instead of using "12/10/2017" etc. as the partition values I'd use integer offsets e.g. UNIX epoch (rounded off to represent the first midnight of the month) to make it easier to compute/code against. I used the friendly dates to better illustrate the approach.

Related

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.

I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.

If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Azure SQL DW table not using all the distributions on the compute nodes to store data

One of the Fact tables in our Azure SQL DW (stores the train telemetry data) is created as a HASH distributed table (HASH key is VehicleDimId – integer field referencing the Vehicle Dimension table). The total number of records in the table are approx. 1.3 billion.
There are 60 unique VehicleDimId (i.e. we have data for 60 unique vehicles) values in the table which means that they have 60 unique hash keys as well. Based on my understanding, I expect the records corresponding to these 60 unique hash key VehicleDimId should be distributed across 60 distributions available (1 hash key for 1 distribution).
However, currently all the data is distributed across just 36 distributions leaving other 24 distributions with no records. In effect, that is just 60% usage of the compute nodes available. Changing the Data Warehouse scale does not have any effect as the number of distributions remain the same to 60. We are currently running our SQL DW at DW400 level. Below is the compute node level record counts of the table.
You can see that the data is not evenly distributed across compute nodes (which is due to the data not being distributed evenly across the underlying distributions).
I am struggling to understand what I need to do to get the SQL DW to use all the distributions rather than just 60% of them.

Hash distribution takes a hash of the binary representation of your distribution key then deterministically sends the row to the assigned distribution. Basically an int value of 999 ends up on the same distribution on every Azure SQL DW predictably. It doesn't look at your specific 60 unique vehicle IDs and evenly divide them.
The best practice is to choose a field (best if it is used in joins or group bys or distinct counts) which has at least 600 (10x the number of distributions) fairly evenly used values. Are there other fields that meet that criteria?
To quote from this article adding some emphasis:
Has many unique values. The column can have some duplicate values.
However, all rows with the same value are assigned to the same
distribution. Since there are 60 distributions, the column should have
at least 60 unique values. Usually the number of unique values is much
greater.
If you only have 60 distinct values your likelihood of ending up with even distribution is very small. With 10x more distinct values your likelihood of achieving even distribution is much higher.
The fallback is to use round robin distribution. Only do this if there are no other good distribution keys which produce even distribution and which are used in queries. Round robin should achieve optimal loading performance but query performance will suffer because the first step of every query will be a shuffle.
In my opinion, concatenating two columns together (as Ellis' answer suggests) to use as the distribution key is usually a worse option than round robin distribution unless you actually use the concatenated column in group bys or joins or distinct counts.
It is possible that keeping the current Vehicle ID distribution is the best choice for query performance since it will eliminate a shuffle step in many queries that join or group on Vehicle ID. However the load performance may be much worse because of the heavy skew (uneven distribution).

Another option is to create a concatenated join key which could be a concatenation of two different keys which will create a higher cardinality than what you have now with 60 x new row cardinality should generally be in the thousands or greater. The caveat here is that key needs to then be referenced in every join so that the work is done one each node. Then, when you hash on this key, you'll get a more even spread.
The only downside is that you have to propagate this concatenated key to the dimension table as well and ensure that your join conditions include this concatenated key until the last query. As an example, you keep the surrogate key in the subqueries and remove it only in the top level query to force collocated joins.

Is there any real sense in uniform distributed partition keys for small applications using DynamoDB?

Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.

I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?

Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.

How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.

Reducing query time in table with unsorted timeranges

I had a question regarding this matter some days ago, but I'm still wondering about how to tune my performance on this query.
I have a table looking like this (SQLite)
CREATE TABLE ZONEDATA (
TIME INTEGER NOT NULL,
CITY INTEGER NOT NULL,
ZONE INTEGER NOT NULL,
TEMPERATURE DOUBLE,
SERIAL INTEGER ,
FOREIGN KEY (SERIAL) REFERENCES ZONES,
PRIMARY KEY ( TIME, CITY, ZONE));
I'm running a query like this:
SELECT temperature, time, city, zone from zonedata
WHERE (city = 1) and (zone = 1) and (time BETWEEN x AND y);
x and y are variables which may have several hundred thousands variables between them.
temperature ranges from -10.0 to 10.0, city and zone from 0-20 (in this case it is 1 and 2, but can be something else). Records are logged continuously with intervals on about 5-6 seconds from different zones and cities. This creates a lot of data, and does not necessarily mean that every record is logged in correct order of time.
The question is how I can optimize retrieval of records in a big time range (where records are not sorted 100% correctly by time). This can take a lot of time, especially when I'm retrieving from several cities and zones. That means running the mentioned query with different parameters several times. What I'm looking for is specific changes to the query, table structure (preferably not) or other changeable settings.
My application using this is btw implemented in c++.

Your data already is sorted by Time.
By having a Primary Key on (Time, City, Zone) all the records with that same Time value will be next to each other. (Unless you have specified a CLUSTER INDEX elsewhere, though I'm not familiar enough with SQLite to know if that's possible.)
In your particular case, however, that means the records that you want are not next to each other. Instead they're in bunches. Each bunch of records will have (city=1, zone=1) and have the same Time value. One bunch for Time1, another bunch for Time2, etc, etc.
It's like putting it all in Excel and ordering by Time, then by City, then by Zone.
To bunch ALL the records you want (for the same City and Zone) change that to (City, Zone, Time).
Note, however, that if you also have a query for all cities and zones but a time = ??? the key I suggested won't be perfect for that, your original key would be better.
For that reason you may wish/need to add different indexes in different orders, for different queries.
This means that to give you a specific recommended solution we need to know the specific query you will be running. My suggested key/index order may be ideal for your simplified example, but the real-life scenario may be different enough to warrant a different index altogether.

You can index those columns, it will sort it internally for faster query but you will not see it.

For a database between is hard to optimize. One way out of this is adding extra fields so you can replace between with an =. For example, if you add a day field, you could query for:
where city = 1 and zone = 1 and day = '2012-06-22' and
time between '2012-06-22 08:00' and '2012-06-22 12:00'
This query is relatively fast with an index on city, zone, day.
This requires thought to pick the proper extra fields. It requires additional code to maintain the field. If this query is in an important performance path of your application, it might be worth it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js