DynamoDB Schema Design - amazon-web-services

I'm thinking of using Amazon AWS DynamoDB for a project that I'm working on. Here's the gist of the situation:
I'm going to be gathering a ton of energy usage data for hundreds of machines (energy readings are taken around every 5 minutes). Each machine is in a zone, and each zone is in a network.
I'm then going to roll up these individual readings by zone and network, by hour and day.
My thinking is that by doing this, I'll be able to perform one query against DynamoDB on the network_day table, and return the energy usage for any given day quickly.
Here's my schema at this point:
table_name | hash_key | range_key | attributes
______________________________________________________
machine_reading | machine.id | epoch | energy_use
machine_hour | machine.id | epoch_hour | energy_use
machine_day | machine.id | epoch_day | energy_use
zone_hour | machine.id | epoch_hour | energy_use
zone_day | machine.id | epoch_day | energy_use
network_hour | machine.id | epoch_hour | energy_use
network_day | machine.id | epoch_day | energy_use
I'm not immediately seeing that great of performance in tests when I run the rollup cronjob, so I'm just wondering if someone with more experience could comment on my key design? The only experience I have so far is with RDS, but I'm very much trying to learn about DynamoDB.
EDIT:
Basic structure for the cronjob that I'm using for rollups:
foreach network
foreach zone
foreach machine
add_unprocessed_readings_to_dynamo()
roll_up_fixture_hours_to_dynamo()
roll_up_fixture_days_to_dynamo()
end
roll_up_zone_hours_to_dynamo()
roll_up_zone_days_to_dynamo()
end
roll_up_network_hours_to_dynamo()
roll_up_network_days_to_dynamo()
end
I use the previous function's values in Dynamo for the next roll up, i.e.
I use zone hours to roll up zone days
I then use zone days to roll up
network days
This is what (I think) is causing a lot of unnecessary reads/writes. Right now I can manage with low throughputs because my sample size is only 100 readings. My concerns begin when this scales to what is expected to contain around 9,000,000 readings.

First things first, time series data in DynamoDB is hard to do right, but not impossible.
DynamoDB uses the hash key to shard the data so using the machine.id means that some of you are going to have hot keys. However, this is really a function of the amount of data and what you expect your IOPS to be. DynamoDB doesn't create a 2nd shard until you push past 1000 read or write IOPS. If you expect to be well below that level you may be fine, but if you expect to scale beyond that then you may want to redesign, specifically include a date component in your hash key to break things up.
Regarding performance, are you hitting your provisioned read or write throughput level? If so raise them to some crazy high level and re-run the test until the bottleneck becomes your code. This could be a simple as setting the throughput level appropriately.
However, regarding your actual code, without seeing the actual DynamoDB queries you are performing a possible issue would be reading too much data. Make sure you are not reading more data than you need from DynamoDB. Since your range key is a date field use a range conditional (not a filter) to reduce the number of records you need to read.
Make sure you code executes the rollup using multiple threads. If you are not able to saturate the DynamoDB provisioned capacity the issue may not be DynamoDB, it may be your code. By performing the rollups using multiple threads in parallel you should be able to see some performance gains.

What's the provisioned throughput on the tables you are using? how are you performing the rollup? Are you reading everything and filtering / filtering on range keys, etc?
Do you need to roll up/a cron job in this situation?
Why not use a table for the readings
machine_reading | machine.id | epoch_timestamp | energy_use
and a table for the aggregates
hash_key can be aggregate type and range key can be aggregate name
example:
zone, zone1
zone, zone3
day, 03/29/1940
when getting machine data, dump it in the first table and after that use atomic counters to increment entities in 2nd table:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Related

Retroactive calculation of VM instance total running time (GCP)

I have a number of instances in a GCP project, that I want to check retroactively how long they've been in use in the last 30 days, i.e. sum the total time an instance is not stopped or terminated during a specific month.
Does anyone know if this can be calculated, and if so - how?
Or maybe another idea that would allow me to sum the total time an instance was in use?
Based on this other post, I would recommend something like:
fetch gce_instance
| metric 'compute.googleapis.com/instance/uptime'
| align delta(30d)
| every 30d
| group_by [metric.instance_name]
I would also consider creating uptime_checks as one of the answers in said post recommends for future checks, but those wouldn't work retroactively. If you need more info about MQL you can find it here.

How do I keep a running count in DynamoDB without a hot table row?

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?
You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

Amazon redshift large table VACUUM REINDEX issue

My table is 500gb large with 8+ billion rows, INTERLEAVED SORTED by 4 keys.
One of the keys has a big skew 680+. On running a VACUUM REINDEX, its taking very long, about 5 hours for every billion rows.
When i track the vacuum progress it says the following:
SELECT * FROM svv_vacuum_progress;
table_name | status | time_remaining_estimate
-----------------------------+--------------------------------------------------------------------------------------+-------------------------
my_table_name | Vacuum my_table_name sort (partition: 1761 remaining rows: 7330776383) | 0m 0s
I am wondering how long it will be before it finishes as it is not giving any time estimates as well. Its currently processing partition 1761... is it possible to know how many partitions there are in a certain table? Note these seem to be some storage level lower layer partitions within Redshift.
These days, it is recommended that you should not use Interleaved Sorting.
The sort algorithm places a tremendous load on the VACUUM operation and the benefits of Interleaved Sorts are only applicable for very small use-cases.
I would recommend you change to a compound sort on the fields most commonly used in WHERE clauses.
The most efficient sorts are those involving date fields that are always incrementing. For example, imagine a situation where rows are added to the table with a transaction date. All new rows have a date greater than the previous rows. In this situation, a VACUUM is not actually required because the data is already sorted according to the Date field.
Also, please realise that 500 GB is actually a LOT of data. Doing anything that rearranges that amount of data will take time.
If you vacuum is running slow you probably don’t have enough space on the cluster. I suggest you double the number of nodes temporarily while you do the vacuum.
You might also want to think about changing how your schema is set up. It’s worth going through this list of redshift tips to see if you can change anything:
https://www.dativa.com/optimizing-amazon-redshift-predictive-data-analytics/
The way we recovered back to the previous stage is to drop the table and restore it from the pre vacuum index time from the backup snapshot.

Dynamo throughput not reaching provisioned level - using Hive and EMR 5.2

We're using Hive running on EMR 5.2.0 to run many many files to a Dynamo table. The provisioned throughput on the table is 3000 writes per second.
We are only able to hit 2000 writes regardless of the throughput percentage that is set in the Hive script.
The Hive execution engine is set to mr, and the dynamo.throughput.read.percent is set to 1.0.
We use the EMR to run the step using command-runner. Thus far we're unable to find any reasons why it's only using 2/3 of the provisioned writes.
Any advice or help would be greatly appreciated, thanks.
Edited to add hive script:
SET hive.execution.engine=mr;
DROP TABLE IF EXISTS s3_import;
DROP TABLE IF EXISTS dynamo_import;
CREATE EXTERNAL TABLE s3_import(fld string, dateRef string)
ROW FORMAT
DELIMITED FIELDS
TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://${s3Path}';
CREATE EXTERNAL TABLE dynamo_import(fld string, dateRef string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = '${tableName}',
"dynamodb.throughput.read.percent" = '${rp}',
"dynamodb.throughput.write.percent" = '${wp}',
"dynamodb.column.mapping" = "fld:fld,dateRef:dateRef");
INSERT OVERWRITE TABLE dynamo_import SELECT * FROM s3_import;
Edit:
If I run two applications in parallel that use 0.5 as the write throughput we're able to achieve the optimal writes within the provisioned amount, this leads me to think that there may be a setting o the cluster that is causing the problem?
The read and write percent settings are best effort rate limiters. The DynamoDB connector estimates the read and write capacity based on an item size heuristic and may not always get it right. That's why you can actually "over provision" reads and writes up to 1.5 (150%) so you should try that.
The other thing that can actually cause your write capacity to not hit the provisioned limit is the presence of hot-spots in the key space. If there are more items in one partition than others then the utilization will be uneven and you will hit throttling on (one or two partitions) even though you're not using the full provisioned rate for the whole table. With 3000 write capacity units and some reads, your table has at least 4 partitions so this can definitely be a factor.

What does ECU units, CPU core and memory mean when I launch a instance

When I launch an instance on EC2, it gives me option for t1.micro, m1.small, m1.large etc. There is a comparision chart of vCPU, ECU, CPU cores, Memory, Instance store. Is this memory RAM of a system ?
I am not able to understand what all these terms refer to, can anyone give me a clear picture of what these terms mean ?
ECU = EC2 Compute Unit. More from here: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
Amazon EC2 uses a variety of measures to provide each instance with a consistent and predictable amount of CPU capacity. In order to make it easy for developers to compare CPU capacity between different instance types, we have defined an Amazon EC2 Compute Unit. The amount of CPU that is allocated to a particular instance is expressed in terms of these EC2 Compute Units. We use several benchmarks and tests to manage the consistency and predictability of the performance from an EC2 Compute Unit. One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. This is also the equivalent to an early-2006 1.7 GHz Xeon processor referenced in our original documentation. Over time, we may add or substitute measures that go into the definition of an EC2 Compute Unit, if we find metrics that will give you a clearer picture of compute capacity.
For linuxes I've figured out that ECU could be measured by sysbench:
sysbench --num-threads=128 --test=cpu --cpu-max-prime=50000 --max-requests=50000 run
Total time (t) should be calculated by formula:
ECU=1925/t
And my example test results:
| instance type | time | ECU |
|-------------------|----------|---------|
| m1.small | 1735,62 | 1 |
| m3.xlarge | 147,62 | 13 |
| m3.2xlarge | 74,61 | 26 |
| r3.large | 295,84 | 7 |
| r3.xlarge | 148,18 | 13 |
| m4.xlarge | 146,71 | 13 |
| m4.2xlarge | 73,69 | 26 |
| c4.xlarge | 123,59 | 16 |
| c4.2xlarge | 61,91 | 31 |
| c4.4xlarge | 31,14 | 62 |
Responding to the Forum Thread for the sake of completeness. Amazon has stopped using the the ECU - Elastic Compute Units and moved on to a vCPU based measure. So ignoring the ECU you pretty much can start comparing the EC2 Instances' sizes as CPU (Clock Speed), number of CPUs, RAM, storage etc.
Every instance families' instance configurations are published as number of vCPU and what is the physical processor. Detailed info and screenshot obstained from here http://aws.amazon.com/ec2/instance-types/#instance-type-matrix
ECUs (EC2 Computer Units) are a rough measure of processor performance that was introduced by Amazon to let you compare their EC2 instances ("servers").
CPU performance is of course a multi-dimensional measure, so putting a single number on it (like "5 ECU") can only be a rough approximation. If you want to know more exactly how well a processor performs for a task you have in mind, you should choose a benchmark that is similar to your task.
In early 2014, there was a nice benchmarking site comparing cloud hosting offers by tens of different benchmarks, over at CloudHarmony benchmarks. However, this seems gone now (and archive.org can't help as it was a web application). Only an introductory blog post is still available.
Also useful: ec2instances.info, which at least aggregates the ECU information of different EC2 instances for comparison. (Add column "Compute Units (ECU)" to make it work.)