Access Patterns Dynamo DB possible with a NoSQL Database? - amazon-web-services

I am relativly new to NoSQL Database Structures.
In my thinking the following access patterns show relations and analytical queries. But then a SQL Data Structure would be the better approach instead of a NoSQL Structure.
I wondering whether the following access patterns are even possible with a dynamoDB or maybe one has to get the data first from the DynamoDB into e.g. a lambda to process them.
Get all customers of this month that increased their spending higher then 25% compared to the last month
Get anual spending of a customer (Only monthly spending is entered into DynamoDB)
Get customers that has a spending over 0 in a specific timeframe
Get all orders in a specific timeframe and where the customer who placed the order has the following attributes (female, 25 yars, 170cm tall)
Get all active customers, active supplyers software, active supplyers raw materials for a given timeframe

Related

Schema design for Google BigTable

In my project, Im using Google BigQuery that holds loots of data.
The BigQuery columns are:
account_id, session_id, transaction_id, username, event, timestamp.
In my dashboard, Im fetching the entire data based on time stamp (last 30 days).
Since I have very large data, the performance are pretty slow (13 sec to fetch the last 30 days data).
Lately, I try to look on Google BigTable and I saw they have an option to get data based on time.
In my tests, the performance of the BigTable are slower from the BigQuery.
Is any suggested schema that can improve the performance with BigTable?
This is example to my schema in BigTable:
const row = {
key: `transactions#${timestamp_micros}`,
data: {
identifiers: {
session_id: `session_id-${startCounter}`,
account_id: `acount-${startCounter}`,
device_id: `device-${startCounter}`,
transaction_id: `transaction_id-${startCounter}`,
runtime_id: 'AQW+2Xx5AQAAstvxskK0c8NTk+vP5eBM',
page_id: `page_id-${startCounter}`,
start_time: timestamp,
},
},
};
Is anyone can suggest a better schema that will help me to fetch the data (based on timestamp range) with the best performance?
A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, no single schema design provides the best fit for all use cases and hence your question is opinionated and will vary from person to person. The patterns described on this page provide a starting point to decide a schema for BigTable. Your unique dataset and the queries you plan to use are the most important things to consider as you design a schema for your time-series data.
As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Having row key transaction_id#reverse_timestampgets your data sorted from the latest timestamp. This could avoid hotspotting issues, which is one of the big reasons for slow query results.
However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:
Are you planning to perform lots of ad hoc queries like "SELECT A
FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable
can't support this query without performing a full table scan. (hence
it is slower than BigQuery)
Will you require multi-row OLTP transactions? Again, use BigQuery, as
Bigtable only supports transactions within a single row.
Are you streaming in new events at high QPS? Bigtable is much better
for these sorts of high-volume updates. Do you want to perform any
sort of large-scale complex transformations on the data? Again,
Bigtable is likely better here, as you can stream data out and back
in faster.
You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable
My personal choice for your use case is Big Query. You can leverage the pruning in Big Query where BigQuery scans the partitions that match the filter and skip the remaining partitions. Not only does it make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can use time-unit column partitioning or ingestion time partitioning. When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions.
So your query for fetching the entire data based on timestamp (last 30 days) should be something like this in BigQuery (when used partitioning):
SELECT
column
FROM
dataset.table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

How to partition DynamoDB table with time-series data from users of different organizations?

I have an application being built using AWS AppSync with a primary focus of sending telemetry data from a mobile application. I am stuck on how to partition and structure the DynamoDB tables for this as the users of the application belong to different organizations, in those organizations there will be admins who are able to view the data specific to their organization.
OrganizationA
-->Admin # View all the telemetry data
---->User # Send the telemetry data from their mobile application
Based on some research from these resources,
Link 1.
Link 2.
The advised manner is to create tables for individual periods i.e., a table for every day with the telemetry readings.
Example(not sure what pk is in this example):
The way in which I am planning to separate the users using AWS Cognito is by attaching a custom attribute when the user signs up such as Organization and Role(Admin or User) as per this answer then use a Pre-Signup Lambda Trigger.
How should I achieve this?
Since you really don't need users from one organization to read data from another organization, and for all your access patterns you will always know the organization id, then that attribute should be a factor in partitioning: either at the table level, or at the partition key level.
Then you have to determine if you can simply use the organization id as a partition key, or you need to further partition -- say, by concatenating the organization id and the hour value for each sample. This will depend on the amount of data you expect to generate by each organization in a given day. The tradeoff being more granular partitioning vs. cost of querying for data.
If organizations generate small amounts of data each day (say, a few events an hour) then just use organization id as the partition key. Otherwise, partition the data further.
In all of the above, the sort key should probably be the timestamp of the events, either with second or millisecond precision depending on your needs. That way your queries can retrieve ordered time-series data.
Keep in mind that when you make queries, you may need to execute multiple queries and stick the results together in your application to fully represent the results as the range may span multiple partitions, or even multiple tables.

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

Using fake timestamps to create partitions on Google BigQuery

Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution

DynamoDb table design: Single table or multiple tables

I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.
FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.