How to implement GDPR Right to erasure with AWS Timestream - amazon-web-services

In an application that collects data from IoT devices from multiple customer in a single AWS Timestream table, what if a customer leaves and requests all of its data to be deleted, according to GDPR's right to erasure?
Timestream doesn't support deleting records, only updating them.
Possible solutions we came up so far:
update all records of the customer with zero values
use table per customer and then delete the whole table (problematic with large number of customers)
don't actually delete the data, but make it not relatable to the customer anymore (probably not GDPR compliant anyway)
Are there any other options here? Or is there no solution and we have to dump Timestream altogether and use a different product that supports deletes?

One other alternative is to use data retention to delete the data in Timestream after a period of time you define.
GDPR mentions that you need to delete the data after 1 month (it seems you can have 2 more month but then you need to inform the user that data deletion would take longer).
Keeping data in Timestream can become expensive. I usually store the data in Timestream (short retention) and S3. I use Timestream for fast analytics or near real time dashboarding. I use S3 for adhoc query, ML or historical dashboarding.

Related

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

Single query to get the data from DynamoDB and RDS

Looking for an advice on AWS architecture. Did some research on my own, but I'm far from an expert and I would really love to hear other opinions. This seems to be a pretty common problem for miscroservice architecture, but AWS looks like a different universe to me with its own rules (and tools), there should be best practices that I'm not aware of yet.
What we have:
SOA: Lambda per entity (usually node.js + DynamoDB)
Some Lambda functions use RDS (MySQL) as a DB (this data was supposed to be used by Quicksight)
GraphQL (AppSync)
First problem occurred when we understood that we have to display in Quicksight the data that is stored in DynamoDB. This was solved by Data Pipeline job that transfers the data from DynamoDB to S3 and then is fetched by Quicksight using Athena. In this case it's acceptable that the data for analysis is not updated in real time.
But now we need to create a table in the main application and combine the data that is stored in different data sources - DynamoDB and MySQL. For example, we have an entity payment with attributes like amount and currency, this data is stored in MySQL. And then there is a contract entity which is stored in DynamoDB. Payment can have a link to a contract (one to many relation). We need to create a table with a list of contracts, so the user can filter contracts by payments attributes like seeing the contracts that have payments in EUR or with total amount > 500 USD. This table must contain real time data and have common data grid features: filtering, sorting, pagination.
Options that I see at the moment:
use SQS to transfer payment attributes from payment service to DynamodDB and store it as a String Set in DynamoDB (e.g. column currencies: ['EUR', 'USD']).
use streams (DynamoDB streams, Kinesis?) to transfer data from DynamoDB to S3, and then query the data with Athena. Not sure it will work for us, I got really bad performance issues with Athena, queries stuck in queue for a couple of minutes, did I do something wrong?
remodel the architecture, merge entities into one DB. Probably this one will take far too long to be allowed by project managers.
Data duplication (and consistency issues as a result) was always a pain for me, but it seems to be unavoidable here.
Any thoughts or links to the articles that might help are highly appreciated.
P.S. The architecture was designed by a previous development team.

Deleting Data from DynamoDb Table automatically

Is there any kind of life retention period concept in DynamoDB.
I mean is there any way such that data inside a table will be deleted after some time like we can set some retention period in S3.
Thanks,
DynamoDB has introduced a Time to Live (TTL) feature. You can create a numeric field and set the value to "time in seconds" (since epoch) when you want the record to be deleted.
DynamoDB will automatically delete the record at the specified time. Of course, TTL has to be configured on a per table basis.
You can fnd more details at: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-how-to.html
No, there is no "retention" setting available in DynamoDB.
You could run a daily/monthly query that uses a date field to filter results, and use that output to determine which Items to delete. This would need to be implemented in your own programming code.
Some users choose to use separate tables to provide ageing. For example, create a separate table for each month. Then, delete old tables once they pass a certain age. However, your software would have to know how to handle multiple tables of data.
Examples:
Reference to monthly rotation of tables
Understand Access Patterns for Time Series Data