Single query to get the data from DynamoDB and RDS - amazon-web-services

Looking for an advice on AWS architecture. Did some research on my own, but I'm far from an expert and I would really love to hear other opinions. This seems to be a pretty common problem for miscroservice architecture, but AWS looks like a different universe to me with its own rules (and tools), there should be best practices that I'm not aware of yet.
What we have:
SOA: Lambda per entity (usually node.js + DynamoDB)
Some Lambda functions use RDS (MySQL) as a DB (this data was supposed to be used by Quicksight)
GraphQL (AppSync)
First problem occurred when we understood that we have to display in Quicksight the data that is stored in DynamoDB. This was solved by Data Pipeline job that transfers the data from DynamoDB to S3 and then is fetched by Quicksight using Athena. In this case it's acceptable that the data for analysis is not updated in real time.
But now we need to create a table in the main application and combine the data that is stored in different data sources - DynamoDB and MySQL. For example, we have an entity payment with attributes like amount and currency, this data is stored in MySQL. And then there is a contract entity which is stored in DynamoDB. Payment can have a link to a contract (one to many relation). We need to create a table with a list of contracts, so the user can filter contracts by payments attributes like seeing the contracts that have payments in EUR or with total amount > 500 USD. This table must contain real time data and have common data grid features: filtering, sorting, pagination.
Options that I see at the moment:
use SQS to transfer payment attributes from payment service to DynamodDB and store it as a String Set in DynamoDB (e.g. column currencies: ['EUR', 'USD']).
use streams (DynamoDB streams, Kinesis?) to transfer data from DynamoDB to S3, and then query the data with Athena. Not sure it will work for us, I got really bad performance issues with Athena, queries stuck in queue for a couple of minutes, did I do something wrong?
remodel the architecture, merge entities into one DB. Probably this one will take far too long to be allowed by project managers.
Data duplication (and consistency issues as a result) was always a pain for me, but it seems to be unavoidable here.
Any thoughts or links to the articles that might help are highly appreciated.
P.S. The architecture was designed by a previous development team.

Related

How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?

Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:
Querying AWS Glue Data Catalog
However, under the section titled Considerations and limitations the following is written:
Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.
My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?
Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?
I talked to AWS Support and did some research on this. Here's what I gathered:
The information_schema is built at query execution time, there doesn't seem to be any caching.
If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.
If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.

Use Case for Amazon Athena

We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.

Logstash and looking up additional data from a relational table?

I have mobile app log data being posted daily (eventually it will be a data stream). I am looking at different solutions for processing this log data and providing analytics. I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app.
However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user.
I have two questions regarding this approach
Is there a limit to have large this lookup file can be? Mine would be < 500K records so I'd imagine it would be fine?
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - i.e. each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana
The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack)
Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line.
Thanks!
logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff.
As you found, you can also use the translate filter. The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (e.g. cron) without restarting logstash. Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated.

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?
For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.
I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html

IoT and DynamoDB

We have to work on an IoT system. Basically sensors sending data to the cloud, and users being able to access the data belonging to them.
The amount of data can be pretty substantial so we need to ensure something that covers both security and heavy load.
The typology of data is pretty straightforward, basically a data and its value at a specified time.
The idea was to use DynamoDB for this, having a table with :
[id of sensor-array]
[id of sensor]
[type of measure]
[value of measure]
[date of measure]
The idea was for the IoT system to put directly (in python) data into the database.
Our questions are :
In terms of performance :
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
In terms of security is it okay to proceed this way?
We used to use NoSQL like MongoDB, so we're finding hard to apply our notions on DynamoDB where the data seems to be arranged in a pretty simple fashion.
Thanks.
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
Yes, DynamoDB will sustain (and cost) all the write throughput you provision for the table. As your data is small, aggregating before writing in batches (BatchWriteItem) is probably more cost efficient than individual writes.
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
Yes, queries by hash (id) and range keys (date) would be very efficient. You may need secondary indexes for more complex queries though.
In terms of security is it okay to proceed this way?
Although data is encrypted in transit and client-side encryption is straightforward, there is a lot uncovered here. For example, AWS IoT provides TLS mutual authentication with certificates, IAM or Cognito, among other security features. AWS IoT can store data in DynamoDB using a simple rule.