Implementing backend caching in a serverless AWS architecture recommendation - amazon-web-services

I have a serverless app using AWS lambda. My lambda fetches some data from other third party APIs and I'm planning to implement backend caching to cache the data my lambda fetches. I've read some articles online and I saw some sample architectures.
If you observe the picture, the dynamoDB is used to store the cache of the data fetched by the lambda. People suggested DynamoDB as the latency it adds while fetching the cache is very minimal. But I cannot go to DynamoDB as each of my data item to be cached is very large, even after gzipping it (approx 5-7MB). Hence, Im planning to use a DocumentDB instead of a DynamoDB. Im just learning how DocumentDB works and have no idea if it is as fast and efficient as DynamoDB. Could anyone comment on this idea and give some suggestions regarding which options I could use in this case apart from DynamoDB and if DocumentDB is a good alternative?

If you need to cache 5-7MB worth of data, have you considered using S3 as a cache? S3 can be used as an effective (and cheap) nosql database. I'm not sure it will meet your specific use case, but it may be worth exploring depending on your needs.
I found a good conversation about this topic on StackOverflow if you want to dive deeper.

Related

Communication between concurrent AWS Lambdas

I want a set of concurrent AWS Lambda functions to write and read data from a common place. They will read the last post introduced and perhaps will write a new (last) post. Posts will be short - less than 1kB - and in json format. Once the last concurrent lambda will finish the execution, the posts will be deleted.
I am considering using AWS Dynamo DB. Do you think that this is the best option? Is it possible to use AWS CloudWatch?
Definitely two options Dynamo & ElasticCache. Pretty basic answer, I am sure there are more things to consider, just focussing on this use case.
Dynamo
Pros: serverless
Cons:
Probably need strongly consistent reads to fetch most recent update in sub milli second difference.
Also could become expensive than ElasticCache depending on how many read/writes we are making.
ElasticCache
Pros: Faster than Dynamo, as it is in-memory.
Cors: we need to run a server, so, will be paying some min $.

Django and Amazon Lambda: Best solution for big data with Amazon RDS or GraphQL or Amazon AppSync

We have a system with large data (about 10 million rows in on a table). We developed it in Django framework and also we want to use Amazon Lambda for serving it. Now I have some question about it:
1- If we want to use Amazon RDS (MySql, PostgresSQL), which one is better? And relational database is a good solution for doing this?
2- I read somewhere, If we want to use a relational database in Amazon Lambda, Django for each instance, opens a new connection to the DB and it is awful. Is this correct?
3- If we want to use GraphQL and Graph database, Is that a good solution? Or we can combine Django Rest-API and GraphQL together?
4- If we don't use Django and use Amazon AppSync, Is better or not? What are our limitations for use this.
Please help me.
Thanks
GraphQL is very useful for graph data, not timeseries. Your choice will depend on the growth factor, not the actual rows. I currently run an RDS instance with 5 billion rows just fine, but the problem is how it will increase over time. I suggest looking into archival strategies using things like S3 or IoT-analytics (this one is really cool).
I wouldn't worry about concurrent connections until you have a proper reason too (+50's per second). Your DB will be the largest server you have anyway.

Best practices for setting up a data pipeline on AWS? (Lambda/EMR/Redshift/Athena)

*Disclaimer: *This is my first time ever posting on stackoverflow, so excuse me if this is not the place for such a high-level question.
I just started working as a data scientist and I've been asked to set up an AWS environment for 'external' data. This data comes from different sources, in different formats (although its mostly csv/xlsx). They want to store it on AWS and be able to query/visualize it with Tableau.
Despite my lack of AWS experience I managed to come up with a solution that's more or less working. This is my approach:
Raw csv/xlsx are grabbed using a Lambda
Data is cleaned and transformed using pandas/numpy in the same Lambda as 1.
The processed data is written to S3 folders as CSV (still within the same lambda)
Athena is used to index the data
Extra tables are created using Athena (some of which are views, others aren't)
Athena connector is setup for Tableau
It works but it feels like a messy solution: the queries are slow and lambdas are huge. Data is often not as normalized as it could be, since it increases query time even more. Storing as CSV also seems silly
I've tried to read up on best practices, but it's a bit overwhelming. I've got plenty questions, but it boils down to: What services should I be using in a situation like this? What does the high-level architecture look like?
I have a fairly similar use-case; however, it all comes down to the size of the project and how for you want to take robustness / future planning of the solution.
As a first iteration, what you have described above seems like it works and is a reasonable approach but as you pointed out is quite basic and clunky. If the external data is something you will be consistently ingestion and can foresee growing i would strongly suggest you design a datalake system first, my recommendation would be either use AWS lake formation service or if you want more control, and build ground up, use something like the 3x3x3 approach.
By designing your datalake correctly managing the data in the future becomes much simpler and nicely partitions your files for future use / data diving.
As a high level architecture would be something like:
Lambda get request from source and paste to s3
Datalake system handles file and auto partitions + tags
then,
Depending on how quickly you need to visualise your data and if it large amounts of data potentially use AWS glue pyshell or pyspark instead of lambda. Which will handle your pandas/numpy a lot cleaner.
I would also recommend converting your files into parquet if your using Athena or equivalent for improved query speed. Remember file partitioning is important to performance!
Note, the above is for quite a robust ingestion system and may be overkill if you have a basic use case with infrequent data ingestion.
If your data is in small packets but is very frequent you could even use a kinesis layer in-front of the lambda to s3 step to pipe your data in a more organised manner. You could also use redshift to host your files instead of S3 if you wanted a more contemporary warehouse solution. However, if you have x sources i would suggest stick with s3 for simplicity.

DynamoDB has very little support docs on the web

I am trying to use DynamoDB for the backend DB of my application, but am having a hard time finding useful information associated to it.
What is the best source of examples and tutorial information for syntax structures etc?
AWS docs are really confusing. Or am I the only person sitting with these problems?
Oh and is the newly launched AWS DocumentDB (Basically MongaDB) going to make DynamoDB pointless to learn, or is there still merit in learning DynamoDB?
The pricing model between DocumentDB and DynamoDB are completely different - there is definitely a place for both - imo, dynamodb is not going away any time soon.
As far as tutorials - there are tons of AWS reinvent videos on youtube, and this site allows you to search/find them easily: https://reinventvideos.com/. Good place to start.

Sanity check on AWS Big Data Architecture

We're currently looking to move our AWS architecture over to something that supports large amounts of data and can scale as we gain more customers. When this project started we stuck with what we knew, a Ruby app on an EC2 making RESTful API calls, storing the results in S3, and also storing everything in an RDS. We have a SPA front end written in VueJS to support the stored data.
As our client list has grown, the outbound API calls and subsequence data we are storing is also growing. I'm currently tasked with looking for a better solution and I wanted to get a sense of feedback on what I was thinking so far. Currently we have around 5 millions rows of relational data which will only increase as our client list does. I could see in a year or two we would be in the low billions or rows.
The Ruby app does a great job of handling queuing the outbound API calls, retries, and everything else in-between. For this reason we thought about keeping the app and rather than inserting directly into the RDS, it would simply dump the results into S3 as a CSV.
A trigger in S3 could now convert the raw CSV data into parquet format using a Lambda function (I was looking at something like PyArrow). From here we could move over from the traditional RDS to something like Athena which supports parquet and would allow us to reuse most of our existing SQL queries.
To further optimize the performance for the user we thought about caching commonly used queries in a Dynamo table. Because the data is based on the scheduled external API calls, we could control when to bust the cache of the queries.
Big Data backends aren't really my thing, so any feedback is greatly appreciated. I know I have a lot more research to do into parquet as it's new to me. Eventually we'd like to do some ML on this data, so I believe parquet will also support thanks.