where to store 10kb pieces of text in amazon aws? - amazon-web-services

These will be indexed and randomly accessed in a web app like SO questions. SimpleDB has a 1024-byte limit per attribute but you could use multiple attrs but sounds inelegant.
Examples: blog posts; facebook status messages; recipes (in a blogging application; facebook-like application; recipe web site).
If I were to build such an application on Amazon AWS, where/how should I store the pieces of text?

With S3, you could put all the actual files in S3, then index them with Amazon RDS, or Postgres on Heroku, or whatever suits you at that time.
Also, you can get the client to download the multi kB text blurbs directly from S3, so your app could just deliver URLs to the messages, thereby creating a massively parallel server - even if the main server is just a single thread on one machine, constructing the page from S3 asset URLs. S3 could store all assets, like images, etc.
The advantages are big. This also solves backup, etc. And allows you to play with many indexing and searching schemes. Search could for instance be done using Google...

I'd say you would want to look at Amazon RDS, running a relational database like MySQL in the cloud. A single DynamoDB read capacity unit can only (consistently) read a 1kb-item, that's probably not going to work for you.
Alternatively, you could store the text files in S3 and put pointers to these files in SimpleDB. It depends on a lot of factors which is going to be more cost-effective: how many files you add every day, how often these files are expected to change, how often they are requested, etc.
Personally, I think that using S3 would not be the best approach. If you store all questions and answers in separate text files, you're looking at a number of requests for displaying even a simple page. Let alone search, which would require you to fetch all the files from S3 and search through them. So for search, you need a database anyway.
You could use SDB for keeping an index but frankly, I would just use MySQL on Amazon RDS (there's a free two-month trial period right now, I think) where you can do all the nice things that relational databases can do, and which also offers support for full-text search. RDS should be able to scale up to huge numbers of visitors every day: you can easily scale up all the way to a High-Memory Quadruple Extra Large DB Instance with 68 GB of memory and 26 ECUs.
As far as I know, SO is also built on top of a relational database: https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/

DynamoDB is might be what you want, there is even a forum use case in their documentation: Example Tables and Data in Amazon DynamoDB

There is insufficient information in the question to provide a reasonable answer to "where should I store text that I'm going to use?"
Depending on how you build your application and what the requirements are for speed, redundancy, latency, volume, scalability, size, cost, robustness, reliability, searchability, modifiability, security, etc., the answer could be any of:
Drop the text in files on an EBS volume attached to an instance.
Drop the text into a MySQL or RDS database.
Drop the text into a distributed file system spread across multiple instances.
Upload the text to S3
Store the text in SimpleDB
Store the text in DynamoDB
Cache the text in ElastiCache
There are also a number of variations on this like storing the master copy in S3, caching copies in ElastiCache and on the local disk, indexing it with specific keys in DynamoDB and making it searchable in Cloud Search.

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

Django and Amazon Lambda: Best solution for big data with Amazon RDS or GraphQL or Amazon AppSync

We have a system with large data (about 10 million rows in on a table). We developed it in Django framework and also we want to use Amazon Lambda for serving it. Now I have some question about it:
1- If we want to use Amazon RDS (MySql, PostgresSQL), which one is better? And relational database is a good solution for doing this?
2- I read somewhere, If we want to use a relational database in Amazon Lambda, Django for each instance, opens a new connection to the DB and it is awful. Is this correct?
3- If we want to use GraphQL and Graph database, Is that a good solution? Or we can combine Django Rest-API and GraphQL together?
4- If we don't use Django and use Amazon AppSync, Is better or not? What are our limitations for use this.
Please help me.
Thanks
GraphQL is very useful for graph data, not timeseries. Your choice will depend on the growth factor, not the actual rows. I currently run an RDS instance with 5 billion rows just fine, but the problem is how it will increase over time. I suggest looking into archival strategies using things like S3 or IoT-analytics (this one is really cool).
I wouldn't worry about concurrent connections until you have a proper reason too (+50's per second). Your DB will be the largest server you have anyway.

Sanity check on AWS Big Data Architecture

We're currently looking to move our AWS architecture over to something that supports large amounts of data and can scale as we gain more customers. When this project started we stuck with what we knew, a Ruby app on an EC2 making RESTful API calls, storing the results in S3, and also storing everything in an RDS. We have a SPA front end written in VueJS to support the stored data.
As our client list has grown, the outbound API calls and subsequence data we are storing is also growing. I'm currently tasked with looking for a better solution and I wanted to get a sense of feedback on what I was thinking so far. Currently we have around 5 millions rows of relational data which will only increase as our client list does. I could see in a year or two we would be in the low billions or rows.
The Ruby app does a great job of handling queuing the outbound API calls, retries, and everything else in-between. For this reason we thought about keeping the app and rather than inserting directly into the RDS, it would simply dump the results into S3 as a CSV.
A trigger in S3 could now convert the raw CSV data into parquet format using a Lambda function (I was looking at something like PyArrow). From here we could move over from the traditional RDS to something like Athena which supports parquet and would allow us to reuse most of our existing SQL queries.
To further optimize the performance for the user we thought about caching commonly used queries in a Dynamo table. Because the data is based on the scheduled external API calls, we could control when to bust the cache of the queries.
Big Data backends aren't really my thing, so any feedback is greatly appreciated. I know I have a lot more research to do into parquet as it's new to me. Eventually we'd like to do some ML on this data, so I believe parquet will also support thanks.

What is the best way to transfer data from AWS SQS to S3?

Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!
for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing

What is the different between AWS Elasticsearch and AWS Redshift

I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift
Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch
I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!
This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.