I have an s3 bucket that gets data loaded into it daily. I am still pretty new to this process, but was wondering what the most cost-effective and efficient way to be able to query the data would be? I am considering a Lambda function to aggregate the data within the S3 buckets, and then either loading it into a Redshift schema, or query it directly with Athena.
Are there better alternatives? Different data warehouses I should consider? Cost is a priority, but ultimately this needs to be a scalable solution.
Thanks in advance!
Related
I have information in Amazon DynamoDB, that has frequently updated/added rows (it is updated by receiving events from Kinesis Stream and processing those events with a Lambda).
I want to provide a way for other teams to query that data through Athena.
It has to be as real-time as possible (the period between receiving the event and the query to Athena including that new/updated information).
The best/most cost optimized way to do that?
I know about some of the options:
scan the table regularly and put the information in Athena. This is going to be quite expensive and not real time.
start putting the raw events in S3 as well, not just DynamoDB, and make a glue crawler that scans the new records only. That's going to be closer to real time, but I don't know how to deal with duplicate events. (the information is quite frequently updated in DynamoDB, it updates old records). also not sure if it is the best way.
maybe update the data catalog directly from the lambda? not sure if that is even possible, I'm still new to the tech stack in aws.
Any better ways to do that?
You can use Athena Federated Query for this use-case.
I have stored changelogs(data with information about data) from non-relational schemaless data tables to S3. now I want some structured relational database to query on all the data. So I need to create a database from S3. Now I am confused about what should I do, whether using another S3 or using some traditional database!!!
You can create glue catalog over the data and query it using serverless Athena.
This way you are not bound to use any rdbms and can query your data at any required time keeping the files in s3.
This will also be cost effective.
Or you can anytime spin up a RDS in AWS if requires. So keeping files in s3 is good option.
I have a table with about 6 million records and want to start archiving records, I have thought of creating a backup version of the same table and moving the records across once they meet the criteria for being archived. However, I have been told that it is also possible to use Hive to copy this data to an S3.
Could someone please explain why I would opt to copy the data in to an S3 bucket rather than store it in another dynamodb table.
DynamomDB has a time-to-live mechanism and you can set a stream of records deletions which will call an AWS Lambda and put the data to S3. Check this detailed guide on how to set it up. Also, you can try out AWS Data Pipeline with an EMR cluster which is a common way to set one-time or periodical migrations.
If you actively use full scan operations over your DynamoDB then it's better to archive and remove the records you don't use. If you query the records only by the primary key, then most probably archiving doesn't worth the effort. You can verify the bill, but storing the first 25 GB in DynamoDB are free.
I'm comparing cloud storage for a large set of files with certain 'attributes' to query. Right now it's about 2.5TB of files and growing quickly. I need high throughput writes and queries. I'll first write the file and attributes to store, then will query to summarize attributes (counts, etc), additionally querying attributes to pull small set of files (by date, name, etc).
I've explored Google Cloud Datastore as a noSQL option, but trying to compare it to AWS services.
One option would be to store files in S3 with 'tags'. I believe you can query these with the REST API, but concerned about performance. I also have seen suggestions to connect Athena, but not sure if that will pull in the tags and the correct use-case.
The other option would be using something like Dynamo or possibly a large RDS? Redshift says it's for Petabyte scale, which we're not quite there...
Thoughts on best AWS storage solution? Pricing is a consideration, but more concerned with best solution moving forward.
You don't want to store the files themselves in a database like RDS or Redshift. You should definitely store the files in S3, but you should probably store or copy the metadata somewhere that is more indexable and searchable.
I would suggest setting up a new object trigger in S3 that triggers a Lambda function whenever a new file is uploaded to S3. The Lambda function could take the file location, size, any tags, etc. and insert that metadata into Redshift, DynamoDB, Elastic Search, or an RDS database like Aurora, where you could then perform queries against that metadata. Unless you are talking many millions of files, then the metadata will be fairly small and you probably won't need the scale of Redshift. The exact database you pick to store the metadata will depend on your use case such as the specific queries you want to perform.
My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.