I am planning to store data into S3 on top of which SQL queries would later be executed. The S3 file would basically contain json records. I would be getting these records through DynamoDB streams triggering AWS Lambda execution so its difficult to handle duplication at that layer as AWS Lambda guarantees atleast once delivery.
To avoid handling duplicate records in queries, I would like ensure that records being inserted at unique.
As far as I know, the only way to do achieve uniqueness is to have a unique S3 key. If I were to opt for this approach, I would end creating couple of million S3 files per day. Each file consists of single json record.
Would creating so many files be an concern when executing Athena queries?
Any alternatives approaches?
I think you would be better off handling the deduplication in Athena itself. For Athena, weeding out a few duplicates will be an easy job. Set up a view that groups by the unique property and uses ARBITRARY or MAX_BY (if you have something to order by to pick the latest) for the non-unique properties, and run your queries against this view to not have to worry about deduplication in each individual query.
You could also run a daily or weekly deduplication job using CTAS, depending on how fresh the data has to be (you can also do complex hybrids with pre-deduplicated historical data union'ed with on-the-fly-deduplicated data).
When running a query Athena lists the objects on S3, and this is not a parallelizable operation (except for partitioned tables where it's parallelizable to the grain of the partitioning), and S3's listings are limited to a page size of 1000. You really don't want to have Athena queries against tables (or partitions) with more than 1000 files.
Write to S3 via a Kinesis Firehose and then query that via Athena. The Firehose will group your records into a relatively small number of files, such that it will then be efficient to query them via Athena. Indeed, it will even organize them into a folder structure that is nicely partitioned by write timestamp.
Related
I'm did ETL for our data and did simple aggregations on it in Athena. Our plan is to use our BI tool to access those tables from Athena. It works for now, but I'm worried that these tables are static i.e. they only reflect the data since I last created the Athena table. When called, are Athena tables automatically ran again? If not, how do I make them be automatically updated when called by our BI tool?
My only solution thus far to overwrite the tables we have is by running two different queries: one query to drop the table, and another to re-create the table. Since it's two different queries, I'm not sure if you can run it all at the same time (at least in Athena, you can't run them all in one go).
Amazon Athena is a query engine, not a database.
When a query is sent to Amazon Athena, it looks at the location stored in the table's DDL. Athena then goes to the Amazon S3 location specified and scans the files for the requested data.
Therefore, every Athena query always reflects the data shown in the underlying Amazon S3 objects:
Want to add data to a table? Then store an additional object in that location.
Want to delete data from a table? Then delete the underlying object that contains that data.
There is no need to "drop a table, then re-create the table". The table will always reflect the current data stored in Amazon S3. In fact, the table doesn't actually exist -- rather, it is simply a definition of what the table should contain and where to find the data in S3.
The best use-case for Athena is querying large quantities of rarely-accessed data stored in Amazon S3. If the data is often accessed and updated, then a traditional database or data warehouse (eg Amazon Redshift) would be more appropriate.
Pointing a Business Intelligence tool to Athena is quite acceptable, but you need to have proper processes in place for updating the underlying data in Amazon S3.
I would also recommend storing the data in Snappy-compressed Parquet files, which will make Athena queries faster and lower cost (because it is charged based upon the amount of data read from disk).
I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.
Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.
I have information in Amazon DynamoDB, that has frequently updated/added rows (it is updated by receiving events from Kinesis Stream and processing those events with a Lambda).
I want to provide a way for other teams to query that data through Athena.
It has to be as real-time as possible (the period between receiving the event and the query to Athena including that new/updated information).
The best/most cost optimized way to do that?
I know about some of the options:
scan the table regularly and put the information in Athena. This is going to be quite expensive and not real time.
start putting the raw events in S3 as well, not just DynamoDB, and make a glue crawler that scans the new records only. That's going to be closer to real time, but I don't know how to deal with duplicate events. (the information is quite frequently updated in DynamoDB, it updates old records). also not sure if it is the best way.
maybe update the data catalog directly from the lambda? not sure if that is even possible, I'm still new to the tech stack in aws.
Any better ways to do that?
You can use Athena Federated Query for this use-case.
I wanted to know the performance improvement when we use Amazon Athena without partitioning and with partitioning. I know for sure that Athena with partitioning is much better than Athena. But does Athena without partitioning give any improvement over Amazon S3?
Partitioning separates data files into separate directories. If the column used for partitioning is part of a query's WHERE clause, it allows Athena to skip-over directories that do not contain relevant data. This is highly effective at improve query performance (and lowering cost) because it reduces the need for disk access and memory.
There are several ways to improve the performance of Amazon Athena:
Store data in a columnar format, such as Parquet. This allows Athena to go directly to specific columns without having to read all columns in a wide table. (This is similar to Amazon Redshift.)
Compress data (eg using Snappy compression) to reduce the amount of data that needs to be read from disk. This also reduces the cost of queries since they are charged based on the amount of data read from disk. (Instant savings!)
Partition data to completely skip-over input files when the partition key is used in a query's WHERE clause
For some examples of these benefits, see: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
I have a table with about 6 million records and want to start archiving records, I have thought of creating a backup version of the same table and moving the records across once they meet the criteria for being archived. However, I have been told that it is also possible to use Hive to copy this data to an S3.
Could someone please explain why I would opt to copy the data in to an S3 bucket rather than store it in another dynamodb table.
DynamomDB has a time-to-live mechanism and you can set a stream of records deletions which will call an AWS Lambda and put the data to S3. Check this detailed guide on how to set it up. Also, you can try out AWS Data Pipeline with an EMR cluster which is a common way to set one-time or periodical migrations.
If you actively use full scan operations over your DynamoDB then it's better to archive and remove the records you don't use. If you query the records only by the primary key, then most probably archiving doesn't worth the effort. You can verify the bill, but storing the first 25 GB in DynamoDB are free.