Smart sampling with AWS Glue Crawlers - amazon-web-services

I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.
Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.
The schema doesn't change often so it is not necessary to go through all of the files on S3.
Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?

Depending on your bucket structure maybe you could just make use of exclude paths and point the crawlers to specific prefixes that you want to be crawled. If the partitioning is hive style partitioning then you can make use of Athena to execute msck repair table to add partitions. Alternatively you can create the tables manually in Athena and run msck repair which is bound to take a very long time if you have to many partitions and files are huge as you mentioned.

Related

Joining many large files on AWS

I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, ​execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.
Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.

Comparison of Amazon S3, Amazon Athena, and Amazon Athena with partitioning

I wanted to know the performance improvement when we use Amazon Athena without partitioning and with partitioning. I know for sure that Athena with partitioning is much better than Athena. But does Athena without partitioning give any improvement over Amazon S3?
Partitioning separates data files into separate directories. If the column used for partitioning is part of a query's WHERE clause, it allows Athena to skip-over directories that do not contain relevant data. This is highly effective at improve query performance (and lowering cost) because it reduces the need for disk access and memory.
There are several ways to improve the performance of Amazon Athena:
Store data in a columnar format, such as Parquet. This allows Athena to go directly to specific columns without having to read all columns in a wide table. (This is similar to Amazon Redshift.)
Compress data (eg using Snappy compression) to reduce the amount of data that needs to be read from disk. This also reduces the cost of queries since they are charged based on the amount of data read from disk. (Instant savings!)
Partition data to completely skip-over input files when the partition key is used in a query's WHERE clause
For some examples of these benefits, see: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog

What are some good strategies/applications for viewing/analyzing metrics I have stored in an S3 bucket?

I have an S3 bucket full of plaintext metrics and want a way in which to analyze and view this data. One option I am considering is Amazon Athena, but I would like to consider the pros and cons of a few approaches.
Amazon Athena is really good for adhoc analysis. If your file is in a format as supported by Athena and if you want to run just few adhoc analysis. You can quickly get started with Athena.
If you want to make your adhoc analysis faster, create an external table over your existing files, consider running a Athena CTAS query to transform your data to Avro / Parquet and partition / bucket your data as necessary.
If cost is not an issue, you can also look in to Redshift. See if its possible to execute redshift copy command on your files to import all these files in a Redshift DB. Use the appropriate sort keys and distribution keys to improve your query performance in Redshift.

Athena performance on too many S3 files

I am planning to store data into S3 on top of which SQL queries would later be executed. The S3 file would basically contain json records. I would be getting these records through DynamoDB streams triggering AWS Lambda execution so its difficult to handle duplication at that layer as AWS Lambda guarantees atleast once delivery.
To avoid handling duplicate records in queries, I would like ensure that records being inserted at unique.
As far as I know, the only way to do achieve uniqueness is to have a unique S3 key. If I were to opt for this approach, I would end creating couple of million S3 files per day. Each file consists of single json record.
Would creating so many files be an concern when executing Athena queries?
Any alternatives approaches?
I think you would be better off handling the deduplication in Athena itself. For Athena, weeding out a few duplicates will be an easy job. Set up a view that groups by the unique property and uses ARBITRARY or MAX_BY (if you have something to order by to pick the latest) for the non-unique properties, and run your queries against this view to not have to worry about deduplication in each individual query.
You could also run a daily or weekly deduplication job using CTAS, depending on how fresh the data has to be (you can also do complex hybrids with pre-deduplicated historical data union'ed with on-the-fly-deduplicated data).
When running a query Athena lists the objects on S3, and this is not a parallelizable operation (except for partitioned tables where it's parallelizable to the grain of the partitioning), and S3's listings are limited to a page size of 1000. You really don't want to have Athena queries against tables (or partitions) with more than 1000 files.
Write to S3 via a Kinesis Firehose and then query that via Athena. The Firehose will group your records into a relatively small number of files, such that it will then be efficient to query them via Athena. Indeed, it will even organize them into a folder structure that is nicely partitioned by write timestamp.

AWS S3 w/ tags, DynamoDB, Redshift?

I'm comparing cloud storage for a large set of files with certain 'attributes' to query. Right now it's about 2.5TB of files and growing quickly. I need high throughput writes and queries. I'll first write the file and attributes to store, then will query to summarize attributes (counts, etc), additionally querying attributes to pull small set of files (by date, name, etc).
I've explored Google Cloud Datastore as a noSQL option, but trying to compare it to AWS services.
One option would be to store files in S3 with 'tags'. I believe you can query these with the REST API, but concerned about performance. I also have seen suggestions to connect Athena, but not sure if that will pull in the tags and the correct use-case.
The other option would be using something like Dynamo or possibly a large RDS? Redshift says it's for Petabyte scale, which we're not quite there...
Thoughts on best AWS storage solution? Pricing is a consideration, but more concerned with best solution moving forward.
You don't want to store the files themselves in a database like RDS or Redshift. You should definitely store the files in S3, but you should probably store or copy the metadata somewhere that is more indexable and searchable.
I would suggest setting up a new object trigger in S3 that triggers a Lambda function whenever a new file is uploaded to S3. The Lambda function could take the file location, size, any tags, etc. and insert that metadata into Redshift, DynamoDB, Elastic Search, or an RDS database like Aurora, where you could then perform queries against that metadata. Unless you are talking many millions of files, then the metadata will be fairly small and you probably won't need the scale of Redshift. The exact database you pick to store the metadata will depend on your use case such as the specific queries you want to perform.