When issuing the msck repair table statement, is the table still accessible for querying during the udpate?
I ask because I'm trying to figure out the best update schedule for a relatively large S3 hive table that is used to drive some reports in QuickSight. Will issuing this command break anyone who happens to simultaneously be running a QuickSight report based on this table?
Yes, the table will be available for running queries while you are running MSCK REPAIR TABLE, it's a background process. Queries run while that command is running will see different partitions, though, as the partitions the command discovers will be added as they are found.
Be aware that running MSCK REPAIR TABLE is a very inefficient process, with many partitions it will run for a very long time, and it is not incremental. This doesn't matter for query performance, but if it takes a long time now, it will only ever take longer and longer and might not be a viable long term strategy. There are some other questions here on StackOverflow about it that you can read to find other strategies for keeping your tables up to date.
Related
Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:
Querying AWS Glue Data Catalog
However, under the section titled Considerations and limitations the following is written:
Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.
Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.
My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?
Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?
I talked to AWS Support and did some research on this. Here's what I gathered:
The information_schema is built at query execution time, there doesn't seem to be any caching.
If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.
If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.
I have a large set of history log files on aws s3 that sum billions of lines,
I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.
My queries have timed out and I am trying to find another way of handling this data.
From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.
How can I turn this large dataset into a query friendly structure?
Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible.
I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.
Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.
Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum
But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.
By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.
Here are few tips to optimize the data that will give major boost to athena performance:
Use columnar formats like orc/parquet with compression to store your data.
Partition your data. In your case you can partition your logs based on year -> month -> day.
Create larger and lesser number of files per partition instead of small and more number of files.
The following AWS article gives detailed information for performance tuning in amazon athena
Top 10 performance tuning tips for amazon-athena
I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table.
To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation.
But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena).
So my question is, what does MSCK REPAIR TABLE actually do behind the scenes and why?
How does MSCK REPAIR TABLE find the partitions?
Additional data in case it's relevant:
Our data is all on S3, it's both slow when running on EMR (Hive) or Athena (Presto), there are ~450 partitions in the table, every partition has on avg 90 files, overall 3 Gigabytes for a partition, files are in Apache parquet format
You are right in the sense it reads the directory structure, creates partitions out of it and then updates the hive metastore. In fact more recently, the command was improved to remove non-existing partitions from metastore as well. The example that you are giving is very simple since it has only one level of partition keys. Consider table with multiple partition keys (2-3 partition keys is common in practice). msck repair will have to do a full-tree traversal of all the sub-directories under the table directory, parse the file names, make sure that the file names are valid, check if the partition is already existing in the metastore and then add the only partitions which are not present in the metastore. Note that each listing on the filesystem is a RPC to the namenode (in case of HDFS) or a web-service call in case of S3 or ADLS which can add to significant amount of time. Additionally, in order to figure out if the partition is already present in metastore or not, it needs to do a full listing of all the partitions which metastore knows of for the table. Both these steps can potentially increase the time taken for the command on large tables. The performance of msck repair table was improved considerably recently Hive 2.3.0 (see HIVE-15879 for more details). You may want to tune hive.metastore.fshandler.threads and hive.metastore.batch.retrieve.max to improve the performance of command.
I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?
To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).
Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc
1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.
In Redshift, there's an STL_QUERY table that stores queries that were run over the last 5 days. I'm trying to find a way to keep more than 5 days worth of records. Here are some things that I've considered:
Is there a Redshift setting for this? It would appear not.
Could I use a trigger? Triggers are not available in Redshift, so this is a no-go.
Could I create an Amazon Data Pipeline job to periodically "scrape" the STL_QUERY table? I could, so this is an option. Unfortunately, I would have to give the pipeline some EC2 instance to use to run this work. It seems like a waste to have an instance sitting around to scrape this table once a day.
Could I use an Amazon Simple Work Flow job to scrape the table? I could, but it suffers from the same issues as 3.
Are there any other options/ideas that I'm missing? I would prefer some other option that does not involve me dedicating an EC2 instance, even if it means paying for an additional service (provided that it's cheaper than the EC2 instance I would have used in it's stead).
Keep it simple, do it all in Redshift.
First, use "CREATE TABLE … AS" to save all current history into a permanent table.
CREATE TABLE admin.query_history AS SELECT * FROM stl_query;
Second, using psql to run it, schedule a job on a machine you control to run this every day.
INSERT INTO admin.query_history SELECT * FROM stl_query WHERE query > (SELECT MAX(query) FROM admin.query_history);
Done. :)
Notes:
You need an 8.x version of psql if you haven't set this up yet.
Even if your job doesn't run for a few days stl_query keeps enough history that you'll be covered.
As per your comment, it might be safer to use starttime instead of query as the criteria.