MSCK REPAIR TABLE working strangely on delta tables - amazon-athena

I have a delta table in s3 and for the same table, I have defined an external table in Athena. After creating the Athena table and generating manifests, I am loading the partitions using MSCK REPAIR TABLE. All the partition columns are in snake_case. But still, I am getting
Partitions not in metastore.
Any idea what am I missing here?

The IAM user or role doesn't have a policy that allows the glue:BatchCreatePartition action. You have to allow glue:BatchCreatePartition in the IAM policy and it should work.

Resolved the issue. I was putting partition columns in wrong order while creating the table.

Related

AWS athena table empty result

I have an AWS athena table that has records which are returned by the standard "Select * from...." query. I do a "show create table on this and say create a new table test2".
The same select query on test2 always returns empty rows. Why is this happening?
Tables in athena save data in external sourse which in aws is S3. When you see the ddl of create table there is a LOCATION which point to the S3 bucket. If the LOCATION is different it is probably the reason that you see no rows when you execute a select on this table.
CREATE EXTERNAL TABLE `test_table`(
...
)
ROW FORMAT ...
STORED AS INPUTFORMAT ...
OUTPUTFORMAT ...
LOCATION s3://bucketname/folder/
If the location is correct, could be that you have to run MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions. From the doc.
Use the MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions.
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created
Make sure to check the Troubleshooting section as well. One thing that I was missing once was the glue:BatchCreatePartition policy on my IAM role.

Amazon Redshift Spectrum with partitioned by does not return results

Given an S3 bucket partitioned by date in this way:
year
|___month
|___day
|___file_*.parquet
I am trying to create a table in amazon redshift Spectrum with this command:
create external table spectrum.visits(
ip varchar(100),
user_agent varchar(2000),
url varchar(10000),
referer varchar(10000),
session_id char(32),
store_id int,
category_id int,
page_id int,
product_id int,
customer_id int,
hour int
)
partitioned by (year char(4), month varchar(2), day varchar(2))
stored as parquet
location 's3://visits/visits-parquet/';
Although an error message is not thrown, the results of the queries are always null, i.e., do not return results. The bucket is not null. Does someone knows want am I doing wrong?
When an External Table is created in Amazon Redshift Spectrum, it does not scan for existing partitions. Therefore, Redshift is not aware that they exist.
You will need to execute an ALTER TABLE ... ADD PARTITION command for each existing partition.
(Amazon Athena has a MSCK REPAIR TABLE option, but Redshift Spectrum does not.)
As I can't comment on people solutions, I needed to add another one.
I would like to point that if your spectrum table comes from Amazon Glue Data Catalog you don't need to manually add partitions to tables, you can have a crawler update partitions on the data catalog and the changes will reflect on spectrum.
One can create external table in Athena & run msck repair on it. Make sure you add "/" at the end of the location. Then create external schema in redshift. This solved my problem of result being showing blank. Alternatively you can run Glue crawler on Athena database, that will generate partitions automatically.

AWS Athena - duplicate columns due to partitionning

We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly.
The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.
What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:
HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns
Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ?
That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..
Or any other suggestions ?
Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.
Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.
The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.
You can fix this by updating the schema of the glue table and rename the duplicate column:
Open the AWS Glue console.
Choose the table name from the list, and then choose Edit schema.
Choose the column name foo (not the partitioned column foo), enter a new name, and then choose Save.
Reference:
Resolve HIVE_INVALID_METADATA error

While Running AWS Athena query Query says Zero Records Returned

SELECT * FROM "sampledb"."parquetcheck" limit 10;
Trying to use Parquet file in S3 and created a table in AWS Athena and it is created perfectly.
However when I run the select query, it says "Zero Records Returned."
Although My Parquet file in S3 has data.
I have created partition too. IAM has full access on Athena.
If your specified columns names are correct, then you may need to load the partitions using: MSCK REPAIR TABLE EnterYourTableName;. This will add new partitions to the Glue Catalog.
If any of the above fails, you can create a temporary Glue Crawler to crawl your table and then validate the metadata in Athena by clicking the 3 dots next to the table name. Then select Generate Create Table DDL. You can then compare any differences in DDL.

Athena MSCK repair table returns 'tables not in metastore'

While running MSCK repair tablename command, athena query editor returns an error tables not in metastore.
But table exists and I can query on that table.
I have data kept in S3 in form of parquet files, partitioned with
hash as partition key (partitions look like hash=0, hash=100 and so on), and I am running glue crawler to create a table in Athena.
I know partitions not in metastore is common issues and there are solutions to fix it. But I am not able to find the solution for tables not in metastore
Has anyone solved similar issue, or have an idea what could be wrong?
Does the IAM role being used to execute the query have permission to read that S3 bucket? I had this error when running a query from Lambda using a role which did not have ListBucket permission on the bucket in question.
I solved this by selecting the correct database from dropdown menu on the left of query editor. I had run the previous setup query on sampledb and then i was trying to run a new query but the new tab changed the db to default. Changing default to sampledb fixed the issue!