I have an AWS athena table that has records which are returned by the standard "Select * from...." query. I do a "show create table on this and say create a new table test2".
The same select query on test2 always returns empty rows. Why is this happening?
Tables in athena save data in external sourse which in aws is S3. When you see the ddl of create table there is a LOCATION which point to the S3 bucket. If the LOCATION is different it is probably the reason that you see no rows when you execute a select on this table.
CREATE EXTERNAL TABLE `test_table`(
...
)
ROW FORMAT ...
STORED AS INPUTFORMAT ...
OUTPUTFORMAT ...
LOCATION s3://bucketname/folder/
If the location is correct, could be that you have to run MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions. From the doc.
Use the MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions.
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created
Make sure to check the Troubleshooting section as well. One thing that I was missing once was the glue:BatchCreatePartition policy on my IAM role.
Related
I created one table in glue database using crawler job. Table created successfully.
However, when I am trying to access that table in athena query editor its giving me below error when i am try to select the data from table:
Query:
select * from DB1.data_tbl;
Output:
Hive File Not Found: Partition location does not exist
I haven't found the partition location define.
Please assist.
Athena, by default, can read only data in S3. It will not read your postgresql databases. To connect to anything other than S3, you have to setup and use Amazon Athena Federated Query.
Alternatively, setup a Glue Job to copy all data from your Postegresql into S3, and then use Athena to query the data from S3.
Given an S3 bucket partitioned by date in this way:
year
|___month
|___day
|___file_*.parquet
I am trying to create a table in amazon redshift Spectrum with this command:
create external table spectrum.visits(
ip varchar(100),
user_agent varchar(2000),
url varchar(10000),
referer varchar(10000),
session_id char(32),
store_id int,
category_id int,
page_id int,
product_id int,
customer_id int,
hour int
)
partitioned by (year char(4), month varchar(2), day varchar(2))
stored as parquet
location 's3://visits/visits-parquet/';
Although an error message is not thrown, the results of the queries are always null, i.e., do not return results. The bucket is not null. Does someone knows want am I doing wrong?
When an External Table is created in Amazon Redshift Spectrum, it does not scan for existing partitions. Therefore, Redshift is not aware that they exist.
You will need to execute an ALTER TABLE ... ADD PARTITION command for each existing partition.
(Amazon Athena has a MSCK REPAIR TABLE option, but Redshift Spectrum does not.)
As I can't comment on people solutions, I needed to add another one.
I would like to point that if your spectrum table comes from Amazon Glue Data Catalog you don't need to manually add partitions to tables, you can have a crawler update partitions on the data catalog and the changes will reflect on spectrum.
One can create external table in Athena & run msck repair on it. Make sure you add "/" at the end of the location. Then create external schema in redshift. This solved my problem of result being showing blank. Alternatively you can run Glue crawler on Athena database, that will generate partitions automatically.
I am new to AWS and trying to figure out how to populate a table within an external schema, residing in Amazon Redshift. I used Amazon Glue to create a table from a .csv file that sits in a S3 bucket. I can query the newly created table via Amazon Athena.
Here is where I am stuck because my task is to take the data and populate a table living in an RedShift external schema. I tried created a Job within Glue, but had no luck.
This is where I am stuck. Am I supposed to first create an empty destination table that mirrors the table that I can query using Athena?
Thank you to anyone in advance who might be able to assist!!!
Redshift Spectrum and Athena both use the Glue data catalog for external tables. When you create a new Redshift external schema that points at your existing Glue catalog the tables it contains will immediately exist in Redshift.
-- Create the Redshift Spectrum schema
CREATE EXTERNAL SCHEMA IF NOT EXISTS my_redshift_schema
FROM DATA CATALOG DATABASE 'my_glue_database'
IAM_ROLE 'arn:aws:iam:::role/MyIAMRole'
;
-- Review the schema info
SELECT *
FROM svv_external_schemas
WHERE schemaname = 'my_redshift_schema'
;
-- Review the tables in the schema
SELECT *
FROM svv_external_tables
WHERE schemaname = 'my_redshift_schema'
;
-- Confirm that the table returns data
SELECT *
FROM my_redshift_schema.my_external_table LIMIT 10
;
SELECT * FROM "sampledb"."parquetcheck" limit 10;
Trying to use Parquet file in S3 and created a table in AWS Athena and it is created perfectly.
However when I run the select query, it says "Zero Records Returned."
Although My Parquet file in S3 has data.
I have created partition too. IAM has full access on Athena.
If your specified columns names are correct, then you may need to load the partitions using: MSCK REPAIR TABLE EnterYourTableName;. This will add new partitions to the Glue Catalog.
If any of the above fails, you can create a temporary Glue Crawler to crawl your table and then validate the metadata in Athena by clicking the 3 dots next to the table name. Then select Generate Create Table DDL. You can then compare any differences in DDL.
I have a defined a partitioned table which points to a S3 bucket which uses date partitioning. I have data for the past 3 months in the S3 bucket. I have loaded the partitions for the 1st month. However I haven't loaded the data in partition using msck repair table or alter table commands for the past 2 months. When I try to query the table , data for the past 2 months are not loaded from S3 , only the most recent partitioned data is showing up in the query results. Is this expected? If so , why?
I tried to create another partitioned table for the same s3 bucket but this time around I did not load any of the partitions. When I query this table , I get the most recent records.
Yes it is expected.
Athena uses metadata to recognize data in S3. Most important metadata used to detect data in S3 is partition. Athena keeps details about all partitions in metadata. Using this partition info, it reaches to corresponding folder in S3 to fetch data.
If you add more files to same partition: If partition is already added in athena metadata, all new files will be detected automatically because athena reads all files from folder in S3 by using partition metadata and s3 location.
If u add files in new partition: if partition is not in athena metadata, athana doesn't know how to locate corresponding folder in S3. Therefore, it doesn't access data from that folder.
There are three ways to recognize new partitions:
1. Run Glue crawler over S3 bucket and it will refresh partition metadata.
2. Use alter table command in athana to add new partitions
3. Use msck repair table if your partition has different schema than table schema.