No data in query on Redshift Spectrum External Table - amazon-web-services

I have created an external table on Redshift Spectrum to read Parquet files in an S3 bucket. I have created the corresponding bucket, database in AWS Lake Formation and associated IAM Roles as well as added partition to the database.
CREATE STATEMENT:
CREATE EXTERNAL TABLE schemadb.invoices(
purchase_id varchar,
invoice_id varchar,
invoice_number varchar,
po_number varchar,
invoice_amount float8
PARTITIONED BY (date varchar)
STORED AS Parquet
location 's3://my-dev-bucket/invoices/'
ADDED PARTITION
ALTER TABLE spectrumdb.invoices
add PARTITION (date='2023-02-10')
LOCATION 's3://my-dev-bucket/invoices/date=2023-02-10/';
ROLES Applied to database:
Admin-Role
IAMAllowedPrincipals
RedshiftSpectrum-S3ReadOnly
Im still unsure why I cannot see the data. Using dbeaver to run the create statement and query the table.

Related

AWS athena table empty result

I have an AWS athena table that has records which are returned by the standard "Select * from...." query. I do a "show create table on this and say create a new table test2".
The same select query on test2 always returns empty rows. Why is this happening?
Tables in athena save data in external sourse which in aws is S3. When you see the ddl of create table there is a LOCATION which point to the S3 bucket. If the LOCATION is different it is probably the reason that you see no rows when you execute a select on this table.
CREATE EXTERNAL TABLE `test_table`(
...
)
ROW FORMAT ...
STORED AS INPUTFORMAT ...
OUTPUTFORMAT ...
LOCATION s3://bucketname/folder/
If the location is correct, could be that you have to run MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions. From the doc.
Use the MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions.
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created
Make sure to check the Troubleshooting section as well. One thing that I was missing once was the glue:BatchCreatePartition policy on my IAM role.

Amazon Redshift Spectrum with partitioned by does not return results

Given an S3 bucket partitioned by date in this way:
year
|___month
|___day
|___file_*.parquet
I am trying to create a table in amazon redshift Spectrum with this command:
create external table spectrum.visits(
ip varchar(100),
user_agent varchar(2000),
url varchar(10000),
referer varchar(10000),
session_id char(32),
store_id int,
category_id int,
page_id int,
product_id int,
customer_id int,
hour int
)
partitioned by (year char(4), month varchar(2), day varchar(2))
stored as parquet
location 's3://visits/visits-parquet/';
Although an error message is not thrown, the results of the queries are always null, i.e., do not return results. The bucket is not null. Does someone knows want am I doing wrong?
When an External Table is created in Amazon Redshift Spectrum, it does not scan for existing partitions. Therefore, Redshift is not aware that they exist.
You will need to execute an ALTER TABLE ... ADD PARTITION command for each existing partition.
(Amazon Athena has a MSCK REPAIR TABLE option, but Redshift Spectrum does not.)
As I can't comment on people solutions, I needed to add another one.
I would like to point that if your spectrum table comes from Amazon Glue Data Catalog you don't need to manually add partitions to tables, you can have a crawler update partitions on the data catalog and the changes will reflect on spectrum.
One can create external table in Athena & run msck repair on it. Make sure you add "/" at the end of the location. Then create external schema in redshift. This solved my problem of result being showing blank. Alternatively you can run Glue crawler on Athena database, that will generate partitions automatically.

Populate external schema table in Redshift from S3 bucket file

I am new to AWS and trying to figure out how to populate a table within an external schema, residing in Amazon Redshift. I used Amazon Glue to create a table from a .csv file that sits in a S3 bucket. I can query the newly created table via Amazon Athena.
Here is where I am stuck because my task is to take the data and populate a table living in an RedShift external schema. I tried created a Job within Glue, but had no luck.
This is where I am stuck. Am I supposed to first create an empty destination table that mirrors the table that I can query using Athena?
Thank you to anyone in advance who might be able to assist!!!
Redshift Spectrum and Athena both use the Glue data catalog for external tables. When you create a new Redshift external schema that points at your existing Glue catalog the tables it contains will immediately exist in Redshift.
-- Create the Redshift Spectrum schema
CREATE EXTERNAL SCHEMA IF NOT EXISTS my_redshift_schema
FROM DATA CATALOG DATABASE 'my_glue_database'
IAM_ROLE 'arn:aws:iam:::role/MyIAMRole'
;
-- Review the schema info
SELECT *
FROM svv_external_schemas
WHERE schemaname = 'my_redshift_schema'
;
-- Review the tables in the schema
SELECT *
FROM svv_external_tables
WHERE schemaname = 'my_redshift_schema'
;
-- Confirm that the table returns data
SELECT *
FROM my_redshift_schema.my_external_table LIMIT 10
;

While Running AWS Athena query Query says Zero Records Returned

SELECT * FROM "sampledb"."parquetcheck" limit 10;
Trying to use Parquet file in S3 and created a table in AWS Athena and it is created perfectly.
However when I run the select query, it says "Zero Records Returned."
Although My Parquet file in S3 has data.
I have created partition too. IAM has full access on Athena.
If your specified columns names are correct, then you may need to load the partitions using: MSCK REPAIR TABLE EnterYourTableName;. This will add new partitions to the Glue Catalog.
If any of the above fails, you can create a temporary Glue Crawler to crawl your table and then validate the metadata in Athena by clicking the 3 dots next to the table name. Then select Generate Create Table DDL. You can then compare any differences in DDL.

Use external table redshift spectrum defined in glue data catalog

I have a table defined in Glue data catalog that I can query using Athena. As there is some data in the table that I want to use with other Redshift tables, can I access the table defined in Glue data catalog?
What will be the create external table query to reference the table definition in Glue catalog?
From AWS (Creating External Schemas),
create external schema athena_schema from data catalog
database 'sampledb'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
region 'us-east-2';
This creates a schema athena_schema that points to the sampledb database in Athena / Glue.
You need to grant appropriate access to the IAM role you specify: the Redshift cluster needs to be able to assume the role, and the role needs access to Glue.