Getting 0 rows while querying external table in redshift - amazon-web-services

We created the schema as follows:
create external schema spectrum
from data catalog
database 'test'
iam_role 'arn:aws:iam::20XXXXXXXXXXX:role/athenaaccess'
create external database if not exists;
and table as follows:
create external table spectrum.Customer(
Subr_Id integer,
SUB_CURRENTSTATUS varchar(100),
AIN integer,
ACCOUNT_CREATED timestamp,
Subr_Name varchar(100),
LAST_DEACTIVATED timestamp)
partitioned by (LAST_ACTIVATION timestamp)
row format delimited
fields terminated by ','
stored as textfile
location 's3://cequity-redshiftspectrum-test/'
table properties ('numRows'='1000');
the access rights are as follows:
Roles of athenaQuickSight access, Full Athena access, and s3 full access are attached to the redshift cluster
However, when we query as below we are getting 0 records. please help.
select count(*) from spectrum.Customer;

If your query returns zero rows from a partitioned external table, check whether a partition has been added to this external table. Redshift Spectrum only scans files in an Amazon S3 location that has been explicitly added using ALTER TABLE … ADD PARTITION. Query the SVV_EXTERNAL_PARTITIONS view to finding existing partitions. Run ALTER TABLE ADD … PARTITION for each missing partition.
Reference
I had the same issue. Doing the above, resolved my issue.
P.S. Explicit run of ALTER TABLE command to create partition can also be automated.

Related

creating external table with partition in ATHENA results in empty table

I have an s3 location with a parquet table partitioned by a date column.
parquet_data ---
-- dt=2021-07-27
files
-- dt=2021-07-26
files
now I want to create an external table (CETAS)
with the table partitioned by the dt column.
CREATE EXTERNAL TABLE IF NOT EXISTS database.tbl_name (
ACCOUNT_NUM bigint
, ID bigint
, NAME string
)
PARTITIONED BY (
dt date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://location/of/data/'
TBLPROPERTIES (
'classification'='parquet',
'typeOfData'='file'
);
when I select from this new table, there is no data in it at all just the headers.
is there something blaring I've missed?
things I've tried.
re-creating the parquet table
creating the table without partition - works, but can't see the partition and can't add the dt in table def, it comes out blank.
When creating a new table with existing partitioned data, run this command:
MSCK REPAIR TABLE database.tbl_name
From MSCK REPAIR TABLE - Amazon Athena:
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3. If new partitions are present in the S3 location that you specified when you created the table, it adds those partitions to the metadata and to the Athena table.
This is required because the partitions were not created by Amazon Athena or AWS Glue, so it does not know that they exist yet.

Does hierarchical partitioning works in AWS Athena/S3?

I am new to AWS and trying to use S3 and Athena for a use case.
I want the data saved as json files in S3 to be queried from Athena. To reduce the data scan i have created directory structure like this
../customerid/date/*.json (format)
../100/2020-04-29/*.json
../100/2020-04-30/*.json
.
.
../101/2020-04-29/*.json
In Athena the table structure has been created according to the data we are expecting and 2 partitions have been created namely customer (customerid) and dt (date).
I want to query all the data for customer '100' and limit my scan to its directory for which i am trying to load the partition as follows
alter table <table_name> add
partition (customer=100) location 's3://<location>/100/’
But I get the following error
FAILED: SemanticException partition spec {customer=100} doesn't contain all (2) partition columns
Clearly its not loading a single partition when multiple partitions have been created
Giving both partitions in alter table
alter table <table_name> add
partition (customer=100, dt=2020-04-22) location 's3://<location>/100/2020-04-22/'
I get this error
missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;
Am i doing something wrong?
Does this even works?
If not is there a way to work with hierarchical partitions?
The issue is with the S3 hierarchial structure that you have given. it should be
../customer=100/dt=2020-04-29/*.json
instead of
../100/2020-04-29/*.json
If you have data in S3 stored in the correct prefix structure as mentioned, then you could add partitions with simple msck repair msck repair table <table_name> command.
Hope this clarifies
I figured the mistake i was making so wanted to share in case anyone finds themselves in the same situation.
For data not partitioned in hive format (Refer this for hive and non-hive format)
Taking the example above, following is the alter command that works
alter table <table_name> add
partition (customer=100, dt=date '2020-04-22') location 's3://<location>/100/2020-04-22/'
Notice the change in the syntax of "dt" partition. As my partition datatype was set to "date" type and not using it while loading the partition was giving the error.
although not giving the data type also works we just need to give single quotes that defaults the partition type to string/varchar
alter table <table_name> add
partition (customer=100, dt='2020-04-22') location 's3://<location>/100/2020-04-22/'
I prefer giving the date data type while adding as that is how i configured my partition.
Hope this helps.

Bulk upload to Amazon Redshift

I need to insert data on daily basis to AWS Redshift.
The requirement is to analyze only the daily batch inserted to Redshift. Redshift cluster is used by BI tools for analytics.
Question:
What are the best practices to "renew" the data set on a daily basis?
My concern is it is a quite heavy operation and performance will be poor but at the same time it is a quite common situation and I believe it was done before by multiple organization.
If the data is on S3, why not create an EXTERNAL TABLE over it. Then if the query speed over external table it not enough you can load it using CREATE TABLE AS SELECT statement into a temporary table, and once loaded, rename to a name your usual table name.
Sketched SQL:
CREATE EXTERNAL TABLE external_daily_batch_20190422 (
<schema ...>
)
PARTITIONED BY (
<if anything to partition on>
)
ROW FORMAT SERDE <data format>
LOCATION 's3://my-s3-location/2019-04-22';
CREATE TABLE internal_daily_batch_temp
DISTKEY ...
SORTKEY ...
AS
SELECT * from external_daily_batch_20190422;
DROP TABLE IF EXISTS internal_daily_batch__backup CASCADE;
ALTER TABLE internal_daily_batch rename to internal_daily_batch__backup;
ALTER TABLE internal_daily_batch_temp rename to internal_daily_batch;
Incremental load not possible?
By the way, is all of your 10TB of data mutable? Isn't incremental update possible?

Redshift spectrum timestamp column issues

I have few files in s3. Used glue data catalog to get the table definition. I have field called log_time and I manually set the datatype to timestamp in glue catalog. Now when I query that table from Athena I can see the timestamp values correctly.
Now I go to Redshift spectrum and create an external schema pointing to the schema created by the glue data catalog. I can see the table that are defined there and also when I check the data type of the column I see that it is defined as timestamp. However I run the same query I can in Athena, log_time field displays the date part correctly. But for the time part it is all 00:00:00 for all rows.
Any idea?
**date value it the file :**2018-12-16 00:47:20.28
When i change the field date-type to timestamp manually in glue-data-catalog then then query in Athena i see the value: 2018-12-16 00:47:20.280
When I create a Redshift spectrum schema pointing to the data-catalog's schema and then query it, I see the value 2018-12-16 00:00:00

can athena table be created for s3 bucket sub-directories?

Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables