creating external table with partition in ATHENA results in empty table - amazon-web-services

I have an s3 location with a parquet table partitioned by a date column.
parquet_data ---
-- dt=2021-07-27
files
-- dt=2021-07-26
files
now I want to create an external table (CETAS)
with the table partitioned by the dt column.
CREATE EXTERNAL TABLE IF NOT EXISTS database.tbl_name (
ACCOUNT_NUM bigint
, ID bigint
, NAME string
)
PARTITIONED BY (
dt date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://location/of/data/'
TBLPROPERTIES (
'classification'='parquet',
'typeOfData'='file'
);
when I select from this new table, there is no data in it at all just the headers.
is there something blaring I've missed?
things I've tried.
re-creating the parquet table
creating the table without partition - works, but can't see the partition and can't add the dt in table def, it comes out blank.

When creating a new table with existing partitioned data, run this command:
MSCK REPAIR TABLE database.tbl_name
From MSCK REPAIR TABLE - Amazon Athena:
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3. If new partitions are present in the S3 location that you specified when you created the table, it adds those partitions to the metadata and to the Athena table.
This is required because the partitions were not created by Amazon Athena or AWS Glue, so it does not know that they exist yet.

Related

Redshift Spectrum to Delta Lake integration using manifest files (an issue in the partitioned table when updating a partition column)

I am working with Delta Table and Redshift Spectrum and I notice strange behaviour.
I follow this article to set up a Redshift Spectrum to Delta Lake integration using manifest files and query Delta tables: https://docs.delta.io/latest/redshift-spectrum-integration.html
Environment information
Delta Lake version: 1.0.1 (io.delta)
Spark version: 2.4.3
Scala version: 2.12.10
Describe the problem
In my use case, the delta table is partitioned by 3 columns (year, month and day). Delta table have also the "application_id" column used as the key for the insert/update operations.
CREATE EXTERNAL TABLE yyyyy.xxxxxxxx (
application_id string,
general_status_startingoffer string,
general_status_offer string
)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://xxxxxxx/v0/data/\_symlink_format_manifest/'
TBLPROPERTIES ('delta.compatibility.symlinkFormatManifest.enabled'='true');
Furthermore, an external schema has been created in redshift for mirroring data of the last version of the delta table.
create external schema yyyyy
from data catalog database 'yyyyy'
iam_role '${iam_role}';
We are using Athena and Redshift Editor to query the records.
The issue seems linked to an old partition that was been deleted in the last version of the delta table. In particular, redshift raises an Error whether fetching Delta Lake manifest.
Steps to reproduce
Following the steps to replicate the error:
The first glue job run creates 1 record in the following partition: year=2022, month=10, day=5
Symlink manifest is generated for redshift
Data has been correctly written
Delta log shows a newly added partition (year=2022, month=10, day=05)
Query on Athena shows correct results:
Query on redshift shows correct results:
The second glue job run updates this record, changing 2 columns (general_status_offer and day)
Symlink manifest is re-generated for redshift
The redshift manifest seems to be correctly updated
Delta log shows a new added partition (year=2022, month=10, day=13) and a deleted partition (year=2022, month=10, day=05). Query on Athena shows correct results:
Query on redshift fails with the following error message:
Caught exception in worker_thread loader thread for location=s3://xxxxxxx/v0/data/\_symlink_format_manifest/year=2022/month=10/day=05: error=DeltaManifest context=Error fetching Delta Lake manifest xxxxx/v0/data/\_symlink_format_manifest/year=2022/month=10/day=05/manifest Message: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 8ZCB2E6TTZ7JMHFD,ExtRid vnD7YB7JAPkW/
It can be resolved by adding a new symlink folder (year=2022, month=10, day=05) with an empty manifest file or running this statement on redshift:
ALTER TABLE xxxx DROP PARTITION (year='2022', month='10', day='05');
We need to automatize this step and it is not easy to recognize which partitions are deleted. Are there any properties that I can set to force the automatic resolution?
If not, this could be a good enhancement :)

AWS Glue job to convert table to Parquet w/o needing another crawler

Is it possible to have a Glue job re-classify a JSON table as Parquet instead of needing another crawler to crawl the Parquet files?
Current set up:
JSON files in partitioned S3 bucket are crawled once a day
Glue Job creates Parquet files in specified folder
Run ANOTHER crawler to RECREATE the same table that was made in step 1
I have to believe that there is a way to convert the table classification without another crawler (but I've been burned by AWS before). Any help is much appreciated!
For convenience considerations - 2 crawlers is the way to go.
For cost considerations - a hacky solution whould be:
Get the json table's CREATE TABLE DDL from Athena using SHOW CREATE TABLE <json_table>; command;
In the CREATE TABLE DDL, Replace the table name and the SerDer from json to parquet. You don't need the other table properties from the original CREATE TABLE DDL except LOCATION.
Execute the new CREATE TABLE DDL in Athena.
For example:
SHOW CREATE TABLE json_table;
Original DDL:
CREATE EXTERNAL TABLE `json_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
...
LOCATION
's3://bucket_name/table_data'
...
New DDL:
CREATE EXTERNAL TABLE `parquet_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_name/table_data'
You can also do it in the same way with Glue api methods: get_table() > replace > create_table().
Notice - if you want to run it periodically you would need to wrap it in a script and scheduled it with another scheduler (crontab etc.) after the first crawler runs.

Bulk upload to Amazon Redshift

I need to insert data on daily basis to AWS Redshift.
The requirement is to analyze only the daily batch inserted to Redshift. Redshift cluster is used by BI tools for analytics.
Question:
What are the best practices to "renew" the data set on a daily basis?
My concern is it is a quite heavy operation and performance will be poor but at the same time it is a quite common situation and I believe it was done before by multiple organization.
If the data is on S3, why not create an EXTERNAL TABLE over it. Then if the query speed over external table it not enough you can load it using CREATE TABLE AS SELECT statement into a temporary table, and once loaded, rename to a name your usual table name.
Sketched SQL:
CREATE EXTERNAL TABLE external_daily_batch_20190422 (
<schema ...>
)
PARTITIONED BY (
<if anything to partition on>
)
ROW FORMAT SERDE <data format>
LOCATION 's3://my-s3-location/2019-04-22';
CREATE TABLE internal_daily_batch_temp
DISTKEY ...
SORTKEY ...
AS
SELECT * from external_daily_batch_20190422;
DROP TABLE IF EXISTS internal_daily_batch__backup CASCADE;
ALTER TABLE internal_daily_batch rename to internal_daily_batch__backup;
ALTER TABLE internal_daily_batch_temp rename to internal_daily_batch;
Incremental load not possible?
By the way, is all of your 10TB of data mutable? Isn't incremental update possible?

Getting 0 rows while querying external table in redshift

We created the schema as follows:
create external schema spectrum
from data catalog
database 'test'
iam_role 'arn:aws:iam::20XXXXXXXXXXX:role/athenaaccess'
create external database if not exists;
and table as follows:
create external table spectrum.Customer(
Subr_Id integer,
SUB_CURRENTSTATUS varchar(100),
AIN integer,
ACCOUNT_CREATED timestamp,
Subr_Name varchar(100),
LAST_DEACTIVATED timestamp)
partitioned by (LAST_ACTIVATION timestamp)
row format delimited
fields terminated by ','
stored as textfile
location 's3://cequity-redshiftspectrum-test/'
table properties ('numRows'='1000');
the access rights are as follows:
Roles of athenaQuickSight access, Full Athena access, and s3 full access are attached to the redshift cluster
However, when we query as below we are getting 0 records. please help.
select count(*) from spectrum.Customer;
If your query returns zero rows from a partitioned external table, check whether a partition has been added to this external table. Redshift Spectrum only scans files in an Amazon S3 location that has been explicitly added using ALTER TABLE … ADD PARTITION. Query the SVV_EXTERNAL_PARTITIONS view to finding existing partitions. Run ALTER TABLE ADD … PARTITION for each missing partition.
Reference
I had the same issue. Doing the above, resolved my issue.
P.S. Explicit run of ALTER TABLE command to create partition can also be automated.

AWS Athena: use "folder" name as partition

I have thousands of individual json files (corresponding to one Table row) stored in s3 with the following path: s3://my-bucket/<date>/dataXX.json
When I create my table in DDL, is it possible to have the data partitioned by the present in the S3 path ? (or at least add the value in a new column)
Thanks
Sadly this is not supported in Athena. For partitioning to work with folders, there are requirements on how the folder must be named.
e.g.
s3://my-bucket/{columnname}={columnvalue}/data.json
In your case, you can still use partitioning if you add those partitions manually to the table.
e.g.
ALTER TABLE tablename ADD PARTITION (datecolumn='2017-01-01') location 's3://my-bucket/2017-01-01/
The AWS docs have some good examples on that topic.
AWS Athena Partitioning
It is possible to do this now using storage.location.template. This will partition by some part of your path. Be sure to NOT include the new column in the column list, as it will automatically be added. There are a lot of options you can search to tweak this for your date example. I used "id" to show the simplest version i could think of.
CREATE EXTERNAL TABLE `some_table`(
`col1` bigint,
PARTITIONED BY (
`id` string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://path/bucket/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'projection.enabled'='true',
'projection.id.type' = 'injected',
'storage.location.template'='s3://path/bucket/${id}/'
)
official docs: https://docs.amazonaws.cn/en_us/athena/latest/ug/partition-projection-dynamic-id-partitioning.html
Its not necessary to do this manually. Setup a glue crawler and it will pick-up the folder( in the prefix) as a partition, if all the folders in the path has the same structure and all the data has the same schema design.
Put it will name the partition as partition0. You can go into edit-schema and change the name of this partition to date or whatever you like.
But make sure you go into your glue crawler and under "configuration options" select the option - "Add new columns only". Otherwise on the next glue-crawler run it will reset the partition name back to partition0.
You need to name each S3 folder like this picture:
With Athena set up, specify dt for the partition:
After that, run MSCK REPAIR TABLE <your table name>; on Athena