Redshift external catalog error when copying parquet from s3 - amazon-web-services

I am trying to copy Google Analytics data into redshift via parquet format. When I limit the columns to a few select fields, I am able to copy the data. But on including few specific columns I get an error:
ERROR: External Catalog Error. Detail: ----------------------------------------------- error: External Catalog Error. code: 16000 context: Unsupported column type found for column: 6. Remove the column from the projection to continue. query: 18669834 location: s3_request_builder.cpp:2070 process: padbmaster [pid=23607] -----------------------------------------------
I know the issue is most probably with the data, but I am not sure how can I debug as this error is not helpful in anyway. I have tried changing data types of the columns to super, but without any success. I am not using redshift spectrum here.

I found the solution. In the error message it says Unsupported column type found for column: 6. Redshift column ordinality starts from 0. I was counting columns from 1, instead of 0 (my mistake). So this means issue was with column 6 (which I was reading as column 7), which was a string or varchar column in my case. I created a table with just this column and tried uploading data in just this column. Then I got
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': '\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15001\n context: The length of the data column display_name is longer than the length defined in the table. Table: 256, Data: 1020
Recreating the column with varchar(max) for those columns solved the issue

I assume you have semistructured data in your parquet (like an array).
In this case, you can have a look at this page at the very bottom https://docs.aws.amazon.com/redshift/latest/dg/ingest-super.html
It says:
If your semistructured or nested data is already available in either
Apache Parquet or Apache ORC format, you can use the COPY command to
ingest data into Amazon Redshift.
The Amazon Redshift table structure should match the number of columns
and the column data types of the Parquet or ORC files. By specifying
SERIALIZETOJSON in the COPY command, you can load any column type in
the file that aligns with a SUPER column in the table as SUPER. This
includes structure and array types.
COPY foo FROM 's3://bucket/somewhere'
...
FORMAT PARQUET SERIALIZETOJSON;
For me, the last line
...
FORMAT PARQUET SERIALIZETOJSON;
did the trick.

Related

HIVE_CURSOR_ERROR when Querying S3 Inventory With Athena - Is size column correct?

I'm attempting to do some analysis on one of our S3 buckets using Athena and I'm getting some errors that I can't explain or find solutions for anywhere I look.
The guide I'm following is https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html.
I created my S3 inventory yesterday and have now received the first report in S3. The format is Apache ORC, the last export shows as yesterday and the additional fields stored are Size, Last modified, Storage class, Encryption.
I can see the data stored under s3://{my-inventory-bucket}/{my-bucket}/{my-inventory} so I know there is data there.
The default encryption on the inventory bucket and inventory configuration both have SSE-S3 encryption enabled.
To create the table, I am using the following query:
CREATE EXTERNAL TABLE my_table (
`bucket` string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/hive/';
Once the table has been created, I load the data using:
MSCK REPAIR TABLE my_table;
The results from loading the data show that data has been loaded:
Partitions not in metastore: my_table=2021-07-17-00-00
Repair: Added partition to metastore my_table=2021-07-17-00-00
Once that's loaded, I verify the data is available using:
SELECT DISTINCT dt FROM my_table ORDER BY 1 DESC limit 10;
Which outputs:
1 2021-07-17-00-00
Now if I run something like the below, everything runs fine and I get the expected results:
SELECT key FROM my_table ORDER BY 1 DESC limit 10;
But as soon as I include the size column, I receive an error:
SELECT key, size FROM my_table ORDER BY 1 DESC limit 10;
Your query has the following error(s):
HIVE_CURSOR_ERROR: Failed to read ORC file: s3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/data/{UUID}.orc
This query ran against the "my_table" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: {UUID}.
I feel like I've got something wrong with my size column. Can anyone help figure this out?
So frustrating. Think I found the answer here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html
IsLatest – Set to True if the object is the current version of the object. (This field is not included if the list is only for the current version of objects.)
Removing that column fixed the problem.

Biqquery: Some rows belong to different partitions rather than destination partition

I am running a Airflow DAG which moves data from GCS to BQ using operator GoogleCloudStorageToBigQueryOperator i am on Airflow version 1.10.2.
This task moves data from MySql to BQ(Table partitioned), all this time we were partitioned by Ingestion-time and the incremental load for past three days were working fine when the data was loaded using Airflow DAG.
Now we changed the partitioned type to be Date or timestamp on a DATE column from the table, after which we have started getting this error, since we are getting the incremental load to have data for last three days from MySql table, I was expecting the BQ job to Append the new records or recreate the partition with 'WRITE_TRUNCATE' which i have tested earlier and both of them fail with below error message.
Exception: BigQuery job failed. Final error was: {'reason': 'invalid', 'message': 'Some rows belong to different partitions rather than destination partition 20191202'}.
I wont be able to post the code since, all modules being called based on JSON parameter, but here is what I am passing to the operator for this table with other regular parameters
create_disposition='CREATE_IF_NEEDED',
time_partitioning = {'field': 'entry_time', 'type': 'DAY'}
write_disposition = 'WRITE_APPEND' #Tried with 'WRITE_TRUNCATE'
schema_update_options = ('ALLOW_FIELD_ADDITION',
'ALLOW_FIELD_RELAXATION')
I believe these are the fields which might cause the issue, any help on this is appreciated.
When using Bigquery partitioned tables by Date or timestamp, you should specify the partition to load the data.
E.g
table_name$20160501
Also, your column value should match the partition, for example, if you create this table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
The column date is the column-based for the partition and if you try to load the next row
$ echo "1,3.14,2018-11-07" > row.csv
$ bq "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
You will get this error due you are loading data from 2018-11-07 when you are using the partition 20181107
Some rows belong to different partitions rather than destination partition 20181105
I suggest to use the following destination_project_dataset_table value and verify if the data match to the partition date.
destination_project_dataset_table='dataset.table$YYYYMMDD',

Redshift Spectrum : Getting no values/ empty while select using Parquet

I have tried using textfile and it works perfectly. I am using Redshift spectrum. To increase performance, I am trying using PARQUET. The table gets created but I get no value returned while firing a Select query. Below are my queries:
CREATE EXTERNAL TABLE gf_spectrum.order_headers
(
header_id numeric(38,18) ,
org_id numeric(38,18) ,
order_type_id numeric(38,18) )
partitioned by (partition1 VARCHAR(240))
stored as PARQUET
location 's3://aws-bucket/Spectrum/order_headers';
Select * from gf_spectrum.order_headers limit 1000;
Also, does PARQUET require partitioning compulsory? I tried that as well, and the table got created. But while retrieving data, I got an S3 Fetch error of invalid version number which did not happen with the text file. Is it something related to PARQUET format?
Thanks for your help.

Skipping header rows in AWS Redshift External Tables

I have a file in S3 with the following data:
name,age,gender
jill,30,f
jack,32,m
And a redshift external table to query that data using spectrum:
create external table spectrum.customers (
"name" varchar(50),
"age" int,
"gender" varchar(1))
row format delimited
fields terminated by ','
lines terminated by \n'
stored as textfile
location 's3://...';
When querying the data I get the following result:
select * from spectrum.customers;
name,age,g
jill,30,f
jack,32,m
Is there an elegant way to skip the header row as part of the external table definition, similar to the tblproperties ("skip.header.line.count"="1") option in Hive? Or is my only option (at least for now) to filter out the header rows as part of the select statement?
Answered this in: How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.
This works in Redshift:
You want to use table properties ('skip.header.line.count'='1')
Along with other properties if you want, e.g. 'numRows'='100'.
Here's a sample:
create external table exreddb1.test_table
(ID BIGINT
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');
Currently, AWS Redshift Spectrum does not support skipping header rows. If you can, you could raise a support issue that would allow tracking the availability of this feature.
It would be possible to forward this request to the development team for consideration.

Converting avro to parquet (using hive maybe?)

I'm trying to convert a bunch of multi-part avro files stored on HDFS (100s of GBs) to parquet files (preserving all data)
Hive can read the avro files as an external table using:
CREATE EXTERNAL TABLE as_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<location>'
TBLPROPERTIES ('avro.schema.url'='<schema.avsc>');
But when I try to create a parquet table:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
it throws an error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<...>
Is it possible to convert uniontype to something that is a valid datatype for the external parquet table?
I'm open to alternative, simpler methods as well. MR? Pig?
Looking for a way that's fast, simple and has minimal dependencies to bother about.
Thanks
Try splitting this:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
into 2 steps:
CREATE EXTERNAL TABLE as_parquet (col1 col1_type, ... , coln coln_type) STORED AS parquet LOCATION 'hdfs:///xyz.parquet';
INSERT INTO TABLE as_parquet SELECT * FROM as_avro;
Or, if you have partitions, which I guess you have for this amount of data:
INSERT INTO TABLE as_parquet PARTITION (year=2016, month=07, day=13) SELECT <all_columns_except_partition_cols> FROM as_avro WHERE year='2016' and month='07' and day='13';
Note:
For step 1, in order to save any typos or small mistakes in columns types and such, you can:
Run SHOW CREATE TABLE as_avro and copy the create statement of as_avro table
Replace the table name, file format and location of the table
Run the new create statement.
This works for me...