Redshift Spectrum : Getting no values/ empty while select using Parquet - amazon-web-services

I have tried using textfile and it works perfectly. I am using Redshift spectrum. To increase performance, I am trying using PARQUET. The table gets created but I get no value returned while firing a Select query. Below are my queries:
CREATE EXTERNAL TABLE gf_spectrum.order_headers
(
header_id numeric(38,18) ,
org_id numeric(38,18) ,
order_type_id numeric(38,18) )
partitioned by (partition1 VARCHAR(240))
stored as PARQUET
location 's3://aws-bucket/Spectrum/order_headers';
Select * from gf_spectrum.order_headers limit 1000;
Also, does PARQUET require partitioning compulsory? I tried that as well, and the table got created. But while retrieving data, I got an S3 Fetch error of invalid version number which did not happen with the text file. Is it something related to PARQUET format?
Thanks for your help.

Related

Redshift external catalog error when copying parquet from s3

I am trying to copy Google Analytics data into redshift via parquet format. When I limit the columns to a few select fields, I am able to copy the data. But on including few specific columns I get an error:
ERROR: External Catalog Error. Detail: ----------------------------------------------- error: External Catalog Error. code: 16000 context: Unsupported column type found for column: 6. Remove the column from the projection to continue. query: 18669834 location: s3_request_builder.cpp:2070 process: padbmaster [pid=23607] -----------------------------------------------
I know the issue is most probably with the data, but I am not sure how can I debug as this error is not helpful in anyway. I have tried changing data types of the columns to super, but without any success. I am not using redshift spectrum here.
I found the solution. In the error message it says Unsupported column type found for column: 6. Redshift column ordinality starts from 0. I was counting columns from 1, instead of 0 (my mistake). So this means issue was with column 6 (which I was reading as column 7), which was a string or varchar column in my case. I created a table with just this column and tried uploading data in just this column. Then I got
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': '\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15001\n context: The length of the data column display_name is longer than the length defined in the table. Table: 256, Data: 1020
Recreating the column with varchar(max) for those columns solved the issue
I assume you have semistructured data in your parquet (like an array).
In this case, you can have a look at this page at the very bottom https://docs.aws.amazon.com/redshift/latest/dg/ingest-super.html
It says:
If your semistructured or nested data is already available in either
Apache Parquet or Apache ORC format, you can use the COPY command to
ingest data into Amazon Redshift.
The Amazon Redshift table structure should match the number of columns
and the column data types of the Parquet or ORC files. By specifying
SERIALIZETOJSON in the COPY command, you can load any column type in
the file that aligns with a SUPER column in the table as SUPER. This
includes structure and array types.
COPY foo FROM 's3://bucket/somewhere'
...
FORMAT PARQUET SERIALIZETOJSON;
For me, the last line
...
FORMAT PARQUET SERIALIZETOJSON;
did the trick.

HIVE_CURSOR_ERROR when Querying S3 Inventory With Athena - Is size column correct?

I'm attempting to do some analysis on one of our S3 buckets using Athena and I'm getting some errors that I can't explain or find solutions for anywhere I look.
The guide I'm following is https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html.
I created my S3 inventory yesterday and have now received the first report in S3. The format is Apache ORC, the last export shows as yesterday and the additional fields stored are Size, Last modified, Storage class, Encryption.
I can see the data stored under s3://{my-inventory-bucket}/{my-bucket}/{my-inventory} so I know there is data there.
The default encryption on the inventory bucket and inventory configuration both have SSE-S3 encryption enabled.
To create the table, I am using the following query:
CREATE EXTERNAL TABLE my_table (
`bucket` string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/hive/';
Once the table has been created, I load the data using:
MSCK REPAIR TABLE my_table;
The results from loading the data show that data has been loaded:
Partitions not in metastore: my_table=2021-07-17-00-00
Repair: Added partition to metastore my_table=2021-07-17-00-00
Once that's loaded, I verify the data is available using:
SELECT DISTINCT dt FROM my_table ORDER BY 1 DESC limit 10;
Which outputs:
1 2021-07-17-00-00
Now if I run something like the below, everything runs fine and I get the expected results:
SELECT key FROM my_table ORDER BY 1 DESC limit 10;
But as soon as I include the size column, I receive an error:
SELECT key, size FROM my_table ORDER BY 1 DESC limit 10;
Your query has the following error(s):
HIVE_CURSOR_ERROR: Failed to read ORC file: s3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/data/{UUID}.orc
This query ran against the "my_table" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: {UUID}.
I feel like I've got something wrong with my size column. Can anyone help figure this out?
So frustrating. Think I found the answer here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html
IsLatest – Set to True if the object is the current version of the object. (This field is not included if the list is only for the current version of objects.)
Removing that column fixed the problem.

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.
To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

Converting avro to parquet (using hive maybe?)

I'm trying to convert a bunch of multi-part avro files stored on HDFS (100s of GBs) to parquet files (preserving all data)
Hive can read the avro files as an external table using:
CREATE EXTERNAL TABLE as_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<location>'
TBLPROPERTIES ('avro.schema.url'='<schema.avsc>');
But when I try to create a parquet table:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
it throws an error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<...>
Is it possible to convert uniontype to something that is a valid datatype for the external parquet table?
I'm open to alternative, simpler methods as well. MR? Pig?
Looking for a way that's fast, simple and has minimal dependencies to bother about.
Thanks
Try splitting this:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
into 2 steps:
CREATE EXTERNAL TABLE as_parquet (col1 col1_type, ... , coln coln_type) STORED AS parquet LOCATION 'hdfs:///xyz.parquet';
INSERT INTO TABLE as_parquet SELECT * FROM as_avro;
Or, if you have partitions, which I guess you have for this amount of data:
INSERT INTO TABLE as_parquet PARTITION (year=2016, month=07, day=13) SELECT <all_columns_except_partition_cols> FROM as_avro WHERE year='2016' and month='07' and day='13';
Note:
For step 1, in order to save any typos or small mistakes in columns types and such, you can:
Run SHOW CREATE TABLE as_avro and copy the create statement of as_avro table
Replace the table name, file format and location of the table
Run the new create statement.
This works for me...