Skipping header rows in AWS Redshift External Tables

Skipping header rows in AWS Redshift External Tables - amazon-web-services

I have a file in S3 with the following data:
name,age,gender
jill,30,f
jack,32,m
And a redshift external table to query that data using spectrum:
create external table spectrum.customers (
"name" varchar(50),
"age" int,
"gender" varchar(1))
row format delimited
fields terminated by ','
lines terminated by \n'
stored as textfile
location 's3://...';
When querying the data I get the following result:
select * from spectrum.customers;
name,age,g
jill,30,f
jack,32,m
Is there an elegant way to skip the header row as part of the external table definition, similar to the tblproperties ("skip.header.line.count"="1") option in Hive? Or is my only option (at least for now) to filter out the header rows as part of the select statement?

Answered this in: How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.
This works in Redshift:
You want to use table properties ('skip.header.line.count'='1')
Along with other properties if you want, e.g. 'numRows'='100'.
Here's a sample:
create external table exreddb1.test_table
(ID BIGINT
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');

Currently, AWS Redshift Spectrum does not support skipping header rows. If you can, you could raise a support issue that would allow tracking the availability of this feature.
It would be possible to forward this request to the development team for consideration.

Related

Redshift external catalog error when copying parquet from s3

I am trying to copy Google Analytics data into redshift via parquet format. When I limit the columns to a few select fields, I am able to copy the data. But on including few specific columns I get an error:
ERROR: External Catalog Error. Detail: ----------------------------------------------- error: External Catalog Error. code: 16000 context: Unsupported column type found for column: 6. Remove the column from the projection to continue. query: 18669834 location: s3_request_builder.cpp:2070 process: padbmaster [pid=23607] -----------------------------------------------
I know the issue is most probably with the data, but I am not sure how can I debug as this error is not helpful in anyway. I have tried changing data types of the columns to super, but without any success. I am not using redshift spectrum here.

I found the solution. In the error message it says Unsupported column type found for column: 6. Redshift column ordinality starts from 0. I was counting columns from 1, instead of 0 (my mistake). So this means issue was with column 6 (which I was reading as column 7), which was a string or varchar column in my case. I created a table with just this column and tried uploading data in just this column. Then I got
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': '\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15001\n context: The length of the data column display_name is longer than the length defined in the table. Table: 256, Data: 1020
Recreating the column with varchar(max) for those columns solved the issue

I assume you have semistructured data in your parquet (like an array).
In this case, you can have a look at this page at the very bottom https://docs.aws.amazon.com/redshift/latest/dg/ingest-super.html
It says:
If your semistructured or nested data is already available in either
Apache Parquet or Apache ORC format, you can use the COPY command to
ingest data into Amazon Redshift.
The Amazon Redshift table structure should match the number of columns
and the column data types of the Parquet or ORC files. By specifying
SERIALIZETOJSON in the COPY command, you can load any column type in
the file that aligns with a SUPER column in the table as SUPER. This
includes structure and array types.
COPY foo FROM 's3://bucket/somewhere'
...
FORMAT PARQUET SERIALIZETOJSON;
For me, the last line
...
FORMAT PARQUET SERIALIZETOJSON;
did the trick.

HIVE_CURSOR_ERROR when Querying S3 Inventory With Athena - Is size column correct?

I'm attempting to do some analysis on one of our S3 buckets using Athena and I'm getting some errors that I can't explain or find solutions for anywhere I look.
The guide I'm following is https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html.
I created my S3 inventory yesterday and have now received the first report in S3. The format is Apache ORC, the last export shows as yesterday and the additional fields stored are Size, Last modified, Storage class, Encryption.
I can see the data stored under s3://{my-inventory-bucket}/{my-bucket}/{my-inventory} so I know there is data there.
The default encryption on the inventory bucket and inventory configuration both have SSE-S3 encryption enabled.
To create the table, I am using the following query:
CREATE EXTERNAL TABLE my_table (
`bucket` string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/hive/';
Once the table has been created, I load the data using:
MSCK REPAIR TABLE my_table;
The results from loading the data show that data has been loaded:
Partitions not in metastore: my_table=2021-07-17-00-00
Repair: Added partition to metastore my_table=2021-07-17-00-00
Once that's loaded, I verify the data is available using:
SELECT DISTINCT dt FROM my_table ORDER BY 1 DESC limit 10;
Which outputs:
1 2021-07-17-00-00
Now if I run something like the below, everything runs fine and I get the expected results:
SELECT key FROM my_table ORDER BY 1 DESC limit 10;
But as soon as I include the size column, I receive an error:
SELECT key, size FROM my_table ORDER BY 1 DESC limit 10;
Your query has the following error(s):
HIVE_CURSOR_ERROR: Failed to read ORC file: s3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/data/{UUID}.orc
This query ran against the "my_table" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: {UUID}.
I feel like I've got something wrong with my size column. Can anyone help figure this out?

So frustrating. Think I found the answer here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html
IsLatest – Set to True if the object is the current version of the object. (This field is not included if the list is only for the current version of objects.)
Removing that column fixed the problem.

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.

To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

Amazon Athena - Column cannot be resolved on basic SQL WHERE query

I am currently evaluating Amazon Athena and Amazon S3.
I have created a database (testdb) with one table (awsevaluationtable). The table has two columns, x (bigint) and y (bigint).
When I run:
SELECT *
FROM testdb."awsevaluationtable"
I get all of the test data:
However, when I try a basic WHERE query:
SELECT *
FROM testdb."awsevaluationtable"
WHERE x > 5
I get:
SYNTAX_ERROR: line 3:7: Column 'x' cannot be resolved
I have tried all sorts of variations:
SELECT * FROM testdb.awsevaluationtable WHERE x > 5
SELECT * FROM awsevaluationtable WHERE x > 5
SELECT * FROM testdb."awsevaluationtable" WHERE X > 5
SELECT * FROM testdb."awsevaluationtable" WHERE testdb."awsevaluationtable".x > 5
SELECT * FROM testdb.awsevaluationtable WHERE awsevaluationtable.x > 5
I have also confirmed that the x column exists with:
SHOW COLUMNS IN sctawsevaluation
This seems like an extremely simple query yet I can't figure out what is wrong. I don't see anything obvious in the documentation. Any suggestions would be appreciated.

In my case, changing double quotes to single quotes resolves this error.
Presto uses single quotes for string literals, and uses double quotes for identifiers.
https://trino.io/docs/current/migration/from-hive.html#use-ansi-sql-syntax-for-identifiers-and-strings
Strings are delimited with single quotes and identifiers are quoted with double quotes, not backquotes:
SELECT name AS "User Name"
FROM "7day_active"
WHERE name = 'foo'

I have edited my response to this issue based on my current findings and my contact with both the AWS Glue and Athena support teams.
We were having the same issue - an inability to query on the first column in our CSV files. The problem comes down to the encoding of the CSV file. In short, AWS Glue and Athena currently do not support CSV's encoded in UTF-8-BOM. If you open up a CSV encoded with a Byte Order Mark (BOM) in Excel or Notepad++, it looks like any comma-delimited text file. However, opening it up in a Hex editor reveals the underlying issue. There are a bunch of special characters at the start of the file: ï»¿ i.e. the BOM.
When a UTF-8-BOM CSV file is processed in AWS Glue, it retains these special characters, and associates then with the first column name. When you try and query on the first column within Athena, you will generate an error.
There are ways around this on AWS:
In AWS Glue, edit the table schema and delete the first column, then reinsert it back with the proper column name, OR
In AWS Athena, execute the SHOW CREATE TABLE DDL to script out the problematic table, remove the special character in the generated script, then run the script to create a new table which you can query on.
To make your life simple, just make sure your CSV's are encoded as UTF-8.

I noticed that the csv source of the original table had column headers with capital letters (X and Y) unlike the column names that were being displayed in Athena.
So I removed the table, edited the csv file so that the headers were lowercase (x and y), then recreated the table and now it works!

Create DynamoDB tables using Hive

I have in my cloud, inside a S3 bucket, a CSV file with some data.
I would like to export that data into a DynamoDB table with columns "key" and "value".
Here's the current hive script I wrote:
CREATE EXTERNAL TABLE FromCSV(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ', '
LOCATION 's3://mybucket/output/';
CREATE EXTERNAL TABLE hiveTransfer(col1 string, col2 string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InvertedIndex",
"dynamodb.column.mapping" = "col1:key,col2:value");
INSERT OVERWRITE TABLE hiveTransfer SELECT * FROM FromCSV;
Now, basically the script works. though I would like to make some modifications to this script as follows:
1) The script works only if the table "InvertedIndex" already exists in DynamoDB, I would like the script to create the new table by itself and then put the data as it already does.
2) In the CSV the key is always a string but I have 2 kinds of values, string or integer. I would like the script to distinguish between the two and make two different tables.
Any help with those two modifications will be appriciated.
Thank you

Hi this could be accomplished but it is not trivial case.
1) For creating dynamo table that can't be done by hive because Dynamo Tables are managed by Amazon cloud. One thing which gets in my mind is to create Hive UDF for creating dynamo table and call it inside some dummy query before running insert. For example:
SELECT CREATE_DYNO_TABLE() FROM dummy;
Where dummy table has only one record.
2) You can split loading into two queries where in one query you will use RLIKE operator and [0-9]+ regular expression to detect numeric values and other just negation of that.
HTH,
Dino

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js