How to properly import tsv to athena - amazon-web-services

I am following this example:
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files - TSV example
Summary of the code:
CREATE EXTERNAL TABLE flight_delays_tsv (
yr INT,
quarter INT,
month INT,
...
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/tsv/';
My questions are:
My tsv does not have column names
(my tsv)
Is it ok if I just list the columns as c1,c2… and all of them as string ?
I do not understand this:
PARTITIONED BY (year STRING)
in the example, the column ‘year’ is not listed in any of the columns…

Column names
The column names are defined by the CREATE EXTERNAL TABLE command. I recommend you name them something useful so that it is easier to write queries. The column names do not need to match any names in the actual file. (It does not interpret header rows.)
Partitioning
From Partitioning Data - Amazon Athena:
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
The field used to partition the data is NOT stored in the files themselves, which is why they are not in the table definition. Rather, the column value is stored in the name of the directory.
This might seem strange (storing values in a directory name!) but actually makes sense because it avoids situations where an incorrect value is stored in a folder. For example, if there is a year=2018 folder, what happens if a file contains a column where the year is 2017? This is avoided by storing the year in the directory name, such that any files within that directory are assigned the value denoted in the directory name.
Queries can still use WHERE year = 2018 even though it isn't listed as an actual column.
See also: LanguageManual DDL - Apache Hive - Apache Software Foundation
The other neat thing is that data can be updated by simply moving a file to a different directory. In this example, it would change the year value as a result of being in a different directory.
Yes, it's strange, but the trick is to stop thinking of it like a normal database and appreciate the freedom that it offers. For example, appending new data is as simple as dropping a file into a directory. No loading required!

Related

When storing impala table as textfile, is it posisble to tell it to save column names in the textfile?

I have created an impala table as
create table my_schema.my_table stored as textfile as select ...
As per the definition the table has data stored in textfiles somewhere in HDFS. Now when i run hdfs command such as:
hadoop fs -cat path_to_file | head
I do not see any column names. I suppose impala stores the column names somewhere else, but since i would like to work with these textfiles also outside of impala, it would be great if the files would include the headers.
Is there some option i can set when creating the table to add the headers to the text files? Or do i need to figure out the names by parsing the results of show create table?

Athena Partition Projection for Date column vs. String

I'm looking to use Athena Partition Projection to analyze log files from AWS application load balancers and firehose emitted logs. The data in S3 is prefixed with year/month/day and potentially hour as well. I've been able to accomplish using the Firehose Example; however this example uses a string formatted partition column.
I'm looking to see if it's possible to use a date formatted partition column instead (with partition project and the firehose emitted s3 prefix format), as our query writers are already used to most of our queries involving date columns and it avoids the need to string format for relative date queries. Is this possible or would the s3 prefixes need to be changed to accomplish?
Table Properties for String column: WORKS
PARTITIONED BY (
`logdate` string)
TBLPROPERTIES (
'projection.enabled'='true',
'projection.logdate.format'='yyyy/MM/dd',
'projection.logdate.interval'='1',
'projection.logdate.interval.unit'='DAYS',
'projection.logdate.range'='NOW-2YEARS,NOW',
'projection.logdate.type'='date',
'storage.location.template'='s3://bucket/prefix/${logdate}')
Table Properties for Date Partition column Does Not Work
PARTITIONED BY (
`logdate` date)
TBLPROPERTIES (
'projection.enabled'='true',
'projection.logdate.format'='yyyy/MM/dd',
'projection.logdate.interval'='1',
'projection.logdate.interval.unit'='DAYS',
'projection.logdate.range'='NOW-2YEARS,NOW',
'projection.logdate.type'='date',
'storage.location.template'='s3://bucket/prefix/${logdate}')
HIVE_INVALID_PARTITION_VALUE: Invalid partition value '2018/11/13' for DATE partition key: logdate=2018%2F11%2F13
I think the only thing you need to do is make sure the type of the logdate partition key to be string:
PARTITIONED BY (logdate string)
This is not the same as projection.logdate.type, which should continue to be date.
Partition keys with type date are just dates within the calculations partition projection performs. For all other purposes they are strings. PP will parse values using the date format you specify, do its calculations, then output strings using the same date format. This happens during query planning, before the Presto engine is involved.
Presto's schema-on-read approach means that you can say that a column has type date if its format matches the expected format of dates; yyyy-MM-dd in Java format. The format that you get from Firehose's S3 keys, yyyy/MM/dd, can't be cast to date automatically, it needs to be explicitly parsed:
parse_datetime(logdate, 'yyyy/MM/dd')
I think it would have been great if PP would have been aware of the types of partition keys so that you could have done what you have tried to do, but I assume that since PP happens during query planning and most likely not anywhere near where the types of things are known it's probably too difficult to achieve.

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.
To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

Dataprep importing files with different number of columns into a dataset

I am trying to create a parameterized dataset that imports files from GCS and puts them under each other. This all works fine (Import Data > Parameterize).
To give a bit of context, I store each day a .csv file with a different name referring to that date.
Now it happens that my provider added a new column since last month into the files. This means that files before this date have 8 columns, whereas from this date 9 columns.
However, when I parameterize, Dataprep only takes into account the columns that are matching (thus 8 columns only). Ideally I would want empty observations for the rows coming from files that did not have this new column.
How can this be achieved?
The parameterized datasets only work on a fixed schema as mentioned in the documentation:
Avoid creating datasets with parameters where individual files or tables have differing schemas.
This fixed schema is generated using one of the file found during the creation of the dataset with parameters.
If the schema has changed, then you can "refresh" it by editing the dataset with parameters and clicking save. If all the matching files contain 9 columns, you should now see 9 columns in the transformer.

Amazon Athena - Column cannot be resolved on basic SQL WHERE query

I am currently evaluating Amazon Athena and Amazon S3.
I have created a database (testdb) with one table (awsevaluationtable). The table has two columns, x (bigint) and y (bigint).
When I run:
SELECT *
FROM testdb."awsevaluationtable"
I get all of the test data:
However, when I try a basic WHERE query:
SELECT *
FROM testdb."awsevaluationtable"
WHERE x > 5
I get:
SYNTAX_ERROR: line 3:7: Column 'x' cannot be resolved
I have tried all sorts of variations:
SELECT * FROM testdb.awsevaluationtable WHERE x > 5
SELECT * FROM awsevaluationtable WHERE x > 5
SELECT * FROM testdb."awsevaluationtable" WHERE X > 5
SELECT * FROM testdb."awsevaluationtable" WHERE testdb."awsevaluationtable".x > 5
SELECT * FROM testdb.awsevaluationtable WHERE awsevaluationtable.x > 5
I have also confirmed that the x column exists with:
SHOW COLUMNS IN sctawsevaluation
This seems like an extremely simple query yet I can't figure out what is wrong. I don't see anything obvious in the documentation. Any suggestions would be appreciated.
In my case, changing double quotes to single quotes resolves this error.
Presto uses single quotes for string literals, and uses double quotes for identifiers.
https://trino.io/docs/current/migration/from-hive.html#use-ansi-sql-syntax-for-identifiers-and-strings
Strings are delimited with single quotes and identifiers are quoted with double quotes, not backquotes:
SELECT name AS "User Name"
FROM "7day_active"
WHERE name = 'foo'
I have edited my response to this issue based on my current findings and my contact with both the AWS Glue and Athena support teams.
We were having the same issue - an inability to query on the first column in our CSV files. The problem comes down to the encoding of the CSV file. In short, AWS Glue and Athena currently do not support CSV's encoded in UTF-8-BOM. If you open up a CSV encoded with a Byte Order Mark (BOM) in Excel or Notepad++, it looks like any comma-delimited text file. However, opening it up in a Hex editor reveals the underlying issue. There are a bunch of special characters at the start of the file:  i.e. the BOM.
When a UTF-8-BOM CSV file is processed in AWS Glue, it retains these special characters, and associates then with the first column name. When you try and query on the first column within Athena, you will generate an error.
There are ways around this on AWS:
In AWS Glue, edit the table schema and delete the first column, then reinsert it back with the proper column name, OR
In AWS Athena, execute the SHOW CREATE TABLE DDL to script out the problematic table, remove the special character in the generated script, then run the script to create a new table which you can query on.
To make your life simple, just make sure your CSV's are encoded as UTF-8.
I noticed that the csv source of the original table had column headers with capital letters (X and Y) unlike the column names that were being displayed in Athena.
So I removed the table, edited the csv file so that the headers were lowercase (x and y), then recreated the table and now it works!