Amazon Athena - Column cannot be resolved on basic SQL WHERE query - amazon-web-services

I am currently evaluating Amazon Athena and Amazon S3.
I have created a database (testdb) with one table (awsevaluationtable). The table has two columns, x (bigint) and y (bigint).
When I run:
SELECT *
FROM testdb."awsevaluationtable"
I get all of the test data:
However, when I try a basic WHERE query:
SELECT *
FROM testdb."awsevaluationtable"
WHERE x > 5
I get:
SYNTAX_ERROR: line 3:7: Column 'x' cannot be resolved
I have tried all sorts of variations:
SELECT * FROM testdb.awsevaluationtable WHERE x > 5
SELECT * FROM awsevaluationtable WHERE x > 5
SELECT * FROM testdb."awsevaluationtable" WHERE X > 5
SELECT * FROM testdb."awsevaluationtable" WHERE testdb."awsevaluationtable".x > 5
SELECT * FROM testdb.awsevaluationtable WHERE awsevaluationtable.x > 5
I have also confirmed that the x column exists with:
SHOW COLUMNS IN sctawsevaluation
This seems like an extremely simple query yet I can't figure out what is wrong. I don't see anything obvious in the documentation. Any suggestions would be appreciated.

In my case, changing double quotes to single quotes resolves this error.
Presto uses single quotes for string literals, and uses double quotes for identifiers.
https://trino.io/docs/current/migration/from-hive.html#use-ansi-sql-syntax-for-identifiers-and-strings
Strings are delimited with single quotes and identifiers are quoted with double quotes, not backquotes:
SELECT name AS "User Name"
FROM "7day_active"
WHERE name = 'foo'

I have edited my response to this issue based on my current findings and my contact with both the AWS Glue and Athena support teams.
We were having the same issue - an inability to query on the first column in our CSV files. The problem comes down to the encoding of the CSV file. In short, AWS Glue and Athena currently do not support CSV's encoded in UTF-8-BOM. If you open up a CSV encoded with a Byte Order Mark (BOM) in Excel or Notepad++, it looks like any comma-delimited text file. However, opening it up in a Hex editor reveals the underlying issue. There are a bunch of special characters at the start of the file:  i.e. the BOM.
When a UTF-8-BOM CSV file is processed in AWS Glue, it retains these special characters, and associates then with the first column name. When you try and query on the first column within Athena, you will generate an error.
There are ways around this on AWS:
In AWS Glue, edit the table schema and delete the first column, then reinsert it back with the proper column name, OR
In AWS Athena, execute the SHOW CREATE TABLE DDL to script out the problematic table, remove the special character in the generated script, then run the script to create a new table which you can query on.
To make your life simple, just make sure your CSV's are encoded as UTF-8.

I noticed that the csv source of the original table had column headers with capital letters (X and Y) unlike the column names that were being displayed in Athena.
So I removed the table, edited the csv file so that the headers were lowercase (x and y), then recreated the table and now it works!

Related

How do I query a CSV dataset that's been imported?

I have imported data from CSV into Questdb using the web console but I can't query any values from it. I can see it in the list of tables but if I run something like this:
select * from imported-data.csv;
I get an error saying:
function, literal or constant is expected
If the table hasn't been renamed, you have to escape the filename with singlequotes, i.e.:
select * from 'imported-data.csv';
And if you want to rename the table to something else, you can also easily do this with RENAME:
RENAME TABLE 'imported-data.csv' TO 'newTable';

How to properly import tsv to athena

I am following this example:
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files - TSV example
Summary of the code:
CREATE EXTERNAL TABLE flight_delays_tsv (
yr INT,
quarter INT,
month INT,
...
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/tsv/';
My questions are:
My tsv does not have column names
(my tsv)
Is it ok if I just list the columns as c1,c2… and all of them as string ?
I do not understand this:
PARTITIONED BY (year STRING)
in the example, the column ‘year’ is not listed in any of the columns…
Column names
The column names are defined by the CREATE EXTERNAL TABLE command. I recommend you name them something useful so that it is easier to write queries. The column names do not need to match any names in the actual file. (It does not interpret header rows.)
Partitioning
From Partitioning Data - Amazon Athena:
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
The field used to partition the data is NOT stored in the files themselves, which is why they are not in the table definition. Rather, the column value is stored in the name of the directory.
This might seem strange (storing values in a directory name!) but actually makes sense because it avoids situations where an incorrect value is stored in a folder. For example, if there is a year=2018 folder, what happens if a file contains a column where the year is 2017? This is avoided by storing the year in the directory name, such that any files within that directory are assigned the value denoted in the directory name.
Queries can still use WHERE year = 2018 even though it isn't listed as an actual column.
See also: LanguageManual DDL - Apache Hive - Apache Software Foundation
The other neat thing is that data can be updated by simply moving a file to a different directory. In this example, it would change the year value as a result of being in a different directory.
Yes, it's strange, but the trick is to stop thinking of it like a normal database and appreciate the freedom that it offers. For example, appending new data is as simple as dropping a file into a directory. No loading required!

Skipping header rows in AWS Redshift External Tables

I have a file in S3 with the following data:
name,age,gender
jill,30,f
jack,32,m
And a redshift external table to query that data using spectrum:
create external table spectrum.customers (
"name" varchar(50),
"age" int,
"gender" varchar(1))
row format delimited
fields terminated by ','
lines terminated by \n'
stored as textfile
location 's3://...';
When querying the data I get the following result:
select * from spectrum.customers;
name,age,g
jill,30,f
jack,32,m
Is there an elegant way to skip the header row as part of the external table definition, similar to the tblproperties ("skip.header.line.count"="1") option in Hive? Or is my only option (at least for now) to filter out the header rows as part of the select statement?
Answered this in: How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.
This works in Redshift:
You want to use table properties ('skip.header.line.count'='1')
Along with other properties if you want, e.g. 'numRows'='100'.
Here's a sample:
create external table exreddb1.test_table
(ID BIGINT
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');
Currently, AWS Redshift Spectrum does not support skipping header rows. If you can, you could raise a support issue that would allow tracking the availability of this feature.
It would be possible to forward this request to the development team for consideration.

Postgres Copy select rows from CSV table

This is my first post to stackoverflow. Your forum has been SO very helpful as I've been learning Python and Postgres on the fly for the last 6 months, that I haven't needed to post yet. But this task is tripping me up and I figure I need to start earning reputation points:
I am creating a python script for backing up data into an SQL database daily. I have a CSV file with an entire months worth of hourly data, but I only want to select a single day of data from from the file and copy those select rows into my database. Am I able to query the CSV table and append the query results into my database? For example:
sys.stdin = open('file.csv', 'r')
cur.copy_expert("COPY table FROM STDIN
SELECT 'yyyymmddpst LIKE 20140131'
WITH DELIMITER ',' CSV HEADER", sys.stdin)
This code and other variations aren't working out - I keep getting syntax errors. Can anyone help me out with this task? Thanks!!
You need create temporary table at first:
cur.execute('CREATE TEMPORARY TABLE "temp_table" (LIKE "your_table") WITH OIDS')
Than copy data from csv:
cur.execute("COPY temp_table FROM '/full/path/to/file.csv' WITH CSV HEADER DELIMITER ','")
Insert necessary records:
cur.execute("INSERT INTO your_table SELECT * FROM temp_table WHERE yyyymmddpst LIKE 20140131")
And don't forget do conn.commit()
Temp table will destroy after cur.close()
You can COPY (SELECT ...) TO an external file, because PostgreSQL just has to read the rows from the query and send them to the client.
The reverse is not true. You can't COPY (SELECT ....) FROM ... . If it were a simple SELECT PostgreSQL could try to pretend it was a view, but really it doesn't make much sense, and in any case it'd apply to the target table, not the source rows. So the code you wrote wouldn't do what you think it does, even if it worked.
In this case you can create an unlogged or temporary table, copy the full CSV to it, and then use SQL to extract just the rows you want, as pointed out by Dmitry.
An alternative is to use the file_fdw to map the CSV file as a table. The CSV isn't copied, it's just read on demand. This lets you skip the temporary table step.
From PostgreSQL 12 you can add a WHERE clause to your COPY statement and you will get only the rows that match the condition.
So your COPY statement could look like:
COPY table
FROM '/full/path/to/file.csv'
WITH( FORMAT CSV, HEADER, DELIMITER ',' )
WHERE yyyymmddpst LIKE 20140131

Create DynamoDB tables using Hive

I have in my cloud, inside a S3 bucket, a CSV file with some data.
I would like to export that data into a DynamoDB table with columns "key" and "value".
Here's the current hive script I wrote:
CREATE EXTERNAL TABLE FromCSV(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ', '
LOCATION 's3://mybucket/output/';
CREATE EXTERNAL TABLE hiveTransfer(col1 string, col2 string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InvertedIndex",
"dynamodb.column.mapping" = "col1:key,col2:value");
INSERT OVERWRITE TABLE hiveTransfer SELECT * FROM FromCSV;
Now, basically the script works. though I would like to make some modifications to this script as follows:
1) The script works only if the table "InvertedIndex" already exists in DynamoDB, I would like the script to create the new table by itself and then put the data as it already does.
2) In the CSV the key is always a string but I have 2 kinds of values, string or integer. I would like the script to distinguish between the two and make two different tables.
Any help with those two modifications will be appriciated.
Thank you
Hi this could be accomplished but it is not trivial case.
1) For creating dynamo table that can't be done by hive because Dynamo Tables are managed by Amazon cloud. One thing which gets in my mind is to create Hive UDF for creating dynamo table and call it inside some dummy query before running insert. For example:
SELECT CREATE_DYNO_TABLE() FROM dummy;
Where dummy table has only one record.
2) You can split loading into two queries where in one query you will use RLIKE operator and [0-9]+ regular expression to detect numeric values and other just negation of that.
HTH,
Dino