Adding LIMIT fixes "Invalid digit, Value N" error in Amazon Redshift. Why? - amazon-web-services

I have a standard listings table on Redshift table with all varchars (due to loading into database)
This query (simplified) gives me error:
with AL as (
select
L.price::int as price,
from listings L
where L.price <> 'NULL'
and L.listing_type <> 'NULL'
)
select price from AL
where price < 800
and the error:
-----------------------------------------------
error: Invalid digit, Value 'N', Pos 0, Type: Integer
code: 1207
context: NULL
query: 2422868
location: :0
process: query0_24 [pid=0]
-----------------------------------------------
If I remove the where price < 800 condition, the query returns just fine... but I need the where condition to be there.
I've also checked the number validity of the price field and all look good.
After playing around, this actually makes it work, and I can't quite explain why.
with AL as (
select
L.price::int as price,
from listings L
where L.price <> 'NULL'
and L.listing_type <> 'NULL'
limit 10000000000
)
select price from AL
where price < 800
Note that the table has far less records than the number stated in limit.
Can anyone (possibly from the Redshift engineer team) explain why this is the way it is? Possibly something to do with how the query plan being executed and parallelized?

I had query that could be expressed simply as:
SELECT TOP 10 field1, field2
FROM table1
INNER JOIN table2
ON table1.field3::int = table2.field3
ORDER BY table1.field1 DESC
Removing the explicit cast to ::int solved a similar error for me.
Meanwhile, postgresql locally requires the "::int" to work.
For what it's worth, my local postgresql version is
PostgreSQL 9.6.4 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

Loading CSV data with NaN into AWS Redshift
I found this post while searching google but the above link had what I needed. I was importing a numeric column with value NaN, which is unsupported by redshift numeric.

Related

Column does not exist AWS Timestream Query error

I am trying to apply WHERE clause on DIMENSION of the AWS Timestream records. However, I got the error: Column does not exist
Here is my table schema:
The table schema
The table measure
First, I will show all the sample data I put in the table
SELECT username, time, manual_usage
FROM "meter-reading"."meter-metrics"
ORDER BY time DESC
LIMIT 4
The result:
Result
What I wanted to do is to query and filter the records by the Dimension ("username" specifically).
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE measure_name = "OnceADay"
ORDER BY time DESC LIMIT 10
Then I got the Error: Column 'OnceADay' does not exist
I tried to search for any quotas for Dimensions name and check for error in my schema:
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.naming
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.system_identifier
But I didn't find that my "username" for the dimension violate any of the above rules.
I checked for some other queries by AWS Blog, the author used the WHERE clause for the Dimension filter normally:
https://aws.amazon.com/blogs/database/effective-queries-for-common-query-patterns-in-amazon-timestream/
I figured it out after I tried with the sample code. Turn out it was a silly mistake I believe.
Using apostrophe (') instead of single quotation marks ("") solved my problem.
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE username = 'OnceADay'
ORDER BY time DESC LIMIT 10

Data type shifts in amazon redshift

I am working on loading my data from s3 to redshift. I noticed a shift in the data type in my query from the redshift error logs.
This is the table I am creating...
main_covid_table_create = ("""
CREATE TABLE IF NOT EXISTS main_covid_table(
SNo INT IDENTITY(1, 1),
ObservationDate DATE,
state VARCHAR,
country VARCHAR,
lastUpdate DATE,
Confirmed DOUBLE PRECISION,
Deaths DOUBLE PRECISION,
Recovered DOUBLE PRECISION
)
""")
with copy command as
staging_main_covid_table_copy = ("""
COPY main_covid_table
FROM {}
iam_role {}
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT AS 'auto'
NULL AS 'NA'
""").format(COVID_DATA, IAM_ROLE)
I get his error from redshift after running the script:
My interpretation of this error is that the data type of lastUpdate is been used for the country column. Can anyone help with this?
Presumably, your error output is from STL_LOAD_ERRORS, in which case the third last column is defined as: "The pre-parsing value for the field "colname" that lead to the parsing error.".
Thus, it is saying that there is a problem with country, and that it is trying to interpret it as a date. This does not make sense given the definitions you have provided. In fact, it looks as if it is trying to load the header line as data, which again doesn't make sense given the presence of IGNOREHEADER 1. It also looks like there is a column mis-alignment.
I recommend that you examine the full error details from the STL_LOAD_ERRORS line including the colname and try to figure out what is happening with the data. You could start with just one line of data in the file and see whether it works, then keep adding the data back to find what is breaking the load.

"Where clause" is not working in AWS Athena

I used AWS Glue Console to create a table from S3 bucket in Athena. You can see a relevant part on the screenshot above. I obfuscated column name, so assume the column name is "a test column". I would like to select the records with value D in that column. The query I tried to run is:
SELECT
*
FROM
table
WHERE
"a test column" = "D"
Nothing is returned. I also tried to use IS instead of =, as well as to surround D with single quotes instead of double quotes within the WHERE clause:
-- Tried this
WHERE
"a test column" = 'D'
-- Tried this
WHERE
"a test column" IS "D"
-- Tried this
WHERE
"a test column" IS 'D'
Nothing works. Can someone help? Thank you.
The error message I got is
Mismatched input 'where' expecting (service: amazon athena; status code: 400; error code: invalid request exception; request id: 8f2f7c17-8832-4e34-8fb2-a78855e3c17d)
Problem with the query syntax. Use single quotes (') when you refer to a string values, because double quotes refer to a column name in your table.
SELECT
*
FROM
table
WHERE
"column_name" = 'D'
The unexpected answer (also apologize if I did not say it clearly in the original post) is that, I cannot add "limit 200" in front of the where clause. I have to add it in the end. Hope it helps others.

Date ranges in where clause of a proc SQL statement

There is a large table containing among other fields the following:
ID, effective_date, Expiration_date.
expiration_date is datetime20. format, and can be NULL
I'm trying to extract rows that expire after Dec 31, 2014 or do not expire (NULL).
Adding the following where statement to the proc sql query gives me no results
where coalesce(datepart(expiration_date),input('31/Dec/2020',date11.))
> input('31/Dec/2014',date11.);
However, when I only select NULL expiration dates and add the following fields:
put(coalesce(datepart(expiration_date),input('31/Dec/2020',date11.)),date11.) as value,
put(input('31/Dec/2014',date11.),date11.) as threshold,
case when coalesce(datepart(expiration_date),input('31/Dec/2020',date11.)) > input('31/Dec/2014',date11.)
then 'pass' else 'fail' end as tag
It shows 'pass' under TAG and all the other fields are correct.
This is an effort to duplicate what I used in SQL Server
where isnull(expiration_date,'9999-12-31') > '2014-12-31'
Using SAS Enterprise Guide 7.1 and while trying to figure it out I've been using
proc sql inobs=100;`
What am I doing wrong ? Thank you.
Some Expiration Dates:
30OCT2015:00:00:00
30OCT2015:00:00:00
29OCT2015:00:00:00
30OCT2015:00:00:00
I would recommend using a date constant ("31DEC2014"d) rather than date functions, or else either use explicit passthrough or disable implicit passthrough. Date functions are challenging when going between databases and so avoiding them when possible is best.

Getting table information for Redshift `stl_load_errors` errors

I am using Redshift COPY command to load data into Redshift table from S3. When something goes wrong, I typically get an error ERROR: Load into table 'example' failed. Check 'stl_load_errors' system table for details. I can always lookup stl_load_errors manually to get details. Now, I am trying to figure out how I can do that automatically.
From documentation it looks like the following query should give me all the details I need:
SELECT *
FROM stl_load_errors errors
INNER JOIN svv_table_info info
ON errors.tbl = info.table_id
AND info.schema = '<schema-name>'
AND info.table = '<table-name>'
However it always returns nothing. I also tried using stv_tbl_perm instead of svv_table_info, and still nothing.
After some troubleshooting, I see two things I don't understand:
I see multiple different IDs in stv_tbl_perm and svv_table_info for the same exact table. Why is that?
I see tbl filed on stl_load_errors referencing ids that do not exist in stv_tbl_perm or svv_table_info. Again why?
Feels like I don't understanding something in structure of these tables, but it completely escapes me what.
This is because tbl and table_id are with different types. First one is integer, second one is iod.
When you cast iod to integer the columns have the same values. You could check this query:
SELECT table_id::integer, table_id
FROM SVV_TABLE_INFO
I have result when I execute
SELECT errors.tbl, info.table_id::integer, info.table_id, *
FROM stl_load_errors errors
INNER JOIN svv_table_info info
ON errors.tbl = info.table_id
Please note that inner join is ON errors.tbl = info.table_id
I finally got to the bottom of it, and it is surprisingly boring and probably not useful to many ...
I had an existing table. My code that was creating the table was wrapped in transaction, and it was dropping the table inside the transaction. The code that was querying the stl_load_errors was outside the transaction. So the table_id outside and inside the transaction where different, as it was a different table.
You could try looking by filename. Doesn't really answer the question about joining the various tables, but I use a query like so to group up files that are part of the same manifest file and let me compare it to the maxerror setting:
select min(starttime) over (partition by substring(filename, 1, 53)) as starttime,
substring(filename, 1, 53) as filename, btrim(err_reason) as err_reason, count(*)
from stl_load_errors where filename like '%/some_s3_path/%'
group by starttime, filename, err_reason order by starttime desc;
This worked for me without any casting:
schemaz=# select i.database, e.err_code from stl_load_errors e join svv_table_info i on e.tbl=i.table_id limit 5
schemaz-# ;
database | err_code
-----------+----------
schemaz | 1204
schemaz | 1204
schemaz | 1204
schemaz | 1204
schemaz | 1204