AWS Redshift does not set integer defaults when copying from S3 - amazon-web-services

Create query:
CREATE TABLE IF NOT EXISTS test.test_table ( "test_field1" INTEGER DEFAULT 0 NOT NULL, "test_field2" INTEGER DEFAULT 0 NOT NULL)
test_file.csv (note that second row hasn't first column value):
test_field1,test_field2
100, 234
,30542
test.manifest:
{"entries":[{"url":"s3://MY_BUCKET_NAME/some_path/test_file.csv","mandatory":true}]}
COPY Query:
COPY test.test_table
("test_field1", "test_field2")
FROM 's3://MY_BUCKET_NAME/some_path/test.manifest'
CREDENTIALS 'aws_access_key_id=some_access_key;aws_secret_access_key=some_secret'
CSV
IGNOREHEADER 1
MANIFEST
REGION AS 'some-region-n'
TIMEFORMAT 'auto'
ACCEPTINVCHARS
When I execute copy query, I got error from stl_load_error:
Missing data for not-null field
Why? I tried to omit test_field1 from the query but got another error:
Extra column(s) found.
When I change field type to VARCHAR (256), redshift returns OK.
Maybe I miss something, it's too strange for me to understand.

Related

Redshift COPY error: "Assert code: 1000 context: Reached unreachable code - Invalid type: 6551 query"

We are trying to copy data from s3 (parquet files) to redshift.
Here are the respective details.
Athena DDL:
CREATE EXTERNAL tablename(
`id` int,
`col1` int,
`col2` date,
`col3` string,
`col4` decimal(10,2),
binarycol binary);
Redshift DDL:
CREATE TABLE IF NOT EXISTS redshiftschema.tablename(
id int,
col1 int,
col2 date,
col3 varchar(512),
col4 decimal(10,2),
binarycol varbyte);
And the copy command is:
COPY <tgt_schema>.tablename FROM 's3://<path>/<tablename>.manifest' iam_role 'redshift-role' FORMAT AS PARQUET manifest;
The above works well with all other tables except when we have a binary column I believe in the athena table. In that case we get following error:
Redshift COPY error: "Assert code: 1000 context: Reached unreachable code - Invalid type: 6551 query"
Could anyone please guide with the issue we are facing?

Redshift copy command not getting default value

I have a file in S3 with the following format:
col1,col2
number1,content1
number2,content2
number3,content3
I am creating a Redshift table with the structure bellow:
CREATE TABLE IF NOT EXISTS general.name.name_test (
col1 VARCHAR(255),
col2 VARCHAR(255),
inserted_timestamp TIMESTAMP DEFAULT GETDATE()
);
After that, I am using Redshift copy command to have the data available in the table I just created:
COPY general.name.name_test
FROM 's3://.../name_test.txt'
ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN ''
DELIMITER AS ','
IGNOREHEADER AS 1
csv;
The problem is that "inserted_timestamp" is NULL and Redshift is not taking the default value.
Am I missing something? This is what I will get in Redshift:
col1,col2,inserted_timestamp
number1,content1,null
number2,content2,null
number3,content3,null
It only works if I specify the columns but I wanted to avoid that if possible:
COPY general.name.name_test
(col1,col2)
FROM 's3://.../name_test.txt'
ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN ''
DELIMITER AS ','
IGNOREHEADER AS 1
csv;
Thank you!
Since CSV doesn’t name the columns RS doesn’t know which column is which. You need to add the column names to the COPY command to clear up the confusion. Alternatively you could add a trailing comma to your data file to indicate that the missing column is the last one.
Other than these approaches I don’t know of a way to make it clear to Redshift.

Snowflake table is not accepting null values in date field

I have one table in snowflake, I am performing bulk load using.
one of the columns in table is date, but in the source table which is on sql server is having null values in date column.
The flow of data is as :
sql_server-->S3 buckets -->snowflake_table
I am able to perform the sqoop job in EMR , but not able to load the data into snowflake table, as it is not accepting null values in the date column.
The error is :
Date '' is not recognized File 'schema_name/table_name/file1', line 2, character 18 Row 2,
column "table_name"["column_name":5] If you would like to continue loading when an error is
encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
can anyone help, where I am missing
Using below command you can able to see the values from stage file:
select t.$1, t.$2 from #mystage1 (file_format => myformat) t;
Based on the data you can change your copy command as below:
COPY INTO my_table(col1, col2, col3) from (select $1, $2, try_to_date($3) from #mystage1)
file_format=(type = csv FIELD_DELIMITER = '\u00EA' SKIP_HEADER = 1 NULL_IF = ('') ERROR_ON_COLUMN_COUNT_MISMATCH = false EMPTY_FIELD_AS_NULL = TRUE)
on_error='continue'
The error shows that the dates are not arriving as nulls. Rather, they're arriving as blank strings. You can address this a few different ways.
The cleanest way is to use the TRY_TO_DATE function on your COPY INTO statement for that column. This function will return database null when trying to convert a blank string into a date:
https://docs.snowflake.com/en/sql-reference/functions/try_to_date.html#try-to-date

Data type shifts in amazon redshift

I am working on loading my data from s3 to redshift. I noticed a shift in the data type in my query from the redshift error logs.
This is the table I am creating...
main_covid_table_create = ("""
CREATE TABLE IF NOT EXISTS main_covid_table(
SNo INT IDENTITY(1, 1),
ObservationDate DATE,
state VARCHAR,
country VARCHAR,
lastUpdate DATE,
Confirmed DOUBLE PRECISION,
Deaths DOUBLE PRECISION,
Recovered DOUBLE PRECISION
)
""")
with copy command as
staging_main_covid_table_copy = ("""
COPY main_covid_table
FROM {}
iam_role {}
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT AS 'auto'
NULL AS 'NA'
""").format(COVID_DATA, IAM_ROLE)
I get his error from redshift after running the script:
My interpretation of this error is that the data type of lastUpdate is been used for the country column. Can anyone help with this?
Presumably, your error output is from STL_LOAD_ERRORS, in which case the third last column is defined as: "The pre-parsing value for the field "colname" that lead to the parsing error.".
Thus, it is saying that there is a problem with country, and that it is trying to interpret it as a date. This does not make sense given the definitions you have provided. In fact, it looks as if it is trying to load the header line as data, which again doesn't make sense given the presence of IGNOREHEADER 1. It also looks like there is a column mis-alignment.
I recommend that you examine the full error details from the STL_LOAD_ERRORS line including the colname and try to figure out what is happening with the data. You could start with just one line of data in the file and see whether it works, then keep adding the data back to find what is breaking the load.

how to copy data when delimiters are missing in source file

Lets say I have a 4 table in redshift with 4-columns as :
Create Table m.mytab(
col_1 BIGINT NOT NULL
col_2 Varchar(200)
col_3 Varchar(200)
col_4 INT
);
And my Source row file contains data as:
col_1^col_2^col_3^col_4
myrowdata1^myrowdata2
myrowdata3^myrowdata4
.....
Here I want to load this data in mytab I tried copy command of redshift as :
copy m.mytab
from 's3://mybucket/folder/fileA.gz '
credentials 'aws_access_key_id=somexxx;aws_secret_access_key=somexxx'
DELIMITER '^'
GZIP
IGNOREHEADER 1
ACCEPTINVCHARS;
Since last 2-delimiters are missing in each row , I am unable to load the data here, can someone suggest me how to resolve this issue?
Thanks
1) Try adding FILLRECORD parameter to your COPY statement
For more information, see Data Conversion Parameters documentation
2) If all rows are missing col3 and col4 you can just create a staging table with col1 and col2 only, copy data to staging table and then issue
ALTER TABLE target_tablename
APPEND FROM staging_tablename
FILLTARGET;
This will move data to the target_tablename very efficiently (just changing the pointer without writing or deleting data) and take care about missing col3 and col4.
More information about the command: ALTER TABLE APPEND