I have a file in S3 with the following format:
col1,col2
number1,content1
number2,content2
number3,content3
I am creating a Redshift table with the structure bellow:
CREATE TABLE IF NOT EXISTS general.name.name_test (
col1 VARCHAR(255),
col2 VARCHAR(255),
inserted_timestamp TIMESTAMP DEFAULT GETDATE()
);
After that, I am using Redshift copy command to have the data available in the table I just created:
COPY general.name.name_test
FROM 's3://.../name_test.txt'
ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN ''
DELIMITER AS ','
IGNOREHEADER AS 1
csv;
The problem is that "inserted_timestamp" is NULL and Redshift is not taking the default value.
Am I missing something? This is what I will get in Redshift:
col1,col2,inserted_timestamp
number1,content1,null
number2,content2,null
number3,content3,null
It only works if I specify the columns but I wanted to avoid that if possible:
COPY general.name.name_test
(col1,col2)
FROM 's3://.../name_test.txt'
ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN ''
DELIMITER AS ','
IGNOREHEADER AS 1
csv;
Thank you!
Since CSV doesn’t name the columns RS doesn’t know which column is which. You need to add the column names to the COPY command to clear up the confusion. Alternatively you could add a trailing comma to your data file to indicate that the missing column is the last one.
Other than these approaches I don’t know of a way to make it clear to Redshift.
Related
When I use Athena CTAS to generate CSV files, I found that null values in the Athena table are replaced by "\N".
How do I get it to just leave these values as empty columns?
The CTAS query I'm using is something like this:
CREATE TABLE table_name WITH (format = 'TEXTFILE', field_delimiter=',', external_location='s3://bucket_name/location') AS SELECT * FROM "db_name"."src_table_name";
Am I doing something wrong?
This is the default token for NULL for LazySimpleSerDe, and CTAS does not expose any mechanism for changing it unfortunately.
If you'd rather have empty fields for your NULL values you have to ensure they are all empty strings, e.g. … AS SELECT COALESCE(col1, ''), COALESCE(col2, ''), ….
I am working on loading my data from s3 to redshift. I noticed a shift in the data type in my query from the redshift error logs.
This is the table I am creating...
main_covid_table_create = ("""
CREATE TABLE IF NOT EXISTS main_covid_table(
SNo INT IDENTITY(1, 1),
ObservationDate DATE,
state VARCHAR,
country VARCHAR,
lastUpdate DATE,
Confirmed DOUBLE PRECISION,
Deaths DOUBLE PRECISION,
Recovered DOUBLE PRECISION
)
""")
with copy command as
staging_main_covid_table_copy = ("""
COPY main_covid_table
FROM {}
iam_role {}
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT AS 'auto'
NULL AS 'NA'
""").format(COVID_DATA, IAM_ROLE)
I get his error from redshift after running the script:
My interpretation of this error is that the data type of lastUpdate is been used for the country column. Can anyone help with this?
Presumably, your error output is from STL_LOAD_ERRORS, in which case the third last column is defined as: "The pre-parsing value for the field "colname" that lead to the parsing error.".
Thus, it is saying that there is a problem with country, and that it is trying to interpret it as a date. This does not make sense given the definitions you have provided. In fact, it looks as if it is trying to load the header line as data, which again doesn't make sense given the presence of IGNOREHEADER 1. It also looks like there is a column mis-alignment.
I recommend that you examine the full error details from the STL_LOAD_ERRORS line including the colname and try to figure out what is happening with the data. You could start with just one line of data in the file and see whether it works, then keep adding the data back to find what is breaking the load.
We have java based ETL application developed using spring boot, in the pipeline , we get data from downstream (calling another application endpoint) we transform data from the input file (csv) and copy the transformed output file to s3 and we are inserting data from s3 to redshift using COPY command like this:
COPY schema_nm.table1
FROM 's3://<bucket_name>/projectNamefolder/output/<file name>'
CREDENTIALS 'aws_iam_role=arn:aws:iam::12345678:role/<role_Name>'
REGION 'us-east-1'
TRIMBLANKS
TRUNCATECOLUMNS
ACCEPTANYDATE
ACCEPTINVCHARS
ESCAPE
REMOVEQUOTES
BLANKSASNULL
FILLRECORD
MAXERROR 100
delimiter ''
DATEFORMAT 'auto'
TIMEFORMAT 'auto';
We have in data some special characters like \n \ " ' with in the data and we want these characters to be inserted in the table along with data. Because of these characters records are dropped or rejected and going to stl_load_errs table. Note that these fields are varchar in the table. Can any one please help with how to make sure we inserted data along with these characters.
tried with options in COPY command like ACCEPTINVCHARS , ESCAPE , REMOVEQUOTES but no luck the records are still rejected.
Create query:
CREATE TABLE IF NOT EXISTS test.test_table ( "test_field1" INTEGER DEFAULT 0 NOT NULL, "test_field2" INTEGER DEFAULT 0 NOT NULL)
test_file.csv (note that second row hasn't first column value):
test_field1,test_field2
100, 234
,30542
test.manifest:
{"entries":[{"url":"s3://MY_BUCKET_NAME/some_path/test_file.csv","mandatory":true}]}
COPY Query:
COPY test.test_table
("test_field1", "test_field2")
FROM 's3://MY_BUCKET_NAME/some_path/test.manifest'
CREDENTIALS 'aws_access_key_id=some_access_key;aws_secret_access_key=some_secret'
CSV
IGNOREHEADER 1
MANIFEST
REGION AS 'some-region-n'
TIMEFORMAT 'auto'
ACCEPTINVCHARS
When I execute copy query, I got error from stl_load_error:
Missing data for not-null field
Why? I tried to omit test_field1 from the query but got another error:
Extra column(s) found.
When I change field type to VARCHAR (256), redshift returns OK.
Maybe I miss something, it's too strange for me to understand.
Lets say I have a 4 table in redshift with 4-columns as :
Create Table m.mytab(
col_1 BIGINT NOT NULL
col_2 Varchar(200)
col_3 Varchar(200)
col_4 INT
);
And my Source row file contains data as:
col_1^col_2^col_3^col_4
myrowdata1^myrowdata2
myrowdata3^myrowdata4
.....
Here I want to load this data in mytab I tried copy command of redshift as :
copy m.mytab
from 's3://mybucket/folder/fileA.gz '
credentials 'aws_access_key_id=somexxx;aws_secret_access_key=somexxx'
DELIMITER '^'
GZIP
IGNOREHEADER 1
ACCEPTINVCHARS;
Since last 2-delimiters are missing in each row , I am unable to load the data here, can someone suggest me how to resolve this issue?
Thanks
1) Try adding FILLRECORD parameter to your COPY statement
For more information, see Data Conversion Parameters documentation
2) If all rows are missing col3 and col4 you can just create a staging table with col1 and col2 only, copy data to staging table and then issue
ALTER TABLE target_tablename
APPEND FROM staging_tablename
FILLTARGET;
This will move data to the target_tablename very efficiently (just changing the pointer without writing or deleting data) and take care about missing col3 and col4.
More information about the command: ALTER TABLE APPEND