how to copy data when delimiters are missing in source file - amazon-web-services

Lets say I have a 4 table in redshift with 4-columns as :
Create Table m.mytab(
col_1 BIGINT NOT NULL
col_2 Varchar(200)
col_3 Varchar(200)
col_4 INT
);
And my Source row file contains data as:
col_1^col_2^col_3^col_4
myrowdata1^myrowdata2
myrowdata3^myrowdata4
.....
Here I want to load this data in mytab I tried copy command of redshift as :
copy m.mytab
from 's3://mybucket/folder/fileA.gz '
credentials 'aws_access_key_id=somexxx;aws_secret_access_key=somexxx'
DELIMITER '^'
GZIP
IGNOREHEADER 1
ACCEPTINVCHARS;
Since last 2-delimiters are missing in each row , I am unable to load the data here, can someone suggest me how to resolve this issue?
Thanks

1) Try adding FILLRECORD parameter to your COPY statement
For more information, see Data Conversion Parameters documentation
2) If all rows are missing col3 and col4 you can just create a staging table with col1 and col2 only, copy data to staging table and then issue
ALTER TABLE target_tablename
APPEND FROM staging_tablename
FILLTARGET;
This will move data to the target_tablename very efficiently (just changing the pointer without writing or deleting data) and take care about missing col3 and col4.
More information about the command: ALTER TABLE APPEND

Related

Redshift copy command not getting default value

I have a file in S3 with the following format:
col1,col2
number1,content1
number2,content2
number3,content3
I am creating a Redshift table with the structure bellow:
CREATE TABLE IF NOT EXISTS general.name.name_test (
col1 VARCHAR(255),
col2 VARCHAR(255),
inserted_timestamp TIMESTAMP DEFAULT GETDATE()
);
After that, I am using Redshift copy command to have the data available in the table I just created:
COPY general.name.name_test
FROM 's3://.../name_test.txt'
ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN ''
DELIMITER AS ','
IGNOREHEADER AS 1
csv;
The problem is that "inserted_timestamp" is NULL and Redshift is not taking the default value.
Am I missing something? This is what I will get in Redshift:
col1,col2,inserted_timestamp
number1,content1,null
number2,content2,null
number3,content3,null
It only works if I specify the columns but I wanted to avoid that if possible:
COPY general.name.name_test
(col1,col2)
FROM 's3://.../name_test.txt'
ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN ''
DELIMITER AS ','
IGNOREHEADER AS 1
csv;
Thank you!
Since CSV doesn’t name the columns RS doesn’t know which column is which. You need to add the column names to the COPY command to clear up the confusion. Alternatively you could add a trailing comma to your data file to indicate that the missing column is the last one.
Other than these approaches I don’t know of a way to make it clear to Redshift.

Athena CTAS replacing null values in tables with \N

When I use Athena CTAS to generate CSV files, I found that null values in the Athena table are replaced by "\N".
How do I get it to just leave these values as empty columns?
The CTAS query I'm using is something like this:
CREATE TABLE table_name WITH (format = 'TEXTFILE', field_delimiter=',', external_location='s3://bucket_name/location') AS SELECT * FROM "db_name"."src_table_name";
Am I doing something wrong?
This is the default token for NULL for LazySimpleSerDe, and CTAS does not expose any mechanism for changing it unfortunately.
If you'd rather have empty fields for your NULL values you have to ensure they are all empty strings, e.g. … AS SELECT COALESCE(col1, ''), COALESCE(col2, ''), ….

Redshift: create external table returns 0 rows

I have a text file test.txt located at s3://myBucket/, see sample below, using which I want to create an external table in Redshift.
When I select from the table, it returns 0 rows.
1,One
2,Two
3,Three
create external table spectrum_schema.test(
Id integer,
Name varchar(255))
row format delimited
fields terminated by ','
stored as textfile
location 's3://myBucket/';
select * from spectrum_schema.test //returns 0 rows
Any suggestions how I can fix this?
I fixed this by moving the file to s3://myBucket/test

AWS Redshift does not set integer defaults when copying from S3

Create query:
CREATE TABLE IF NOT EXISTS test.test_table ( "test_field1" INTEGER DEFAULT 0 NOT NULL, "test_field2" INTEGER DEFAULT 0 NOT NULL)
test_file.csv (note that second row hasn't first column value):
test_field1,test_field2
100, 234
,30542
test.manifest:
{"entries":[{"url":"s3://MY_BUCKET_NAME/some_path/test_file.csv","mandatory":true}]}
COPY Query:
COPY test.test_table
("test_field1", "test_field2")
FROM 's3://MY_BUCKET_NAME/some_path/test.manifest'
CREDENTIALS 'aws_access_key_id=some_access_key;aws_secret_access_key=some_secret'
CSV
IGNOREHEADER 1
MANIFEST
REGION AS 'some-region-n'
TIMEFORMAT 'auto'
ACCEPTINVCHARS
When I execute copy query, I got error from stl_load_error:
Missing data for not-null field
Why? I tried to omit test_field1 from the query but got another error:
Extra column(s) found.
When I change field type to VARCHAR (256), redshift returns OK.
Maybe I miss something, it's too strange for me to understand.

Redshift copy doesn't insert data into my table

I have table SampleTable
and run the following Redshift command through a SQL Client (JackDB)
copy SampleTable
from 's3://bucket-name/backup/data.csv.gz'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
gzip
csv;
the command does return
Executed successfully Updated 0 rows in 2.771 seconds.
but no data are inserted into empty table SampleTable
select count(*)
from SampleTabe
return 0
there are 100MB data in data.csv.gz
solved by myself, the data do not correspond to the query.
I should include delimiter to overwrite the default and IGNOREHEADER 1 to skip the csv header.
just bothered by the fact that no stl_load_error is recorded in this case