Issue with copying data from s3 to Redshift

Issue with copying data from s3 to Redshift - amazon-web-services

I am trying to sync a table from MySQL RDS to redshift trough data pipeline.
There was no issue in copying data frm RDS to S3. But while copying S3 to redhsift the follwoing isue is seen.
amazonaws.datapipeline.taskrunner.TaskExecutionException: java.lang.RuntimeException: Unable to load data: Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]
While observing data it is seen that while copying data to S3 an extra "0" is being appended at the end of time stamp i.e 2015-04-28 10:25:58 from MySQL table is being copied as 2015-04-28 10:25:58.0 into CSV file which is giving issue.
I also tried copying with copy command using the following
copy XXX
from 's3://XXX/rds//2018-02-27-14-38-04/1d6d39b9-4aac-408d-8275-3131490d617d.csv'
iam_role 'arn:aws:iam::XXX:role/XXX' delimiter ',' timeformat 'auto';
but still the same issue.
Can anyone help me sort out this issue.
Thanks in advance

Related

How to ignore errors in redshift copy command

I've parquet files and need to load into redshift using copy command. The command is getting failed due to spectrum scan error. So I want to ignore the file if any causing error.
Is there any way to ignore records/maxerror option in redshift copy command for parquet file load?
COPY <targettablename> from '<s3 path>' iam_role 'arn:aws:iam::1232432' format as parquet maxerror 250
Error:- MAXERROR argument is not supported for PARQUET based COPY

For copying data from parquet file to Redshift, you just use this below format-
Copy SchemaName.TableName
From 'S3://buckets/file path'
access_key_id 'Access key id details' secret_access_key 'Secret access key details'
Format as parquet
STATUPDATE off
Spectrum scan error you get when there is discrepancy in source columns data type and destination column data types, for that you have to change data types according to Redshift's standard data type format.
For checking errors you can refer this query-
Select * from SVL_S3LOG where query = 'Query_id needs to be placed here'

remove backslash from a .csv file to load data to redshift from s3

I am getting an issue when I am loading my file , I have backslash in my csv file
how and what delimited can I use while using my copy command so that I don't get
error loading data from s3 to redshift.
Though I used the QUOTE command but gave me a syntax error so seems like new format
doesn't like the QUOTE key word.
Please if any one can provide a new and correct
command or dow I need to clean or preprocess my data before uploading to s3.
If the
Data size is too big it might not be a very feasible solution
If I have to process it , Do I use pyspark or python(PANDAS) to do it?
Below is the copy command I am using to copy data from s3 to redshift
I tried passing a quote command in the copy command but seems like it doesn't take
that anymore also there is no example in amazon docs on how to do or acheive it
If someone can suggest a command which can replace especial characters while loading
the data
COPY redshifttable from 'mys3filelocation'
CREDENTIALS 'aws_access_key_id=myaccess_key;aws_secret_access_key=mysecretID'
region 'us-west-2'
CSV
DATASET:
US063737,2019-11-07T10:23:25.000Z,richardkiganga,536737838,Terminated EOs,"",f,Uganda,Richard,Kiganga,Business owner,Round Planet DTV Uganda,richardkiganga,0.0,4,7.0,2021-06-1918:36:05,"","",panama-
Disc.s3.amazon.com/photos/…,\"\",Mbale,Wanabwa p/s,Eastern,"","",UACE Certificate,"",drive.google.com/file/d/148dhf89shh499hd9303-JHBn38bh/… phone,Mbale,energy_officer's_id_type,letty
mainzi,hakuna Cell,Agent,8,"","",4,"","","",+647739975493,Feature phone,"",0,Boda goda,"",1985-10-12,Male,"",johnatlhnaleviski,"",Wife

Loading data into redshft database

I have five JSON files in one folder in amazon s3. I am trying to load all five files from s3 into redshift using copy command. I am getting an error while loading one file from s3 to redshift. Is there any way in redshift to skip loading that file and load the next file.

Use the MAXERROR parameter in the COPY command to increase the number of errors permitted. This will skip over any lines that produce errors.
Then, use the STL_LOAD_ERRORS table to view the errors and diagnose the data problem.

How to export data from table as CSV from Greenplum database to AWS s3 bucket

I have data in a table
select * from my_table
It contains 10k observations.How do I export data in the table as CSV to s3 bucket .
(I dont want to export the data to my local machine and then push to s3).

Please, please, please STOP labeling your questions with both PostgreSQL and Greenplum. The answer to your question is very different if you are using Greenplum versus PostgreSQL. I can't stress this enough.
If you are using Greenplum, you should the S3 protocol in External Tables to read and write data to S3.
So your table:
select * from my_table;
And your external table:
CREATE EXTERNAL TABLE ext_my_table (LIKE my_table)
LOCATION ('s3://s3_endpoint/bucket_name')
FORMAT 'TEXT' (DELIMITER '|' NULL AS '' ESCAPE AS E'\\');
And then writing to your s3 bucket:
INSERT INTO ext_my_table SELECT * FROM my_table;
You will need to do some configuration on your Greenplum cluster so that you have an s3 configuration file too. This goes in every segment directory too.
gpseg_data_dir/gpseg-prefixN/s3/s3.conf
Example of the file contents:
[default]
secret = "secret"
accessid = "user access id"
threadnum = 3
chunksize = 67108864
More information on S3 can be found here: http://gpdb.docs.pivotal.io/5100/admin_guide/external/g-s3-protocol.html#amazon-emr__s3_config_file

I'll suggest to first load data into your master node using WINSCP or File transfer.
Then move this file from your master node to S3 storage.
Because, moving data from Master node to S3 storage utilises Amazon's bandwidth and it will be much faster than our local connection bandwidth used to transfer file from local machine to S3.

Unable to create AWS data pipeline for copying s3 to redshift

I am new to AWS, im trying to create a data pipeline to transfer s3 files into redshift.
I have already performed the same task manually. Now with pipelining, I am unable to proceed further here
Problem with Copy Options :
Sample data on s3 files is like :
15,NUL next, ,MFGR#47,MFGR#3438,indigo,"LARGE ANODIZED BRASS",45,LG CASE
22,floral beige,MFGR#4,MFGR#44,MFGR#4421,medium,"PROMO, POLISHED BRASS",19,LG DRUM
23,bisque slate,MFGR#4,MFGR#41,MFGR#4137,firebrick,"MEDIUM ""BURNISHED"" TIN",42,JUMBO JAR
24,dim white,MFGR#4,MFGR#45,MFGR#459,saddle,"MEDIUM , ""PLATED"" STEEL",20,MED CASE
So at manual work I gave this copy command:
copy table from 's3://<your-bucket-name>/load/key_prefix'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>'
csv
null as '\000';
and it worked perfectly
I tried with basic options as :
1. csv
2. null as '\000'
But none works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issue with copying data from s3 to Redshift - amazon-web-services

Related

How to ignore errors in redshift copy command

remove backslash from a .csv file to load data to redshift from s3

Loading data into redshft database

How to export data from table as CSV from Greenplum database to AWS s3 bucket

Unable to create AWS data pipeline for copying s3 to redshift

Categories

Resources