Move RedShift file to S3 as CSV - amazon-web-services

I'm trying to move a file from RedShift to S3. Is there an option to move this file as a .csv?
Currently I am writing a shell script to get the Redshift data, save it as a .csv, and then upload to S3. I'm assuming since this is all on AWS services, they would have an argument or something that let's me do this.

Use the UNLOAD command. It will create at least one file per slice, you will have to merge the files by yourself.
unload ('__SQL__')
to 's3://__BUCKET__/__PATH__'
credentials 'aws_access_key_id=__S3_KEY__;aws_secret_access_key=__S3_SECRET__'
delimiter as ','
addquotes
escape

Related

remove backslash from a .csv file to load data to redshift from s3

I am getting an issue when I am loading my file , I have backslash in my csv file
how and what delimited can I use while using my copy command so that I don't get
error loading data from s3 to redshift.
Though I used the QUOTE command but gave me a syntax error so seems like new format
doesn't like the QUOTE key word.
Please if any one can provide a new and correct
command or dow I need to clean or preprocess my data before uploading to s3.
If the
Data size is too big it might not be a very feasible solution
If I have to process it , Do I use pyspark or python(PANDAS) to do it?
Below is the copy command I am using to copy data from s3 to redshift
I tried passing a quote command in the copy command but seems like it doesn't take
that anymore also there is no example in amazon docs on how to do or acheive it
If someone can suggest a command which can replace especial characters while loading
the data
COPY redshifttable from 'mys3filelocation'
CREDENTIALS 'aws_access_key_id=myaccess_key;aws_secret_access_key=mysecretID'
region 'us-west-2'
CSV
DATASET:
US063737,2019-11-07T10:23:25.000Z,richardkiganga,536737838,Terminated EOs,"",f,Uganda,Richard,Kiganga,Business owner,Round Planet DTV Uganda,richardkiganga,0.0,4,7.0,2021-06-1918:36:05,"","",panama-
Disc.s3.amazon.com/photos/…,\"\",Mbale,Wanabwa p/s,Eastern,"","",UACE Certificate,"",drive.google.com/file/d/148dhf89shh499hd9303-JHBn38bh/… phone,Mbale,energy_officer's_id_type,letty
mainzi,hakuna Cell,Agent,8,"","",4,"","","",+647739975493,Feature phone,"",0,Boda goda,"",1985-10-12,Male,"",johnatlhnaleviski,"",Wife

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open · PyPI is a great library for reading from an S3 object without having to download it first.)

No extension while using from_options' in DynamicFrameWriter in AWS Glue spark context

I am new to AWS. I am writing **AWS Glue job** for some transformation and I could do it. But now after the transformation I used **'from_options' in DynamicFrameWriter Class** to transfer the data frame as csv file. But the file copied to S3 without any extension. Also is there any way to rename the file copied, using DynamicFrameWriter or any other. Please help....
Step1: Triggered an AWS glue job for trnsforming files in S3 to RDS instance..
Step2: On successful job completion transfer the contents of file to another S3 using from_options' in DynamicFrameWriter class. But the file dosen't have any extension.
you have to set the format of the file you are writing.
eg: format=csv
This should set the csv file extension.. You however cannot choose the name of the file that you want to write it as. The only option you have is to have some sort of s3 operation where you change the key name of the file.

Retaining source file name while importing data from s3 to Redshift

I have large numbers of files within s3 bucket and usually import it to Redshift. Since number of files is large I need a column in Redshift table which should contain source file name from s3 location.
Is there any means to carried out problem ?
Agree with Ketan that this is currently not possible in Redshift. If this is what you would want to achieve, it is possible through either
Reading the S3 files programmatically and write a new S3 files with file name as the column and load the new file
Alternatively, use Hive. Create external table on S3 file bucket location and use INPUT__FILE__NAME to get the file names, create a new table and then write back to S3. You can also do some pre-processing in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Hope this helps.
That isn't possible. During a Copy operation, Redshift only loads file contents into a table; it doesn't provide access to S3 file names.
To achieve what you want, you need to preprocess the data to add additional information inside the files.