Unload a table from redshift to S3 in parquet format without python script - amazon-web-services

I Found that we can use spectrify python module to convert a parquet format but i want to know which command will unload a table to S3 location in parquet format.
one more thing i found that we can load parquet formatted data from s3 to redshift using copy command, https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-load-listing-from-parquet
can we do the same for unload to s3 from redshift?

There is no need to use AWS Glue or third party Python to unload Redshift data to S3 in Parquet format. The new feature is now supported:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
Documentation can be found at UNLOAD - Amazon Redshift

Have you considered AWS Glue? You can create Glue Catalog based on your Redshift Sources and then convert into Parquet. AWS blog for your reference although it talks about converting CSV to Parquet, but you get the idea.

Related

why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be able to read data from redshift. My question is why pyspark need this S3 temporal directory. Other libraries, like Pandas for instance, can read Redshift tables directly without using any temporal directory.
Thanks to everyone.
Luis
The Redshift data source uses Amazon S3 to efficiently transfer data in and out of Redshift and uses JDBC to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.
See https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html
Implementation style, for performance for Redshift. COPY etc from Redshift faster than Spark.

How to unload table data in redshift to s3 bucket in excel format

I have table stored in redshift.I want to share data with my colleagues in excel format for s3 bucket.
I know how to share in csv format but not excel format. Please help.
This can be done via a Lambda function that you program. You can use the Redshift data client to read the data from a Redshift table. In your Lambda function, you can write the data to an Excel file using an Excel API such as Apache POI. Then use the Amazon S3 API to write the Excel file to an Amazon S3 bucket.
Amazon Redshift only UNLOADs in CSV or Parquet format.
Excel can open CSV files.
If you want the files in Excel format, you will need 'something' to do the conversion. This could be a Python program, an SQL client, or probably many other options. However, S3 will not do it for you.

Can I convert CSV files sitting on Amazon S3 to Parquet format using Athena and without using Amazon EMR

I would like to convert the csv data files that are right now sitting on Amazon S3 into Parquet format using Amazon Athena and push them back to Amazon S3 without taking any help from Amazon EMR. Is this possible to do it? Has anyone experienced something similar?
Amazon Athena can query data but cannot convert data formats.
You can use Amazon EMR to Convert to Columnar Formats. The steps are:
Create an external table pointing to the source data
Create a destination external table with STORED AS PARQUET
INSERT OVERWRITE <destination_table> SELECT * FROM <source_table>

AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table.
Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked.
The layout looks like this:
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/01/file_2.parquet
s3://bucket/parquet_table/01/file_3.parquet
s3://bucket/parquet_table/01/file_1.parquet
s3://bucket/parquet_table/02/file_2.parquet
s3://bucket/parquet_table/02/file_3.parquet
I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.
If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw
java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)
Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?
Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.
The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below
s3://your-bucket/parquet_table/id=1/file1.parquet
s3://your-bucket/parquet_table/id=2/file2.parquet
and so on..
then use the below path to read all the files in bucket
location : s3://your-bucket/parquet_table
If the data in s3 partition the above way, you'll not the face any issues.

Can we use sqoop to export data from Hadoop (Hive) to Amazon Redshift

I have a local Hadoop cluster and want to load data into Amazon Redshift. Informatica/Talend is not an option considering the costs so can we leverage Sqoop to export the tables from Hive into Redshift directly? Does Sqoop connect to Redshift?
The most efficient way to load data into Amazon Redshift is by placing data into Amazon S3 and then issuing the COPY command in Redshift. This performs a parallel data load across all Redshift nodes.
While Sqoop might be able to insert data into Redshift by using traditional INSERT SQL commands, it is not a good way to insert data into Redshift.
The preferred method would be:
Export the data into Amazon S3 as CSV format (preferably in .gz or .bzip format)
Trigger a COPY command in Redshift
You should be able to export data to S3 by copying data to a Hive External Table in CSV format.
Alternatively, Redshift can load data from HDFS. It needs some additional setup to grant Redshift acces to the EMR cluster. See Redshift documentation: Loading Data from Amazon EMR
copy command not supporting upsert it just simply load as many times as you mention and end up with duplicate data, so better way is use glue job and modify it for update else insert or use lambda to upsert into redshift