Does AWS Redshift Spectrum support JSON - amazon-redshift-spectrum

I saw the below text in a blog and it did not name json. Does AWS Redshift Spectrum support processing of JSON. In normal Redshift, we need to create the table structure before processing JSON.
Amazon Redshift Spectrum supports structured and semi-structured data formats which include Parquet, Textfile, Sequencefile and Rcfile. Amazon recommends using a columnar format because it will allow you to choose only the columns you need to transfer data from S3.
Thanks

Related

Migrate data from Hive to BigQuery

I need to migrate 70TB data (2400 tables) from on-premises Hive to BigQuery. Initial plan is to load ORC files from Hive to Cloud Storage and then to BigQuery tables.
What is a better way achieving this through automation or any other GCP service.
I would suggest you to leverage data pipelines for the stated purpose.
Here’s some reference on how to use it - https://cloud.google.com/architecture/dw2bq/dw-bq-data-pipelines#what-is-a-data-pipeline
Also, you can explore different ways to transfer your on prem data to bigquery here - https://cloud.google.com/architecture/dw2bq/dw-bq-migration-overview
And please note that in Big query ORC is not supported. So you have to convert your ORC data into one of these 3 formats - Avro, JSON, CSV.

Unload a table from redshift to S3 in parquet format without python script

I Found that we can use spectrify python module to convert a parquet format but i want to know which command will unload a table to S3 location in parquet format.
one more thing i found that we can load parquet formatted data from s3 to redshift using copy command, https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-load-listing-from-parquet
can we do the same for unload to s3 from redshift?
There is no need to use AWS Glue or third party Python to unload Redshift data to S3 in Parquet format. The new feature is now supported:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
Documentation can be found at UNLOAD - Amazon Redshift
Have you considered AWS Glue? You can create Glue Catalog based on your Redshift Sources and then convert into Parquet. AWS blog for your reference although it talks about converting CSV to Parquet, but you get the idea.

When to use Amazon Redshift spectrum over AWS Glue ETL to query on Amazon S3 data

As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.
AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.

Can I convert CSV files sitting on Amazon S3 to Parquet format using Athena and without using Amazon EMR

I would like to convert the csv data files that are right now sitting on Amazon S3 into Parquet format using Amazon Athena and push them back to Amazon S3 without taking any help from Amazon EMR. Is this possible to do it? Has anyone experienced something similar?
Amazon Athena can query data but cannot convert data formats.
You can use Amazon EMR to Convert to Columnar Formats. The steps are:
Create an external table pointing to the source data
Create a destination external table with STORED AS PARQUET
INSERT OVERWRITE <destination_table> SELECT * FROM <source_table>

Confusions related to Redshift about dataset (Structured, Unstructured, Semi-structured) and format to be used

Can anyone explain me clearly about what kind of data Redshift can handle(like structured, unstructured , or in any formats)?
How to copy Cloudfront logs into Amazon Redshift even the log is in unstructured data without going to Amazon EMR?
**How to find Database size which is created in Amazon Redshift?
Please someone explain me clearly about all the three questions which i have mentioned it above...It will be better if you explain me with some example or sample code or any source it will be very helpful for my project
Amazon Redshift provides a standard SQL interface (based on PostgreSQL). Therefore, it is best suited for structured data that is stored in Tables, Rows and Columns.
It is also possible to store JSON records within a field and access them via JSON functions.
To load data into Amazon Redshift, it needs to be in a delimited file format, such as comma delimited, tab delimited, fixed-length fields or JSON format. Any data that is not in a suitable format will need to be pre-processed and converted to a suitable format. This could be done with tools such as Amazon Athena (Presto) or Amazon EMR (Hadoop).
Amazon CloudFront logs are in Tab-Delimited format and can be loaded directly into Amazon Redshift. For an example, see: Analyzing S3 and CloudFront Access Logs with AWS Redshift
Information about disk space consumed by tables can be obtained via the SVV_DISKUSAGE system view.