Confusions related to Redshift about dataset (Structured, Unstructured, Semi-structured) and format to be used - amazon-web-services

Can anyone explain me clearly about what kind of data Redshift can handle(like structured, unstructured , or in any formats)?
How to copy Cloudfront logs into Amazon Redshift even the log is in unstructured data without going to Amazon EMR?
**How to find Database size which is created in Amazon Redshift?
Please someone explain me clearly about all the three questions which i have mentioned it above...It will be better if you explain me with some example or sample code or any source it will be very helpful for my project

Amazon Redshift provides a standard SQL interface (based on PostgreSQL). Therefore, it is best suited for structured data that is stored in Tables, Rows and Columns.
It is also possible to store JSON records within a field and access them via JSON functions.
To load data into Amazon Redshift, it needs to be in a delimited file format, such as comma delimited, tab delimited, fixed-length fields or JSON format. Any data that is not in a suitable format will need to be pre-processed and converted to a suitable format. This could be done with tools such as Amazon Athena (Presto) or Amazon EMR (Hadoop).
Amazon CloudFront logs are in Tab-Delimited format and can be loaded directly into Amazon Redshift. For an example, see: Analyzing S3 and CloudFront Access Logs with AWS Redshift
Information about disk space consumed by tables can be obtained via the SVV_DISKUSAGE system view.

Related

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

Amazon S3 to Amazon Athena to Tableau

I am working on a project to get data from an Amazon S3 bucket into Tableau.
The data needs to reorganised and combined from multiple .CSV files. Is Amazon Athena capable of connecting from the S3 to Tableau directly and is it relatively easy/cheap? Or should I instead look at another software package to achieve this?
I am looking to visualise the data and provide a forecast based on observed trend (may need to incorporate functions to generate data to fit linear regression).
It appears that Tableau can query data from Amazon Athena.
See: Connect to your S3 data with the Amazon Athena connector in Tableau 10.3 | Tableau Software
Amazon Athena can query multiple CSV files in a given path (directory) and run SQL against the data. So, it sounds like this is a feasible solution for you.
Yes, you can integrate Athena with Tableau to query your data in S3. There are plenty resource online that describe how to do that, e.g. link 1, link 2, link 3. But obviously, tables that define meta information of your data have to be defined before hand.
Amazon Athena pricing is based on on the amount of data scanned by each query, i.e. 5$ per 1TB of data scanned. So it all comes down how much data you have and how it is structured, i.e. partitioning, bucketing file format etc. Here is a nice blog post that covers these aspects.
While you prototype a dashboard there is one thing to keep in mind. By deafult, each time you would change list of parameters, filters etc, Tableau would automatically send a request to AWS Athena to execute your query. Luckily, you can disable auto querying of the data source and do it manually.

How to upload data via SQL to Amazon Redshift?

I created a cluster and connected to the database via SQL Workbench, but how can I upload data via SQL to Amazon Redshift?
I guess I have to use Amazon S3 but I could not find a sample video or text that describes it well.
There are two ways to insert information into Amazon Redshift:
Via the COPY command
Via INSERT statements
It is not recommended to use INSERT statements because they are not efficient for large data volumes. They are okay for doing ETL-type processes such as copying data between tables, but as a general rule data should be loaded via COPY.
As per Using a COPY Command to Load Data, the COPY command can load data from:
Amazon S3 (recommended, highly parallel)
Amazon EMR (Hadoop)
Amazon DynamoDB
Via SSH from remote hosts
The load from Amazon S3 is performed in parallel across all nodes and is the most efficient way to load data.
The Amazon Redshift COPY command can read several file formats:
Delimited (eg CSV)
Fixed-Width
AVRO
JSON
And these formats can also be compressed (eg gzip)
Bottom line: Get your data into Amazon S3 in a compatible format, then use COPY to load it.
Also, try to understand DISTKEY and SORTKEY to get full performance benefits out of Redshift. Definitely read the manual -- it will save you more time than it takes to read!

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.

Creating Table As substitution

I am currently working with AWS-Athena and it does not support CREATE TABLE AS which is fine so I thought I would approach it by doing INSERT OVERWRITE DIRECTORY S3://PATH and then loading from S3 but apparently that doesn't seem to work either. How would I create a table from a query if both of these options are out the window?
Amazon Athena is read-only. It cannot be used to create tables in Amazon S3.
However, the output of an Amazon Athena query is stored in Amazon S3 and could be used as input for another query. However, you'd have to know the path of the output.
Amazon Athena is ideal for individual queries against data stored in Amazon S3, but is not the best tool for ETL actions, which typically involve transforming data, storing it and then sequentially processing it again.
You don't have to use INSERT, just create an external table over the location of the previous query results
https://aws.amazon.com/premiumsupport/knowledge-center/athena-query-results/