Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are researching on creating a Data Lake solution on AWS - similar to what's outlined here - https://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/
We will be storing all the "raw" data in S3 and load it into EMR or Redshift as needed.
At this stage, I am looking for suggestions on whether to use the ETL or the ELT approach for loading data into Amazon Redshift. We will be using Talend for ETL/ELT.
Should we stage the "raw" data from S3 in Redshift first before transforming it or should we transform the data in S3 and load it into Redshift?
I would appreciate any suggestion/advise.
Thank you.
Definitely ELT.
The only case where ETL may be better is if you are simply taking one pass over your raw data, then using COPY to load it into Redshift, and then doing nothing transformational with it. Even then, because you'll be shifting data in and out of S3, I doubt this use case will be faster.
As soon as you need to filter, join, and otherwise transform information, it is much faster to do it in the DBMS. If you hit a case where the data transformation relies on data that is already in the DW, it will be orders of magnitude faster.
We run hundreds of ELT jobs a day on different DW platforms, performance testing alternative methods of ingesting and transforming data. In our experience the difference between ETL and ELT in an MPP DW can be 2000+ percent.
It depends on the purpose of having Redshift. If your business case is for users to query the data against Redshift (or a front end application using Redshift as backend), then i would not recommend to do ETL in Redshift. In this case, it would be better to perform your business transformations ahead of time (ex: S3->EMR->S3) and then load the processed data to Redshift.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 months ago.
Improve this question
I have some configuration data i want to read from the spring boot application. The configuration can get bigger and bigger, therefore i dont want to create yaml file in the project. I was thinking about what would be the cheapest way to save this data in AWS. Like DynamoDB? or do you have maybe better and cheaper ideas?
DynamoDB would be a good solution, it will allow you to quickly retrieve and edit the config set. DynamoDB also has a free tier of 25RCU/WCU and 25GB storage per month, which I assume you will fall into that category making it a free option.
The only limitation with DynamoDB is that 400KB is the maximum item size, however, you can split your item up into an "item-collection" where all of the items share the same partition key and the sort key identifies the sub components.
AWS App Config is the correct service to use for application configuration setup but it is not cheaper than using DynamoDB for free, so your trade off is down to what you need from your configuration data.
Ignoring the cheaper part of the question, the AWS AppConfig is the right tool for storing configuration, as it comes with many useful features out of the box (like gradual deployment, rollbacks, validation, etc). You can find pricing information here. Note that the total price will also depend on how often you refresh configuration (i.e. your cache invalidation time). Other tools, like S3, might be cheaper, but are not as well suited for configuration use case.
what would be the cheapest way to save this data in AWS
The cheapest way to save data in AWS is to use S3 Intelligent Deep Archive or S3 Glacier Deep Archive for $0.00099 per GB.
Reference: https://aws.amazon.com/s3/pricing/
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I started working on some ML Ops project on AWS SageMaker, and I have a question about the way of storing/processing data. I have the DB of a company with tens of millions of invoices and clients that should be cleaned/transformed a bit for some classification and regression jobs. What would be the best approach: to make a new DB and develop ETL jobs that take the data from the standard DB, clean and transform it and put it into the “ML DB” (then I directly use this data for my models) or make jobs that take the data from the standard DB, process it and saves it as huge CSV files in S3 buckets? Intuitively, it seems that Relational DB -> process -> NoSql/Relational DB is a better approach than Relational DB -> process -> huge CSV file. I didn’t find anything about this on Google and all the AWS SageMaker docs are using CSV files on S3 as example and are not mentioning anywhere about making ML pipelines directly with relational stored data. What would be the best approach and why?
Your first approach sounds fine:
The original data is left as-is in case you need to repeat the process later (perhaps with changes / improvements).
The system will work for any data source once you plumb it in; i.e. you can re-use the transformation and load parts.
I don't know anything about SageMaker, etc, or the whole CVS thing, but once you have the data in your ML DB you can obviously export it to any format you like later, like CVS.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Based on many sources I understood that both(BigQuery and Bigtable) are considered as 2 different solutions for data sotrage.
For example here written that we can consider bigQuery as a data storage if we need any statistic(for example sums, averages, counts) about huge amount of data. In contrast, BigTable can be consideres like usual NoSql storage.
From another hand I've read snippet from the book where mentioned that:
From this snippet I understood that BigQuery is a tool for querying from anywhere but not for data storage. Could you please clarify ?
BigQuery is a Data Warehouse solution in Google Cloud Platform. In BigQuery you can have two kinds of persistent table:
Native/Internal: In this kind of table, you load data from some source (can be some file in GCS, some file that you upload in the load job or you can even create an empty table). After created, this table will store data on a specific BigQuery's storage system.
External: This kind of table is basically a pointer to some external storage like GCS, Google Drive and BigTable
Furthermore, you can also create temporary tables that exists only during the query execution.
Both BigQuery and BigTable can be used as storage for data.
The point is that BigQuery is more like an engine to run queries and make aggregations on huge amounts of data that can be on external sources or even inside BigQuery's own storage system. That makes it good for data analysis.
BigTable is more like a NoSQL database that can deal with petabytes of data and gives you mechanisms to perform data analysis as well. As you can see in the image below, BigTable should be chosen intead of BigQuery if you need low-latency.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I wish to record time series data at a very high frequency. I am wondering if there is a elegant serverless solution that allows me to store, and react to real-time data.
I want to use stored data to create statistical models, and then I want to process new data in real-time based on those models.
AWS Kinesis streams seems to fit the bill - however, I am unsure if it is only for reacting in real time, or if it also collects historical data that I might be able to use offline to build models.
Google DataFlow and Pub/Sub also seem to be relevant, but not sure if it would be appropriate for the above.
If you go with AWS, you might use Kinesis and EMR to achieve your goal. Firstly you can create a delivery stream in fully managed Kinesis Firehose and route it to S3 or Redshift to collect historical data.
Once your data is on S3, you may do the statistical analysis by pointing the S3 bucket to an EMR job to process fresh data that s3 receive. Read this article for more information.
On EMR managed hadoop framework, you may setup Open-Source R and RStudio for statistical analysis if you will. Here is guide on that.
We accomplished this using Kinesis with Flink ( from apache ) . Flink is really very scalable solution.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:
EC2
EMR + S3
The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.
What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?
What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.
I'm hopping for an answer from some person having practical experience with this. Thank you.
EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
As for data locality, S3 is RACK_LOCAL to EMR spark clusters.
As per the source mentioned below, EMR+S3 with EMRFS doesn't maintain data locality and is not suitable for analytics processing based on tools such as SQL. RedShift is the right choice for such use cases where compute and data are at one place. Please refer to 39:00 to 42:00 in the below link:
https://youtu.be/08G9NfDETVE
This is also mentioned in https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html. Please refer to the performance per dollar section.
To check how EMR works with S3 please refer to Programming elastic map reduce book by KEVIN SCHMIDT & CHRISTOPHER PHILLIPS(Chapter 1 Amazon Elastic MapReduce Versus Traditional Hadoop Installs section).