S3 and EMR data locality [closed] - amazon-web-services

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:
EC2
EMR + S3
The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.
What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?
What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.
I'm hopping for an answer from some person having practical experience with this. Thank you.

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
As for data locality, S3 is RACK_LOCAL to EMR spark clusters.

As per the source mentioned below, EMR+S3 with EMRFS doesn't maintain data locality and is not suitable for analytics processing based on tools such as SQL. RedShift is the right choice for such use cases where compute and data are at one place. Please refer to 39:00 to 42:00 in the below link:
https://youtu.be/08G9NfDETVE
This is also mentioned in https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html. Please refer to the performance per dollar section.
To check how EMR works with S3 please refer to Programming elastic map reduce book by KEVIN SCHMIDT & CHRISTOPHER PHILLIPS(Chapter 1 Amazon Elastic MapReduce Versus Traditional Hadoop Installs section).

Related

Best way to save configuration data in database AWS [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 months ago.
Improve this question
I have some configuration data i want to read from the spring boot application. The configuration can get bigger and bigger, therefore i dont want to create yaml file in the project. I was thinking about what would be the cheapest way to save this data in AWS. Like DynamoDB? or do you have maybe better and cheaper ideas?
DynamoDB would be a good solution, it will allow you to quickly retrieve and edit the config set. DynamoDB also has a free tier of 25RCU/WCU and 25GB storage per month, which I assume you will fall into that category making it a free option.
The only limitation with DynamoDB is that 400KB is the maximum item size, however, you can split your item up into an "item-collection" where all of the items share the same partition key and the sort key identifies the sub components.
AWS App Config is the correct service to use for application configuration setup but it is not cheaper than using DynamoDB for free, so your trade off is down to what you need from your configuration data.
Ignoring the cheaper part of the question, the AWS AppConfig is the right tool for storing configuration, as it comes with many useful features out of the box (like gradual deployment, rollbacks, validation, etc). You can find pricing information here. Note that the total price will also depend on how often you refresh configuration (i.e. your cache invalidation time). Other tools, like S3, might be cheaper, but are not as well suited for configuration use case.
what would be the cheapest way to save this data in AWS
The cheapest way to save data in AWS is to use S3 Intelligent Deep Archive or S3 Glacier Deep Archive for $0.00099 per GB.
Reference: https://aws.amazon.com/s3/pricing/

Best way to transfer data between vm instaces in different projects in Google cloud platform [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
in my project vm1 has 1 TB and vm2 has 1.5TB data . i want to backup and copy those data to another vm in a different project(that project in different organization).
what is the best way to do this first I've tried compress data in vm1 and upload to cloud storage bucket but it's more time consuming.
Thank you.
There are several ways to do it.
If this is a Linux machine you can use rsync to synchronise several folders on remote machines - it can be ran using cron daily or hourly - depending on how often you want to backup to be created. Here's a very informative
You can try to compress the data before the transfer but it will take a lot of time (considering that you have over 1TB).
In my opinion fastest way would be to transfer uncompressed data over the Internet
you will be charged for egress traffic (from 0,01USD/GB even up to 0,15USD/GB !).
You can't use Shared VPC because this is only possible inside an organisation and you have VM's in different orgs;
Shared VPC connects projects within the same organization
You can use other protocols to do the transfer such as FTP, use Samba or just regular folder sharing in Windows (if we're talking about such machines) or some rsyn relative for Windows such as cwRsync.
If you provide more details about your use case then I can update my answer and provide more up-to-the-point solution.

What is the right cloud service(s) for real-time streaming and backtesting applications? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I wish to record time series data at a very high frequency. I am wondering if there is a elegant serverless solution that allows me to store, and react to real-time data.
I want to use stored data to create statistical models, and then I want to process new data in real-time based on those models.
AWS Kinesis streams seems to fit the bill - however, I am unsure if it is only for reacting in real time, or if it also collects historical data that I might be able to use offline to build models.
Google DataFlow and Pub/Sub also seem to be relevant, but not sure if it would be appropriate for the above.
If you go with AWS, you might use Kinesis and EMR to achieve your goal. Firstly you can create a delivery stream in fully managed Kinesis Firehose and route it to S3 or Redshift to collect historical data.
Once your data is on S3, you may do the statistical analysis by pointing the S3 bucket to an EMR job to process fresh data that s3 receive. Read this article for more information.
On EMR managed hadoop framework, you may setup Open-Source R and RStudio for statistical analysis if you will. Here is guide on that.
We accomplished this using Kinesis with Flink ( from apache ) . Flink is really very scalable solution.

ETL vs ELT in Amazon Redshift [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are researching on creating a Data Lake solution on AWS - similar to what's outlined here - https://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/
We will be storing all the "raw" data in S3 and load it into EMR or Redshift as needed.
At this stage, I am looking for suggestions on whether to use the ETL or the ELT approach for loading data into Amazon Redshift. We will be using Talend for ETL/ELT.
Should we stage the "raw" data from S3 in Redshift first before transforming it or should we transform the data in S3 and load it into Redshift?
I would appreciate any suggestion/advise.
Thank you.
Definitely ELT.
The only case where ETL may be better is if you are simply taking one pass over your raw data, then using COPY to load it into Redshift, and then doing nothing transformational with it. Even then, because you'll be shifting data in and out of S3, I doubt this use case will be faster.
As soon as you need to filter, join, and otherwise transform information, it is much faster to do it in the DBMS. If you hit a case where the data transformation relies on data that is already in the DW, it will be orders of magnitude faster.
We run hundreds of ELT jobs a day on different DW platforms, performance testing alternative methods of ingesting and transforming data. In our experience the difference between ETL and ELT in an MPP DW can be 2000+ percent.
It depends on the purpose of having Redshift. If your business case is for users to query the data against Redshift (or a front end application using Redshift as backend), then i would not recommend to do ETL in Redshift. In this case, it would be better to perform your business transformations ahead of time (ex: S3->EMR->S3) and then load the processed data to Redshift.

Clusters available for using Hadoop/MapReduce framework [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Does anyone know any free accessible clusters that are open to public and that use a Hadoop/MapReduce framework? There are plenty of tutorials of how to use MapReduce, but is there a way to test the examples without using my local single machine and installing the required framework?
Thanks!
Amazon EC2 has ready to use Hadoop cluster for per time rent, not very expensive even for play. Other way is to play with Cloudera Hadoop VM http://www.cloudera.com/downloads/virtual-machine/. You can run cluster on several virtual machines.
I will soon have a solution - it's not free, but it is VERY cheap.
I have built a small cluster for training and education (via web access) and will be live in May 2013.
I will rent out 4 node cluster for $2 a day or $10 a week.
Since the cluster is not very big, it will handle data sets of only 20-40GB, but will have full web access to run mapreduce, pig scripts.
Whilst I am asking for some money, it's not really a business - just hoping that I can pay the power bills!
http://jyrocluster.com
Regards,
Serge
You could also use Apache Whirr to deploy your own test cluster on Amazon EC2. This gives you more control than Elastic Map Reduce. It should be cheap if you are using it only to test map reduce jobs for short periods of time.
You can give CloudxLab a try. Though it is not free, it is quite affordable. It provides a complete environment to practice Hadoop, Spark, Kafka, Hive, Pig, HBase, Oozie, Zookeeper, Flume, Sqoop, Mahout, R, Linux, Python, Scala, NumPy, Scipy, scikit-learn etc. You will not have to install or configure any software on your local machine to use CloudxLab. Many of the popular trainers are already using CloudxLab.