Is using map-reduce necessary - mapreduce

when doing a cloud computing project is it necessary to use amazon s3 as defined in:{http://www.ibm.com/developerworks/aix/library/au-cloud_apache/#figure2} in figure 1,
or I can just use a map-reduce and a database?
Thanks in advance.

S3 is just one of the many storage options in the world. It is not mandatory for any cloud application, just sometimes highly practical.

Related

What is the difference between AWS Elastic MapReduce and AWS Kinesis Data Analytics?

I'm executing a Flink Job with this tools.
I think both can do exactly the same with the proper configuration. Does Kinesis Data Analytics do something that EMR can not do or vice versa?
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time.
Amazon Elastic Map Reduce provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR.
The major difference is maintainability and management from your side.
If you want more independent management and more control then I would say go for AWS EMR. Where its your responsibility to manage the EMR infrastructure as well as the Apache Flink cluster in it.
But if you want less control and more focus on application development and you need to deliver faster(tight deadline) then KDA is the way to go. Here AWS provides all the bells and whistles you need for running your application. This also easily sets up with AWS s3 as code source and provides a bare minimum Configuration Management using the UI.
It scales automatically as well.(Need to understand KCU though).
It provides the same Flink dashboard where you can monitor your application and AWS Cloudwatch integration for debugging your application.
Please go through this nice presentation and let me know it that helps.
Please let me know.
https://www.youtube.com/watch?v=c_LswkrwOvk
I will say one major difference between the two is that Kinesis does not provide a hosted Hadoop service unlike Elastic MapReduce (now EMR)
Got this same question also. This video was helpful in explaining with a real architecture scenario and AWS explanation here tries to explain how Kinesis and EMR can fit together with possible use cases.

Transporting data to Amazon Workspaces

I am managing a bunch of users using Amazon Workspaces, they have terabytes of data which they want to start playing around with on their workspace.
I am wondering what is the best way to do the data upload process? Can everything just be downloaded from Google Drive or Dropbox? Or should I use something like AWS Snowball, which is specifically for migration?
While something like AWS Snowball is probably the safest, best bet, I'm kind of hesitant to add another AWS product to the mix, which is why I might just have everything be uploaded and then downloaded from Google Drive / Dropbox. Then again, I am building an AWS environment that will be used long term, and long term using Google Drive / Dropbox won't be a solution.
Thoughts to architect this out (short term and long term)?
Why would you be hesitant to include more AWS products in the mix? Generally speaking, if you aren't combining multiple AWS products to build your solutions then you aren't making very good use of AWS.
For the specific task at hand I would look into AWS WorkDocs, which is integrated very well with AWS Workspaces. If that doesn't suit your needs I would suggest placing the data files on Amazon S3.
You can use FileZilla Pro to upload your data to a AWS S3 bucket.
And use FileZilla Pro within the Workspaces instance to download the files.

Using both AWS and GCP at the same time?

I have a social network website and I stored all the media in S3. I'm planning to use AWS for S3+Lambda and GCP for GCE, Cloudsql. What are the cons of using it this way? Bandwidth between GCP and S3 (since it's not in the same network)?
Thanks.
Using both services together can make sense when you're leveraging one provider's strengths, or for redundancy / disaster recovery. You might also find the pricing model of one provider suits your use-case better. The tradeoff is inconvenience, extra code to manage interoperability, learning two sets of APIs and libraries, and possibly latency.
A few use-cases I've seen personally:
Backing up S3 buckets to Cloud Storage in COLDLINE via the Transfer Job system; goal is to protect code and data backups against worst-case S3 data loss or account hacking in AWS
Using BigQuery to analyze logs pre-processed in AWS EMR and synced into Cloud Storage; depending on your workload BigQuery might cost a lot less than running a Redshift cluster
I've also heard arguments that Google's ML pipelines are superior in some domains, so this might be a common crossover case.
Given the bulk of your infrastructure is already in Google, have you considered using Cloud Functions and Cloud Storage instead of Lambda and S3?

Amazon Kinesis emulator

We are evaluating real-time event processing engines (like twitter storm)
One of the options is recently released Amazon Kinesis.
I'm wondering if there is any sort of emulator/sandbox environment available that would allow to play around with kinesis a bit without the need to set up AWS account and paying for the use of the service
Thank you in advance
I've been using kinesalite for a while now and it seems to work quite well.
Since I don't really like to have node packages into my machine, I've also created a docker image for it.
Hope this helps.
EMR recently launched Kinesis integration for tools in Hadoop ecosystem, you may want to try and see if your usecases can be addressed there. http://aws.typepad.com/aws/2014/02/process-streaming-data-with-kinesis-and-elastic-mapreduce.html

MapReduce on Amazon / Account management

I am totally new to AWS.
I have $8700 credit from Amazon AWS. We are 87 people that I want to share this $8700 among us. (preferably $40 for each person).
1- Please guide me how I can create accounts for them and allocate like $40 to each? Or assume that they have already AWS accounts and I want to allocate $40 to each.
2- We are going to use Amazon AWS to get familiar with map-reduce. I don't know what service we should use (like EC2, Elastic ...). We prefer the easiest one. We want to use 1 computer first and run a map-reduce function on a big data-set and see how long the process takes and test it again using 4 and 8 systems to see the difference.
3- What language should we use for map reduction. Is it possible to use JAVA or C++? Where should we write our code (In netBeans, Microsoft VS, ...)? And where I can find some sample codes?
4- I am not sure about the data set either. Should it be on Oracle? Microsoft SQL SERVER ...
Thank you so much for your help in advance.
I really appreciate your help.
1 - you need to understand AWS Identity and Access Manager http://aws.amazon.com/iam/. You can create users linking the same billing account. I'm not sure if you can allot credits to individual users, but you can control access.
2 - The service you want is AWS Elastic MapReduce http://aws.amazon.com/elasticmapreduce/
3- What language are you most comfortable with?
Q: What programming languages does Amazon Elastic MapReduce support?
You can use Java to implement Hadoop custom jars. Alternatively, you may use other languages including Perl, Python, Ruby, C++, PHP,
and R via Hadoop Streaming. Please refer to the Developer’s Guide for
instructions on using Hadoop Streaming.
http://aws.amazon.com/elasticmapreduce/faqs/#dev-8
4- I think you mean databases...I favor MySQL because its cheaper to run than MSSQL or Oracle. have you read about Amazon RDS? http://aws.amazon.com/rds/
For amount sharing you can look at IAM role.Put yourself as Admin.
For Mapreduce EMR is best option as you don't have
to take care about complicated time sync and DNS issues.
Yes it is possible to use JAVA/C++.For sample code you can refer s3 bucket s3n://elasticmapreduce/samples And also documentation from amazon is pretty good http://docs.aws.amazon.com/ElasticMapReduce/ . You can write your code on AWS instances or you can create your local dev environment and build your code on EMR instances.
Is it dataset and database ?