MapReduce on Amazon / Account management - amazon-web-services

I am totally new to AWS.
I have $8700 credit from Amazon AWS. We are 87 people that I want to share this $8700 among us. (preferably $40 for each person).
1- Please guide me how I can create accounts for them and allocate like $40 to each? Or assume that they have already AWS accounts and I want to allocate $40 to each.
2- We are going to use Amazon AWS to get familiar with map-reduce. I don't know what service we should use (like EC2, Elastic ...). We prefer the easiest one. We want to use 1 computer first and run a map-reduce function on a big data-set and see how long the process takes and test it again using 4 and 8 systems to see the difference.
3- What language should we use for map reduction. Is it possible to use JAVA or C++? Where should we write our code (In netBeans, Microsoft VS, ...)? And where I can find some sample codes?
4- I am not sure about the data set either. Should it be on Oracle? Microsoft SQL SERVER ...
Thank you so much for your help in advance.
I really appreciate your help.

1 - you need to understand AWS Identity and Access Manager http://aws.amazon.com/iam/. You can create users linking the same billing account. I'm not sure if you can allot credits to individual users, but you can control access.
2 - The service you want is AWS Elastic MapReduce http://aws.amazon.com/elasticmapreduce/
3- What language are you most comfortable with?
Q: What programming languages does Amazon Elastic MapReduce support?
You can use Java to implement Hadoop custom jars. Alternatively, you may use other languages including Perl, Python, Ruby, C++, PHP,
and R via Hadoop Streaming. Please refer to the Developer’s Guide for
instructions on using Hadoop Streaming.
http://aws.amazon.com/elasticmapreduce/faqs/#dev-8
4- I think you mean databases...I favor MySQL because its cheaper to run than MSSQL or Oracle. have you read about Amazon RDS? http://aws.amazon.com/rds/

For amount sharing you can look at IAM role.Put yourself as Admin.
For Mapreduce EMR is best option as you don't have
to take care about complicated time sync and DNS issues.
Yes it is possible to use JAVA/C++.For sample code you can refer s3 bucket s3n://elasticmapreduce/samples And also documentation from amazon is pretty good http://docs.aws.amazon.com/ElasticMapReduce/ . You can write your code on AWS instances or you can create your local dev environment and build your code on EMR instances.
Is it dataset and database ?

Related

What is the difference between AWS Elastic MapReduce and AWS Kinesis Data Analytics?

I'm executing a Flink Job with this tools.
I think both can do exactly the same with the proper configuration. Does Kinesis Data Analytics do something that EMR can not do or vice versa?
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time.
Amazon Elastic Map Reduce provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR.
The major difference is maintainability and management from your side.
If you want more independent management and more control then I would say go for AWS EMR. Where its your responsibility to manage the EMR infrastructure as well as the Apache Flink cluster in it.
But if you want less control and more focus on application development and you need to deliver faster(tight deadline) then KDA is the way to go. Here AWS provides all the bells and whistles you need for running your application. This also easily sets up with AWS s3 as code source and provides a bare minimum Configuration Management using the UI.
It scales automatically as well.(Need to understand KCU though).
It provides the same Flink dashboard where you can monitor your application and AWS Cloudwatch integration for debugging your application.
Please go through this nice presentation and let me know it that helps.
Please let me know.
https://www.youtube.com/watch?v=c_LswkrwOvk
I will say one major difference between the two is that Kinesis does not provide a hosted Hadoop service unlike Elastic MapReduce (now EMR)
Got this same question also. This video was helpful in explaining with a real architecture scenario and AWS explanation here tries to explain how Kinesis and EMR can fit together with possible use cases.

Transporting data to Amazon Workspaces

I am managing a bunch of users using Amazon Workspaces, they have terabytes of data which they want to start playing around with on their workspace.
I am wondering what is the best way to do the data upload process? Can everything just be downloaded from Google Drive or Dropbox? Or should I use something like AWS Snowball, which is specifically for migration?
While something like AWS Snowball is probably the safest, best bet, I'm kind of hesitant to add another AWS product to the mix, which is why I might just have everything be uploaded and then downloaded from Google Drive / Dropbox. Then again, I am building an AWS environment that will be used long term, and long term using Google Drive / Dropbox won't be a solution.
Thoughts to architect this out (short term and long term)?
Why would you be hesitant to include more AWS products in the mix? Generally speaking, if you aren't combining multiple AWS products to build your solutions then you aren't making very good use of AWS.
For the specific task at hand I would look into AWS WorkDocs, which is integrated very well with AWS Workspaces. If that doesn't suit your needs I would suggest placing the data files on Amazon S3.
You can use FileZilla Pro to upload your data to a AWS S3 bucket.
And use FileZilla Pro within the Workspaces instance to download the files.

querying Amazon Aurora with Java

I apologize if this is too broad, but I am very new to AWS and I have a specific task I want to do but I can't seem to find any resources to explain how to do it.
I have a Java application that at a high level manages data, and I want that application to be able to store and retrieve information from Amazon Aurora. The simplest task I want to achieve is to be able to run the query "SELECT * FROM Table1" (where Table1 is some example table name in Aurora) from Java. I feel like I'm missing something fundamental about how AWS works, because I've thus far been drowning in a sea of links to AWS SDKs, none of which seem to be relevant to this task.
If anyone could provide some concrete information toward how I could achieve this task, what I'm missing about AWS, etc, I would really appreciate it. Thank you for your time.
You don't use the AWS SDK to query an RDS database. The API/SDK is for managing the servers themselves, not for accessing the RDBMS software running on the servers. You would connect to AWS Aurora via Java just like you would connect to any other MySQL database (or PostgreSQL if you are using that version of Aurora), via the JDBC driver. There's nothing AWS specific about that, other than making sure your code is running from a location that has access to the RDS instance.

Which AWS services for mobile app backend?

I'm trying to figure out what AWS services I need for the mobile application I'm working on with my startup. The application we're working on should go into the app-/play-store later this year, so we need a "best-practice" solution for our case. It must be high scaleable so if there are thousands of requests to the server it should remain stable and fast. Also we maybe want to deploy a website on it.
Actually we are using Uberspace (link) servers with an Node.js application and MongoDB running on it. Everything works fine, but for the release version we want to go with AWS. What we need is something we can run Node.js / MongoDB (or something similar to MongoDB) on and something to store images like profile pictures that can be requested by the user.
I have already read some informations about AWS on their website but that didn't help a lot. There are so many services and we don't know which of these fit our needs perfectly.
A friend told me to just use AWS EC2 for the Node.js server + MongoDB and S3 to store images, but on some websites I have read that it is better to use this architecture:
We would be glad if there is someone who can share his/her knowledge with us!
To run code: you can use lambda, but be careful: the benefit you
don't have to worry about server, the downside is lambda sometimes
unreasonably slow. If you need it really fast then you need it on EC2
with auto-scaling. If you tune it up properly it works like a charm.
To store data: DynamoDB if you want it really fast (single digits
milliseconds regardless of load and DB size) and according to best
practices. It REQUIRES proper schema or will cost you a fortune,
otherwise use MongoDB on EC2.
If you need RDBMS then RDS (benefits:
scalability, availability, no headache with maintenance)
Cache: they have both Redis and memcached.
S3: to store static assets.
I do not suggest CloudFront, there are another CDN on market with better
price/possibilities.
API gateway: yes, if you have an API.
Depending on your app, you may need SQS.
Cognito is a good service if you want to authenticate your users at using google/fb/etc.
CloudWatch: if you're metric-addict then it's not for you, perhaps standalone EC2
will be better. But, for most people CloudWatch is abcolutely OK.
Create all necessary alarms (CPU overload etc).
You should use roles
to allow access to your S3/DB from lambda/AWS.
You should not use the root account but create a separate user instead.
Create billing alarm: you'll know if you're going to break budget.
Create lambda functions to backup your EBS volumes (and whatever else you may need to backup). There's no problem if backup starts a second later, so
Lambda is ok here.
Run Trusted Adviser now and then.
it'd be better for you to set it up using CloudFormation stack: you'll be able to deploy the same infrastructure with ease in another region if/when needed, also it's relatively easier to manage Infrastructure-as-a-code than when it built manually.
If you want a very high scalable application, you may be need to use a serverless architecture with AWS lambda.
There is a framework called serverless that helps you to manage and organize all your lambda function and put them behind AWS Gateway.
For the storage you can use AWS EC2 and install MongoDB or you can go with AWS DynamODB as your NoSql storage.
If you want a frontend, both web and mobile, you may be want to visit the react native approach.
I hope I've been helpful.

How to list/view all created resources on AWS?

Is there a way list/view(graphically?) all created resources on amazon? All the db's users, pools etc.
The best way I can think of is to run each of the cli aws <resource> ls commands in a bash file.
What would be great would be to have a graphical tool that showed all the relationships. Is anyone aware of such a tool?
UPDATE
I decided to make my own start on this, currently its just on the cli, but might move to graphical output. Help needed!
https://github.com/QuantumInformation/aws-x-ray
No, it is not possible to easily list all services created on AWS.
Each service has a set of API calls and will typically have Describe* calls that can list resources. However, these commands would need to be issued to each service individually and they typically have different syntax.
There are third-party services (eg Kumolus) that offer functionality to list and visualize services but they are typically focussed on Amazon EC2 and Amazon VPC-based services. They definitely would not go 'into' a database to list DB users, but they would show Amazon RDS instances.