Medium Hadoop / Spark Cluster Administration - amazon-web-services

Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.

If you have some experience in scripting language then you can go for chef. The recipes are already available for deployment and configuration of cluster and it's very easy to start with.
And if wants to do it by your own then you can use sshxcute java API which runs the script on remote server. You can build up the commands there and pass them to sshxcute API to deploy the cluster.

Check out Apache Ambari. Its a great tool for central management of configs, adding new nodes, monitoring the cluster, etc. This would be your best bet.

Related

Does anyone know how to use AWS App2Container(A2C)?

AWS App2Container (A2C) is a recently launched feature by AWS. It is a CLI tool to help you lift and shift applications that run in your on-premises data centres or on virtual machines so that they run in containers that are managed by Amazon ECS or Amazon EKS. Since there is not much info on the internet about this, apart from the AWS document so does anybody knows how to implement it and what are the dependencies required for it?
This is a fairly new service so most people will be relying on reading at the moment.
For JAVA applications the setup instructions on Linux indicate that you just download the app2container package and then run the following over your code
sudo app2container containerize --application-id java-app-id
For .NET applications the setup instructions on Windows indicate that it is exactly the same process, run the install file and that will have all dependencies.
The best way to try and implement this will be by following these tutorials step by step. Also remember at this time it is JAVA or .NET only.

What is the best way to host Apache Camel in AWS?

As we move our workloads to AWS I am looking for an ETL tool which is widely used and has the appropriate connectors - Apache Camel appears to fit the bill. However, I am struggling to find information on how Camel can be deployed in AWS - the obvious one is on an EC2 instance, but we would like to avoid the setup and maintenance required by Virtual Machines. I don't see anyone offering it as a managed service, so the option I'd like to explore is running it as a container in ECS, as we will have numerous other containers running.
Containers don't seem to be an installation option on the Apache Camel website - perhaps it is just too limiting for a tool whose purpose is to connect to everything else?
Is it acceptable and practical to run Camel in a container, and where could I find more information about it?
Apache Camel appears to fit the bill.
Indeed the Apache Camel is a great integration framework. And that's the point. It is a framework, not a product. So there are multiple ways to run the Camel flows. As a web app, as a standalone app, as a part if our own code. Camel itself is pretty agnostic to the way you run the flows and that's why you don't have very specific way enforced in the web site.
If you want an out of box product, which can generate containerized deployments with Apache Camel, you could have a look at Apache ServiceMix, Apache Karaf or it's supported RedHat Fuse.
Is it acceptable and practical to run Camel in a container, and where could I find more information about it?
It is perfectly fine.
Question: Can you (are you able) to create a docker container with your (any other) application?. Based on the question this skill is lacking and I really suggest to learn it.
You may check folowing post https://medium.com/#wkrzywiec/how-to-put-your-java-application-into-docker-container-5e0a02acdd6b
FROM java:8-jdk-alpine
COPY ./target/myapp.jar /usr/app/myapp.jar
WORKDIR /usr/app
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "myapp.jar"]
Let's assume you can run your ETL tasks as a standalone application, then just run it in the container as any other standalone application.
it we would like to avoid the setup and maintenance required by Virtual Machines
Question: how do you distribute your camel tasks? I mean - what is result of your build? A war file? A standalone app?
To build a web app you could see https://www.baeldung.com/spring-apache-camel-tutorial
The most convenient way to deploy a war file in AWS is AWS Beanstalk service.
If you build a standalone application (or use servicemix) and you can build a container, then indeed ECS or Fargate seem as natural options.

implement a tool that uses the technologies - Jenkins, Docker, Docker Swarm, and AWS

I want to implement a tool that uses the technologies - Jenkins, Docker, Docker Swarm, and AWS - to achieve a deployment tool that our team of developers can use to manage instances and deploys.
Please recommend what technologies should we (both administrators and developers) be using, what needs to be built and what sorts of machines must be having.
Any help here would be much appreciated.
Your question is too generic to provide a specific answer, as there are different approaches to implement what you are trying to achieve. IMHO the best approach would be to talk with your existing dev team & administrators and come up with a solution which all parties find easy to manage and maintain container based environment rather than specifying several specific technologies.
Each tool you have mentioned has different capabilities and also there are other tools that provide the same features which would be more ideal for your situation. (Thats why proper understanding between Devs and admins are necessary on what you really want to achieve.) .
Since you have asked about what kind of machines you must be having (I suppose this is on AWS env) try Core OS on AWS instances. CoreOS (Container Linux) will be the best option to manage and run your container based environments. [About CoreOS]
Jenkins can run in a docker container and issue docker commands to deploy new docker containers that reside in the same swarm as jenkins. You also need to hook into a software repo like git. Jenkins Blue Ocean is something you could look at for pipe-lining your dev->build->test->deploy->maintain pipes. Also, Travis-ci, github, JIRA, and Dockerhub are useful components to what you are trying to achieve.

Deploy an instance of my application for every customer?

Ok, so I would like to build my application in a way that allows for each organization to get their own instance.
My way of thinking here, is that I could do something with AWS or digital ocean or whatever to deploy my java (dropwizard) application every time a new client registers their company with us.
This would be virtualized, I would be hoping, so I would have those instances running on various virtual servers.
Basically, when a company registers... I would like to spin up an instance of the core API, and an instance of the DB server (or the two could be one instance here, I guess)
Is this a thing? I would google it, but I am not fully sure what to be looking for!
I know this is not a dropwizard question - but I tagged it this way because it is a dropwizard application I am building - and I figure people in that community may have had similar concerns! Please feel free to edit!
You would need to automate the process of spinning up an environment using something like CloudFormation, Ansible, Terraform, Chef, Puppet, etc. There are a lot of tools in this space. These tools are called Infrastructure as Code (IaC). Once you have it automated, setting up a new environment for a new customer would be a simple task of kicking off the appropriate script.

Develping with Django, Git, and Cloud Server

I'm currently working with a team in my University to put together a new webapp. Nothing too fancy, just run of the mill MySQL + Django. We are also hoping to use Git for source control. We were wondering what hosting options were available to us. We're all very competent with Unix, so a ssh connection would be preferable. We also looked into the Amazon Cloud, but are not sure if that's right for us. What does Stackoverflow suggest for a provider to host both a Git repo for us and our webapp. The simpler, the better. It should also run a Linux environment.
I have had great success using the Rackspace Cloud servers. You get root SSH into the server, so you can set up your Git repo and your web app there. They have a lot of options for which flavor of Linux you want to use as well.
I'm doing Django/Postgres on an Ubuntu server and haven't had any problems at all. As a bonus, it includes very easy web and API integration with their CDN if you're interested in that.
I looked into a variety of cloud providers and RS had the best options for me, although CDN integration was a big deal for my site so that factor weighed heavier than it might for you.
I use the cheapo 256MB RAM/10GB HD install and pay around ~$12/month after bandwidth costs are figured into it.
Here's the pricing: http://www.rackspace.com/cloud/cloud_hosting_products/servers/pricing/
Why not AWS? It has a free tier that is able to run basic Django apps well. You can run it using a Django AMI directly or a service like BitNami Cloud Hosting (Disclaimer: I am a BitNami developer, I am actually in charge of many of the Python-based stacks). Both options allow you to run a micro instance of an Amazon Machine for free (680Mb Ram, 10Gb disk).
On BitNami Cloud Hosting, we recently added support for Python and Django (Python 2.6.5 and Django 1.3) and we already included Git. When you select to create a new server you will have access to all those components on top of Ubuntu 10.04.
Also if you are interested in using Redmine (as dgel suggests) you can select to install it when you create your server in the same machine. Since it is an university project, you may also want to consider hosting the Git part on github.com for free.
I would highly recommend sourcerepo.com for git and redmine hosting. $6.95 per month for unlimited projects including redmine instances with git hooks. You don't need to worry about setting up or maintaining the git repos or redmine instances yourself.
Then for your project's public hosting you can't beat linode.com for $19.95 per month.