I have to access some big files in buckets in Amazon S3 and do processing on them. For this I was planning to use Apache Spark. I have 2 EC2 instances for this learning project. These are not used but for small crons, so could I use them to install and run Spark? If so, how to install Spark on existing EC2 boxes, so that I can make one master and one slave?

If it helps, I installed Spark in standalone mode on one branch, and the other as well, setting one as Master, and the other as slave. The detailed instructions for the same as I followed are

See the tutorial on Apache Spark Cluster on EC2 here

yes you can create easily a master slave with 2 aws instances just set SPARK_MASTER_IP = instance_privateIP_1 in on both instances and put instance2 private ip in slaves file in conf folder and these configurations are same on both the machine and other configurations also set like memory core etc. and then you can start it from master, and make sure the spark is install on same location in both the machines.


Centralized & versioned file based system for a web application deployed in Kubernetes

I am trying to create an Centralized file based repository where I can upload all the configuration files needed for an application to run which is deployed as a pod inside the Kubernetes. Any suggestion on achieving this functionality ? Can the file based repository version the files uploaded ?
I see that s3fs-fuse can be used to achieve this, but i lack to see that, it wont support versioning the added config files in the S3 bucket.
Any other suggestion ?
You could use elastic file system which is supported by EKS:
Applications running in Kubernetes can use EFS file systems to share data between pods in a scale-out group, or with other applications running within or outside of Kubernetes. EFS can also help Kubernetes applications be highly available because all data written to EFS is written to multiple AWS Availability zones. If a Kubernetes pod is terminated and relaunched, the CSI driver will reconnect the EFS file system, even if the pod is relaunched in a different AWS Availability Zone.
But its not S3 and it does not have versioning of files such as S3 has. You would have to add such functionality yourself, e.g. by keeping everything in a git repository on the EFS file system.
Why not use git?
The following article contains an example which runs a git clone within an initContainer:

AWS CloudEndure migration specifics

My company plans on using AWS CloudEndure to migrate bunch of on-site HyperV servers to aws cloud.
I want to specifically know what folder structure is being migrated, and have not been able to find it anywhere. For example, if theres VScode with very specific configuration and plugins on those servers, is all that configuration migrated as well? does that mean that "/user/appdata/.vscode" folder is being migrated?
I understand that agent migrates all the server volumes to EBS cluster and then they are being replicated in EC2 instances, but
Can anybody show an example of files structure that is being migrated?
CloudEndure does a block level replication for your disks, so you will end up having identical replica on your AWS target
On a windows machine check the logs at C:\Program Files (x86)\Cloudendure\agent.log.0
to get the details on what the agent is working on.
CloudEndure works at disk level, It is going replicate data on the disk, CloudEndure works with Re-Hosting(Lift and Shift) method. Once the replication done, you can use migrated server as same as your on-premise server but on AWS.

How to install software on multiple aws ec2 instances?

I created multiple (say 16) AWS EC2 ubuntu instances such as:
I want to keep these instances to have the same settings for later jobs. My question is how I could manage them jointly. For example, how could I install Docker in all of them at once and so that I can use docker swarm?
Ideally you would actually configure the server build before you deploy the 16 instances.
You would launch a fresh Ubuntu server and install all of the software on it with its configuration. Once all software is installed you'd create an AMI. When you go to launch the 16 servers you'd go ahead with launching them from your AMI instead of the Ubuntu image.
To follow best practices you'd not do this installation by hand, instead using a configuration automation tool such as Ansible, Chef or Puppet to configure the server to your liking.
You can make use of aws user data to install same software on all the instance during ec2 creation.

Build system when using auto scaling group with ELB in aws

I was using a free tier aws account in which I had one ec2 machine (Linux). I have a simple website with backend server running on django at 8000 port and front end server written in angular and running on http (80) port. I used nginx for https and redirection of calls to backend and frontend server.
Now for backend build system, I did these 3 main steps (which I automated by running jenkins on the same machine).
1) git pull (Pull the latest code from repo).
2) Do migrations (Updating my db with any new table).
3) Restarting the django server. (I was using gunicorn).
Now, I split my front end and backend server into 2 different machines using auto scaling groups and I am now using ELB (Aws Elastic Load balancer) to route the requests. I am done with the setup. But now I am having problem in continuous deployment. The main thing is that ELB uses auto scaling groups which in turn uses AMI.
Now, since AMI's are created once, my first question is how to automate this process and deploy my latest code in already running aws servers.
Second, if I want to run few steps just once for all the servers like my second step of updating db with new tables then how to achieve that.
And also third if these steps need to run on a machine, then do I need to have another ec2 instance to automate the process of creating AMI, updating auto scaling groups with it and then deploying latest code in that.
So, basically I want to know the best practices that people follow in deploying latest code in aws machines that were created by auto scaling groups with the help of AMI. Also I use bitbucket for code management.
First Question: how to automate 'package based deployment'.
Instead of creating a new AMI for every release, create a baseline AMI which only changes when your new release require OS changes / security patches / etc. Look into tools such as packer to create AMIs automatically. In order to automate your code deployment when it changes, you can use a package-based deployment approach, which means you create a package for every release (Should be part of your CI process), which is stored in some repository such as Nexus, Artifactory, or even a simple S3 bucket.
When you deploy a new instance of your application, it should run some sort of script to pull and unpack/install that package on the instance < this is the basic concept, there are many tools that can help you achieve this, for example, Chef, or AWS CloudFormation.
So essentially, Step 1 should pull the code, create the package and store it in some repository available to your application servers > this can be done offline.
Second Question: How to run other tasks such as updating database schema.
As mentioned above, this can also be part of your 'deployment' automation, so if you are using Chef or even a simple bash script, it can update a database schema before unpacking the new code, this really depends on your database, how you manage it, and who orchestrates the deployment.
For example, you could have a Jenkins job that pulls the new schema and updates your database when ever you rollout a release.
Your third question can be solved by Packer, it can spin up instances, create an AMI, and terminate the instance.
Read more into CICD, and CICD related tools.

How to change yarn scheduler configuration on aws EMR?

Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks.
Logging into my EMR namenode and doing a quick
find \ -iname yarn-site.xml
I was able to find it to be located at /etc/hadoop/conf.empty/yarn-site.xml and capacity-scheduler to be located at /etc/hadoop/conf.empty/capacity-scheduler.xml.
But note how these are under conf.empty and I suspect these might not be the actual locations for yarn-site and capacity-scheduler xmls.
I understand that I can change these configurations while making a cluster but what I need to know is how to be able to change them without tearing apart the cluster.
I just want to play around scheduling properties and such and try out different schedulers to identify what might work will with my spark applications.
Thanks in advance!
Well, the yarn-site.xml and capacity-scheduler.xml are indeed under correct locations (/etc/hadoop/conf.empty/) and on running cluster , editing them on master node and restarting YARN RM Daemon will change the scheduler.
When spinning up a new cluster , you can use EMR Configurations API to change appropriate values.
For example : Specify appropriate values in capacity-scheduler and yarn-site classifications on your Configuration for EMR to change those values in corresponding XML files.
Edit: Sep 4, 2019 :
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
Please see