Data Intensive process in EC2 - any tips? - amazon-web-services

We are trying to run an ETL process in an High I/O Instance on Amazon EC2. The same process locally on a very well equipped laptop (with a SSD) take about 1/6th the time. This process is basically transforming data (30 million rows or so) from flat tables to a 3rd normal form schema in the same Oracle instance.
Any ideas on what might be slowing us down?

Or another option is to simply move off of AWS and rent beefy boxes (raw hardware) with SSDs in something like Rackspace.
We have moved most of our ETL processes off of AWS/EMR. We host most of it on Rackspace and getting a lot more CPU/Storage/Performance for the money. Don't get me wrong AWS is awesome but there comes a point where it's not cost effective. On top of that you never know how they are really managing/virtualizing the hardware that applies to your specific application.
My two cents.

Related

How much would AWS ec2 cost for a project of my type

I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."
There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.
If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month

Can long running query improves performance using AWS?

As we are a data warehouse team, we deals with millions of records in and out on daily basis. We have jobs running ever day, and loads on to SQL Server Flex clones from oracle DB through ETL loads. As we are dealing with huge amount of data and complex queries, query runs pretty longer and it goes to hours. So we are looking towards using AWS. We wanted to setup our own licensed Microsoft SQL server on EC2. But I was wondering, how this will improve performance of long running query. What would be the main reason that same query takes longer on our own servers and executes faster on AWS. Or did I misunderstood the concept?(just letting you know I am at a learning phase)
PS: We are still in a R&D phase. Any thoughts or opinion would be greatly appreciated regarding AWS for long running queries.
You need to provide more details on your question.
What is your query ?
How big is the tables ?
What is the bottle neck ? CPU ? IO ? RAM ?
AWS is just infrastructure.
It does makes your life easier because you can scale up or down your machine in a click of buttons.
Well, I guess you can crank up your machine to however big you want, but even so, nothing will solve a bad query and bad architecture.
Keep in mind, EC2 comes with 2 type of disk. EBS and Ephemeral.
EBS is SAN. Ephemeral is attached to the EC2 instance it self.
By far, Ephemeral will be much faster of course, but the downside is that when you shutdown your EC2 and start it up again, all of the data in that drive is wiped clean.
As for licensing (windows and SQL Server), it is baked into the EC2 instance pre baked AMI (Amazon Machine Image).
I've never used my own license in EC2.
With same DB, Same hardware configuration, query will perform similarly on AWS or on prim. You need to check whether you have configured DB / indexes etc optimally. Also, think of replicating data to some other database which is optimized for querying huge amount of data.

Elastic Beanstalk Auto Scaling - Which metric should I use?

I've got an application that is built in node.js, and is primarily used to post photos to (up to 25mb). The app resizes to thumbnail size, and moves both the thumbnail and full size image to S3. When the uploads begin happening, they usually come in bursts of 10-15 pictures, rinse, wash, repeat in 5 minute durations. I'm seeing a lot of scaling, and the trigger is the default 6MB NetworkOut trigger. My question is, is the moving the photos to S3 considered NetworkOut? Or should I consider a different scaling trigger, so far the app hasn't stuttered so I'm hesitant to not fix what ain't broken, but I am seeing quite a big of scaling so I thought I would investigate. Thanks for any help!
The short answer - scale when ever a resource is constrained. eg, If your instances can keep up with network IO or cpu is above 80% then scale. Yes, sending any data from your ec2 instance is network out traffic. You got to get that data from point A to B somehow :)
As you go up in size on ec2 instances you get more memory and cpu along with more network IO. If you don't see issue with transfers you may want to switch the auto scale over to watch cpu or memory. In an app I'm working on users can start jobs which require a bit of cpu. So I have my auto-scale to scale if my cpu is over 80%. But you might have a process that consumes a lot of memory and not much cpu...
On a side note - you may want to think about having your uploads go directly to your s3 bucket and use a lambda to trigger the resize routine. This has several advantages over your current design. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
I suggest getting familiar with the instance metrics. You can then recognize your app-specific bottlenecks on the current instance type and count.
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-metrics.html

Good setup on AWS for ELK

We are looking into getting an ELK stack setup on Amazon but we don't really know what we need of machines to handle it smoothly.
Now I know that it will become obvious if it doesn't run smooth but still we hoped to get an idea on what we would need for our situation.
So we 4 servers that generate log files in a custom format. About ~45 million lines of logs each day, generating about 4 files of 600mb (gzipped) so around ~24GB of logs each day.
Now we are looking into the ELK stack and would like the dashboards of Kibana display realtime data, so I was thinking of logging using syslog to logstash.
4 Servers -> Rsyslog (on those 4 servers) -> Logstash (AWS) -> ElasticSearch (AWS) -> Kibana (AWS)
So now we need to figure out what kind of hardware we would need in AWS to handle this.
I read somewhere 3 masters for ElasticSearch and 2 datanodes at minimum.
So that would total 5 servers + 1 server for Kibana and 1 for Logstash?
So I would need a total of 7 servers to get started, but that kinda seems overkill?
I would like to keep my data for 1 month, so 31 days at most, so I would have around ~1.4TB of raw logdata in Elastic Search (~45GB x 31)
But since I don't really have a clue on what the best setup would be, any hints/tips/info would be welcome.
Also a system or tool that would handle this for me (node failure, etc) could be useful.
Thanks in advance,
darkownage
Here's how I've architected my cloud clusters:
3 Master nodes - these nodes coordinate the cluster and keeping three of them helps tolerate failure. Ideally these will spread across availability zones. These can be fairly small and ideally do not receive any requests - their only job is to maintain the cluster. In this case set discovery.zen.minimum_master_nodes = 2 to maintain quorum. These IPs and these IPs only are what you should provide to all cluster nodes in discovery.zen.ping.unicast.hosts
Indexes: you should probably take advantage of daily indexes - see https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html This will make more sense below but will also be beneficial if you begin to scale up - you can increase shard count over time without re-indexing.
Data Nodes: Depending on your scale or performance requirements there are a few options - i2.xlarge or d2.xlarge will work well but r3.2xlarge are also a good option. Make sure to keep the JVM heap <30GB. Keep the data paths on ephemeral drives local to the instances - EBS is not really so ideal for this use case but depending on your requirements might be sufficient. Be sure you have multiple data nodes so the replica shards can split across availability zones. As your data requirements increase, just scale these up.
Hot/Warm: Depending on the use case - it sometimes is beneficial to split your data nodes into Hot/Warm (Fast SSD/Slow HDD). This is mainly due to the fact that all writes are in realtime, and the majority of reads are on the past few hours. If you can move yesterday's data onto cheaper, slower drives, it helps out quite a bit. This is a little more involved but you can read more at https://www.elastic.co/blog/hot-warm-architecture. This requires adding some tags and using curator on a nightly basis but is generally worth it due to the cost savings of moving largely unsearched data off of more expensive SSD.
In production, I run ~20 r3.2xlarge for the hot tier and 4-5 d2.xlarge for the warm tier with a replication factor of 2 - this allows ~TB per day of ingest and a decent amount of retention. We scale Hot for volume and Warm for retention.
Overall - good luck! It's a fun stack to build and operate once everything is running smoothly.
PS - Depending on the time/resources you have available, you can run the managed elasticsearch service on AWS, but the last time i looked its ~60% more expensive than running it on your own instances, and YMMV.
Seems like you need something to start with ELK Stack on AWS
Did u tried this couple of CloudFormation scripts, It would ease your installation process and will help you setup your environment in one go.
ELK-Cookbook - CloudFormation Script
ELK-Stack with Google OAuth in Private VPC
Comment below if this doesn't solves your problem.

Amazon EC2 and EBS using Windows AMIs

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!
I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.