How much would AWS ec2 cost for a project of my type - amazon-web-services

I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."

There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.

If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month

Related

Costs associated with Data Analysis (data cleaning) on the cloud

I am data analyst. My company is moving all data science to a cloud provider (it could be Azure, GCP,AWS). All the data science programming tools like Jupyter notebook will be installed on the cloud environment (there will be no local installations of Python, or Jupyter Notebooks on the laptop).
For most of my work, I will be reading/ingesting relational database tables directly from an on-premise Database. Also most of my data analysis work does not require any GPU instances for data processing. Sometimes, I also do simple research or experimentation data analysis programming such as data cleaning using Jupyter notebooks without the need for usage of GPU instances.
I would like to find out if it would be possible to do such activities without incurring any pay-per-use costs or unnecessary expenses for my company on their data science cloud computing platform given that none of my tasks utilize GPUs? Please advise, thank you.
EDIT Note: It is difficult to work & develop locally with Jupyter on my company PC because I do not have full permissions to install Python packages(usually this has to be requested for approval, which is very painful and takes a very long time).
Jupyter Notebook can be installed in the cloud, but also on prem and on your workstation. You pay either resource in the cloud, on prem, or your worstation.
Of course, if you add large disk, GPUs, CPUs, memory, it costs more! The problem isn't the cost, it is more where do you want to run your notebook?
I think, there is a bad alternative. With Colab you have free Jupyter Notebook instance. But, AFAIK, it's not private, it's public instances and if you work for your company, you can have data leakage. (Not sure, to validate, but it's not a recommended solution in any case)
EDIT 1
Considering your latest comment, I wondering if you need a jupyter notebook to run your code.
Indeed, Jupyter is simply and IDE: you could create your script, even this one that need GPU locally, and to run it on production data on Compute Engine that you provision only for the process. At the end of the script destroy the VM. No Jupyter notebook environment for that, no?
EDIT 2
Thanks to your note, I understand that developing locally isn't an option. In this case, I recommend you to use a managed Jupyter Notebook solution. You can provision this VM on Google Cloud if you want, you can also have different VM, with or without GPU.
The principle is the same: when you stop to work with your instance, stop it. You will only pay for the storage (the disk) when the instance is down.
And the dev principle can be the same: use a small CPU/GPU for your dev, and when you have to process big data, run your script on a powerful VM. Because you pay only when the VM is running, you can optimize cost like that.
In addition to Guillaume's answer, if you want to keep track or to plan ahead if there are cost that will occur while using instances. You can use Google Cloud Platform's Pricing calculator:
https://cloud.google.com/products/calculator?hl=en
With this, you can can choose what product do you're interested to, what kind of components do want in your set-up (e.g. how many RAM, capacity of your storage space, CPU)in case you choose to use GCP Compute Engine, choose what location you are and check if that location price suits your company's budget.
If you want to have more information regarding Google Cloud Platform pricing, you can check out this link:
https://cloud.google.com/compute/all-pricing#compute-optimized_machine_types

Can long running query improves performance using AWS?

As we are a data warehouse team, we deals with millions of records in and out on daily basis. We have jobs running ever day, and loads on to SQL Server Flex clones from oracle DB through ETL loads. As we are dealing with huge amount of data and complex queries, query runs pretty longer and it goes to hours. So we are looking towards using AWS. We wanted to setup our own licensed Microsoft SQL server on EC2. But I was wondering, how this will improve performance of long running query. What would be the main reason that same query takes longer on our own servers and executes faster on AWS. Or did I misunderstood the concept?(just letting you know I am at a learning phase)
PS: We are still in a R&D phase. Any thoughts or opinion would be greatly appreciated regarding AWS for long running queries.
You need to provide more details on your question.
What is your query ?
How big is the tables ?
What is the bottle neck ? CPU ? IO ? RAM ?
AWS is just infrastructure.
It does makes your life easier because you can scale up or down your machine in a click of buttons.
Well, I guess you can crank up your machine to however big you want, but even so, nothing will solve a bad query and bad architecture.
Keep in mind, EC2 comes with 2 type of disk. EBS and Ephemeral.
EBS is SAN. Ephemeral is attached to the EC2 instance it self.
By far, Ephemeral will be much faster of course, but the downside is that when you shutdown your EC2 and start it up again, all of the data in that drive is wiped clean.
As for licensing (windows and SQL Server), it is baked into the EC2 instance pre baked AMI (Amazon Machine Image).
I've never used my own license in EC2.
With same DB, Same hardware configuration, query will perform similarly on AWS or on prim. You need to check whether you have configured DB / indexes etc optimally. Also, think of replicating data to some other database which is optimized for querying huge amount of data.

Estimate AWS cost

The company which I work right now planning to use AWS to host a new website for a client. Their old website had roughly 75,000 sessions and 250,000 page views per year. We haven't used AWS before and I need to give a rough cost estimate to my project manager.
This new website is going to be mostly content-driven with a cms backend (probably WordPress) + a cost calculator for their services. Can anyone give me a rough idea about the cost to host such kind of a website in aws?
I have used simple monthly calculator with a single Linux t2.small 3 Year upfront which gave me around 470$.
(forgive my English)
The only way to know the cost is to know the actual services you will consume (Amazon EC2, Amazon EBS, database, etc). It is not possible to give an accurate "guess" of these requirements because it really does depend upon the application and usage patterns.
It is normally recommended that you implement the system and run it for a while before committing to Reserved Instances so that you have a chance to measure performance and test a few different instance types.
Be careful using T2 instances for production workloads. They are very powerful instances, but if the CPU Credits run out, the amount of CPU is limited.
Bottom line: Implement, measure, test. Then you'll know what is right for your needs.
Take Note
When you are new in AWS you have a 1 year free tier on a single t2.micro
Just pulled it out, looking into your requirement you may not need this
One load balancer and App server should be fine (Just use route53 to serve some static pages from s3 while upgrading or scalling )
Use of email subscription and processing of Some document can be handled with AWS Lambda, SNS and SWQ which may further reduce the cost ( you may reduce the server size and do all the hevay lifting from Lambda)
A simple webpage with 3000 request/monthly can be handled by T2 micro which is almost free for one year as mentioned above in the note
You don't have a lot of details in your question. AWS has a wide variety of services that you could be using in that scenario. To accurately estimate costs, you should gather these details:
What will the AWS storage be used for? A database, applications, file storage?
How big will the objects be? Each type of storage has different limits on individual file size, estimate your largest object size.
How long will you store these objects? This will help you determine static, persistent or container storage.
What is the total size of the storage you need? Again, different products have different limits.
How often do you need to do backup snapshots? Where will you store them?
Every cloud vendor has a detailed calculator to help you determine costs. However, to use them effectively you need to have all of these questions answered and you need to understand what each product is used for. If you would like to get a quick estimate of costs, you can use this calculator by NetApp.

Good setup on AWS for ELK

We are looking into getting an ELK stack setup on Amazon but we don't really know what we need of machines to handle it smoothly.
Now I know that it will become obvious if it doesn't run smooth but still we hoped to get an idea on what we would need for our situation.
So we 4 servers that generate log files in a custom format. About ~45 million lines of logs each day, generating about 4 files of 600mb (gzipped) so around ~24GB of logs each day.
Now we are looking into the ELK stack and would like the dashboards of Kibana display realtime data, so I was thinking of logging using syslog to logstash.
4 Servers -> Rsyslog (on those 4 servers) -> Logstash (AWS) -> ElasticSearch (AWS) -> Kibana (AWS)
So now we need to figure out what kind of hardware we would need in AWS to handle this.
I read somewhere 3 masters for ElasticSearch and 2 datanodes at minimum.
So that would total 5 servers + 1 server for Kibana and 1 for Logstash?
So I would need a total of 7 servers to get started, but that kinda seems overkill?
I would like to keep my data for 1 month, so 31 days at most, so I would have around ~1.4TB of raw logdata in Elastic Search (~45GB x 31)
But since I don't really have a clue on what the best setup would be, any hints/tips/info would be welcome.
Also a system or tool that would handle this for me (node failure, etc) could be useful.
Thanks in advance,
darkownage
Here's how I've architected my cloud clusters:
3 Master nodes - these nodes coordinate the cluster and keeping three of them helps tolerate failure. Ideally these will spread across availability zones. These can be fairly small and ideally do not receive any requests - their only job is to maintain the cluster. In this case set discovery.zen.minimum_master_nodes = 2 to maintain quorum. These IPs and these IPs only are what you should provide to all cluster nodes in discovery.zen.ping.unicast.hosts
Indexes: you should probably take advantage of daily indexes - see https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html This will make more sense below but will also be beneficial if you begin to scale up - you can increase shard count over time without re-indexing.
Data Nodes: Depending on your scale or performance requirements there are a few options - i2.xlarge or d2.xlarge will work well but r3.2xlarge are also a good option. Make sure to keep the JVM heap <30GB. Keep the data paths on ephemeral drives local to the instances - EBS is not really so ideal for this use case but depending on your requirements might be sufficient. Be sure you have multiple data nodes so the replica shards can split across availability zones. As your data requirements increase, just scale these up.
Hot/Warm: Depending on the use case - it sometimes is beneficial to split your data nodes into Hot/Warm (Fast SSD/Slow HDD). This is mainly due to the fact that all writes are in realtime, and the majority of reads are on the past few hours. If you can move yesterday's data onto cheaper, slower drives, it helps out quite a bit. This is a little more involved but you can read more at https://www.elastic.co/blog/hot-warm-architecture. This requires adding some tags and using curator on a nightly basis but is generally worth it due to the cost savings of moving largely unsearched data off of more expensive SSD.
In production, I run ~20 r3.2xlarge for the hot tier and 4-5 d2.xlarge for the warm tier with a replication factor of 2 - this allows ~TB per day of ingest and a decent amount of retention. We scale Hot for volume and Warm for retention.
Overall - good luck! It's a fun stack to build and operate once everything is running smoothly.
PS - Depending on the time/resources you have available, you can run the managed elasticsearch service on AWS, but the last time i looked its ~60% more expensive than running it on your own instances, and YMMV.
Seems like you need something to start with ELK Stack on AWS
Did u tried this couple of CloudFormation scripts, It would ease your installation process and will help you setup your environment in one go.
ELK-Cookbook - CloudFormation Script
ELK-Stack with Google OAuth in Private VPC
Comment below if this doesn't solves your problem.

Data Intensive process in EC2 - any tips?

We are trying to run an ETL process in an High I/O Instance on Amazon EC2. The same process locally on a very well equipped laptop (with a SSD) take about 1/6th the time. This process is basically transforming data (30 million rows or so) from flat tables to a 3rd normal form schema in the same Oracle instance.
Any ideas on what might be slowing us down?
Or another option is to simply move off of AWS and rent beefy boxes (raw hardware) with SSDs in something like Rackspace.
We have moved most of our ETL processes off of AWS/EMR. We host most of it on Rackspace and getting a lot more CPU/Storage/Performance for the money. Don't get me wrong AWS is awesome but there comes a point where it's not cost effective. On top of that you never know how they are really managing/virtualizing the hardware that applies to your specific application.
My two cents.