Costs associated with Data Analysis (data cleaning) on the cloud - amazon-web-services

I am data analyst. My company is moving all data science to a cloud provider (it could be Azure, GCP,AWS). All the data science programming tools like Jupyter notebook will be installed on the cloud environment (there will be no local installations of Python, or Jupyter Notebooks on the laptop).
For most of my work, I will be reading/ingesting relational database tables directly from an on-premise Database. Also most of my data analysis work does not require any GPU instances for data processing. Sometimes, I also do simple research or experimentation data analysis programming such as data cleaning using Jupyter notebooks without the need for usage of GPU instances.
I would like to find out if it would be possible to do such activities without incurring any pay-per-use costs or unnecessary expenses for my company on their data science cloud computing platform given that none of my tasks utilize GPUs? Please advise, thank you.
EDIT Note: It is difficult to work & develop locally with Jupyter on my company PC because I do not have full permissions to install Python packages(usually this has to be requested for approval, which is very painful and takes a very long time).

Jupyter Notebook can be installed in the cloud, but also on prem and on your workstation. You pay either resource in the cloud, on prem, or your worstation.
Of course, if you add large disk, GPUs, CPUs, memory, it costs more! The problem isn't the cost, it is more where do you want to run your notebook?
I think, there is a bad alternative. With Colab you have free Jupyter Notebook instance. But, AFAIK, it's not private, it's public instances and if you work for your company, you can have data leakage. (Not sure, to validate, but it's not a recommended solution in any case)
EDIT 1
Considering your latest comment, I wondering if you need a jupyter notebook to run your code.
Indeed, Jupyter is simply and IDE: you could create your script, even this one that need GPU locally, and to run it on production data on Compute Engine that you provision only for the process. At the end of the script destroy the VM. No Jupyter notebook environment for that, no?
EDIT 2
Thanks to your note, I understand that developing locally isn't an option. In this case, I recommend you to use a managed Jupyter Notebook solution. You can provision this VM on Google Cloud if you want, you can also have different VM, with or without GPU.
The principle is the same: when you stop to work with your instance, stop it. You will only pay for the storage (the disk) when the instance is down.
And the dev principle can be the same: use a small CPU/GPU for your dev, and when you have to process big data, run your script on a powerful VM. Because you pay only when the VM is running, you can optimize cost like that.

In addition to Guillaume's answer, if you want to keep track or to plan ahead if there are cost that will occur while using instances. You can use Google Cloud Platform's Pricing calculator:
https://cloud.google.com/products/calculator?hl=en
With this, you can can choose what product do you're interested to, what kind of components do want in your set-up (e.g. how many RAM, capacity of your storage space, CPU)in case you choose to use GCP Compute Engine, choose what location you are and check if that location price suits your company's budget.
If you want to have more information regarding Google Cloud Platform pricing, you can check out this link:
https://cloud.google.com/compute/all-pricing#compute-optimized_machine_types

Related

How does Google Cloud Run spin up instantly

So, I really like the idea of server-less. I came across Google Cloud Functions and Google Cloud Run.
So google cloud functions are individual functions, which is a broad perspective, I assume google must be securely running on a huge nodejs server. And it contains all the functions of all the google consumers and fulfils the request using unique URLs. Now, Google takes care of the cost of this one big server and charges users for every hit their function gets. So its pay to use. And makes sense.
But when it comes to Cloud Run. I fail to understand how does it work. Obviously the container must not always be running because then they will simply charge a monthly basis instead of a per-hit basis, just like a normal VM where docker image is deployed. But no, in reality, they charge on per hit basis, that means they spin up the container when a request arrives. So, I don't understand how does it spin it up so fast? The users have the flexibility of running any sort of environment, that means the docker container could contain literally anything. Maybe a full-fledged Linux OS. How does it load up the environment OS so quickly and fulfils the request? Well, maybe it maintains the state of the machine and shut it down when not in use, but even then, it will require a decent amount of time to restore the state.
So how does google really does it? How is it able to spin up a customer's container in literally no time?
The idea of fast spinning-up sandboxes containers (that run on their own kernel for security reasons) have been around for a pretty long time. For example, Intel Clear Linux Containers and Firecracker provide fast startup through various optimizations.
As you can imagine, implementing something like this would require optimizations at many layers (scheduling, traffic serving, autoscaling, image caching...).
Without giving away Google’s secrets, we can probably talk about image storage and caching: Just like how VMs use initramfs to pre-cache the state of the VM, instead of reading all the files from harddisk and following the boot sequence, we can do similar tricks with containers.
Google uses a similar solution for Cloud Run, called gVisor. It's a user-space virtualization technique (not an actual VMM or hypervisor). To run containers on a Linux-like environment, gVisor doesn't need to boot a Linux kernel from scratch (because gVisor reimplements the linux kernel in go!).
You’ll find many optimizations on other serverless platforms across most cloud providers (such as how to keep a container instance around, should you be predictively scheduling inactive containers before the load arrives). I recommend reading the Peeking Behind the Curtains of Serverless Platforms paper to get an idea about what are the problems in this space and what are cloud providers trying to optimize for speed and cost.
You have to decouple the containers to the VMs. The second link of Dustin is great because if you understand the principles of Kubernetes (and more if you have a look to Knative), it's easy to translate this to Cloud Run.
You have a pool of resources (Nodes in Kubernetes, the VM in fact with CPU and memory) and on these resources, you can run container: 1, 2, 1000 per VM, maybe, you don't know and you don't care. The power of the container, is the ability to be packaged with all the dependency that it needs. Yes, I talked about package because your container isn't an OS, it contains the dependencies for interacting with the host OS.
For preventing any problem between container from different project/customer, the container run into a sandbox (GVisor, first link of Dustin).
So, there is no VM to start and to stop, no VM to create when you deploy a Cloud Run services,... It's only a start of your container on existing resources. It's also for this reason that you need to have a stateless container, without disks attached to it.
Do you want 3 "secrets"?
It's exactly the same things with Cloud Functions! Your code is packaged into a container and deploy exactly as it's done with Cloud Run.
The underlying platform that manages Cloud Functions and Cloud Run is the same. That's why the behavior and the feature are very similar! Cloud Functions is longer to deploy because Google need to build the container for you. With Cloud Run the container is already built.
Your Compute Engine instance is also managed as a container on the Google infrastructure! More generally, all is container at Google!

Manually install every thing in GCP VM module

I am new to cloud and still learning GCP, I exhausted almost all my free credit for GCP within 2 months while learning different modules.
GCP is great and provides a lot of things to ease the development and maintenance process.
But I realized using different modules cost me a lot.
So I was wondering if I could have a big VM box, install MySQL, Docker, and Java and React required components by myself, I can achieve pretty much what I want without using extra modules.
Am I right?
Can I use the same VM to host multiple sites by changing ports of API, or do I need to have different boxes for that?
Your question is out of GCP domain but about IT architecture. You can create a big VM with all installed on it. But you have to manage it by yourselves and the scalability is hard.
You can also have 1 VM per website, but the management cost is higher (patching and updgrade)! However you can scale with a better granularity (per website).
The standard pattern today is to explode your monolith server into dedicated services. The database on a specific server, the docker and Java in another one, and the react in a static component (like Google Cloud Platform).
If you want to use VM, you can use GKE and you containerize your application. It's far more easier to maintain your VM with an automatic tool like K8S.
The ultimate step, is to use serverless and/or full managed solution. Cloud SQL for your database, GCS for your static content, and App Engine or Cloud Run for your backend. Like this, you pay as you use and if you website is not very used, you won't be charged on it (except for the database).

How much would AWS ec2 cost for a project of my type

I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."
There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.
If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month

Economic possibility to execute many workflow tasks

so I have these Airflow DAGs which consists of several tasks. Basically each task executes some independent analysis steps against a given code snippet and finally it is decided if this snippet can be used from a regulatory point of view.
Each tasks - depending on the code snippet - is quite short (1-25 minutes at most) and mostly it boils down to executing some external analysis tool (open source and internally) and processing the output of this tool.
All this works quite nice on my development machine but since we are analyzing quite a lot of code snippets during working hours (~50 per hour) and none outside of working hours, I'd like to get all of this up and running somewhere in the cloud (I don't really care if on google cloud, aws or azure).
So my question is what would be an economic way of getting this up and running in the cloud? I thought about using google cloud composer and these google preemptible VMs (the ones that shut down randomly but are super cheap) but it seems that I can not use the PVMs together with cloud composer.
Since the various steps in the DAG are independent the PVMs would be IMO great - if during task execution they are shut down I just retry this one task on a different PVM.
Thank you
On Google Cloud, there are a few options for you.
Run self-managed Airflow on a Compute Engine VM
Run Cloud Composer
The best option will be a mix of how much you want to spend and what features you need. Self-managed Airflow is a great option if you want to have very low cost (less than $100 per month) and are OK self-managing the VM and taking on the risk that the SLA from Google will only cover the VM, so if Airflow malfunctions, you're going to have to detect it and fix it.
The benefit of Composer is the fact that it's integrated so you get things like IAM, Stackdriver, WebUI proxying and so on. You will pay more for the service, however, since it's managed. Presently there is no way to run Composer with preemptible VMs.
Your use case sounds like it could run on a default size cluster on Cloud Composer, though.
It's worth noting that if you go self-managed, you also get the benefit that Google actively contributes to Airflow, so things like the operators should work against the current product APIs. Google also contributes fixes and new operators pretty regularly.

Google Colaboratory vs Google Datalab. How are they different?

I understand both are built over Jupyter noteboooks but run in cloud. Why do we have two then?
Jupyter is the only thing these two services have in common.
Colaboratory is a tool for education and research. It doesn’t require any setup or other Google products to be used (although notebooks are stored in Google Drive). It’s intended primarily for interactive use and long-running background computations may be stopped. It currently only supports Python.
Cloud Datalab allows you to analyse data using Google Cloud resources. You can take full advantage of scalable services such as BigQuery and Machine Learning Engine to analyse, manipulate and visualise data. You can use it with Python, SQL, and JavaScript.
Google Colaboratory is free. But, you are limited to one spec of cpu/ram/disk/gpu.
Google Datalab is paid. You pay for whatever specs you want.
The notebook interface is also a bit different between the two.