I understand both are built over Jupyter noteboooks but run in cloud. Why do we have two then?
Jupyter is the only thing these two services have in common.
Colaboratory is a tool for education and research. It doesn’t require any setup or other Google products to be used (although notebooks are stored in Google Drive). It’s intended primarily for interactive use and long-running background computations may be stopped. It currently only supports Python.
Cloud Datalab allows you to analyse data using Google Cloud resources. You can take full advantage of scalable services such as BigQuery and Machine Learning Engine to analyse, manipulate and visualise data. You can use it with Python, SQL, and JavaScript.
Google Colaboratory is free. But, you are limited to one spec of cpu/ram/disk/gpu.
Google Datalab is paid. You pay for whatever specs you want.
The notebook interface is also a bit different between the two.
Related
Cloud Data Fusion offers the ability to create ETL jobs using their graphical pipeline UI representation whereas Dataproc lets us run previously created Spark/Hadoop/Hive jobs.
With my limited experience in both these services, I have found Cloud Data Fusion to be the easier of the two to use & manage. I would like to know the use cases in which creating & running jobs in Dataproc is preferred over Cloud Data Fusion.
You asked for an opinion, so your question should be closed...
Anyway, it mainly depends on what you prefer! If you are a developer, and you want to handle, manage, customize/tweak all the steps your pipeline for performance, observability or security reason, code, and Dataproc is better for you. Same reason if all your developers already know the Hadoop ecosystem.
If you prefer to focus on the data transformation/wrangling with low/no code solution, Data fusion is for you. Especially if you have a few or no skills in development (business users).
At the end, all the pipeline will run on Dataproc.
I am data analyst. My company is moving all data science to a cloud provider (it could be Azure, GCP,AWS). All the data science programming tools like Jupyter notebook will be installed on the cloud environment (there will be no local installations of Python, or Jupyter Notebooks on the laptop).
For most of my work, I will be reading/ingesting relational database tables directly from an on-premise Database. Also most of my data analysis work does not require any GPU instances for data processing. Sometimes, I also do simple research or experimentation data analysis programming such as data cleaning using Jupyter notebooks without the need for usage of GPU instances.
I would like to find out if it would be possible to do such activities without incurring any pay-per-use costs or unnecessary expenses for my company on their data science cloud computing platform given that none of my tasks utilize GPUs? Please advise, thank you.
EDIT Note: It is difficult to work & develop locally with Jupyter on my company PC because I do not have full permissions to install Python packages(usually this has to be requested for approval, which is very painful and takes a very long time).
Jupyter Notebook can be installed in the cloud, but also on prem and on your workstation. You pay either resource in the cloud, on prem, or your worstation.
Of course, if you add large disk, GPUs, CPUs, memory, it costs more! The problem isn't the cost, it is more where do you want to run your notebook?
I think, there is a bad alternative. With Colab you have free Jupyter Notebook instance. But, AFAIK, it's not private, it's public instances and if you work for your company, you can have data leakage. (Not sure, to validate, but it's not a recommended solution in any case)
EDIT 1
Considering your latest comment, I wondering if you need a jupyter notebook to run your code.
Indeed, Jupyter is simply and IDE: you could create your script, even this one that need GPU locally, and to run it on production data on Compute Engine that you provision only for the process. At the end of the script destroy the VM. No Jupyter notebook environment for that, no?
EDIT 2
Thanks to your note, I understand that developing locally isn't an option. In this case, I recommend you to use a managed Jupyter Notebook solution. You can provision this VM on Google Cloud if you want, you can also have different VM, with or without GPU.
The principle is the same: when you stop to work with your instance, stop it. You will only pay for the storage (the disk) when the instance is down.
And the dev principle can be the same: use a small CPU/GPU for your dev, and when you have to process big data, run your script on a powerful VM. Because you pay only when the VM is running, you can optimize cost like that.
In addition to Guillaume's answer, if you want to keep track or to plan ahead if there are cost that will occur while using instances. You can use Google Cloud Platform's Pricing calculator:
https://cloud.google.com/products/calculator?hl=en
With this, you can can choose what product do you're interested to, what kind of components do want in your set-up (e.g. how many RAM, capacity of your storage space, CPU)in case you choose to use GCP Compute Engine, choose what location you are and check if that location price suits your company's budget.
If you want to have more information regarding Google Cloud Platform pricing, you can check out this link:
https://cloud.google.com/compute/all-pricing#compute-optimized_machine_types
I am new to cloud and still learning GCP, I exhausted almost all my free credit for GCP within 2 months while learning different modules.
GCP is great and provides a lot of things to ease the development and maintenance process.
But I realized using different modules cost me a lot.
So I was wondering if I could have a big VM box, install MySQL, Docker, and Java and React required components by myself, I can achieve pretty much what I want without using extra modules.
Am I right?
Can I use the same VM to host multiple sites by changing ports of API, or do I need to have different boxes for that?
Your question is out of GCP domain but about IT architecture. You can create a big VM with all installed on it. But you have to manage it by yourselves and the scalability is hard.
You can also have 1 VM per website, but the management cost is higher (patching and updgrade)! However you can scale with a better granularity (per website).
The standard pattern today is to explode your monolith server into dedicated services. The database on a specific server, the docker and Java in another one, and the react in a static component (like Google Cloud Platform).
If you want to use VM, you can use GKE and you containerize your application. It's far more easier to maintain your VM with an automatic tool like K8S.
The ultimate step, is to use serverless and/or full managed solution. Cloud SQL for your database, GCS for your static content, and App Engine or Cloud Run for your backend. Like this, you pay as you use and if you website is not very used, you won't be charged on it (except for the database).
so I have these Airflow DAGs which consists of several tasks. Basically each task executes some independent analysis steps against a given code snippet and finally it is decided if this snippet can be used from a regulatory point of view.
Each tasks - depending on the code snippet - is quite short (1-25 minutes at most) and mostly it boils down to executing some external analysis tool (open source and internally) and processing the output of this tool.
All this works quite nice on my development machine but since we are analyzing quite a lot of code snippets during working hours (~50 per hour) and none outside of working hours, I'd like to get all of this up and running somewhere in the cloud (I don't really care if on google cloud, aws or azure).
So my question is what would be an economic way of getting this up and running in the cloud? I thought about using google cloud composer and these google preemptible VMs (the ones that shut down randomly but are super cheap) but it seems that I can not use the PVMs together with cloud composer.
Since the various steps in the DAG are independent the PVMs would be IMO great - if during task execution they are shut down I just retry this one task on a different PVM.
Thank you
On Google Cloud, there are a few options for you.
Run self-managed Airflow on a Compute Engine VM
Run Cloud Composer
The best option will be a mix of how much you want to spend and what features you need. Self-managed Airflow is a great option if you want to have very low cost (less than $100 per month) and are OK self-managing the VM and taking on the risk that the SLA from Google will only cover the VM, so if Airflow malfunctions, you're going to have to detect it and fix it.
The benefit of Composer is the fact that it's integrated so you get things like IAM, Stackdriver, WebUI proxying and so on. You will pay more for the service, however, since it's managed. Presently there is no way to run Composer with preemptible VMs.
Your use case sounds like it could run on a default size cluster on Cloud Composer, though.
It's worth noting that if you go self-managed, you also get the benefit that Google actively contributes to Airflow, so things like the operators should work against the current product APIs. Google also contributes fixes and new operators pretty regularly.
I want to plan an architecture based on GCP cloud platform. Below are the subject areas what I have to cover. Can someone please help me to find out the proper services which will perform that operation?
Data ingestion (Batch, Real-time, Scheduler)
Data profiling
AI/ML based data processing
Analytical data processing
Elastic search
User interface
Batch and Real-time publish
Security
Logging/Audit
Monitoring
Code repository
If I am missing something which I have to take care then please add the same too.
GCP offers many products with functionality that can overlap partially. What product to use would depend on the more specific use case, and you can find an overview about it here.
That being said, an overall summary of the services you asked about would be:
1. Data ingestion (Batch, Real-time, Scheduler)
That will depend on where your data comes from, but the most common options are Dataflow (both for batch and streaming) and Pub/Sub for streaming messages.
2. Data profiling
Dataprep (which actually runs on top of Dataflow) can be used for data profiling, here is an overview of how you can do it.
3. AI/ML based data processing
For this, you have several options depending on your needs. For developers with limited machine learning expertise there is AutoML that allows to quickly train and deploy models. For more experienced data scientists there is ML Engine, that allows training and prediction of custom models made with frameworks like TensorFlow or scikit-learn.
Additionally, there are some pre-trained models for things like video analysis, computer vision, speech to text, speech synthesis, natural language processing or translation.
Plus, it’s even possible to perform some ML tasks in GCP’s data warehouse, BigQuery in SQL language.
4. Analytical data processing
Depending on your needs, you can use Dataproc, which is a managed Hadoop and Spark service, or Dataflow for stream and batch data processing.
BigQuery is also designed with analytical operations in mind.
5. Elastic search
There is no managed Elastic search service directly provided by GCP, but you can find several options on the marketplace, like an API service or a Kubernetes app for Google’s Kubernetes Engine.
6. User interface
If you are referring to a user interface for your own use, GCP’s console is what you’d be using. If you are referring to a UI for end-users, I’d suggest using App Engine.
If you are referring to a UI for data exploration, there is Datalab, which is essentially a managed notebook service, and Data Studio, where you can build plots of your data in real time.
7. Batch and Real-time publish
The publishing service in GCP, for both synchronous and asynchronous messages is Pub/Sub.
8. Security
Most security concerns in GCP are addressed here. Which is a wide topic by itself and should probably need a separate question.
9. Logging/Audit
GCP uses Stackdriver for logging of most of its products, and provides many ways to process and analyze those logs.
10. Monitoring
Stackdriver also has monitoring features.
11. Code repository
For this there is Cloud Source Repositories, which integrate with GCP’s automated build system and can also be easily synched with a Github repository.
12. Analytical data warehouse
You did not ask for this one, but I think it's an important part of a data analysis stack.
In the case of GCP, this would be BigQuery.