An API I am building on AWS Lambda has some basic scikit-learn/numpy operations, such as vector arithmetic and clustering. However, the dependencies are very large (scipy = 100M+) and cause issues with deployment limits for a lot of the serverless solutions out there (e.g. lambda = 250MB limit). These dependencies also exceed layer limits.
I'm guessing this is a basic problem for ML engineers, so I'm wondering what the defacto practice for a microservice based architectures is. Is it typical for organizations to host all custom ML logic to a service like AWS Sagemaker, even if it's something as simple as SKLearn? Note: I'm not looking to migrate to docker (asked here).
Related
I've been building serverless applications on AWS for the past few years, utilizing services such as Lambda, DynamoDB, SNS, SQS, Kinesis, etc., relying on the Serverless framework for local development and deployments. Due to professional reasons, I have now to switch to Google Cloud and I've been exploring the serverless space in that platform. Unfortunately, at first glance it doesn't seem to be as mature as AWS, which I don't know whether it's true or just caused by my lack of expertise. The reasons that make me claim that are basically the following:
There is no logical grouping of functions and resources: on AWS, Lambda functions are grouped in Applications, and can be deployed as a whole via SAM or the Serverless framework, which also allow creating any associated resource (databases, queues, event buses, etc.). It seems that on GCP functions are treated as individual entities, which makes managing them and orchestrating them harder.
Lack of tooling: both the SAM cli and the Serverless framework provide tools for local development and deployments. I haven't found anything on GCP like the former (the Functions Framework seems to cover it partially, but it doesn't handle deployments), and even though that the latter supports GCP, it's missing basic features, such as creating resources other than functions. What is more, GCP is not in the core framework and the plugin is looking for maintainers.
Less event sources: Lambda is directly integrated with a long list of services. On the other hand, Cloud Functions is integrated with just a few services, making HTTP triggers the only option on most cases. It seems they're trying to address this issue with Eventarc, but I don't think it's generally available yet.
Does anybody have any tips on how to setup a local environment for this kind of applications and how to manage them effectively?
Here some documentation that might be helpful for your case, even though required to take a deep look into it.
Configure Serverless VPC Access (which i think applies for 'setting up your local environment').
Cloud Run Quick start (which contains how to built and deploy serverless services with GCP Cloud Run using node.js, python, java, etc.
We are deploying a data consortium between more than 10 companies. Wi will deploy several machine learning models (in general advanced analytics models) for all the companies and we will administrate all the models. We are looking for a solution that administrates several servers, clusters and data science pipelines. I love kedro, but not sure what is the best option to administrate all while using kedro.
In summary, we are looking for the best solution to administrate several models, tasks and pipelines in different servers and possibly Spark clusters. Our current options are:
AWS as our data warehouse and Databricks for administrating servers, clusters and tasks. I don't feel that the notebooks of databricks are a good solution for building pipelines and to work collaboratively, so I would like to connect kedro to databricks (is it good? is it easy to schedule the run of the kedro pipelines using databricks?)
Using GCP for data warehouse and use kubeflow (iin GCP) for deploying models and the administration and the schedule of the pipelines and the needed resources
Setting up servers from ASW or GCP, install kedro and schedule the pipelines with airflow (I see a big problem administrating 20 servers and 40 pipelines)
I would like to know if someone knows what is the best option between these alternatives, their downsides and advantages, or if there are more possibilities.
I'll try and summarise what I know, but be aware that I've not been part of a KubeFlow project.
Kedro on Databricks
Our approach was to build our project with CI and then execute the pipeline from a notebook. We did not use the kedro recommended approach of using databricks-connect due to the large price difference between Jobs and Interactive Clusters (which are needed for DB-connect). If you're working on several TB's of data, this quickly becomes relevant.
As a DS, this approach may feel natural, as a SWE though it does not. Running pipelines in notebooks feels hacky. It works but it feels non-industrialised. Databricks performs well in automatically spinning up and down clusters & taking care of the runtime for you. So their value add is abstracting IaaS away from you (more on that later).
GCP & "Cloud Native"
Pro: GCP's main selling point is BigQuery. It is an incredibly powerful platform, simply because you can be productive from day 0. I've seen people build entire web API's on top of it. KubeFlow isn't tied to GCP so you could port this somewhere else later on. Kubernetes will also allow you to run anything else you wish on the cluster, API's, streaming, web services, websites, you name it.
Con: Kubernetes is complex. If you have 10+ engineers to run this project long-term, you should be OK. But don't underestimate the complexity of Kubernetes. It is to the cloud what Linux is to the OS world. Think log management, noisy neighbours (one cluster for web APIs + batch spark jobs), multi-cluster management (one cluster per department/project), security, resource access etc.
IaaS server approach
Your last alternative, the manual installation of servers is one I would recommend only if you have a large team, extremely large data and are building a long-term product who's revenue can sustain the large maintenance costs.
The people behind it
How does the talent market look like in your region? If you can hire experienced engineers with GCP knowledge, I'd go for the 2nd solution. GCP is a mature, "native" platform in the sense that it abstracts a lot away for customers. If your market has mainly AWS engineers, that may be a better road to take. If you have a number of kedro engineers, that also has relevance. Note that kedro is agnostic enough to run anywhere. It's really just python code.
Subjective advise:
Having worked mostly on AWS projects and a few GCP projects, I'd go for GCP. I'd use the platform's components (BigQuery, Cloud Run, PubSub, Functions, K8S) as a toolbox to choose from and build an organisation around that. Kedro can run in any of these contexts, as a triggered job by the Scheduler, as a container on Kubernetes or as a ETL pipeline bringing data into (or out of) BigQuery.
While Databricks is "less management" than raw AWS, it's still servers to think about and VPC networking charges to worry over. BigQuery is simply GB queried. Functions are simply invocation count. These high level components will allow you to quickly show value to customers and you only need to go deeper (RaaS -> PaaS -> IaaS) as you scale.
AWS also has these higher level abstractions over IaaS but in general, it appears (to me) that Google's offering is the most mature. Mainly because they have published tools they've been using internally for almost a decade whereas AWS has built new tools for the market. AWS is the king of IaaS though.
Finally, a bit of content, two former colleagues have discussed ML industrialisation frameworks earlier this fall
I recently started working with AWS and IaC, I'm using Cloudformation to provision my AWS resources, but I discovered that AWS provide both a SDK and a CDK to enable you to provision resources programmatically instead of plain json/yaml.
But based on the documentation I did not really understand how they differ, can someone explain me how they differ and for what use case you should use what?
CDK: Is a framework to model and provision your infrastructure or stack. Stack can consist of a database for ex: DynamoDB, S3 Bucket, Lambda, API Gateway etc. It provides a facility to write code to create an infrastructure in AWS. Also called Infrastructure as code.
Check here
SDK: These are the code libraries provided by Amazon in various languages, like Java, Python, PHP, Javascript, Typescript etc. These libraries help interact with AWS services (like creating data in DynamoDB) which you either create through CDK or console. SDKs simplify using AWS services in your application with an API.
Check here
AWS SDK is a library primarily to ease the access to the AWS services by handling for you the data (de)serialization, credentials management, failure handling, etc. Perhaps, for specific scenarios, you could use the AWS SDK as the infrastructure as a code tool, however it could be cumbersome as it is not the intended usage of the library.
Based on the https://docs.aws.amazon.com/whitepapers/latest/develop-deploy-dotnet-apps-on-aws/infrastructure-as-code.html, dedicated tools for the IaC are AWS CloudFormation and AWS CDK.
AWS CDK is an abstraction on top of CloudFormation. CDK scripts are in fact transformed to the CloudFormation definitions when scripts are synthesized.
The difference can be best described on an example: Imagine that for each lambda function in your stack you want to create an error CloudWatch alarm and connect to the SNS topic.
With CloudFormation you will either a) need to write a pretty much similar bunch of yaml/json definitions for each lambda function to ensure the monitoring, b) use the nested stack templates, c) use CloudFormation modules.
With CDK you can write a generic code construct - class or method, which can create the alarm for the given lambda function and create the SNS alarm action for given topic.
In other words, CDK helps you generalize and re-use your IaC in a very familiar way to how you develop your business code. The code is shorter and more readable than the CF definitions.
The difference is even more remarkable when you need to set up similar resources in different AWS regions and when you have different AWS account per environment. You can manage all AWS accounts and regions with a single CDK codebase.
Some background first: CloudFormation is Amazon's solution for an “Infrastructure as Code” approach to managing the definition, provisioning and deployment of a bunch of resources across accounts/regions. This is done by using their declarative yaml/json-based template language to define it all, and then executing the templates through various means (console, cli, APIs...). More info:
white paper: https://docs.aws.amazon.com/whitepapers/latest/develop-deploy-dotnet-apps-on-aws/infrastructure-as-code.html
faq: https://aws.amazon.com/cloudformation/faqs/
There are other popular IaC solutions or tools to help achieve it more easily out there, such as Terraform and Kubernetes (container orchestration that also uses declarative templates to define desired states).
Potential benefits of IaC: At a high level, you can better track & audit your infra, reuse definitions/processes, make all your changes in a more consistent manner, faster thanks to all the automation and assurances you can get with an infra-as-code approach. You may be familiar with these as mentioned in previous answers and more, such as:
version controlling your infrastructure definitions,
more efficient and logically complex ways of constructing templates,
ability to write tests against them,
do diffs (see "change sets") before making real infra changes with the templates,
detect when live infra differs from your definitions,
automate rollbacks,
and lots of other state management assistance through a framework like CF that might be needed when performing regular ops duties.
CDK:
This is for helping to automate CloudFormation as part of an IaC approach to provisioning and deploying resources. It lets you use various popular programming languages to help with the creation, testing, and management of your CF setup. Some of AWS’s motivations: “YAML is an excellent format for describing the desired state of your cluster, but it is does not have primitives for expressing logic and reusable abstractions.“ “AWS CDK uses the familiarity and expressive power of programming languages for modeling your applications.”
More info: https://docs.aws.amazon.com/cdk/v2/guide/home.html
However, Amazon knows about other solutions, and happily points them out on the main CDK page now, downplaying its original connection to CF. You don't need to use CloudFormation if you don't want to; specifically, they mention you can use the same CDK constructs with the help of:
cdktf for Terraform maintained by its creators, Hashicorp
cdk8s for Kubernetes by AWS. re: “We realized this was exactly the same problem our customers had faced when defining their applications through CloudFormation templates, a problem solved by the AWS Cloud Development Kit (AWS CDK), and that we could apply the same design concepts from the AWS CDK to help all Kubernetes users.”
SDK:
AWS has an API for all of their services, and the various SDKs give you access to them. For example, I can use AWS’s Java SDK to manage an API Gateway. If I wanted to script some custom deployment process, I could do so with the SDK, managing all the state, etc. myself. You could probably even re-implement the CloudFormation service with the various underlying APIs... The APIs have varying levels of documentation though. E.g. CloudFormation Java APIs are only mentioned in the raw API reference, not the friendlier Developer Guide.
I find that the difference for me is that the CDK codifies the CloudFormation JSON/YAML. First response, is great ya okay in code but the benefit on the code side of things is you can write unit testing against the code. Therefore you get to build that sense of security or insurance policy against the provisioned services in the CDK.
There are other ways to test CF, however, with a dev background, this feels more comfortable.
I am learning GCP, and came across Kuberflow and Google Cloud Composer.
From what I have understood, it seems that both are used to orchestrate workflows, empowering the user to schedule and monitor pipelines in the GCP.
The only difference that I could figure out is that Kuberflow deploys and monitors Machine Learning models. Am I correct? In that case, since Machine Learning models are also objects, can't we orchestrate them using Cloud Composer? How does Kubeflow help in any way, better than Cloud Composer when it comes to managing Machine Learning models??
Thanks
Kubeflow and Kubeflow Pipelines
Kubeflow is not exactly the same as Kubeflow Pipelines. The Kubeflow project mostly develops Kubernetes operators for distributed ML training (TFJob, PyTorchJob). On the other hand the Pipelines project develops a system for authoring and running pipelines on Kubernetes. KFP also has some sample components, by the main product is the pipeline authoring SDK and the pipeline execution engine
Kubeflow Pipelines vs. Cloud Composer
The projects are pretty similar, but there are differences:
KFP use Argo for execution and orchestration. Cloud Composer uses Apache Airflow.
KFP/Argo is designed for distributed execution on Kubernetes. Cloud Composer/Apache Airflow are more for single-machine execution.
KFP/Argo are language-agnostic - components can use any language (components describe containerized command-line programs). Cloud Composer/Apache Airflow use Python (Airflow operators are defined as Python classes).
KFP/Argo have concept of data passing. Every component has inputs and outputs and pipleine connects them into a data passing graph. Cloud Composer/Apache Airflow do not really have data passing (Airflow has global variable storage and XCom, but it's not the same thing as explicit data passing) and the pipeline is a task dependency graph rather than mostly data dependency graph (KFP can also have task dependencies, but usually they're not needed).
KFP supports execution caching feature that skips execution of tasks that have already been executed before.
KFP records all artifacts produced by pipeline runs in ML Metadata database.
KFP has experimental adapter which allows using Airflow operators as components.
KFP has large fast-growing ecosystem of custom components.
Kubeflow is a platform for developing and deploying a machine learning (ML) systems. Its components are focused on creating workflows aimed to build ML systems.
Cloud Composer provides the infraestructure to run Apache Airflow worflows. Its components are known as Airflow Operators and the workflows are connections between these operators that are known as DAGs.
Both services run on Kubernetes, but they are based on different programming frameworks; therefore, you are correct, Kuberflow deploys and monitors Machine Learning models. See below the answer for your questions:
In that case, since Machine Learning models are also objects, can't we orchestrate them using Cloud Composer?
You would need to find an operator that meet your needs, or create a custom operator with the structure required to create a model, see this example. Even when it can be performed, this could be more difficult that using Kubeflow.
How does Kubeflow help in any way, better than Cloud Composer when it comes to managing Machine Learning models??
Kubeflow hides complexity as it is focused on Machine Learninig models. The frameworks specialized on machine learning makes those things easier than using Cloud Composer which in this context can be considered as a general purpose tool (focused on linking existing services supported by the Airflow Operators).
Taking this straight from kubeflow.org
The Kubeflow project is dedicated to making deployments of machine
learning (ML) workflows on Kubernetes simple, portable and scalable.
Our goal is not to recreate other services, but to provide a
straightforward way to deploy best-of-breed open-source systems for ML
to diverse infrastructures. Anywhere you are running Kubernetes, you
should be able to run Kubeflow.
And as you can see it is a suite made of many software that are useful in the life cycle of a ML model. It comes with tensorflow, jupiter, etc.
Now the real deal, when it comes to Kubeflow is "easy deploy of a ML model at scale on a Kubernetis cluster".
However on GCP you already a ML suite in cloud, datalab, cloud build etc. So I don't know how much efficient will be sinning up a kubernetis cluster if you don't need the "portability" factor.
Cloud Composer is the real deal while taking about orchestration of a workflow. It is a "managed" version of Apache Airflow and it is ideal for any "simple" workflow that changes a lot, since you can change it via a visual UI and with python.
It is also ideal to automate infrastructure operations:
It's more of an open question and I'm just hoping for any opinions and suggestions. I have AWS in mind but it probably can relate also to other cloud providers.
I'd like to provision IaaC solution that will be easily maintainable and cover all the requirements of modern serverless architecture. Terraform is a great tool for defining the infrastructure, has many official resources and stable support from the community. I really like its syntax and the whole concept of modules. However, it's quite bad for working with Lambdas. It also raises another question: should code change be deployed using the same flow as infrastructure change? Where to draw the line between code and infrastructure?
On the other hand, Serverless Framework allows for super easy development and deployment of Lambdas. It's strongly opinionated when it comes to the usage of resources but it comes with some many out-of-the-box features that it's worth it. It shouldn't really be used for defining the whole infrastructure.
My current approach is to define any shared resources using Terraform and any domain-related resources using Serverless. Here I have another issue that is related to my previous questions: deployment dependency. The simple scenario: Lambda.1 adds users to Cognito (shared resource) which has Lambda.2 as a trigger. I have to create a custom solution for managing the deployment order (Lambda.2 has to be deployed first, etc.). It's possible to hook up the Serverless Framework deployment into Terraform but then again: should the code deployment be mixed with infrastructure deployment?
It is totally possible to mix the two and I have had to do so a few times. How this looks actually ends up being simpler than it seems.
First off, if you think about whatever you do with the Serverless Framework as developing microservices (without the associated infrastructure management burden), that takes it one step in the right direction. Then, what you can do is decide that everything that is required to make that microservice work internally is defined within that microservice as a part of the services configuration in the serverless.yml, whether that be DynamoDB tables, Auth0 integrations, Kinesis streams, SQS, SNS, IAM permissions allocated to functions, etc. Keep that all defined as a part of that microservice. Terraform not required.
Now think about what that and other microservices might need to interact with more broadly. They aren't critical for that services internal operation but are critical for integration into the rest of the organisations infrastructure. This includes things like deployment IAM roles used by the Serverless Framework services to deploy into CloudFormation, Relational Databases that have to be shared amongst multiple services and resources, networking elements (VPC's, Security Groups, etc), monolithic clusters like ElasticSearch and Redis ... all of these elements are great candidates for definition outside of the Serverless Framework and work really well with Terraform.
Any resource would be able to connect to these Terraform defined resource as needed, unlike that hard association such as Lambda functions triggered off of an API Gateway endpoint.
Hope that helps