I'm playing around with Google cloud functions. My first conclusion: they are really perfect! I created a function which gets triggered by a modification on the document which is stored in a bucket (or a new upload). This works fine.
But then I started to think: what if I want to trigger all files inside the buckets to be executed against a NEW function. The previous functions are already ran against all files, so I prefer to only run the NEW function agains all documents.
How do you guys do this? So basically my questions are:
How do you keep track of what functions are already applied to the files?
How do you trigger all files to re-apply all functions?
How do you trigger all files for just ONE (new) function?
How do you keep track of what functions are already applied to the
files?
Cloud Functions triggers on events. Once the event fires, a Cloud Function is called (if setup to do so). Nothing within GCP keeps track of this except forStackDriver. Your functions will need to keep track of their actions including being triggered for which object.
How do you trigger all files to re-apply all functions?
There is no command or feature to trigger a function for all files. You will need to implement this feature yourself.
How do you trigger all files for just ONE (new) function?
There is no command or feature to trigger a function for a new function. You will need to implement this feature yourself.
Depending on the architecture that you are trying to implement, most people use a database such as Cloud Datastore to track objects within a bucket, transformations that occur and results.
Using a database will allow you to accomplish your goals, but with some effort.
Keep in mind that Cloud Functions has a timeout after running for 540 seconds. This means if you have millions of files, you will need to implement an overlapping strategy for procssing that many objects.
For cases where I need to process millions of objects, I usually launch App Engine Flexible or Compute Engine to complete large tasks and then shutdown once completed. The primary reason is very high bandwidth to Google Storage and Datastore.
Related
I am looking at AWS Lambda to create a Python function that would process data.
I need to load a heavy model to run my script (trained word2vec model), it takes about 5 min to do it on my computer for example. But once it's loaded, the execution of the function is very fast.
If I use AWS Lambda, will this model load only once or will it load each time I call my function ?
Thanks,
Maybe.
AWS Lambda uses reusable containers. So, for your use case, the Lambda function will execute quickly if it happened in an already initialized container. It'll be slow otherwise. However, there is no way you can predict the behavior.
Relevant documentation:
From here:
The first time a function executes after being created or having its code or resource configuration updated, a new container with the appropriate resources will be created to execute it, and the code for the function will be loaded into the container.
Let’s say your function finishes, and some time passes, then you call it again. Lambda may create a new container all over again, in which case the experience is just as described above. This will be the case for certain if you change your code. However, if you haven’t changed the code and not too much time has gone by, Lambda may reuse the previous container.
Remember, you can’t depend on a container being reused, since it’s Lambda’s prerogative to create a new one instead.
More official documentation here.
It will MAYBE (thanks Michael-sqlbot for the correction) load each time you invoke Lambda.
We can infer that the AWS Lambdas are stateless based on the following
Lambda is stateless
"Lambda functions are 'stateless' with no affinity to the underlying infrastructure, so that Lambda can rapidly launch as many copies of the function as needed to scale to the rate of incoming events
Lambda must be coded in stateless style
Your Lambda function code must be written in a stateless style, and have no affinity with the underlying compute infrastructure. Your code should expect local file system access, child processes, and similar artifacts to be limited to the lifetime of the request
However Container reuse is possible in Lambda
If you haven’t changed the code and not too much time has gone by, Lambda may reuse the previous container
So basically to answer your question, it is possible that you get back the model, and the probability of that is inversely proportional to the time span between 2 Lambda invocations. But you simply cannot rely on that
I've been reading some articles regarding this topic and have preliminary thoughts as what I should do with it, but still want to see if anyone can share comments if you have more experience with running machine learning on AWS. I was doing a project for a professor at school, and we decided to use AWS. I need to find a cost-effective and efficient way to deploy a forecasting model on it.
What we want to achieve is:
read the data from S3 bucket monthly (there will be new data coming in every month),
run a few python files (.py) for custom-built packages and install dependencies (including the files, no more than 30kb),
produce predicted results into a file back in S3 (JSON or CSV works), or push to other endpoints (most likely to be some BI tools - tableau etc.) - but really this step can be flexible (not web for sure)
First thought I have is AWS sagemaker. However, we'll be using "fb prophet" model to predict the results, and we built a customized package to use in the model, therefore, I don't think the notebook instance is gonna help us. (Please correct me if I'm wrong) My understanding is that sagemaker is a environment to build and train the model, but we already built and trained the model. Plus, we won't be using AWS pre-built models anyways.
Another thing is if we want to use custom-built package, we will need to create container image, and I've never done that before, not sure about the efforts to do that.
2nd option is to create multiple lambda functions
one that triggers to run the python scripts from S3 bucket (2-3 .py files) every time a new file is imported into S3 bucket, which will happen monthly.
one that trigger after the python scripts are done running and produce results and save into S3 bucket.
3rd option will combine both options:
- Use lambda function to trigger the implementation on the python scripts in S3 bucket when the new file comes in.
- Push the result using sagemaker endpoint, which means we host the model on sagemaker and deploy from there.
I am still not entirely sure how to put pre-built model and python scripts onto sagemaker instance and host from there.
I'm hoping whoever has more experience with AWS service can help give me some guidance, in terms of more cost-effective and efficient way to run model.
Thank you!!
I would say it all depends on how heavy your model is / how much data you're running through it. You're right to identify that Lambda will likely be less work. It's quite easy to get a lambda up and running to do the things that you need, and Lambda has a very generous free tier. The problem is:
Lambda functions are fundamentally limited in their processing capacity (they timeout after max 15 minutes).
Your model might be expensive to load.
If you have a lot of data to run through your model, you will need multiple lambdas. Multiple lambdas means you have to load your model multiple times, and that's wasted work. If you're working with "big data" this will get expensive once you get through the free tier.
If you don't have much data, Lambda will work just fine. I would eyeball it as follows: assuming your data processing step is dominated by your model step, and if all your model interactions (loading the model + evaluating all your data) take less than 15min, you're definitely fine. If they take more, you'll need to do a back-of-the-envelope calculation to figure out whether you'd leave the Lambda free tier.
Regarding Lambda: You can literally copy-paste code in to setup a prototype. If your execution takes more than 15min for all your data, you'll need a method of splitting your data up between multiple Lambdas. Consider Step Functions for this.
SageMaker is a set of services that each is responsible for a different part of the Machine Learning process. What you might want to use is the hosted version of Jupyter notebooks in SageMaker. You get a lot of freedom in the size of the instance that you are using (CPU/GPU, memory, and disk), and you can install various packages on that instance (such as FB Prophet). If you need it once a month, you can stop and start the notebook instances between these times and "Run all" the cells in your notebooks on this instance. It will only cost you the minutes of execution.
regarding the other alternatives, it is not trivial to run FB Prophet in Lambda due to the size limit of the libraries that you can install on Lambda (to avoid too long cold start). You can also use ECS (container Service) where you can have much larger images, but you need to know how to build a Docker image of your code and endpoint to be able to call it.
We are trying the lambda for our ETL job which is written in Clojure.
Our architecture is the scheduler will trigger the parent lambda, then the parent lambda trigger 100 child lambda and counter lambda. The child lambdas after completion of their work it will write the data to s3 . The counter lambda will check the number of files in the S3 , if it is 100 then it will combine all the files and save it to S3, otherwise it will span a new counter lambda and die.
All the positive scenario is working fine, but if any child fails then the counter lambda will end up in the indefinite loop, because there wont be 100 files.
If there any proper way of spanning child lambda, monitor it and if it fails need to restart or retry that alone ?
Is there any good Clojure lambda framework ?
Process monitoring is not built into any lambda clojure libraries that I know of, so for this case I'd recommend taking a page out of the erlang metaphorical play book (supervisor trees) and say that to have a dependable distributed system every actor needs a monitor so a decent approach would be to have a watcher for each lambda task. This can really simplify the error handling cases along the "let it crash" philosophy.
So this would leave you with this list of lambdas:
counters:
a watcher/restarter for the counter (you kind of already have this)
workers x100
supervisors x100
Each supervisor only checks for the presence of one particular file and restarts one particular lambda if it does not exist. this gets much easier if your process is idempotent, so you don't have to worry too much if a file is produced twice, though it's not too hard to check if the lambda a supervisor is watching is still running using the aws api. this supervisor can be started by the thing it's supervising or by the thing that starts the rest of the system, whatever is easier for your codebase. You likely don't need to explicitly start the workers, the supervisor can do that.
The important part is to add cloudwatch or whatever your favourite eventing system is (mine is riemann) so you can add alerts to know when you need to watch the watchers.
There is easy way out there in AWS is called AWS Step Functions. Step Functions provides a graphical console to arrange and visualize the components of your application as a series of steps. Define steps using the AWS Step Functions console or API, a fluent Java API, or AWS CloudFormation templates.
Step makes it simple to orchestrate AWS Lambda functions. Irrespective of language of function, it manages all the lambdas.
Step is good for following use cases
Run sequence functions
Run functions in parallel
Select functions based on data
Retry the functions
try/catch/finally for functions
Running the code for hours
I have by mistake given wrong name to AWS Lambda function. Now, I wanted to change its name. I found from the given stackoverflow question that best way to do that is just create a new function and copy the exact same code into it.
Is it possible to rename an AWS Lambda function?
I am thinking to do that but I am just worried about data loss. Since my lambda is currently had 2 SNS triggers from where it is constantly receiving data. So, if I stop this lambda and create new one, I would lose some data during that time. Also, if I start the new lambda before deleting previous one, I would some entries in my datastore twice. So, is there any way I could use to get this done?
As #John Rotenstein said, it is not possible to rename an AWS Lambda. If you look at the documentation for Lambda (http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html) you will see that updating FunctionName requires replacement of the entity.
If you specify a name, you cannot perform updates that require replacement of this resource. You can perform updates that require no or some interruption. If you must replace the resource, specify a new name.
If you are working with more complex systems, as it seems due to your note of SNS triggers, I would highly encourage you to take a look at CloudFormation (https://aws.amazon.com/cloudformation/), which uses code to manage deployed services. This not only has the benefit of allowing easier updates, but also enables other fun things which are inherent with code, such as integration with a VCS.
As a data loss prevention strategy while you perform this migration, you can create a new Lambda and point it to a staging database, delete the old Lambda, repoint your new Lambda to your production database, and push updates from your staging database into your production database. Check out the import/export docs (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html) to see one method in which you might perform data migration.
There is no rename function for an AWS Lambda function.
You could instead try creating an alias to a Lambda function that would allow both names to function simultaneously. (This is normally used when different versions exist.)
We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: