I am using parameter store for storing the database credentials, and accessing them using a talend job. However, due to multiple jobs accessing these parameters at the same time; we are facing issue to scale the parameter store, as it reached threshold.
I have decided to go with the AWS Secrets Manager, and so far I am able to successfully create/fetch these secrets using an AWS Lambda function (Python); and I am using Python client side caching library SecretCache and SecretCacheConfig.
Had the below queries regarding the same:
Is it possible to know the location where my lambda function is caching these secrets?
If the secrets changes while my talend job is executing, how to make sure that the talend job uses the updated value for a secret?
How to make sure that the latest value for a secret is being used.
The SecretCache implementation is an in-memory only cache so you will not be able to find the cache on disk anywhere.
Questions ii) and iii) are somewhat related and depend on how the secret is used. In order to have a highly available rotation strategy you must be able to have two active secrets or users and alternate between them. By ensuring the cache is refreshed at more than twice the rotation rate, the cache will switch to the latest user password before the current one is over-written.
For example, if you are using a database, you could setup multi-user rotation and rotate once a day. Since the default cache refresh rate (secret_refresh_interval) is 1 hour, the cache will pick up the latest version of the secret before the next rotation.
Related
I currently am retrieving a JWKS keys using the Auth0 JWKS library for my Lambda custom authoriser function.
As explained in this issue on the JWKS library, apparently the caching built into JWKS for the public key ID does not work on lambda functions and as such they recommend writing the key to the tmp file.
What reasons could there be as to why cache=true would not work?
As far as I was aware, there should be no difference that would prevent in-memory caching working with lambda functions but allow file-based caching on the tmp folder to be the appropriate solution.
As far as I can tell, the only issues that would occur would be from the spawning of containers rate-limiting JWKS API and not the act of caching using the memory of the created containers.
In which case, what would be the optimal pattern of storing this token externally in Lambda?
There are a lot of option how to solve this. All have different advantages and disadvantages.
First of, storing the keys in memory or on the disk (/tmp) has the same result in terms of persistence. Both are available across calls to the same Lambda instance.
I would recommend storing the keys in memory, because memory access is a lot faster than reading from a file (on every request).
Here are other options to solve this:
Store the keys in S3 and download during init.
Store the keys on an EFS volume, mount that volume in your Lambda instance, load the keys from the volume during init.
Download the keys from the API during init.
Package the keys with the Lambdas deployment package and load them from disk during init.
Store the keys in AWS SSM parameter store and load them during init.
As you might have noticed, the "during init" phase is the most important part for all of those solutions. You don't want to do that for every request.
Option 1 and 2 would require some other "application" that you build do regularly download the keys and store them on S3 or a EFS volume. That is extra effort, but might in certain circumstances be a good idea for more complex setups.
Option 3 is basically what you are already doing at the moment and is probably the best tradeoff between simplicity and sound engineering for simple use cases. As stated before, you should store the key in memory.
Option 4 is a working "hack" that is the easiest way to get your key to your Lambda. I'd never recommend doing this, because sudden changes to the key would require a re-deployment of the Lambda, while in the meantime requests can't be authenticated, resulting in a down time.
Option 5 can be a valid alternative to option 3, but requires the same key management by another application like option 1 and 2. So it is not necessarily a good fit for a simple authorizer.
Wanted to know if cloud based platforms such as Azure and Amazon zeroize the content on the hard disk whenever an 'instance' is 'deleted' and prior to making it available for other users?
I've tried using 'dd' command on an Amazon-LightSail instance and it appears that the raw data is indeed zeroized. However was not sure if it was by chance (i just tried few random lengths) or if they actually take care to do that.
The concern is, if I leave passwords in configuration files, then someone who comes along would be able to read them (theoretically). Same goes for data in a database.
Generically, the solution to your concern typically used by Azure is storage encryption.
Your data is encrypted by default at the platform level with a key specific to your subscription; when the data or resource is removed, whether or not the storage is zeroed, it is effective inaccessible to a resource deployed on the same storage in another subscription.
Some functions I am writing would need to store and share a set of cryptographic keys (<1kb) somewhere so that:
it is shared across functions and within instances of the same function
it is maintained after function deploys
The keys are modified (and written) every 4 hours or so, based on whether a key has expired or a new key needs to be created.
Right now, I am storing the keys as encrypted binary in a cloud bucket with access limited to that function. It works, except that it is fairly slow (~500ms for the read / write that is required when updating the keys).
I have considered some other solutions:
Redis: fast, but overkill given the price ($40/month) it would cost to store a single value
Cloud SQL: the functions are already connected to a cloud instance so it would not incur more costs
Dropping everything and using a KMS. Unfortunately it would not meet the requirements I have.
The library I use in my functions is available here.
Is there a better way to store a single small blob of data for cloud functions (and possibly other tools like GKE) ?
Edit
The solution I ended up using was using a single table in a database that the app was already connected to. It is also about 5 times faster than using a bucket (<100ms).
The moral of the story is to use whatever is already provisioned to store the keys. If storing a key is a problem, then using the combo KMS + cloud functions for rotations described below seems like a good option.
All the code + more details are available here.
A better approach would be to manage your keys with Cloud KMS. However, as you mentioned before Cloud KMS does not automatically delete the old key version material and you will need to manually delete old versions which I suspect is a thing that you don’t want to do.
Another possibility is to just keep the keys in Firestore. Since for this you don’t have to provision any specific infrastructure such as with Redis Memorystore and Postgres Cloud SQL it will be easier to manage and to scale in the long run.
The general idea would be to have a Cloud Function triggered by Cloud Scheduler every 4 hours, and this function will rotate the keys on your Cloud Firestore.
How does this sound to you?
I've been reading some articles regarding this topic and have preliminary thoughts as what I should do with it, but still want to see if anyone can share comments if you have more experience with running machine learning on AWS. I was doing a project for a professor at school, and we decided to use AWS. I need to find a cost-effective and efficient way to deploy a forecasting model on it.
What we want to achieve is:
read the data from S3 bucket monthly (there will be new data coming in every month),
run a few python files (.py) for custom-built packages and install dependencies (including the files, no more than 30kb),
produce predicted results into a file back in S3 (JSON or CSV works), or push to other endpoints (most likely to be some BI tools - tableau etc.) - but really this step can be flexible (not web for sure)
First thought I have is AWS sagemaker. However, we'll be using "fb prophet" model to predict the results, and we built a customized package to use in the model, therefore, I don't think the notebook instance is gonna help us. (Please correct me if I'm wrong) My understanding is that sagemaker is a environment to build and train the model, but we already built and trained the model. Plus, we won't be using AWS pre-built models anyways.
Another thing is if we want to use custom-built package, we will need to create container image, and I've never done that before, not sure about the efforts to do that.
2nd option is to create multiple lambda functions
one that triggers to run the python scripts from S3 bucket (2-3 .py files) every time a new file is imported into S3 bucket, which will happen monthly.
one that trigger after the python scripts are done running and produce results and save into S3 bucket.
3rd option will combine both options:
- Use lambda function to trigger the implementation on the python scripts in S3 bucket when the new file comes in.
- Push the result using sagemaker endpoint, which means we host the model on sagemaker and deploy from there.
I am still not entirely sure how to put pre-built model and python scripts onto sagemaker instance and host from there.
I'm hoping whoever has more experience with AWS service can help give me some guidance, in terms of more cost-effective and efficient way to run model.
Thank you!!
I would say it all depends on how heavy your model is / how much data you're running through it. You're right to identify that Lambda will likely be less work. It's quite easy to get a lambda up and running to do the things that you need, and Lambda has a very generous free tier. The problem is:
Lambda functions are fundamentally limited in their processing capacity (they timeout after max 15 minutes).
Your model might be expensive to load.
If you have a lot of data to run through your model, you will need multiple lambdas. Multiple lambdas means you have to load your model multiple times, and that's wasted work. If you're working with "big data" this will get expensive once you get through the free tier.
If you don't have much data, Lambda will work just fine. I would eyeball it as follows: assuming your data processing step is dominated by your model step, and if all your model interactions (loading the model + evaluating all your data) take less than 15min, you're definitely fine. If they take more, you'll need to do a back-of-the-envelope calculation to figure out whether you'd leave the Lambda free tier.
Regarding Lambda: You can literally copy-paste code in to setup a prototype. If your execution takes more than 15min for all your data, you'll need a method of splitting your data up between multiple Lambdas. Consider Step Functions for this.
SageMaker is a set of services that each is responsible for a different part of the Machine Learning process. What you might want to use is the hosted version of Jupyter notebooks in SageMaker. You get a lot of freedom in the size of the instance that you are using (CPU/GPU, memory, and disk), and you can install various packages on that instance (such as FB Prophet). If you need it once a month, you can stop and start the notebook instances between these times and "Run all" the cells in your notebooks on this instance. It will only cost you the minutes of execution.
regarding the other alternatives, it is not trivial to run FB Prophet in Lambda due to the size limit of the libraries that you can install on Lambda (to avoid too long cold start). You can also use ECS (container Service) where you can have much larger images, but you need to know how to build a Docker image of your code and endpoint to be able to call it.
I am trying to benchmark WSO2 identity server 4.5, using postgresql, to measure how many policies can be supported without having a too bad decision time.
I have java program to upload all my policies using EntitlementPolicyAdminServiceStub from org.wso2.carbon.identity.entitlement.stub-4.2.0.jar:
adminStub.addPolicy(myPolicy)
After the 100 first policies there is an important degradation of upload time (more than 2 sec for each policy) and it gets worse with time.
For 3000 policies, WSO2 is no more responsive and, when I have a look to database statistics I can see there are more than 10^12 Tuples Returned for the all the database and 10^11 Sequential Tuples Read for the reg_resource_property table.
Is it something normal or is there a mistake in my WSO2 configuration?
Yes.. there can be some limit.. In your case, it seems to be an issue with policy storing.. By default identity server stores XACML policies in the wso2 registry.. You could see such database statistics due to that.. Because it is not like just putting policy as a database entry in a table.. As it is governance registry, there are lot things happened behind that... If you need more performance with policy storing, I guess it is better to implement a new policy store by extending the default behavior. Basically you can write a policy store to persist policies in a simple database table or even in file system. You can find the source of the Registry policy store from here.
Also, In runtime, all policies are loaded in to the memory, normally this is happened when entitlement engine is initialized.. or less any update is happened.. When number of policies are large there can be delay in retrieving policies from registry (but registry itself as caching and indexing...so may be not as slow as we think). As runtime, all policies are kept in the memory, we may need to consider about the memory footprint of the server. You can increase it using wso2server.sh file.
Also, there are some doc that has been mentioned about performance test with WSO2IS, Please refer it for more details