How to version control Airflow variables and connections? - airflow-scheduler

From a development perspective, defining variables and connections inside the UI is effective but not robust, as it is impossible to keep track of what has been added and removed.
Airflow came up with a way to store variables as environment variables. But a few natural questions arise from this:
Does this need to be defined before every DAG? What if I have multiples DAGs sharing the same env values? Seems a bit redundant to be defining it every time.
If defined this way, do they still display on the UI? The UI is still a great idea for taking quick look at some of the key value pairs.
I guess in a perfect world, the solution I would be looking for is somehow, just define the value of the variables and connections in the airflow.cfg file which would automatically populate the variables and connections in the UI.
Any kind of help is appreciated. Thank you in advance!

There is one more way of storing and managing and connections, one that is most versatile, secure and gives you all the versioning and auditing support - namely Secret Backends.
https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/index.html
It has built-in integration with Vault, GCP Secret Store, AWS Secret store, you can use Local Filesystem Secret Backend, and you can also roll your own backend.
When you use one of those then you get all the versioning, management, security, access management coming from the Secret Backend you use (most of the secret backends have all those built-in).
This also means that you CANNOT see/edit the values via Airflow UI as it's all delegated to those backends. But the backends usually come with their own UIs for that.
Answering your questions:
If you define connections/variables via env vars, you should define the variables in your Workers and Scheduler, not in the DAGs. That means that (if your system is distributed) you need to have a mechanism to update those variables and restart all airflow processes when they change (for example via deploying new images with those variables or upgrading helm chart or similar)
No. The UI only displays variables/connections defined in the DB.

Related

Cache JWKS in Lambda memory vs in temp

I currently am retrieving a JWKS keys using the Auth0 JWKS library for my Lambda custom authoriser function.
As explained in this issue on the JWKS library, apparently the caching built into JWKS for the public key ID does not work on lambda functions and as such they recommend writing the key to the tmp file.
What reasons could there be as to why cache=true would not work?
As far as I was aware, there should be no difference that would prevent in-memory caching working with lambda functions but allow file-based caching on the tmp folder to be the appropriate solution.
As far as I can tell, the only issues that would occur would be from the spawning of containers rate-limiting JWKS API and not the act of caching using the memory of the created containers.
In which case, what would be the optimal pattern of storing this token externally in Lambda?
There are a lot of option how to solve this. All have different advantages and disadvantages.
First of, storing the keys in memory or on the disk (/tmp) has the same result in terms of persistence. Both are available across calls to the same Lambda instance.
I would recommend storing the keys in memory, because memory access is a lot faster than reading from a file (on every request).
Here are other options to solve this:
Store the keys in S3 and download during init.
Store the keys on an EFS volume, mount that volume in your Lambda instance, load the keys from the volume during init.
Download the keys from the API during init.
Package the keys with the Lambdas deployment package and load them from disk during init.
Store the keys in AWS SSM parameter store and load them during init.
As you might have noticed, the "during init" phase is the most important part for all of those solutions. You don't want to do that for every request.
Option 1 and 2 would require some other "application" that you build do regularly download the keys and store them on S3 or a EFS volume. That is extra effort, but might in certain circumstances be a good idea for more complex setups.
Option 3 is basically what you are already doing at the moment and is probably the best tradeoff between simplicity and sound engineering for simple use cases. As stated before, you should store the key in memory.
Option 4 is a working "hack" that is the easiest way to get your key to your Lambda. I'd never recommend doing this, because sudden changes to the key would require a re-deployment of the Lambda, while in the meantime requests can't be authenticated, resulting in a down time.
Option 5 can be a valid alternative to option 3, but requires the same key management by another application like option 1 and 2. So it is not necessarily a good fit for a simple authorizer.

Netlify restrictions on environment variable size

As I understand it, Netlify Environment Variables have some restrictions on size. Looking into it, they use AWS under the hood and are subject to the same restrictions as this service. Most notably:
Keys can contain up to 128 characters. Values can contain up to 256 characters.
The combined size of all environment properties cannot exceed 4,096 bytes when stored as strings with the format key=value.
I'm passing JWT keys to my serverless functions via environment variables in Netlify. The keys in question (particularly the private key) are long enough to flout these restrictions. My private key is at least 3K characters; well over the 256 outlined above.
How have others managed to get round this issue? Is there another way to add lengthy keys without having to include them in your codebase?
You should not store those keys in the environment variables. Even though environment variables can be encrypted, for sensitive information like this I would not recommend using it. There are two possible solutions I can think of.
Solution 1
Use AWS Systems Manager (SSM). Specifically, you should use the parameter store. There you can create key-value pairs like your environment variables and they can be marked as "SecureString", so they are encrypted.
Then you use the AWS SDK to read the value from SSM in your application.
One benefit of this approach is, that you can use IAM to restrict access to those SSM parameters and make sure that only trusted people/applications have access. If you use environment variables, you can not separately manage access to those values.
Solution 2
As far as I am aware, those keys must be coming from somewhere. Typically, there are "key endpoints" from whatever authentication provider you use (e.g. Auth0, Okta). In your application you could get the keys from the endpoint using a HTTP call and then cache them in your application for a while, to avoid unnecessary HTTP requests.
The benefit of this approach is that you don't have to manage those keys yourself. When they change for whatever reason, you will not need to change anything/deploy anything to make your application work with the new keys. Although this should not happen too often, so it is still reasonable from my point of view to "hardcode" the keys in SSM.
A little late to the party here, but there's now a build plugin that allows inlining of extra-long environment variables for use in Netlify functions.
There's a tutorial here: https://ntl.fyi/3Ie1MXH
And you can find the build plugin and docs here: https://github.com/bencao/netlify-plugin-inline-functions-env
If your environment variable needs to preserve new lines (such as in the case of a private key), you’ll need do something like this on the environment variable:
process.env.PRIVATE_KEY.replace(/\\n/g, "\n")
Other than that, it worked a treat for me!

Better way of using AWS parameter store? 1. Keeping all variables in a single parameter as JSON 2. Keeping every variable as a separate parameter

I have a web app to be hosted on AWS cloud. We are reading all application configuration from AWS parameter store. However, I am not sure if I should have all the variables as a single parameter in a json format or have one parameter for each variable in parameter store.
Problem with having a single parameter as json string is, AWS parameter store does not return a JSON object, but a string. So we have to bind string to a model which involves reflection (which is a very heavy operation). Having separate parameter for each variable means having additional lines of code in the program (which is not expensive).
Also, my app is a multi-tenant app, which has a tenant resolver in the middleware. So configuration variables will be present for every tenant.
There is no right answer here - it depends. What I can share is my team's logic.
1) Applications are consistently built to read env variables to override default
All configuration/secrets are designed this way in our applications. The primary reason is we don't like secrets stored unencrypted on disk. Yes, env variables can be read even so, but less risky than disk which might get backed up
2) SSM Parameter Store can feed values into environment variables
This includes Lambda, ECS Containers, etc.
This allows us to store encrypted (SSM Secure), transmit encrypted, and inject into applications. It handles KMS decryption for you (assuming you setup permissions).
3) Jenkins (our CI) can also inject env variables from Jenkins Credentials
4) There is nothing stopping you from building a library that supports both techniques
Our code reads an env variable called secrets_json and if it exists and passes validation, it sets the key/values in them as env variables.
Note: This also handles the aspect you mentioned about JSON being a string.
Conclusion
The key here is I believe you want to have code that is flexible and handles several different situations. Use it as a default in all your application designs. We have historically used 1:1 mapping because initially SSM length was limited. We may still use it because it is flexible and supports some of our special rotation policies that secrets manager doesn't yet support.
Hope our experience lets you choose the best way for you and your team.

Building Erlang applications for the cloud

I'm working on a socket server that'll be deployed to AWS and so far we have the basic OTP application set up following a structure similarly to the sample project in Erlang in Practice, but we wanted to avoid having a global message router because that's not going to scale well.
Having looked through the OTP design guide on Distributed Applications and the corresponding chapters (Distribunomicon and Distributed OTP) in Learn You Some Erlang it seems the built-in distributed application mechanism is geared towards on-premise solutions where you have known hostnames and IPs and the cluster configuration is determined ahead of time, whereas in our intended setup the application will need to scale dynamically up and down and the IP addresses of the nodes will be random.
Sorry that's a bit of a long-winded build up, my question is whether there are design guidelines for distributed Erlang applications that are deployed to the cloud and need to deal with all the dynamic scaling?
Thanks,
There are a few possible approaches:
In Erlang and OTP in Action, one method presented is to use one or two central nodes with known domains or IPs, and have all the other nodes connect to this one to discover each other
Applications like https://github.com/heroku/redgrid/tree/logplex require having a central redis node where all Erlang nodes register themselves instead, and do membership management
Third party services like Zookeeper and whatnot to do something similar
Whatever else people may recommend
Note that unless you're going to need to protect your communication, either by switching the distribution protocol to use SSL, or by using AWS security groups and whatnot to restrict who can access your network.
I'm just learning Erlang so can't offer any practical advice of my own but it sounds like your situation might require a "Resource Discovery" type of approach as i've read about in Erlang & OTP in Action.
Erlware also have an application to help with this: https://github.com/erlware/resource_discovery
Other stupid answers in addition to Fred's smart answers include:
Using Route53 and targetting a name instead of an IP
Keeping an IP address in AWS KMS or AWS Secrets Manager, and connecting to that (nice thing about this is it's updatable without a rebuild)
Environment variables: scourge or necessary evil?
Stuffing it in a text file in an obscured, password protected s3 bucket
VPNs
Hardcoding and updating the build in CI/CD
I mostly do #2

Is it possible to use Django and Node.Js?

I have a django backend set up for user-logins and user-management, along with my entire set of templates which are used by visitors to the site to display html files. However, I am trying to add real-time functionality to my site and I found a perfect library within Node.Js that allows two users to type in a text box and have the text appear on both their screens. Is it possible to merge the two backends?
It's absolutely possible (and sometimes extremely useful) to run multiple back-ends for different purposes. However it opens up a few cans of worms, depending on what kind of rigour your system is expected to have, who's in your team, etc:
State. You'll want session state to be shared between different app servers. The easiest way to do this is to store external session state in a framework-agnostic way. I'd suggest JSON objects in a key/value store and you'll probably benefit from JSON schema.
Domains/routing. You'll need your login cookie to be available to both app servers, which means either a single domain routed by Apache/Nginx or separate subdomains routed via DNS. I'd suggest separate subdomains for the following reason
Websockets. I may be out of date, but to my knowledge neither Apache nor Nginx support proxying of websockets, which means if you want to use that you'll sacrifice the flexibility of using an http server as a app proxy and instead expose Node directly via a subdomain.
Non-specified requirements. Things like monitoring, logging, error notification, build systems, testing, continuous integration/deployment, documentation, etc. all need to be extended to support a new type of component
Skills. You'll have to pay in time or money for the skill-sets required to manage a more complex application architecture
So, my advice would be to think very carefully about whether you need this. There can be a lot of time and thought involved.
Update: There are actually companies springing around who specialise in adding real-time to existing sites. I'm not going to name any names, but if you look for 'real-time' on the add-on marketplace for hosting platforms (e.g. Heroku) then you'll find them.
Update 2: Nginx now has support for Websockets
You can't merge them. You can send messages from Django to Node.Js through some queue system like Reddis.
If you really want to use two backends, you could use a database that is supported by both backends.
Though I would not recommended it.