I'm making dataflow pipeline with python.
I want to share global variables across pipeline transform and across worker nodes like global variables (across multiple workers).
Is there any way to support this?
thanx in advance
Stateful processing may be of use for sharing state between workers of a specific node (would not be able to share between transforms though):
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Related
From a development perspective, defining variables and connections inside the UI is effective but not robust, as it is impossible to keep track of what has been added and removed.
Airflow came up with a way to store variables as environment variables. But a few natural questions arise from this:
Does this need to be defined before every DAG? What if I have multiples DAGs sharing the same env values? Seems a bit redundant to be defining it every time.
If defined this way, do they still display on the UI? The UI is still a great idea for taking quick look at some of the key value pairs.
I guess in a perfect world, the solution I would be looking for is somehow, just define the value of the variables and connections in the airflow.cfg file which would automatically populate the variables and connections in the UI.
Any kind of help is appreciated. Thank you in advance!
There is one more way of storing and managing and connections, one that is most versatile, secure and gives you all the versioning and auditing support - namely Secret Backends.
https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/index.html
It has built-in integration with Vault, GCP Secret Store, AWS Secret store, you can use Local Filesystem Secret Backend, and you can also roll your own backend.
When you use one of those then you get all the versioning, management, security, access management coming from the Secret Backend you use (most of the secret backends have all those built-in).
This also means that you CANNOT see/edit the values via Airflow UI as it's all delegated to those backends. But the backends usually come with their own UIs for that.
Answering your questions:
If you define connections/variables via env vars, you should define the variables in your Workers and Scheduler, not in the DAGs. That means that (if your system is distributed) you need to have a mechanism to update those variables and restart all airflow processes when they change (for example via deploying new images with those variables or upgrading helm chart or similar)
No. The UI only displays variables/connections defined in the DB.
We're setting up the back-end architecture for our mobile application to connect to, but we need some help. Our application is built to mimic "take a number" tickets you would see at a deli or pharmacy. Users will use our mobile application to send a request to our node controller and our node controller will respond with a spot number.
We currently have our node controller set up on Amazon Elastic Beanstalk and have enabled load balancing to handle any surges in requests. Our question is: how do we persist our spotNumber across multiple instances of our node controller? We have it built now as a local variable that starts at 1 and increments with each request, but will this persist if AWS spins up a new instance of our node controller to handle increased traffic? If not, what would be the best practice for preserving our spotNumber across all potential instances of our server?
Thank you in advance!
Use a database.
Clearly you can't store the value within the node application, not only due to scaling but to prevent data loss if the application shuts down unexpectedly.
It sounds like you don't already have a database, so DynamoDB might be a good choice, as long as your only use case is to share a counter between applications. You can find an example here.
You could also use Redis on Elasticache, but I think that it's overkill for a single counter.
Keeping accurate counters at different scales may require different implementations. At small scale, a simple session variable and locking logic in the application would be enough. However, at a larger scale session synchronization and locking is better managed with a database. In particular for your case, DynamoDB conditional writes or Redis counters seems useful. However, keep your requirements simple and clear, managing counters at scale may require algorithms and data structures with funny names, like the HyperLogLog.
I have created a terraform stack for all the required resources that we utilise to build out a virtual data center within AWS. VPC, subnet, security groups etc, etc.
It all works beautifully :). I am having a constant argument with network engineers that want to have a completely separate state for networking etc. As a result of this we have to manage multiple state files and it requires 10 to 15 terraform plan/apply commands to being up the data center. Not alone do we have to run the commands multiple times, we cannot reference the module output variables from when creating ec2 instances etc, so now there are "magic" variables appearing within variable files. I want to put the scripts to create the ec2 instances, els etc within the same directory as the "data center" configuration so that we manage one state file (encrypted in s3 with dynamodb lock) and that our git repo has a one to one relationship with our infrastructure. There is also the added benefit that a single terraform plan/apply will build the whole datacenter in a single command.
Question is really, is it a good idea to manage data center resources (vpc, subnets, security groups) and compute resources in a single state file? Are there any issues that I may come across? Has anybody experience in managing an AWS environment with terraform this way?
Regards,
David
To begin with the Terraform provider let's you access output variables from other state files so you don't have to use magic variables. The rest is just a matter of your style. Do you frequently bring the whole datacenter infrastracture up? If so you may consider doing it in one project. If on the other hand you only change some things you may want to make it more modular relying on output from other projects. Keeping them separate makes the planing faster and avoids a very costly terraform destroy mistake.
During the last years there have been a lot of discussion about layouts for Terraform projects.
Times have also changed with Terraform 1.0 so I think this question deserve some love.
As a result of this we have to manage multiple state files and it requires 10 to 15 terraform plan/apply commands to being up the data center.
Using modules is possible to maintain separated states without requiring executing commands for each state.
Not alone do we have to run the commands multiple times, we cannot reference the module output variables
Terraform support output values. Leveraging Terraform Cloud or Terraform remote states is possible to introduce dependencies between states.
A prerequisite to adventure into multiple Terraform states in my opinion is using state locking (OP refers to using AWS DynamoDB lock mechanism but other Storage backend support this too).
Generally having everything in a single state is not the best solution and may be considered an anti-pattern.
Having multiple state is referred to as state isolation.
Why would you want to isolate states?
Reasons are multiple and the benefits are clear:
bugs blast radius. If you introduce a bug somewhere in the code an you apply all the code for the entire datacenter in the worst possible scenario everything will be affected. On the other hand if networking was separated in the worst scenario the bug could only affect networking (which in a DC would be a very severe issue but still better than everything).
state (write) lock. If you use state lock Terraform will lock the state for any operation that may possibly write to the state. This means that with a single state multiple teams working on separate areas are not able to write to the state at the same time, so updating the networking blocks instance provisioning for example.
secrets. Secrets are written plain-text to the state. A single state means all teams secrets will end up in the same state (that you must encrypt, OP is correctly doing this). As with anything with security having all eggs in the same basket is a risk.
A side benefit of isolating state is that file layout may help with code ownership (across teams or project).
How to isolate state?
There are 3 mainly ways:
via file layout (with or without modules)
via workspaces (not to be confused with Terraform Cloud Workspaces)
mix up the above ways (Here be dragons!)
There is no wide consensus on how to do it but for further reading:
https://blog.gruntwork.io/how-to-manage-terraform-state-28f5697e68fa 2016 article, I think this is sort of the root of the discussion.
https://terragrunt.gruntwork.io/docs/features/keep-your-terraform-code-dry/
https://www.terraform-best-practices.com/code-structure
A tool worth looking at may be Terragrunt from Gruntwork.
I am developing a multi-process application and using cassandra, I have a single session opened at the begining of the server and i want to share the session to other processes.I just want to know is it possible in cassandra(python driver) . if not why ?
Yes, It is possible and recommended to use one session
4 simple rules when using the DataStax drivers for Cassandra
When using one of the DataStax drivers for Cassandra, either if it’s C#, Python, or Java, there are 4 simple rules that should clear up the majority of questions and that will also make your code efficient:
Use one Cluster instance per (physical) cluster (per application lifetime)
Use at most one Session per keyspace, or use a single Session and explicitely specify the keyspace in your queries
If you execute a statement more than once, consider using a PreparedStatement
You can reduce the number of network roundtrips and also have atomic operations by using Batches
Source http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra
No, its not recommended.
Quoting the official datastax documentation:
Be sure to never share any Cluster, Session, or ResponseFuture objects across multiple processes. These objects should all be created after forking the process, not before.
Source: https://datastax.github.io/python-driver/performance.html#multiprocessing
Good day, I do not know if this is possible from what I understand of the AWS API documentation, but I was wondering is it possible to use multi threading to list all my instances asynchronously. By that I mean can I create a thread to list a number of instances while another thread lists a different set. I have a ridiculously large amount to get through hand waiting for the return from the API call seems for too long. Thank you in advance for any help.
Yes, some SDKs do support async operations. See AmazonEC2Client.describeInstancesAsync() for example in Java.