Migrating On Premise Apache Airflow workflow scripts to Cloud Composer - google-cloud-platform

I have an on-premise environment running Airflow v2.2.0 and wish to migrate all the workflows in this instance to a Cloud Composer instance. While doing this migration, some of the operators used on the on-premise environment do not work after running the same workflow on Cloud Composer 1 (Airflow 2.1.4).
Below is a single task from such a workflow:
hive_task = HiveOperator(
hql="./scripts/hive-script.sql",
task_id="survey_data_aggregator",
hive_cli_conn_id="hive_server_conn",
dag=data_aggregator,
hiveconfs={"input_path": "{{ params['input_path'] }}", "output_path": "{{ params['output_path'] }}"}
)
Execution of the workflow results in the following error [Errno 2] No such file or directory: 'beeline'
Having faced this error before, I know it is due to the worker nodes not having the beeline binary in its PATH. However, I do not want to SSH into every worker instance and update its PATH variable.
Upon further research, it was found that when I replace HiveOperator with DataProcHiveOperator, and update the arguments accordingly, the workflow works as expected. This too is an unacceptable solution as manually editing each workflow script is not a practical workaround. Additionally, there might be more operators which require manual intervention similar to this which would further increase the manual effort needed by the migration.
What is the optimal way of handling situations such as these without having to amend workflow scripts manually? Furthermore, is there an official documentation to a Google recommended way to migrate Apache Airflows from on-premise instances to Cloud Composer as I am unable to find a reference to it.
EDIT 1:
I found out that if a task requires 3rd party binaries, the KubernetesPodOperator could be used to spin up a pod containing a docker image which contains all the required binaries and dependencies.
kubernetes_min_pod = KubernetesPodOperator(
task_id='pod-ex-minimum',
name='pod-ex-minimum',
cmds=['echo', '"hello world"'],
namespace='default',
image='gcr.io/gcp-runtimes/ubuntu_18_0_4'
)
However this pod is killed as soon as after the command specified in the cmds is executed. Is there a way to specify that all succeeding tasks be run inside this pod?
EDIT 2:
There's been some confusion about my question. Please let me clarify:
While I understand that if my DAG needs a binary to work, I can download that binary on the airflow worker pod and update the PATH variable. I am keeping this option as a last resort for now.
I am basically looking for a way of spinning up a Google Composer instance with my environment having all prerequisite binaries. Ideally, this would work if I could specify a custom image for the airflow worker pod, where I can preinstall all my required dependencies, but I have been unable to find a way to do this.

Related

Best way to update code on Azure Linux VMSS from Git using JENKINS

I am planning to use Azure VMSS for deploying a set of spring boot apps. I am planning to create a custom linux VM image with all the required softwares/utilities as well as the required directory structure and configure this image in VMSS. We use jenkins as CI/CD tool and Git as source code repo. What is the best way to build and deploy these spring boot apps on VMSS?
I think one way is to write a custom script extension which downloads code from Git repo and then starts these spring boot apps. I believe this script will then get executed every time a new VM is provisioned.
But what about cases where already multiple VMs are running on top of minimum scale instance count. I believe a manual restart will not trigger the CSE script to run on these already running VMs right?
Could anyone advise the best way to handle this?
Also once a VM is deallocated due to auto scale down, what is the best/cost optimal way to back up the log files from VM to storage (blob or file share)?
You could enable Automatically tear down virtual machines after every use in the organization settings/project setting >> agent pool >> VMSS agent pool >> settings. Then, a new VM instance is used for every job. After running a job, the VM will go offline and be reimaged before it picks up another job. The Custom Script Extension will be executed on every virtual machine in the scaleset immediately after it is created or reimaged. Here is the reference document: Create the scale set agent pool.
To back up the log files from VM, you could refer to Troubleshoot and support about related file path on the target virtual machine.

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

How airflow loads/updates DagBag from dags home folder on google cloud platform?

Please do not down vote my answer. If needed then I will update and correct my words. I have done my home-work research. I am little new so trying to understand this.
I would like to understand that how do airflow on Google cloud platform gets the changes from dags home folder to UI. Also Please help me with my dags setup script. I have read so many answers along with books. book link is here
I tried figuring out my answer from page 69 which says
3.11 Scheduling & Triggers The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have
been met. Behind the scenes, it monitors and stays in sync with a
folder for all DAG objects it may contain, and periodically (every
minute or so) inspects active tasks to see whether they can be
triggered.
My understanding from this book is that scheduler regularly takes changes from dags home folder. (Is it correct?)
I also read multiple answers on stack overflow , I found this one useful Link
But still answer does not contain process that is doing this creation/updation of dagbag from script.py in dag home folder. How changes are sensed.
Please help me with my dags setup script.
We have created a generic python script that dynamically creates dags by reading/iterating over config files.
Below is directory structure
/dags/workflow/
/dags/workflow/config/dag_a.json
/dags/workflow/config/dag_b.json
/dags/workflow/task_a_with_single_operator.py
/dags/workflow/task_b_with_single_operator.py
/dags/dag_creater.py
Execution flow dag_creater.py is as following :-
1. Iterate in dags/workflow/config folder get the Config JSON file and
read variable dag_id.
2. create Parent_dag = DAG(dag_id=dag_id,
start_date=start_date, schedule_interval=schedule_interval,
default_args=default_args, catchup=False)
3. Read tasks with dependencies of that dag_id from config json file
(example :- [[a,[]],[b,[a]],[c,[b]]]) and code it as task_a >>
task_b >> task_c
This way dag is created. All works fine. Dags are also visible on UI and running fine.
But problem is, My dag creation script is running every time. Even in each task logs I see logs of all the dags. I expect this script to run once. just to fill entry in metadata. I am unable to understand like why is it running every time.
Please make me understand the process.
I know airflow initdb is run once we setup metadata first time. So that is not doing this update all time.
Is it scheduler heart beat updating all?
Is my setup correct?
Please Note: I can't type real code as that is the restriction from my
organization. However if asked, i will provide more information.
Airflow Scheduler is actually continuously running in Airflow runtime environment as a main contributor for monitoring changes in DAG folder and triggering the relevant DAG tasks residing in this folder. The main settings for Airflow Scheduler service can be found in airflow.cfg file, essentially the heart beat intervals which effectively impact the general DAG tasks maintenance.
However, the way how the particular task will be executed is defined as per the Executor's model in Airflow configuration.
To store DAGs being available for the Airflow runtime environment GCP Composer uses Cloud Storage, implementing the specific folder structure, synchronizing any object arriving to /dags folder with *.py extension be verified if it contains the DAG definition.
If you expect to run DAG spreading script within Airflow runtime, then in this particular use case I would advise you to look at PythonOperator, using it in the separate DAG to invoke and execute your custom generic Python code with guarantees scheduling it only once a time. You can check out this Stack thread with implementation details.

How Docker and Ansible fit together to implement Continuous Delivery/Continuous Deployment

I'm new to the configuration management and deployment tools. I have to implement a Continuous Delivery/Continuous Deployment tool for one of the most interesting projects I've ever put my hands on.
First of all, individually, I'm comfortable with AWS, I know what Ansible is, the logic behind it and its purpose. I do not have same level of understanding of Docker but I got the idea. I went through a lot of Internet resources, but I can't get the the big picture.
What I've been struggling is how they fit together. Using Ansible, I can manage my Infrastructure as Code; building EC2 instances, installing packages... I can even deploy a full application by pulling its code, modify config files and start web server. Docker is, itself, a tool that packages an application and ensures that it can be run wherever you deploy it.
My problems are:
How does Docker (or Ansible and Docker) extend the Continuous Integration process!?
Suppose we have a source code repository, the team members finish working on a feature and they push their work. Jenkins detects this, runs all the acceptance/unit/integration test suites and if they all passed, it declares it as a stable build. How Docker fits here? I mean when the team pushes their work, does Jenkins have to pull the Docker file source coded within the app, build the image of the application, start the container and run all the tests against it or it runs the tests the classic way and if all is good then it builds the Docker image from the Docker file and saves it in a private place?
Should Jenkins tag the final image using x.y.z for example!?
Docker containers configuration :
Suppose we have an image built by Jenkins stored somewhere, how to handle deploying the same image into different environments, and even, different configurations parameters ( Vhosts config, DB hosts, Queues URLs, S3 endpoints, etc...) What is the most flexible way to deal with this issue without breaking Docker principles? Are these configurations backed in the image when it gets build or when the container based on it is started, if so how are they injected?
Ansible and Docker:
Ansible provides a Docker module to manage Docker containers. Assuming I solved the problems mentioned above, when I want to deploy a new version x.t.z of my app, I tell Ansible to pull that image from where it was stored on, start the app container, so how to inject the configuration settings!? Does Ansible have to log in the Docker image, before it's running ( this sounds insane to me ) and use its Jinja2 templates the same way with a classic host!? If not, how is this handled?!
Excuse me if it was a long question or if I misspelled something, but this is my thinking out loud. I'm blocked for the past two weeks and I can't figure out the correct workflow. I want this to be a reference for future readers.
Please, it would very helpful to read your experiences and solutions because this looks like a common workflow.
I would like to answer in parts
How does Docker (or Ansible and Docker) extend the Continuous Integration process!?
Since docker images same everywhere, you use your docker images as if they are production images. Therefore, when somebody committed a code, you build your docker image. You run tests against it. When all tests pass, you tag that image accordingly. Since docker is fast, this is a feasible workflow.
Also docker changes are incremental; therefore, your images will have minimal impact on storage. Also when your tests fail, you may also choose to save that image too. In this way, developer will pull that image and investigate easily why your tests failed. Developer may choose to run tests in their machine too since docker images in jenkins and their machine are not different.
What this brings that all developers will have same environment, same version of all software since you decide which one will be used in docker images. I have come across to bugs that are due to differences between developer machines. For example in the same operating system, unicode settings may affect your code. But in docker images all developers will test against same settings, same version software.
Docker containers configuration :
If you are using a private repository, and you should use one, then configuration changes will not affect hard disk space much. Therefore except security configurations, such as db passwords, you can apply configuration changes to docker images(Baking the Configuration into the Container). Then you can use ansible to apply not-stored configurations to deployed images before/after startup using environment variables or Docker Volumes.
https://dantehranian.wordpress.com/2015/03/25/how-should-i-get-application-configuration-into-my-docker-containers/
Does Ansible have to log in the Docker image, before it's running (
this sounds insane to me ) and use its Jinja2 templates the same way
with a classic host!? If not, how is this handled?!
No, ansible will not log in the Docker image, but ansible with Jinja2 templates can be used to change dockerfile. You can change dockerfile with templates and can inject your configuration to different files. Tag your files accordingly and you have configured images to spin up.
Regarding your question about handling multiple environment configurations using the same Docker image, I have been planning on using a Service Discovery tool like Consul as a centralized config/property management tool. So, when you start your container up, you set an ENV var that tells it what application it is (appID), and what environment config it should use (ex: MyApplication:Dev) and it will pull its config from Consul at startup. I still have to investigate the security around Consul (as if we are storing DB connection credentials in there for example, how do we restrict who can query/update those values). I don't want to just use this for containers, but all apps in general. Another cool capability is to change the config value in Consul and have a hook back into your app to apply the changes immediately (maybe like a REST endpoint on your app to push changes down to and dynamically apply it). Of course your app has to be written to support this!
You might be interested in checking out Martin Fowler's blog articles on immutable infrastructure and on Phoenix servers.
Although not a complete solution, I have suggestions for two of your issues. Although they might not be perfect, these are the practices we are using in our workflow, and prove themselves so far.
Defining different environments - supposing you've written a different Ansible role for each environment you launch, we define an environment variable setting the environment we wish the container to belong to. We then download the suitable configuration file from an S3 bucket using the env variable set before into the container (which should be possible if you supply AWS creds or give your server an IAM role) and inject these parameters into the code when building it.
Ansible doesn't need to log into the docker app, but the solution is a bit tricky. I've tried two ways of tackling this problem, and both aren't ideal. The first one is to download the configuration file as part of the docker image command line, and build the app on container startup. While this solution works - it breaches the Docker philosophy and makes the image highly prone to build errors.
Another solution is pushing several images to your docker hub repo, and then pulling the appropriate image according to the environment at hand.
In a broader stroke, I've tried launching our app completely with Ansible and it was hell, many configuration steps are tricky and get trickier when you try to implement them as a playbook. When I switched to maintaining the severs alone with Ansible, and deploying the app itself with Docker things got a lot easier.

How does ElasticBeanStalk deploy your application version to instances?

I am currently using AWS ElasticBeanStalk and I was curious as to how (as in internally) it knows that when you fire up an instance (or it automatically does with scaling), to unpack the zip I uploaded as a version? Is there some enviroment setting that looks up my zip in my S3 bucket and then unpacks automatically for every instance running in that environment?
If so, could this be used to automate a task such as run an SQL query on boot-up (instance deployment) too? Are these automated tasks changeable or viewable at all?
Thanks
I don't know how beanstalk knows which version to download and unpack, but running a task on start-up is trivial. Check out cloud-init, a tool written by Ubuntu that's now packaged in Amazon Linux. It allows you to pass arbitrary shell scripts into the UserData section of the instance configuration, and those shell scripts will run on startup.
It's a great way to bootstrap instances on startup, which avoids the soul-sucking misery of managing AMIs.
A quick (possibly non-applicable) warning: If you're running a SQL query on a database that lives on the beanstalk AMI, you're pretty much guaranteed to lose your database at some point. Those machines are designed to be entirely transient. Do not put databases on them. See this answer for more details.
Since your goal seems to be to run custom configuration tasks, the answer is yes, there is a way to do that. You can define custom actions in an .ebextensions file packaged with your app. For example, you can configure a command to run every time a new machine is deployed:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html#linux-commands