For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you
I have an on-premise environment running Airflow v2.2.0 and wish to migrate all the workflows in this instance to a Cloud Composer instance. While doing this migration, some of the operators used on the on-premise environment do not work after running the same workflow on Cloud Composer 1 (Airflow 2.1.4).
Below is a single task from such a workflow:
hive_task = HiveOperator(
hql="./scripts/hive-script.sql",
task_id="survey_data_aggregator",
hive_cli_conn_id="hive_server_conn",
dag=data_aggregator,
hiveconfs={"input_path": "{{ params['input_path'] }}", "output_path": "{{ params['output_path'] }}"}
)
Execution of the workflow results in the following error [Errno 2] No such file or directory: 'beeline'
Having faced this error before, I know it is due to the worker nodes not having the beeline binary in its PATH. However, I do not want to SSH into every worker instance and update its PATH variable.
Upon further research, it was found that when I replace HiveOperator with DataProcHiveOperator, and update the arguments accordingly, the workflow works as expected. This too is an unacceptable solution as manually editing each workflow script is not a practical workaround. Additionally, there might be more operators which require manual intervention similar to this which would further increase the manual effort needed by the migration.
What is the optimal way of handling situations such as these without having to amend workflow scripts manually? Furthermore, is there an official documentation to a Google recommended way to migrate Apache Airflows from on-premise instances to Cloud Composer as I am unable to find a reference to it.
EDIT 1:
I found out that if a task requires 3rd party binaries, the KubernetesPodOperator could be used to spin up a pod containing a docker image which contains all the required binaries and dependencies.
kubernetes_min_pod = KubernetesPodOperator(
task_id='pod-ex-minimum',
name='pod-ex-minimum',
cmds=['echo', '"hello world"'],
namespace='default',
image='gcr.io/gcp-runtimes/ubuntu_18_0_4'
)
However this pod is killed as soon as after the command specified in the cmds is executed. Is there a way to specify that all succeeding tasks be run inside this pod?
EDIT 2:
There's been some confusion about my question. Please let me clarify:
While I understand that if my DAG needs a binary to work, I can download that binary on the airflow worker pod and update the PATH variable. I am keeping this option as a last resort for now.
I am basically looking for a way of spinning up a Google Composer instance with my environment having all prerequisite binaries. Ideally, this would work if I could specify a custom image for the airflow worker pod, where I can preinstall all my required dependencies, but I have been unable to find a way to do this.
I'm writing wrappers to create, start, stop GCP services. I can use the gcloud command in a shell script or gcloud-API in a python script.
I want to know which one is recommended and preferred? Any limitation of one over the other?
Fluent in both shell and python so language is not a problem.
One thing I would like to see is the output of the executions and get some information.
Eg: If I create a dataproc, I would like to know the master node and worker node names.
Thanks in advance.
In my django project I would like to be able to delete certain entries in the database automatically if they're too old. I can write a function that checks the creation_date and if its too old, deletes it, but I want this function to be run automatically at regular intervals..Is it possible to do this?
Thanks
This is what cron is for.
You will be better off reading this section of the Django docs http://docs.djangoproject.com/en/1.2/howto/custom-management-commands/#howto-custom-management-commands
Then you can create your function as a Django management command and use it in conjunction with cron on *nix (or scheduled tasks on Windows) to run it on a schedule.
See this for a good intro guide to cron http://www.unixgeeks.org/security/newbie/unix/cron-1.html
What you require is a cron job.
A cron job is time-based job scheduler. Most web hosting companies provide this feature which lets u run a service or a script at a time of your choosing. Most Unix-based OSes have this feature.
You would have better direct help asking this question on serverfault.com, the sister site of stackoverflow.
How do people deploy/version control cronjobs to production? I'm more curious about conventions/standards people use than any particular solution, but I happen to be using git for revision control, and the cronjob is running a python/django script.
If you are using Fabric for deploment you could add a function that edits your crontab.
def add_cronjob():
run('crontab -l > /tmp/crondump')
run('echo "#daily /path/to/dostuff.sh 2> /dev/null" >> /tmp/crondump')
run('crontab /tmp/crondump')
This would append a job to your crontab (disclaimer: totally untested and not very idempotent).
Save the crontab to a tempfile.
Append a line to the tmpfile.
Write the crontab back.
This is propably not exactly what you want to do but along those lines you could think about checking the crontab into git and overwrite it on the server with every deploy. (if there's a dedicated user for your project.)
Using Fabric, I prefer to keep a pristine version of my crontab locally, that way I know exactly what is on production and can easily edit entries in addition to adding them.
The fabric script I use looks something like this (some code redacted e.g. taking care of backups):
def deploy_crontab():
put('crontab', '/tmp/crontab')
sudo('crontab < /tmp/crontab')
You can also take a look at:
http://django-fab-deploy.readthedocs.org/en/0.7.5/_modules/fab_deploy/crontab.html#crontab_update
django-fab-deploy module has a number of convenient scripts including crontab_set and crontab_update
You can probably use something like CFEngine/Chef for deployment (it can deploy everything - including cron jobs)
However, if you ask this question - it could be that you have many production servers each running large number of scheduled jobs.
If this is the case, you probably want a tool that can not only deploy jobs, but also track success failure, allow you to easily look at logs from the last run, run statistics, allow you to easily change the schedule for many jobs and servers at once (due to planned maintenance...) etc.
I use a commercial tool called "UC4". I don't really recommend it, so I hope you can find a better program that can solve the same problem. I'm just saying that administration of jobs doesn't end when you deploy them.
There are really 3 options of manually deploying a crontab if you cannot connect your system up to a configuration management system like cfengine/puppet.
You could simply use crontab -u user -e but you run the risk of someone having an error in their copy/paste.
You could also copy the file into the cron directory but there is no syntax checking for the file and in linux you must run touch /var/spool/cron in order for crond to pickup the changes.
Note Everyone will forget the touch command at some point.
In my experience this method is my favorite manual way of deploying a crontab.
diff /var/spool/cron/<user> /var/tmp/<user>.new
crontab -u <user> /var/tmp/<user>.new
I think the method I mentioned above is the best because you don't run the risk of copy/paste errors which helps you maintain consistency with your version controlled file. It performs syntax checking of the cron tasks inside of the file, and you won't need to perform the touch command as you would if you were to simply copy the file.
Having your project under version control, including your crontab.txt, is what I prefer. Then, with Fabric, it is as simple as this:
#task
def crontab():
run('crontab deployment/crontab.txt')
This will install the contents of deployment/crontab.txt to the crontab of the user you connect to the server. If you dont have your complete project on the server, you'd want to put the crontab file first.
If you're using Django, take a look at the jobs system from django-command-extensions.
The benefits are that you can keep your jobs inside your project structure, with version control, write everything in Python and configure crontab only once.
I use Buildout to manage my Django projects. With Buildout, I use z3c.recipe.usercrontab to install cron jobs in deploy or update.
You said:
I'm more curious about conventions/standards people use than any particular solution
But, to be fair, the particular solution will depend in your environment and there is no universal elegant silver bullet. Given that you happen to be using Python/Django, I recommend Celery. It is an asynchronous task queue for Python, which integrates nicely with Django. And, on top of the features that it gives as an asynchronous task queue, it also has specific features for periodic tasks.
I have personally used the django-celery-beat integration and it integrates perfectly with Django settings and behaves correctly in distributed environments. If your periodic tasks are related to Django stuff, I strongly recommend to take a look at Celery I started using it only for certain asynchronous mailing and ended up using it for a lot of asynchronous tasks + periodic sanity checks and other web application maintenance stuff.