GCP Composer: Run Python Script in another GCS bucket - google-cloud-platform

I'm new to Airflow, and I'm trying to run a python script that reads data from Bigquery, does some preprocessing, and exports a table back to Bigquery. This is the dag I have
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
YESTERDAY = datetime.now() - timedelta(days=1)
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': YESTERDAY,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'max_tries': 0,
}
with DAG(
dag_id = 'my_code',
default_args = default_args,
schedule_interval = '#daily',
catchup = False
) as dag:
import data = BashOperator(
task_id = 'daily_task',
bash_command = 'python gs://project_id/folder1/python_script.py'
)
This gives an error of 'No such file or directory found'. I did not set up the Environment in Composer, so I'm not sure if it requires specific credentials. I tried storing the script in the dags folder, but then it wasn't able to access the bigquery tables.
I have two questions:
How do I properly define the location of the python script within another GCS bucket? Should the gs location work if proper credentials are applied? Or do I necessarily have to store the scripts in a folder within the dags folder?
How do I provide the proper credentials (like login ID and password) within the DAG, in case that is all that's needed to solve the issues?
I handwrote the code since the original is in a work laptop and I cannot copy. Let me know if there are any errors. Thank you!

To solve your issue, I propose you a solution which in my opinion, is easier to manage.
Whenever possible it is better to use Python scripts within Composer's Bucket.
Copy your Python script in the Composer bucket and DAG folder with a separated process outside of Composer (gcloud) or directly in the DAG. If you want to do that in the DAG, you can check from this link
Use a Python operator that invokes your Python script inside the DAG
The Service Account used by Composer needs having the good privileges to read and write data to BigQuery. If you copy the Python scripts directly in the DAG, the SA needs to have the privileges to download file from GCS in the project 2.
from your_script import your_method_with_bq_logic
with airflow.DAG(
'your_dag',
default_args=your_args,
schedule_interval=None) as dag:
bq_processing = PythonOperator(
task_id='bq_processing',
python_callable=your_method_with_bq_logic
)
bq_processing
You can import the Python script main method in the code because it exists in the DAG folder.

Related

python script path compute engine for cloud composer bashoperator

I am working on GCP for the first time and I am trying to execute python script on compute engine using bashoperator. I stored my python script in bucket where composer script is located, but when I try to run the bashoperator it throws an error file not found. can I know where should I place my python script so that I can execute that python script on compute engine.
`
bash_task = bash_operator.BashOperator(
task_id='script_execution',
bash_command='gcloud compute ssh --project '+PROJECT_ID+ ' --zone '+REGION+' '+GCE_INSTANCE+ '--command python3 script.py',
dag=dag)
python3: can't open file '/home/airflow/gcs/dags/script.py': [Errno 2] No such file or directory`
Solution 1:
Instead of executing a Python script in a separated Compute Engine VM instance from Cloud Composer, you can directly execute a Python code in Composer with PythonOperator :
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from time import sleep
from datetime import datetime
def my_func(*op_args):
print(op_args)
# your Python script and logic here
with DAG('python_dag', description='Python DAG', schedule_interval='*/5 * * * *', start_date=datetime(2018, 11, 1), catchup=False) as dag:
dummy_task = DummyOperator(task_id='dummy_task', retries=3)
python_task = PythonOperator(task_id='python_task', python_callable=my_func, op_args=['one', 'two', 'three'])
dummy_task >> python_task
Solution 2 :
Use SSHOperator in Cloud Composer, this topic can help :
ssh launch script VM Composer
In this case the script is located in the VM.
Solution 3 :
You can also think about to rewrite your Python script in Beam and Dataflow Python, if it’s no so complicated to rewrite it.
Dataflow has the advantage to be serverless and Airflow proposes built in operators to launch Dataflow jobs.

AWS SageMaker Processing job

I was able to run a simple python code in Notebook instance to read and write csv files from/to S3 bucket. Now I want to create the SageMaker processing job to run the same code without any input/output data configuration. I have downloaded the same code and pushed the image to ECR repository. How to run this code in processing job and it should be able to install 's3fs' module?I just want to run python code in processing jobs without giving any input/output algorithms/configuration. Used boto3 to read/write from s3 bucket. With the current code it's stuck in "In Progress"
downloaded code in vs code
downloaded code in vs code
!pip install s3fs
import boto3
import pandas as pd
from io import StringIO
client = boto3.client('s3')
path = 's3://weatheranalysis/weatherset.csv'
df = pd.read_csv(path)
df.head()
filename = 'newdata.csv'
bucketName = 'weatheranalysis'
csv_buffer = StringIO()
df.to_csv(csv_buffer)
client = boto3.client('s3')
response = client.put_object(
ACL='private',
Body = csv_buffer.getvalue(),
Bucket =bucketName,
Key = filename
)
You can do so by defining a 'sagemaker-processing-container' and Set up the ScriptProcessor from the SageMaker Python SDK to run your existing python script in preprocessor.py.
A simple example can be found here
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
script_processor = ScriptProcessor(command=['python3'],
image_uri='image_uri',
role='role_arn',
instance_count=1,
instance_type='ml.m5.xlarge')
script_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
ProcessingOutput(source='/opt/ml/processing/output/validation'),
ProcessingOutput(source='/opt/ml/processing/output/test')])

Scheduled Django management commands using Zappa & Lamda

I've got my Django site working on Lambda using Zappa. It was very simple. I'm now searching to find out how I set up scheduled Django management commands. From what I've read the work around is to create Python functions that execute the management commands and then schedule the functions to run using the Zappa settings file. Is this still the right method as the help manual doesn't say anything?
At the time of writing there is an open Zappa issue about this.
symroe came up with this solution which seems to work nicely:
class Runner:
def __getattr__(self, attr):
from django.core.management import call_command
return lambda: call_command(attr)
import sys
sys.modules[__name__] = Runner()
This allows you to specify any Django management command in your zappa_settings.json file without further code modifications. That bit looks like this where zappa_schedule.py is the name of the file containing the above code and publish_scheduled_pages() is a registered management command:
"events": [{
"function": "zappa_schedule.publish_scheduled_pages",
"expression": "rate(1 hour)"
}],

issue with importing dependencies while running Dataflow from google cloud composer

I'm running Dataflow from google cloud composer, the dataflow script contains some non-standard dependencies like zeep, googleads.
which are required to be installed on dataflow worker nodes, so I packaged them with setup.py. when I try to run this in a dag, composer is validating the dataflow files and complaining about No module names Zeep , googleads. So I created pythonvirtualenvoperator and installed all the non standard dependencies required and tried to run the dataflow job and it still complained about inporting zeep and googleads.
Here is my codebase:
PULL_DATA = PythonVirtualenvOperator(
task_id=PROCESS_TASK_ID,
python_callable=execute_dataflow,
op_kwargs={
'main': 'main.py',
'project': PROJECT,
'temp_location': 'gs://bucket/temp',
'setup_file': 'setup.py',
'max_num_workers': 2,
'output': 'gs://bucket/output',
'project_id': PROJECT_ID},
requirements=['google-cloud-storage==1.10.0', 'zeep==3.2.0',
'argparse==1.4.0', 'google-cloud-kms==0.2.1',
'googleads==15.0.2', 'dill'],
python_version='2.7',
use_dill=True,
system_site_packages=True,
on_failure_callback=on_failure_handler,
on_success_callback=on_success_handler,
dag='my-dag')
and my python callable code:
def execute_dataflow(**kwargs):
import subprocess
TEMPLATED_COMMAND = """
python main.py \
--runner DataflowRunner \
--project {project} \
--region us-central1 \
--temp_location {temp_location} \
--setup_file {setup_file} \
--output {output} \
--project_id {project_id}
""".format(**kwargs)
process = subprocess.Popen(['/bin/bash', '-c', TEMPLATED_COMMAND])
process.wait()
return process.returncode
My main.py file
import zeep
import googleads
{Apache-beam-code to construct dataflow pipeline}
Any suggestions?
My job has a requirements.txt. Rather than using the --setup_file option as yours does, it specifies the following:
--requirements_file prod_requirements.txt
This tells DataFlow to install the libraries in requirements.txt prior to running the job.
Reference: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Using a sample Dataflow pipeline script with import googleads, zeep, I set up a test Composer environment. The DAG is just as yours, and I get the same error.
Then I make a couple of changes, to make sure the dependencies can be found on the worker machines.
In the DAG, I use a plain PythonOperator, not a PythonVirtualenvOperator.
I have my dataflow pipeline and setup file (main.py and setup.py) in a Google Cloud Storage bucket, so Composer can find them.
The setup file has a list of requirements where I need to have e.g. zeep and googleads. I adapted a sample setup file from here, changing this:
REQUIRED_PACKAGES = [
'google-cloud-storage==1.10.0', 'zeep==3.2.0',
'argparse==1.4.0', 'google-cloud-kms==0.2.1',
'googleads==15.0.2', 'dill'
]
setuptools.setup(
name='Imports test',
version='1',
description='Imports test workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
# Command class instantiated and run during pip install scenarios.
'build': build,
'CustomCommands': CustomCommands,
}
)
My DAG is
with models.DAG( 'composer_sample',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
PULL_DATA = PythonOperator(
task_id='PULL_DATA',
python_callable=execute_dataflow,
op_kwargs={
'main': '/home/airflow/gcs/data/main.py',
'project': PROJECT,
'temp_location': 'gs://dataflow-imports-test/temp',
'setup_file': '/home/airflow/gcs/data/setup.py',
'max_num_workers': 2,
'output': 'gs://dataflow-imports-test/output',
'project_id': PROJECT_ID})
PULL_DATA
with no changes to the Python callable. However, with this configuration I still get the error.
Next step, in the Google Cloud Platform (GCP) console, I go to "Composer" through the navigation menu, and then click on the environment's name. On the tab "PyPI packages", I add zeep and googleads, and click "submit". It takes a while to update the environment, but it works.
After this step, my pipeline is able to import the dependencies and run successfully. I also tried running the DAG with the dependencies indicated on the GCP console but not in the requirements of setup.py. And the workflow breaks again, but in different places. So make sure to do indicate them in both places.
You need to install the libraries in your Cloud Composer environment (check out this link). There is a way to do it within the console but I find these steps easier:
Open your environments page
Select the actual environment where your Composer is running
Navigate to the PyPI Packages tab
Click edit
Manually add each line of your requirements.txt
Save
You might get an error if the version you provided for a library is too old, so check the logs and update the numbers, as needed.

reading files in google cloud machine learning

I tried to run tensorflow-wavenet on the google cloud ml-engine with gcloud ml-engine jobs submit training but the cloud job crashed when it was trying to read the json configuration file:
with open(args.wavenet_params, 'r') as f:
wavenet_params = json.load(f)
arg.wavenet_params is simply a file path to a json file which I uploaded to the google cloud storage bucket. The file path looks like this: gs://BUCKET_NAME/FILE_PATH.json.
I double-checked that the file path is correct and I'm sure that this part is responsible for the crash since I commented out everything else.
The crash log file doesn't give much information about what has happened:
Module raised an exception for failing to call a subprocess Command '['python', '-m', u'gcwavenet.train', u'--data_dir', u'gs://wavenet-test-data/VCTK-Corpus-Small/', u'--logdir_root', u'gs://wavenet-test-data//gcwavenet10/logs']' returned non-zero exit status 1.
I replaced wavenet_params = json.load(f) by f.close() and I still get the same result.
Everything works when I run it locally with gcloud ml-engine local train.
I think the problem is with reading files with gcloud ml-engine in general or that I can't access the google cloud bucket from within a python file with gs://BUCKET_NAME/FILE_PATH.
Python's open function cannot read files from GCS. You will need to use a library capable of doing so. TensorFlow includes one such library:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
with file_io.FileIO(args.wavenet_params, 'r') as f:
wavenet_params = json.load(f)