I use google composer. I have a dag that uses the panda.read_csv() function to read a .csv.gz file. The DAG keeps trying without showing any errors. Here is the airflow log:
*** Reading remote log from gs://us-central1-data-airflo-dxxxxx-bucket/logs/youtubetv_gcpbucket_to_bq_daily_v2_csv/file_transfer_gcp_to_bq/2018-11-04T20:00:00/1.log.
[2018-11-05 21:03:58,123] {cli.py:374} INFO - Running on host airflow-worker-77846bb966-vgrbz
[2018-11-05 21:03:58,239] {models.py:1196} INFO - Dependencies all met for <TaskInstance: youtubetv_gcpbucket_to_bq_daily_v2_csv.file_transfer_gcp_to_bq 2018-11-04 20:00:00 [queued]>
[2018-11-05 21:03:58,297] {models.py:1196} INFO - Dependencies all met for <TaskInstance: youtubetv_gcpbucket_to_bq_daily_v2_csv.file_transfer_gcp_to_bq 2018-11-04 20:00:00 [queued]>
[2018-11-05 21:03:58,298] {models.py:1406} INFO -
----------------------------------------------------------------------
---------
Starting attempt 1 of
----------------------------------------------------------------------
---------
[2018-11-05 21:03:58,337] {models.py:1427} INFO - Executing <Task(BranchPythonOperator): file_transfer_gcp_to_bq> on 2018-11-04 20:00:00
[2018-11-05 21:03:58,338] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run youtubetv_gcpbucket_to_bq_daily_v2_csv file_transfer_gcp_to_bq 2018-11-04T20:00:00 --job_id 15096 --raw -sd DAGS_FOLDER/dags/testdags/youtubetv_gcp_to_bq_v2.py']
python code in DAG:
from datetime import datetime,timedelta
from airflow import DAG
from airflow import models
import os
import io,logging, sys
import pandas as pd
from io import BytesIO, StringIO
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.bash_operator import BashOperator
#GCP
from google.cloud import storage
import google.cloud
from google.cloud import bigquery
from google.oauth2 import service_account
from airflow.operators.slack_operator import SlackAPIPostOperator
from airflow.models import Connection
from airflow.utils.db import provide_session
from airflow.utils.trigger_rule import TriggerRule
def readCSV(checked_date,file_name, **kwargs):
subDir=checked_date.replace('-','/')
fileobj = get_byte_fileobj(BQ_PROJECT_NAME, YOUTUBETV_BUCKET, subDir+"/"+file_name)
df_chunks = pd.read_csv(fileobj, compression='gzip',memory_map=True, chunksize=1000000) # return TextFileReader
print ("done reaCSV")
return df_chunks
DAG:
file_transfer_gcp_to_bq = BranchPythonOperator(
task_id='file_transfer_gcp_to_bq',
provide_context=True,
python_callable=readCSV,
op_kwargs={'checked_date': '2018-11-03', 'file_name':'daily_events_xxxxx_partner_report.csv.gz'}
)
The DAG is successfully run on my local airflow version.
def readCSV(checked_date,file_name, **kwargs):
subDir=checked_date.replace('-','/')
fileobj = get_byte_fileobj(BQ_PROJECT_NAME, YOUTUBETV_BUCKET, subDir+"/"+file_name)
df = pd.read_csv(fileobj, compression='gzip',memory_map=True)
return df
tested get_byte_fileobj and it works as a stand alone function.
Based on this discussion airflow google composer group it is a known issue.
One of the reason can be because of overkilling all the composer resources (in my case memory)
I have a similar issue recently.
In my case it's beacause the kubernetes worker overload.
You can watch the worker performance on kubernetes dashboard too see whether your case is cluster overloadding issue.
If yes, you can try set the value of an airflow configuration celeryd_concurrency lower to reduce the parallism in a worker and see whether the cluster loads goes down
Related
I'm new to using the AWS managed Airflow service. I want to use Airflow to start an EC2 instance, make sure it's running, and then do some further work in the instance.
So far I have this dag below which is basically a copy of this.
This, however, fails every time and I'm not adept enough to know why?
import os
from datetime import datetime
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.providers.amazon.aws.operators.ec2 import EC2StartInstanceOperator, EC2StopInstanceOperator
from airflow.providers.amazon.aws.sensors.ec2 import EC2InstanceStateSensor
INSTANCE_ID = os.getenv("INSTANCE_ID", "instance-id")
with DAG(
dag_id='example_ec2',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
tags=['example'],
catchup=False,
) as dag:
# [START howto_operator_ec2_start_instance]
start_instance = EC2StartInstanceOperator(
task_id="ec2_start_instance",
instance_id=INSTANCE_ID,
)
# [END howto_operator_ec2_start_instance]
# [START howto_sensor_ec2_instance_state]
instance_state = EC2InstanceStateSensor(
task_id="ec2_instance_state",
instance_id=INSTANCE_ID,
target_state="running",
)
# [END howto_sensor_ec2_instance_state]
chain(start_instance, instance_state)
Just going to answer my own question in case anyone else finds this.
The line below doesnt work for whatever reason
INSTANCE_ID = os.getenv("INSTANCE_ID", "instance-id")
The alternative is to do the following
from airflow.models import Variable # import Variable
INSTANCE_ID = Variable.get("INSTANCE_ID") # get INSTANCE_ID
Make sure you have added INSTANCE_ID as a variable by going to Admin -> Variable.
I submit Spark jobs to EMR via Managed Airflow service of AWS (MWAA). The jobs were running fine with the MWAA version 1.10.12. Recently, AWS released a newer version of MWAA i.e. 2.0.2. I created a new environment with this version and tried submitting the same job to EMR. But it failed with the following error :
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
at org.apache.hadoop.fs.Path.initialize(Path.java:263)
at org.apache.hadoop.fs.Path.<init>(Path.java:221)
at org.apache.hadoop.fs.Path.<init>(Path.java:129)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2096)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2078)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:192)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$5(SparkSubmit.scala:364)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:902)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
at java.net.URI$Parser.fail(URI.java:2847)
at java.net.URI$Parser.failExpecting(URI.java:2853)
at java.net.URI$Parser.parse(URI.java:3056)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:260)
... 27 more
Command exiting with ret '1'
The spark-submit command looks like this:
spark-submit --deploy-mode cluster
--master yarn
--queue low
--jars s3://bucket-name/jars/mysql-connector-java-8.0.20.jar,s3://bucket-name/jars/postgresql-42.1.1.jar,s3://bucket-name/jars/delta-core_2.12-1.0.0.jar
--py-files s3://bucket-name/dependencies/spark.py, s3://bucket-name/dependencies/helper_functions.py
--files s3://bucket-name/configs/spark_config.json
s3://bucket-name/jobs/data_processor.py [command-line-args]
The job submission failed in under 10 seconds. Hence, the YARN application ID did not get created.
What I tried to resolve the error:
I added amazon related package to requirements.txt:
apache-airflow[mysql]==2.0.2
pycryptodome==3.9.9
apache-airflow-providers-amazon==1.3.0
I changed the import statements from:
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.operators.emr_terminate_job_flow_operator import EmrTerminateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
to
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow import DAG
from airflow.models import Variable
Changed the URI scheme to s3n and s3a
I checked the official documentation and blogs on MWAA as well as Airflow 2.0.2 and made the above changes. But nothing has worked so far. I seek help in resolving this error at the earliest. Thanks in advance
I am trying to use the airflow.providers.amazon.aws.operators.s3_list S3ListOperator to list files in an S3 bucket in my AWS account with the DAG operator below:
list_bucket = S3ListOperator(
task_id = 'list_files_in_bucket',
bucket = '<MY_BUCKET>',
aws_conn_id = 's3_default'
)
I have configured my Extra Connection details in the form of: {"aws_access_key_id": "<MY_ACCESS_KEY>", "aws_secret_access_key": "<MY_SECRET_KEY>"}
When I run my Airflow job, it appears it is executing fine & my task status is Success. Here is the Log output:
[2021-04-27 11:44:50,009] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=s3_default
[2021-04-27 11:44:50,013] {base_aws.py:170} INFO - Credentials retrieved from extra_config
[2021-04-27 11:44:50,013] {base_aws.py:84} INFO - Creating session with aws_access_key_id=<MY_ACCESS_KEY> region_name=None
[2021-04-27 11:44:50,027] {base_aws.py:157} INFO - role_arn is None
[2021-04-27 11:44:50,661] {taskinstance.py:1185} INFO - Marking task as SUCCESS. dag_id=two_step, task_id=list_files_in_bucket, execution_date=20210427T184422, start_date=20210427T184439, end_date=20210427T184450
[2021-04-27 11:44:50,676] {taskinstance.py:1246} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-04-27 11:44:50,700] {local_task_job.py:146} INFO - Task exited with return code 0
Is there anything I can do to print the files in my bucket to Logs?
TIA
This code is enough and you don't need to use print function. Just check the corresponding log, then go to xcom, and the return list is there.
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='ob-air-pre',
prefix='data/',
delimiter='/',
aws_conn_id='aws'
)
The result from executing S3ListOperator is an XCom object that is stored in the Airflow database after the task instance has completed.
You need to declare another operator to feed in the results from the S3ListOperator and print them out.
For example in Airflow 2.0.0 and up you can use TaskFlow:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
#dag.task(task_id="print_objects")
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
dag=dag
)
print_objects(list_bucket.output)
In older versions,
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
dag=dag,
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
)
print_objects_in_bucket = PythonOperator(
dag=dag,
task_id='print_objects_in_bucket',
python_callable=print_objects,
op_args=("{{ti.xcom_pull(task_ids='list_files_in_bucket')}}",)
)
list_bucket >> print_objects_in_bucket
I can run the single file as a dataflow job in cloud composer but when i run it as a package it fails .
pipeline_jobs/
-- __init__.py
-- run.py (main file)
-- setup.py
-- data_pipeline/
----- __init__.py
----- tasks.py
----- transform.py
----- util.py
I'm getting this error:
WARNING - File "/tmp/dataflowd232f-run.py", line 14, in <module
{gcp_dataflow_hook.py:120} WARNING - from data_pipeline.tasks import task
WARNING - ImportError: No module named data_pipeline.tasks.
This is the dag configuration:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataFlowPythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.strptime("2017-11-01","%Y-%m-%d"),
'py_options': [],
'dataflow_default_options': {
'start-date': '20171101',
'end-date': '20171101',
'project': '<project-id>',
'region': '<location>',
'temp_location': 'gs://<bucket>/flow/tmp',
'staging_location': 'gs://<bucket>/flow/staging',
'setup_file': 'gs://<bucket>/dags/pipeline_jobs/setup.py',
'runner': 'DataFlowRunner',
'job_name': 'job_name_lookup',
'task-id': 'run_pipeline'
},
}
dag = DAG(
dag_id='pipeline_01',
default_args=default_args,
max_active_runs=1,
concurrency =1
)
task_1 = DataFlowPythonOperator(
py_file = 'gs://<bucket>/dags/pipeline_jobs/run.py',
gcp_conn_id='google_cloud_default',
task_id='run_job',
dag=dag)
I tried putting run.py into dags folder but still getting same error.
Any kind of suggestions would be really helpful.
Tried doing this as well:
from pipeline_jobs .data_pipeline.tasks import task
but still the same issue.
Try put the entire pipeline_jobs/ in dags folder following this instruction and refer the dataflow py file as: /home/airflow/gcs/dags/pipeline_jobs/run.py.
I've two files.
test.py
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
class Connection():
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("Remote_Spark_Program - Leschi Plans")
conf.set('spark.executor.instances', 1)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print ('all done.')
con = Connection()
test_test.py
from test import Connection
sparkConnect = Connection()
when I run test.py the connection is made successfully but with test_test.py it gives
raise KeyError(key)
KeyError: 'SPARK_HOME'
KEY_ERROR arises if the SPARK_HOME is not found or invalid. So it's better to add it to your bashrc and check and reload in your code. So add this at the top of your test.py
import os
import sys
import pyspark
from pyspark import SparkContext, SparkConf, SQLContext
# Create a variable for our root path
SPARK_HOME = os.environ.get('SPARK_HOME',None)
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "lib"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
Also add this at the end of your ~/.bashrc file
COMMAND: vim ~/.bashrc if you are using any Linux based OS
# needed for Apache Spark
export SPARK_HOME="/opt/spark"
export IPYTHON="1"
export PYSPARK_PYTHON="/usr/bin/python3"
export PYSPARK_DRIVER_PYTHON="ipython3"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH"
export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
export CLASSPATH="$CLASSPATH:/opt/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
Note:
In the above bashrc code, I have given my SPARK_HOME value as /opt/spark you can give the location where you keep your spark folder(the downloaded one from the website).
Also I'm using python3 you can change it to python in the bashrc if you are using python 2.+ versions
I was using Ipython, for easy testing during runtime, like load the data once and test your code many times. If you are using plain old text editor, let me know I will update the bashrc accordingly.