Cloud Composer Parallelism issue DAG and Tasks

Cloud Composer Parallelism issue DAG and Tasks - google-cloud-platform

I have created simple DAG using Bash Operator and BigQuery operator, there are approx 40 BigQuery tasks, I want to execute all 40 BigQuery task to execute parallel, only 10-15 tasks are executing parallel, I have tried with different Configuration parameters.
I also also need to execute 15-20 same kind of DAG, I need to execute all DAGs to execute parallel, all tasks from all DAGs.
My DAG is like
start >> bigquery task_1 >> end
start >> bigquery task_2 >> end
... ... ... ... ... ... ...
... ... ... ... ... ... ...
start >> bigquery task_n.. >> end
start and end is bash operator
My composer configuration as below:
Image version: composer-1.13.3-airflow-1.10.12
Python version: 3
Worker nodes"
Node count: 3
Disk size (GB): 50
Machine type: n1-standard-2
I have tried with default Airflow configuration and well as below custom configuration and don't see much improvement.
custom configuration 1
parallelism = 300
dag_concurrency = 60
max_active_runs_per_dag = 15
enable_xcom_pickling = False
sql_alchemy_pool_recycle = 570
store_serialized_dags = False
min_serialized_dag_update_interval = 30
dag_concurrencymax_active_runs_per_dag = 60
custom configuration 2
parallelism = 230
dag_concurrency = 46
max_active_runs_per_dag = 15
enable_xcom_pickling = False
sql_alchemy_pool_recycle = 570
store_serialized_dags = False
min_serialized_dag_update_interval = 30
dag_concurrencymax_active_runs_per_dag = 46
Any suggestion will be highly appreciated.

Related

Boto3 and AWS RDS: properly wait for database creation from snapshot

I have a following code in my Lambda (Python and Boto3):
rds.restore_db_instance_from_db_snapshot(
DBSnapshotIdentifier=snapshot_name,
DBInstanceIdentifier=db_id,
DBInstanceClass=rds_instance_class,
VpcSecurityGroupIds=secgroup,
DBSubnetGroupName=rds_subnet_groupname,
MultiAZ=False,
PubliclyAccessible=False,
CopyTagsToSnapshot=True
)
waiter = rds.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=db_id)
# some other operation that expects that DB is up and running.
The waiter was added as an attempt to properly wait for DB. However, it looks like the waiter times out.
What would be the correct waiter to use in this case?

try setting waiter.config.delay and/or waiter.config.max_attempts.
waiter = rds.get_waiter('db_instance_available')
waiter.config.delay = 123 # this is in seconds
waiter.config.max_attempts = 123
waiter.wait(DBInstanceIdentifier=db_id)
OR
waiter = rds.get_waiter('db_instance_available')
waiter.wait(
DBClusterIdentifier=db_id
WaiterConfig={
'Delay': 123,
'MaxAttempts': 123
}
)
WaiterConfig (dict) A dictionary that provides parameters to control
waiting behavior.
Delay (integer) The amount of time in seconds to wait between
attempts. Default: 30
MaxAttempts (integer) The maximum number of attempts to be made.
Default: 60

Could it be that your waiter is actually checking the existing db and seeing that it's available before the status can update on the previous command to restore the snapshot?

Timedeltasensor delaying from schedule interval

I have a job which runs at 13:30. Of which first task takes almost 1 hour to complete after that we need to wait 15 mins. So, I am using Timedeltasensor like below.
waitfor15min = TimeDeltaSensor(
task_id='waitfor15min',
delta=timedelta(minutes=15),
dag=dag)
However in logs, It is showing schedule_interval + 15 min like below
[2020-11-05 20:36:27,013] {time_delta_sensor.py:45} INFO - Checking if the time (2020-11-05T13:45:00+00:00) has come
[2020-11-05 20:36:27,013] {base_sensor_operator.py:79} INFO - Success criteria met. Exiting.
[2020-11-05 20:36:30,655] {logging_mixin.py:95} INFO - [2020-11-05 20:36:30,655] {jobs.py:2612} INFO - Task exited with return code 0
How can I create delay between job??

You could use PythonOperator and write a function that simply waits 15 minutes. There is an example on how a wait task could look like:
def my_sleeping_function(random_base, **kwargs)):
"""This is a function that will run within the DAG execution"""
time.sleep(random_base)
# Generate 5 sleeping tasks, sleeping from 0.0 to 0.4 seconds respectively
for i in range(5):
task = PythonOperator(
task_id='sleep_for_' + str(i),
python_callable=my_sleeping_function,
op_kwargs={'random_base': float(i) / 10},
provide context=true,
dag=dag,
)
run_this >> task

How to use boto3 waiters to take snapshot from big RDS instances

I started migrating my code to boto 3 and one nice addition I noticed are the waiters.
I want to create a snapshot from a db instance and I want to check for it's availability before I resume with my code.
My approach is the following:
# Notice: Step : Check snapshot availability [1st account - Oregon]
print "--- Check snapshot availability [1st account - Oregon] ---"
new_snap = client1.describe_db_snapshots(DBSnapshotIdentifier=new_snapshot_name)['DBSnapshots'][0]
# print pprint.pprint(new_snap) #debug
waiter = client1.get_waiter('db_snapshot_completed')
print "Manual snapshot is -pending-"
sleep(60)
waiter.wait(
DBSnapshotIdentifier = new_snapshot_name,
IncludeShared = True,
IncludePublic = False
)
print "OK. Manual snapshot is -available-"
,but the documentation says that it polls the status every 15 seconds for 40 times. That is 10 minutes. Yet, a rather big DB will need more than that .
How could I use the waiter to alleviate for that?

Waiters have configuration parameters'delay' and 'max_attempts'
like this :
waiter = rds_client.get_waiter('db_instance_available')
print( "waiter delay: " + str(waiter.config.delay) )
waiter.py on github

You could do it without the waiter if you like.
From the documentation for that waiter:
Polls RDS.Client.describe_db_snapshots() every 15 seconds until a successful state is reached. An error is returned after 40 failed checks.
Basically that means it does the following:
RDS = boto3.client('rds')
RDS.describe_db_snapshots()
You can just run that but filter to your snapshot id, here is the syntax.http://boto3.readthedocs.io/en/latest/reference/services/rds.html#RDS.Client.describe_db_snapshots
response = client.describe_db_snapshots(
DBInstanceIdentifier='string',
DBSnapshotIdentifier='string',
SnapshotType='string',
Filters=[
{
'Name': 'string',
'Values': [
'string',
]
},
],
MaxRecords=123,
Marker='string',
IncludeShared=True|False,
IncludePublic=True|False
)
This will end up looking something like this:
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
then you can just loop until that returns a snapshot which is available. So here is a very rough idea.
import boto3
import time
RDS = boto3.client('rds')
RDS.describe_db_snapshots()
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
while snapshot_description['DBSnapshots'][0]['Status'] != 'available' :
print("still waiting")
time.sleep(15)
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')

I think the other answer alluded to this solution but here it is expressly.
[snip]
...
# Create your waiter
waiter_db_snapshot = client1.get_waiter('db_snapshot_completed')
# Increase the max number of tries as appropriate
waiter_db_snapshot.config.max_attempts = 120
# Add a 60 second delay between attempts
waiter_db_snapshot.config.delay = 60
print "Manual snapshot is -pending-"
....
[snip]

Is possible to avoid the 60 seconds limit in urllib2.urlopen with GAE?

I'm requesting a file with a size around 14MB from a slow server with urllib2.urlopen, and it spend more than 60 seconds to get the data, and I'm getting the error:
Deadline exceeded while waiting for HTTP response from URL:
http://bigfile.zip?type=CSV
Here my code:
class CronChargeBT(webapp2.RequestHandler):
def get(self):
taskqueue.add(queue_name = 'optimized-queue', url='/cronChargeBTB')
class CronChargeBTB(webapp2.RequestHandler):
def post(self):
url = "http://bigfile.zip?type=CSV"
url_request = urllib2.Request(url)
url_request.add_header('Accept-encoding', 'gzip')
urlfetch.set_default_fetch_deadline(300)
response = urllib2.urlopen(url_request, timeout=300)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
...work with the data insiste the file...
I create a cron task who calls CronChargeBT. Here the cron.yaml:
- description: cargar BlueTomato
url: /cronChargeBT
target: charge
schedule: every wed,sun 01:00
and it create a new task and insert into a queue, here the queue configuration:
- name: optimized-queue
rate: 40/s
bucket_size: 60
max_concurrent_requests: 10
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 10
max_backoff_seconds: 200
Of coursethat the timeout=300 isn't working because the 60seconds limit in GAE but I think yhat I can avoid it using a task... anyone knows how I can get the data in the file avoiding this timeout.
Thanks a lot!!!

Cron jobs are limited to 10 minutes deadline, not 60 seconds. If your download fails, perhaps just retry? Does the download work if you download it from your computer? There's nothing you can do on GAE if the server you are downloading from is too slow or unstable.
Edit: According to https://cloud.google.com/appengine/docs/java/outbound-requests#request_timeouts, there is a maximum deadline of 60 seconds for cron job requests. Therefore, you can't get around it.

Flume roll settings not working

Edit*: Here is the full config file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.command = /usr/bin/vmstat 1
tier1.sources.source1.channels = channel1
tier1.sources.source1.kafka.consumer.timeout.ms = 20000000
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.channels.channel1.brokerList= ip.address:9092
tier1.channels.channel1.topic= test1
tier1.channels.channel1.zookeeperConnect=ip.address:2181
tier1.channels.channel1.parseAsFlumeEvent=false
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /user/flume/
tier1.sinks.sink1.hdfs.rollInterval = 5000
tier1.sinks.sink1.hdfs.rollSize = 5000
tier1.sinks.sink1.hdfs.rollCount = 1000
tier1.sinks.sink1.hdfs.idleTimeout= 10
tier1.sinks.sink1.hdfs.maxOpenFiles=1
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
I didn't have idleTimeout and maxOpenFiles till recently. So it wasn't working even with the default configurations for those 2 options.
Question on using Flume to aggregate Kafka data. Currently, Flume is creating a new file every second for reading in streaming data. These are my settings:
tier1.sinks.sink1.hdfs.rollInterval = 500 (should be 500 seconds)
tier1.sinks.sink1.hdfs.rollSize = 5000 (should be bytes)
tier1.sinks.sink1.hdfs.rollCount = 1000 (number of events)
The one setting I'm not completely sure on is rollCount, so some additional info:
i'm getting 80 bytes/second, some of my files are 80 bytes with 2 messages, some are 160 bytes, but with 4 messages. So it's not doing it based off time or size, so it may have to be with count, but I don't see why such small messages would register as 1000 events?
Thank you for the help!

Could the rollInterval be milliseconds? I think I may have had this issue before.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cloud Composer Parallelism issue DAG and Tasks - google-cloud-platform

Related

Boto3 and AWS RDS: properly wait for database creation from snapshot

Timedeltasensor delaying from schedule interval

How to use boto3 waiters to take snapshot from big RDS instances

Is possible to avoid the 60 seconds limit in urllib2.urlopen with GAE?

Flume roll settings not working

Categories

Resources