Scheduling issues with Control-M

Scheduling issues with Control-M - scheduling

I am working on scheduling some jobs using Control-M. My scenario is as below:
I have the following jobs - Job 1, Job 2, Job 3 and Job 4. All of them does an insert into the same table. I have to schedule all the four jobs to start at the same time. Since they are inserting into the same table, I am running into lock issues.
I cannot add a dependency between these jobs because I will be adding more jobs to this stream. Also, there are no logical dependencies between these jobs.
Also, all these jobs call the same script, but with different parameters.
Is there any way to handle this issue?

One way is to use the "Resources" properties for the tasks. If they all need the same exclusive or limited to 1 in quantity resource then they will get run one at a time.

You should use Control Resource, no Quantitative Resources.
Only write in the field Control Resources the name of the used table with the option of Exclusive active. This parameter should be add on every Job that can make lock on that table. You can keep the Exclusive un-selected for those Jobs that can use the table but don't lock it.
Control Resource and Quantitative Resources are not the same.

Related

PowerBI - Problem with parallel auto-update

My dataset consists of a dozen tables, each with its own clickhouse query. Some of the requests are quite heavy, but each of them is executed separately, without exceeding the limit on the resources used.
But when the dashboard is updated at the scheduled time, all these requests start to be executed simultaneously, which causes confusion of the source and the resulting error: DB::Exception: Memory limit (total) exceeded.
Anyone have any ideas how to ask PowerBI to execute requests sequentially (not simultaneously) with a scheduled update?
Maybe it's possible to add M-code with "sleep" functions? Or something like this:
if (
nothing updating now,
let Source = Odbc.Query(...) in Source
)

There is no direct way to do this in Power BI. The desktop does have a setting to enable/disable parallel loading of tables, but it doesn't work once you deploy to the service. The only option would be to use dataflows. You can then set a schedule to populate them, or set up a Power Automate/Logic app/Data Factory to hit the API to call them in some order.

Generate automatic records in a table in Django even if the system is not being used

Does anyone know a way to generate a record in a table without the system being used by the user?
I need to generate something similar to a notification or reminder, with various data obtained from other tables, something similar to a report
Thank you

To run periodic tasks, you will need some sort of task scheduler like celery or huey. With that in place, you can just create and save instances of whatever model you have in mind from the task scripts and the task scheduler will repeat it periodically.

Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.

As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.

One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.

As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.

You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.

How to create User Task only once a day

We want to collect data during the day and create an User Task once a day. How can that be done with camunda? Is there a possibility to use process variables or do we need to access our own database and mark the corresponding items as processed (as soon as we create the daily user task)?
Do we need to create these user tasks programmatically? (We are using embedded Spring Boot Camunda instance)

One very good option would be to use a Timer Start Event per the documentation here: https://docs.camunda.org/manual/7.10/reference/bpmn20/events/timer-events/#timer-start-event.
It seems that you may want to use that in conjunction with a Timer Intermediate Catching Event (https://docs.camunda.org/manual/7.10/reference/bpmn20/events/timer-events/#timer-intermediate-catching-event) in something like the following manner:
Start a process instance at a specific time in the morning with the Timer Start Event. Perhaps 6:30AM in your local time zone?
Execute certain steps to gather data, perhaps through external service invocations, etc.
At a specific time (in the afternoon?), create the User Task and present the data. The User Task could follow the Timer Intermediate Catching Event noted above.
I hope this helps!

Concurrent Queries, COPY and Connections in AWS Redshift

I am trying to understand the difference between concurrent connections and concurrent queries in Redshift. As per documents, We can make 500 concurrent connections to a Redshift cluster but it says maximum 15 queries can be run at the same time in a cluster. Now what is the exact value?
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
How many concurrent COPY statement can run in a cluster ?
We are evaluating Redshift as our primary reporting data store. If we cannot run a large number of queries simultaneously it may be difficult for us to go with this model.

I think, you have misread somewhere, Max concurrent queries are 50 per WLM. Refer below thread for Amazon support response for more detail.
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
At a time, Max 50 queries could be running concurrently. Yes it does include INSERT/UPDATE/DELETE etc all.
How many concurrent COPY statement can run in a cluster ?
Ideally, you could Max go up to 50 concurrently, but Copy works bit differently.
Amazon Redshift automatically loads in parallel from multiple data files.
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see Loading Data from Amazon S3.
Meaning, you could run concurrent Copy commands but make sure one copy command at a time per table.
So practically, it doesn't depend on Nodes on cluster, but Number of tables as well.
So if you have only 1 table, you would like to execute 50 insert concurrently, it will result only 1 Copy concurrently.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scheduling issues with Control-M - scheduling

One way is to use the "Resources" properties for the tasks. If they all need the same exclusive or limited to 1 in quantity resource then they will get run one at a time.

Related

PowerBI - Problem with parallel auto-update

Generate automatic records in a table in Django even if the system is not being used

Alternatives for Athena to query the data on S3

How to create User Task only once a day

Concurrent Queries, COPY and Connections in AWS Redshift

Categories

Resources