scheduling an informatica workflow with a customized frequency - admin

Hello Dear Informatica admin/platform experts,
I have a workflow that i need to scheduled say Monday-Friday and Sunday. All the 6 days the job should at a specific time say 10 times in a day, but the timing is not uniform but at a predefined time say(9 AM, 11 AM, 1:30 PM etc), so the difference in the timing is not uniform. so we had 10 different scheduling workflows for each schedule/run that triggers a shell script that uses pmcmd command.
It looked a bit weird for me, so what i did was, have a single workflow that triggers the pmcmd shell script, and have a link between the start and the shell script where i specified a condition of the time and scheduled it to run monday-friday and sunday every 30 minutes.
So what happens is, it runs 48 times in a day but actually triggers the "actual" workflow only 10 times. and the remaining 38 times it just runs but does nothing.
one of my informatica admin colleague says that running this 38 times(which does actually nothing) consumes informatica resources. Though i was quite sure it does not, but as i am just an informatica developer and not an expert, thought of posting it here, to check whether it is really true?
Thanks.
Regards
Raghav

Well... it does consume some resources. Each time workflow starts, it does quite a few operations on the Repository. It also allocates some memory on Integration Service as well as creates log file for the Workflow. Even if there are no sessions executed at all.
So, there is an impact. Multiply that by the number of workflows, times the number of executions - and there might be a problem.
Not to mention there are some limitations regarding the number of Workflow being executed at the same time.
I don't know your platform and setup. But this look like a field for improvement indeed. A cron scheduler should help you a lot.

Related

AWS CloudWatch rule schedule has irregular intervals (when it shouldn't)

There is an Elastic Container Service cluster running an application internally referred to as Deltaload. It checks the data in Oracle production database and in dev database in Amazon RDS and loads whatever is missing into RDS. A CloudWatch rule is set up to trigger this process every hour.
Now, for some reason, every 20-30 hours there is one interval of a different length. Normally, it is a ~25 min gap, but on other occasions it can be 80-90 min instead of 60. I could understand a difference of 1-2 minutes, but being off by 30 min from an hourly schedule sounds really problematic, especially given the full run takes ~45 min. Does anyone have any ideas on what could be the reason for that? Or at least how can I figure out why it is so?
The interesting part is that this glitch in schedule either breaks or fixes the Deltaload app. What I mean is, if it is running successfully every hour for a whole day and then the 20 min interval happens, it will then be crashing every hour for the next day until the next glitch arrives, after which it will work again (the very same process, same container, same everything). It crashes, because the connection to RDS times out. This 'day of crashes, day of runs' thing has been going on since early February. I am not too proficient with AWS. This Deltaload app is written in C#, which I don't know. The only thing I managed to do is to increase the RDS connection timeout to 10 min, which did not fix the problem. The guy that wrote the app has left the company a time ago and is unavailable. There are no other developers on this project, as everyone got fired, because of corona. So far, the best alternative I see, is to just rewrite the whole thing in Python (which I know). If anyone has any other thoughts on how understand/fix it, I'd greatly appreciate any input.
To restate my actual question: why is it that CloudWatch rule drops in irregular intervals in a regular schedule? How to prevent this from happening?

What is the best way to monitor and show the results of Async jobs (like EMR & AWS glue) which take 20-30 minutes to execute

I have a job for my program which takes a long time to execute. Now I want to show the status of this job to my UI once it is completed. I have found two solutions to this problem:
Have an api call execute at the end of the 30 minute job to update the status that the job is complete. This is good because it can give additional information as to what happened in the job, but has it's drawback in that if something goes completely wrong, theres a chance that the code which calls the api will never happen and hence the status will never update.
Have continuous monitoring on this task once it has started. Have a while loop and keep checking if the task is done. This is a good approach in that we can almost always get the correct status of the task, but often we can only see the high level yes/no here instead of being able to see the fine grained execution details which might be made available.
One thing I haven't implemented though which I think my be a good solution is having both of these solutions in tandem does both so if there is a success case, I get the details of the execution. In case of total failure, I get that output as well from the other monitoring tool. What are the general principles followed when building such monitoring support for jobs which take longer times to process?
Use AWS Step Functions as a serverless state machine. It has support for interacting directly with a bunch of services https://docs.aws.amazon.com/en_us/step-functions/latest/dg/connect-supported-services.html

How to reduce the time taken by the glue etl job(spark) to actually start executing?

I want to start a glue etl job, though the execution is fair (time concerns), however, the time taken by glue to actually start executing the job is too much.
I looked into various documentation and answers but none of them could give me the solution. There was some explanation of this behavior: cold start but no solution.
I expect to have the job up asap, it takes sometimes around 10 mins to start a job which gets executed in 2 mins.
Unfortunately it's not possible now. Glue uses EMR under the hood and it requires some time to spin up a new cluster with desired number of executors. As far as I know they have a pool of spare EMR clusters with some most common DPU configurations so if you are lucky your job can get one and start immediately, otherwise it will wait.

Scheduler, is it too heavy?

I have a Scheduled Tasks that is too heavy, so my question is, a too heavy Scheduled Tasks could drop down the coldfusion server? i mean sometimes my Scheduled Tasks exceeded the loop limit time, anyway I am looking for other way to make the same thing but no too heavy.
Well, a scheduled task is really just an automatic call for a normal CF page request. This means, if you manually bring up the scheduled task URL in a browser window, does it time out there as well?
Remember that a scheduled task is being called and run by the server which will mean you can have different session, CGI, request and form scope values as opposed to an actual user. However, you can use the requestTimeout attribute of the CFSETTING tag to extend how long the page will have to complete the tasks before it times out. The requestTimeout attribute takes a value which represents number of seconds after which, if no request back from the server, CF considers the page to be unresponsive.
However, it would depend greatly upon what your scheduled task is actually doing. There are all kinds of ways you could take code to break it into constituent parts for quicker processing. Maybe figuring out what the loop is doing (and does it really need to do all of everything it's doing) is a good place to start.
There's a few things to consider.
Firstly, your task takes a while to run, and currently times out. There's some things to investigate here:
well not an investigation, but you can set the requesttimeout for that request via <cfsetting> as per #coldfusiondevshop's suggestion.
You could audit your code to see if it can be coded differently to not take so long to run. Due to the nature of work that's generally done as a task, this might not be possible.
Depending on your ColdFusion version (pls always tag your questions with the CF version, as well as just "ColdFusion") you could leverage the enhanced scheduling CF10 has to split the tasks into smaller chunks and chain them together, thus making the parts of the sum less likely to time out. This too is not always possible, but is a consideration.
The other thing to think about is whether it might be worth having a separate CF instance running for your tasks. As well as your task timing out, it could also be slowing down processing for everyone else too, whilst that's running. Have you checked that? For a single task this is probably overkill, but it's good to bear in mind that tasks don't need to be run by the same CF instance(s) as the rest of the site, and indeed there's a compelling case not to do so if there are a lot of tasks that will strain a CF instance's resources.
In summary: increase your timeout via <cfsetting>, then audit your code to see if you can improve its performance. The other things I say are just "bonus point" suggestions.

Using any of the Amazon Web Services, how could I schedule something to happen 1 year from now?

I'd like to be able to create a "job" that will execute in an arbitrary time from now... Let's say 1 year from now. I'm trying to come up with a stable, distributed system that doesn't rely on me maintaining a server and scheduling code. (Obviously, I'll have to maintain the servers to execute the job).
I realize I can poll simpleDB every few seconds and check to see if there's anything that needs to be executed, but this seems very inefficient. Ideally I could create an Amazon SNS topic that would fire off at the appropriate time, but I don't think it's possible.
Alternatively, I could create a message in the Amazon SQS that would not be visible for 1 year. After 1 year, it becomes visible and my polling code picks up on it and executes it.
It would seem this is a topic like Singletons or Inversion Control that Phd's have discussed and come up with best practices for. I can't find the articles if there any.
Any ideas?
Cheers!
The easiest way for most people to do this would be to run at least an EC2 server with a cron job on the EC2 server to trigger an action. However, the cost of running an EC2 server 24 hours a day for a year just to trigger an action would be around $170 at the cheapest (8G t1.micro with Heavy Utilization Reserved Instance). Plus, you have to monitor that server and recover from failures.
I have sketched out a different approach to running jobs on a schedule that uses AWS resources completely. It's a bit more work, but does not have the expense or maintenance issues with running an EC2 instance.
You can set up an Auto Scaling schedule (cron format) to start an instance at some point in the future, or on a recurring schedule (e.g., nightly). When you set this up, you specify the job to be run in a user-data script for the launch configuration.
I've written out sample commands in the following article, along with special settings you need to take care of for this to work with Auto Scaling:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
With this approach, you only pay for the EC2 instance hours when the job is actually running and the server can shut itself down afterwards.
This wouldn't be a reasonable way to schedule tens of thousands of emails with an individual timer for each, but it can make a lot of sense for large, infrequent jobs (a few times a day to once per year).
I think it really depends on what kind of job you want to execute in 1 year and if that value (1 year) is actually hypothetical. There are many ways to schedule a task, windows and linux both offer a service to schedule tasks. Windows being Task Scheduler, linux being crontab. In addition to those operating system specific solutions you can use Maintenance tasks on MSSQL server and I'm sure many of the larger db's have similar features.
Without knowing more about what you plan on doing its kind of hard to suggest any more alternatives since I think many of the other solutions would be specific to the technologies and platforms you plan on using. If you want to provide some more insight on what you're going to be doing with these tasks then I'd be more than happy to expand my answer to be more helpful.