AWS CloudWatch rule schedule has irregular intervals (when it shouldn't) - amazon-web-services

There is an Elastic Container Service cluster running an application internally referred to as Deltaload. It checks the data in Oracle production database and in dev database in Amazon RDS and loads whatever is missing into RDS. A CloudWatch rule is set up to trigger this process every hour.
Now, for some reason, every 20-30 hours there is one interval of a different length. Normally, it is a ~25 min gap, but on other occasions it can be 80-90 min instead of 60. I could understand a difference of 1-2 minutes, but being off by 30 min from an hourly schedule sounds really problematic, especially given the full run takes ~45 min. Does anyone have any ideas on what could be the reason for that? Or at least how can I figure out why it is so?
The interesting part is that this glitch in schedule either breaks or fixes the Deltaload app. What I mean is, if it is running successfully every hour for a whole day and then the 20 min interval happens, it will then be crashing every hour for the next day until the next glitch arrives, after which it will work again (the very same process, same container, same everything). It crashes, because the connection to RDS times out. This 'day of crashes, day of runs' thing has been going on since early February. I am not too proficient with AWS. This Deltaload app is written in C#, which I don't know. The only thing I managed to do is to increase the RDS connection timeout to 10 min, which did not fix the problem. The guy that wrote the app has left the company a time ago and is unavailable. There are no other developers on this project, as everyone got fired, because of corona. So far, the best alternative I see, is to just rewrite the whole thing in Python (which I know). If anyone has any other thoughts on how understand/fix it, I'd greatly appreciate any input.
To restate my actual question: why is it that CloudWatch rule drops in irregular intervals in a regular schedule? How to prevent this from happening?

Related

scheduling an informatica workflow with a customized frequency

Hello Dear Informatica admin/platform experts,
I have a workflow that i need to scheduled say Monday-Friday and Sunday. All the 6 days the job should at a specific time say 10 times in a day, but the timing is not uniform but at a predefined time say(9 AM, 11 AM, 1:30 PM etc), so the difference in the timing is not uniform. so we had 10 different scheduling workflows for each schedule/run that triggers a shell script that uses pmcmd command.
It looked a bit weird for me, so what i did was, have a single workflow that triggers the pmcmd shell script, and have a link between the start and the shell script where i specified a condition of the time and scheduled it to run monday-friday and sunday every 30 minutes.
So what happens is, it runs 48 times in a day but actually triggers the "actual" workflow only 10 times. and the remaining 38 times it just runs but does nothing.
one of my informatica admin colleague says that running this 38 times(which does actually nothing) consumes informatica resources. Though i was quite sure it does not, but as i am just an informatica developer and not an expert, thought of posting it here, to check whether it is really true?
Thanks.
Regards
Raghav
Well... it does consume some resources. Each time workflow starts, it does quite a few operations on the Repository. It also allocates some memory on Integration Service as well as creates log file for the Workflow. Even if there are no sessions executed at all.
So, there is an impact. Multiply that by the number of workflows, times the number of executions - and there might be a problem.
Not to mention there are some limitations regarding the number of Workflow being executed at the same time.
I don't know your platform and setup. But this look like a field for improvement indeed. A cron scheduler should help you a lot.

DynamoDB on-demand mode suddenly stops working

I have a table that is incrementally populated with a lambda function every hour. The write capacity metric is full of predictable spikes and throttling was normally avoided by relying on the burst capacity.
The first three loads after turning on-demand mode on kept working. Thereafter it stopped loading new entries into the table and began to time-out (from ~10 seconds to the current limit of 4 minutes). The lambda function was not modified at all.
Does anyone know why might this be happening?
EDIT: I just see timeouts in the logs.
Logs before failure
Logs after failure
Errors and availability (%)
Since you are using Lambda to perform incremental writes, this issue is more than likely on Lambda side. That is where I would start looking for this. Do you have CW logs to look through? If you cannot find it, open a case with AWS support.
Unless this was recently fixed, there is a known bug in Lambda where you can get a series of timeouts. We encountered it on a project I worked on: a lambda would just start up and sit there doing nothing, quite like yours.
So like Kirk, I'd guess the problem is with the Lambda, not DynamoDB.
At the time there was no fix. As a workaround, we had another Lambda checking the one that suffered from failures and rerunning those. Not sure if there are other solutions. Maybe deleting everything and setting it back up again (with your fingers crossed :))? Should be easy enough if everything is in Cloudformation.

Google Cloud Functions warmup time

Do we know what the warmup/cold-start time is for a function on the new (still beta as of now) Google Cloud Functions? Or the timeout before it cools down again?
I've been trying Azure functions in the per-use mode, and it's been ridiculously bad - seemed inconsistent but I've seen 30 seconds, and it seems to be about 5 minutes of no-use before it cools down again.
I'm assuming Google's functions have the same issue, but I don't see even preliminary documentation on those time periods. The answer seems to be a "ping" every X minutes to keep it alive, but knowing what that X should be would make a difference in billing.
There is an interesting blog post by Mikhail Shilkov, published in August 2018. At this time and based on his experiments, the minimum alive period observed for Google Cloud Functions (GCF) is 3 minutes:
As a consequence, using X=2 should be enough to keep a single function warm and ready to serve a request. However, in that case, you should ask yourself whether it is still relevant to use a serverless solution such as GCF.

CouchDB load spike (even with low traffic)?

We're been running CouchDB v1.5.0 on AWS and its been working fine. Recently AWS came out with new prices for their new m3 instances so we switched our CouchDB instance to use an m3.large. We have a relatively small database with < 10GB of data in it.
Our steady state metrics for it are system loads of 0.2 and memory usages of 5% or so. However, we noticed that every few hours (3-4 times per day) we get a huge spike that floors our load to 1.5 or so and memory usage to close to 100%.
We don't run any cronjobs that involve the database and our traffic flow about the same over the day. We do run a continuous replication from one database on the west coast to another on the east coast.
This has been stumping me for a bit - any ideas?
Just wanted to follow up on this question in case it helps anyone.
While I didn't figure out the direct answer to my load spike question, I did discover another bug from inspecting the logs that I was able to solve.
In my case, running "sudo service couchdb stop" was not actually stopping couchdb. On top of that, every couple of seconds a new process of couch would try and spawn only to be blocked by the existing couchdb process.
Ultimately, removing the respawn flag /etc/init.d/couchdb fixed this error.

Using any of the Amazon Web Services, how could I schedule something to happen 1 year from now?

I'd like to be able to create a "job" that will execute in an arbitrary time from now... Let's say 1 year from now. I'm trying to come up with a stable, distributed system that doesn't rely on me maintaining a server and scheduling code. (Obviously, I'll have to maintain the servers to execute the job).
I realize I can poll simpleDB every few seconds and check to see if there's anything that needs to be executed, but this seems very inefficient. Ideally I could create an Amazon SNS topic that would fire off at the appropriate time, but I don't think it's possible.
Alternatively, I could create a message in the Amazon SQS that would not be visible for 1 year. After 1 year, it becomes visible and my polling code picks up on it and executes it.
It would seem this is a topic like Singletons or Inversion Control that Phd's have discussed and come up with best practices for. I can't find the articles if there any.
Any ideas?
Cheers!
The easiest way for most people to do this would be to run at least an EC2 server with a cron job on the EC2 server to trigger an action. However, the cost of running an EC2 server 24 hours a day for a year just to trigger an action would be around $170 at the cheapest (8G t1.micro with Heavy Utilization Reserved Instance). Plus, you have to monitor that server and recover from failures.
I have sketched out a different approach to running jobs on a schedule that uses AWS resources completely. It's a bit more work, but does not have the expense or maintenance issues with running an EC2 instance.
You can set up an Auto Scaling schedule (cron format) to start an instance at some point in the future, or on a recurring schedule (e.g., nightly). When you set this up, you specify the job to be run in a user-data script for the launch configuration.
I've written out sample commands in the following article, along with special settings you need to take care of for this to work with Auto Scaling:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
With this approach, you only pay for the EC2 instance hours when the job is actually running and the server can shut itself down afterwards.
This wouldn't be a reasonable way to schedule tens of thousands of emails with an individual timer for each, but it can make a lot of sense for large, infrequent jobs (a few times a day to once per year).
I think it really depends on what kind of job you want to execute in 1 year and if that value (1 year) is actually hypothetical. There are many ways to schedule a task, windows and linux both offer a service to schedule tasks. Windows being Task Scheduler, linux being crontab. In addition to those operating system specific solutions you can use Maintenance tasks on MSSQL server and I'm sure many of the larger db's have similar features.
Without knowing more about what you plan on doing its kind of hard to suggest any more alternatives since I think many of the other solutions would be specific to the technologies and platforms you plan on using. If you want to provide some more insight on what you're going to be doing with these tasks then I'd be more than happy to expand my answer to be more helpful.