Does billing of ACI continue to happen even when my python code is waiting for messages on service bus subscription? - azure-servicebus-topics

I am have simple python code which subscribes to a service bus subscription. I have containerized this and deployed as part of ACI on Azure.
If message arrives on service bus subscription, the code is executed, executes it logic and then waits indefinitely for another message from appear.
The code is what Azure has provided in its documentation for python sdk here
Since ACI is serverless and bills/second, just wanted a confirmation if I'll get billed even if it is not executing my code and waiting for message for appear on topic/subscription (event-based) ?

Yes, of course. It will cost if there is anyone container instance in the running state. Until you stop all the container instance, then the cost will stop. So even if your code is waiting, but the instance is running.

Related

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.

Azure Web Job reading message from Service Bus doesnt delete message after

The scenario here is that we have a service bus queue and a web job. The web job reads the message from the service bus queue and calls a logic up which then goes on and does other stuff.
The problem we are facing is that after the web job reads the message from the service bus, it occasionally doesn't delete it after, which constantly causes the logic app to be called and flood our database with data.
Here is the message in question as seen from azure management studio:
https://gyazo.com/7f57b460421d1bb4a69fcb8b5a9ff01f
As you can see, there is no lock time on the message. I have tried to play around with the settings to no avail.
When i manually try to delete that message from azure management studio it is also unsuccessful but there is no error message received.
Does anyone know what is going on here? I feel like this is a problem with the queue itself as opposed to a bug in our code since 2-3 tools that i have used are unable to delete this message from the queue.
It looks like the message is only deleted after a specific time (does not go to the dead-letter queue however).
Thanks
So just for information, i figured my own issue out. When the file scraper job runs, it puts a message in the service bus. The webjob now that runs and picks up that file stores the file that it just picked up locally as well as on blob storage.
The problem was that webjob keeps a queue of what it processes locally which was never cleared so every time the webjob run, it was processing all previous files as well.

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

Break out of loop in AWS SWF activity

I'm running permanent loop in SWF Activity. Say like a web crawler crawling a website www.example1.com. However, I don't want to wait until it finishes crawling, but at certain time I want to terminate the activity and switch it to craw website www.example2.com instead.
I have tried to use 'try-cancel', 'terminate', workflow by workflow-id. It seems like it just sends signal to SWF to indicate that the task is finished in the AWS console, but the Activity process on worker is still running.
Any solution for this?
When activity is cancelled a heartbeat call returns flag that indicates that. So your activity loop should include heartbeating code to support cancellation. See "activity heartbeat" section from "error handling" page of AWS Flow Framework for Java
Developer Guide for an example.

How to kill /re-start a long running task

Is there a way to kill / re-start a long running task in AWS SWF? Sometimes some of our tasks run for a longer duration and we would like to manually kill a certain task (either via UI or programmatically) and re-start the task if possible. How to achieve this?
Console is option to manually kill workflow.
You can also set timeouts to whole workflow execution time or to individual activities. This can be set when you register your activity or when you start your activity (defaultTaskStartToCloseTimeoutSecond).
It's not clear what language you're using.
If you're using java, then you should look into Exponential Retry in Flow Framework. This make SDK restart your activity if it fails.
Long running activity is expected to heartbeat using RecordActivityTaskHeartbeat. It leads to timeout failure after short hearbeat interval instead of long task execution timeout if the activity process hangs or crashes.
The workflow code (decider) can always request activity cancellation through RequestCancelActivityTask decision. The cancellation request is returned as output of the RecordActivityTaskHeartbeat call. Activity implementation should cancel itself and report back to the service using RespondActivityTaskCanceled API call.
See Error Handling section of AWS Flow Framework Developer Guide for the AWS Flow Framework way of cancelling activities.
Sometimes activity implementation cannot support heartbeating and self cancellation. The solution is to execute another kill activity that terminates the first activity execution. For example under Unix such kill activity could emit "kill -9" command for the process that implements the first one.