We are having 37 Sessions, each sessions having tables varying between 20 to 25, our Target DB is Greenplum. Due to huge queries running at different times on DB few segments nodes are going down so few of our CDC sessions gets failed.
So, we are planning to enable Resume from Last Checkpoint for CDC sessions. Do we need to check "Enable HA recovery" at workflow to use Resume from Last Checkpoint for Sessions?
The workflows should be started in warm mode, this way it automatically takes cares of resume from last point.
Please note that if you started the PowerExchange logger in cold mode, you need to start the logger also with cold mode.
A Restart folder is created at $INFA_HOME/server/infa_shared/ which stores the init and term tokens for workflow to resume from last failure.
Related
I am executing a external job using DBMS_SCHEDULER through apex page by clicking a button in below manner.(Dynamic action=>Execute PlSql)
dbms_scheduler.run_job(job_name => 'APEXDATA.myJobName', use_current_session=> TRUE);
Its executing the external job correctly.(taking 1-2 minutes).My issue is that, in between the time while its executing i can not able to access any other page or can not able to login with new session nothing.showing below error in every task i am performing.
**503 Service Unavailable
The connection pool named: |apex|| is not correctly configured, due to the following error(s):
Exception occurred while getting connection: oracle.ucp.UniversalConnectionPoolException:
All connections in the Universal Connection Pool are in use**
Is this the general or known issue?if yes how to resolve the issue,because in same time other user also has to perform any other task or other may login same time.
Thank You.
I think you're mixing 2 things that hard to combine:
Dynamic actions are designed to submit code from the page without a page submit so the user can continue to work on the page after he has done something (eg run pl/sql code)
Running a process in the database that takes up the database session until it is completed ( use_current_session=> TRUE). Your dbms_scheduler.run_job process will run in the current session and as long as that job is running no other operations can be run in that database session (the connection is in use as shown in the error message).
Solutions:
use_current_session=> FALSE so the job runs in the background
In the dynamic action, set "Wait for result" to true, so the user is forced to wait until the job completes.
Execute the job on page submit which will also force the user to wait for the job to be completed.
Since your job takes 1-2 mins to complete, options 2 and 3 are probably not feasible because the user experience is not optimal. If you execute the job in the background, then you probably need to write some additional code to prevent the user from clicking a couple of times and submitting the job multiple times. You could do that by checking if the job is running before you submit it and not submit it if it is currently running.
Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !
This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.
We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.
The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.
Scenario : Few jobs are running currently. If cluster reboot happens in the middle of the job execution, I shall be able to observe the continuity of process instance execution with proper state after reboot.
Will Camunda take care of preserving the process instance state by using some checkpoints and resumes automatically from where it halted ?
If you have reached at least one asynchronous continuation (e.g. check the property "async after" or on the start event), then the process instance has been persistent to the database and a job scheduled. Any crash would lead the following transaction to not commit and rollback. The job executor will restart processing from the last commit point when it detects a due job.