AWS Beanstalk: Cannot retrieve logs in degraded state - amazon-web-services

After some time of serving my app has died and gone to "degraded" state. I have no idea what happened because no one was using it. Maybe it was hibernated and did not wake up?
Now I am trying to check the logs but I am not able to do it. Requesting logs takes ages and from time to time I get timeouts. When I click Request logs (100 lines or full logs ) I get this message
Elastic Beanstalk is updating your environment.
To cancel this operation select Abort Current Operation from the Actions dropdown.
this takes some time and finally nothing happens. Moreover I cannot abort this operation as is written because:
Error
Could not abort the current environment operation for MY_APP_NAME: Environment named MY_ENV_NAME is in an invalid state for this operation. Must be pending deployment.

Related

"Error: error - Service Unavailable" while running any job via oracle Apex Page

I am executing a external job using DBMS_SCHEDULER through apex page by clicking a button in below manner.(Dynamic action=>Execute PlSql)
dbms_scheduler.run_job(job_name => 'APEXDATA.myJobName', use_current_session=> TRUE);
Its executing the external job correctly.(taking 1-2 minutes).My issue is that, in between the time while its executing i can not able to access any other page or can not able to login with new session nothing.showing below error in every task i am performing.
**503 Service Unavailable
The connection pool named: |apex|| is not correctly configured, due to the following error(s):
Exception occurred while getting connection: oracle.ucp.UniversalConnectionPoolException:
All connections in the Universal Connection Pool are in use**
Is this the general or known issue?if yes how to resolve the issue,because in same time other user also has to perform any other task or other may login same time.
Thank You.
I think you're mixing 2 things that hard to combine:
Dynamic actions are designed to submit code from the page without a page submit so the user can continue to work on the page after he has done something (eg run pl/sql code)
Running a process in the database that takes up the database session until it is completed ( use_current_session=> TRUE). Your dbms_scheduler.run_job process will run in the current session and as long as that job is running no other operations can be run in that database session (the connection is in use as shown in the error message).
Solutions:
use_current_session=> FALSE so the job runs in the background
In the dynamic action, set "Wait for result" to true, so the user is forced to wait until the job completes.
Execute the job on page submit which will also force the user to wait for the job to be completed.
Since your job takes 1-2 mins to complete, options 2 and 3 are probably not feasible because the user experience is not optimal. If you execute the job in the background, then you probably need to write some additional code to prevent the user from clicking a couple of times and submitting the job multiple times. You could do that by checking if the job is running before you submit it and not submit it if it is currently running.

When I get 'services has reached steady state', in Amazon ECS does it means some tasks had stopped?

Does this means that my service tasks are stopping or it's ok to get these log messages?
actually opposite this. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments.
No it doesn't mean that any of your tasks had stopped. If a task stops you will see an event that clearly states so and will include a link to the specific task that was stopped. For example you will get something like this "service xxx has stopped 1 running tasks: task xxx."
If no tasks have been created or stopped in the last six hours the ECS console will duplicate the last event message to let you know that everything works as expected.
From the ECS docs:
"To ensure that this event view is helpful, we only show the 100 most recent events and duplicate event messages are omitted until either the cause is resolved or six hours passes. If the cause is not resolved within six hours, you will receive another service event message for that cause."
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html
Check this thread here on the aws forums. https://forums.aws.amazon.com/thread.jspa?threadID=182793
This sounds like normal behavior. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments. Are you seeing any issues?

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Hello World PipeLine with ShelCommandlActivity

I'm trying to create a simple dataFlow pipeline with a single Activity of ShellCommandActivity type. I've attached the configuration of the activity and ec2 resource.
When I execute this the Ec2Resource sits in the WAITING_ON_DEPENDENCIES state then after sometime changes to TIMEDOUT. The ShellCommandActivity is always in the CANCELED state. I see the instance launch and very quicky changes to the terminated stated.
I've specified a s3 log file url, but that never gets updated.
Can anyone give me any pointers? Also is there any guidance out there on debugging this?
Thanks!!
You are currently forcing your instance to shut down after 1 minute which gives the TIMEOUT status if it can't execute in that time. Try increasing it to 50 minutes.
Also make sure you are using an AMI that runs Amazon Linux and that you are using full absolute paths in your scripts.
S3 log files are written as:
s3://bucket/folder/

Operation "timing out" during new item creation in Sitecore Editor

I've created a command in the Sitecore Editor that automatically builds out up to 25 items at a time. The problem that I'm experiencing is that the operation just "hangs" and does not complete. I don't think it's an error because I've added error handling and logging.
I'm getting the following error message "The operation could not be completed. Your session may have been lost due to a time-out or a server failure. Try again."
How can I increase the "time-out" duration (if this is a setting somewhere) - or is there another solution to this problem?
Long running operations will time out eventually depending on your IIS settings, usually after 20 mins. Instead, you should run your commands as a scheduled task, as they run in the background, with no waiting for the IIS request.
However, it seemes strange that inserting 25 items is such a long operation that the browser times out. You might have another issue in your code.