All jobs failing in C COMPSs execution - c++

I have downloaded COMPSs 1.4 and some test programs from http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation and I am trying to test them. Java executions went fine; however, I amb having problems with C.
I am currently trying to execute the Simple. The Readme states that I only need two commands:
buidapp simple
runcompss --lang=c master/simple 1
The app builds fine, but when executing with this command, I get the following error:
[ERRMGR] - WARNING: Job 1 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR] - WARNING: Task 1 execution on worker localhost has failed; rescheduling task execution. (changing worker)
[ERRMGR] - WARNING: No task could be scheduled to any of the available resources.
This could end up blocking COMPSs. Will check it again in 20 seconds.
Possible causes:
-Network problems: non-reachable nodes, sshd service not started, etc.
-There isn't any computing resource that fits the defined tasks constraints.
If this happens 2 more times, the runtime will shutdown.
After 3 checks, the execution ends with no results. Is there something I am missing?

When running an application with the C binding, the default project.xml is not valid because you have to define a project.xml which includes the place where the worker binaries are deployed in each host.
<Project>
<Worker Name="localhost">
<InstallDir>/opt/COMPSs/Runtime/scripts/system/</InstallDir>
<WorkingDir>[/path/to/dir/used_as_working_dir]</WorkingDir>
<AppDir>[/path/to/installation]</AppDir>
<LimitOfTasks>4</LimitOfTasks>
</Worker>
</Project>

Related

App Engine Flexible deployment fails to become healthy in the allotted time

My flask app deployment via App Engine Flex is timing out and after setting debug=True. I see the following line repeating over and over until it fails. I am not sure however what this is and cannot find anything useful in logs explorer.
Updating service [default] (this may take several minutes)...working DEBUG: Operation [apps/enhanced-bonito-349015/operations/81b83124-17b1-4d90-abdc-54b3fa28df67] not complete. Waiting to retry.
Could anyone share advice on where to look to resolve this issue?
Here is my app.yaml (I thought this was due to a memory issue..):
runtime: python
env:flex
entrypoint: gunicorn - b :$PORT main:app
runtime_config:
python_version:3
resources:
cpu:4
memory_gb: 12
disk_size_gb: 1000
readiness_check:
path: "/readines_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Error logs:
ERROR: (gcloud.app.deploy) Error Response: [4] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-10T23:21:10.941Z47607.vt.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
There could be possible ways to resolve such deployment errors.
Increase the value of app_start_timeout_sec to the maximum value which is 1800
Make sure that all the Google Cloud services that Endpoints and ESP require are enabled on your project.
Assuming that splitHealthChecks feature is enabled, make sure to follow all the steps needed when migrating from the legacy version.

amazon-ssm-agent failing to restart after reboot on Windows Server 2019 instance

We are applying patches to our Windows instances using the patch manager function in AWS Systems Manager. We have a patch baseline that is executed against a set of windows instances (each of which are part of a patch group) by executing a maintenance window which in turn executes a run command against each of the instances. However we are finding the following:
The instances in question seem to get patches installed correctly. Executing wmic qfe list shows that the patches have been installed on the target machines
The target instances are then rebooted after patches are installed
The run command remains in progress indefinitely
From more investigation we found that the amazon-ssh-agent failed to start when the instances are rebooted. The error logs were as follows:
[devInstanceA]: PS C:\ProgramData\Amazon\SSM\Logs> get-content .\errors.log -tail 20
2020-11-09 09:36:02 ERROR [func1 # coremanager.go.246] [instanceID=i-04b3ce4e6e53b0b6f] error occurred trying to start core module. Plugin name: StartupProcessor. Error: Internal error occurred by startup processor: runtime error: invalid memory address or nil pointer dereference
Once we manually restarted the amazon-ssh-agent again the run command completed successfully. This issue is we dont want to have to manually start the amazon-ssh-agenton each instance especially as we have alot of instances!
Any ideas on what is causing this, i.e. why is the amazon-ssh-agent not starting up successfully after automatic reboot?

Running Django checks in production runtime

I have a Django app that is deployed on kubernetes. The container also has a mount to a persistent volume containing some files that are needed for operation. I want to have a check that will check that the files are there and accessible during runtime everytime a pod starts. The Django documentation recommends against running checks in production (the app runs in uwsgi), and because the files are only available in the production environment, the check will fail when unit tested.
What would be an acceptable process for executing the checks in production?
This is a community wiki answer posted for better visibility. Feel free to expand it.
Your use case can be addressed from Kubernetes perspective. All you have to do is to use the Startup probes:
The kubelet uses startup probes to know when a container application
has started. If such a probe is configured, it disables liveness and
readiness checks until it succeeds, making sure those probes don't
interfere with the application startup. This can be used to adopt
liveness checks on slow starting containers, avoiding them getting
killed by the kubelet before they are up and running.
With it you can use the ExecAction that would execute a specified command inside the container. The diagnostic would be considered successful if the command exits with a status code of 0. An example of a simple command check could be one that checks if a particular file exists:
exec:
command:
- stat
- /file_directory/file_name.txt
You could also use a shell script but remember that:
Command is the command line to execute inside the container, the
working directory for the command is root ('/') in the container's
filesystem. The command is simply exec'd, it is not run inside a
shell, so traditional shell instructions ('|', etc) won't work. To use
a shell, you need to explicitly call out to that shell.

AWS-RunBashScript errors/warnings with Python

I have many EC2 instances that retain Celery jobs for processing. To efficiently start the overall task of completing the queue, I have tested AWS-RunBashScript in AWS' SSM with a BASH script that calls a Python script. For example, for a single instance this begins with sh start_celery.sh.
When I run the command in SSM, this is the following output (compare to other output below, after reading on):
/home/ec2-user/dh2o-py/venv/local/lib/python2.7/dist-packages/celery/utils/imports.py:167:
UserWarning: Cannot load celery.commands extension u'flower.command:FlowerCommand':
ImportError('No module named compat',)
namespace, class_name, exc))
/home/ec2-user/dh2o-py/tasks/task_harness.py:49: YAMLLoadWarning: calling yaml.load() without
Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
task_configs = yaml.load(conf)
Running a worker with superuser privileges when the worker accepts messages serialized with pickle is a very bad idea!
If you really want to continue then you have to set the C_FORCE_ROOT
environment variable (but please think about this before you do).
User information: uid=0 euid=0 gid=0 egid=0
failed to run commands: exit status 1
Note that only warnings are thrown. When I SSH to the same instance and run the same command (i.e. sh start_celery.sh), the following (same) output results BUT the process runs:
I have verified that the process does NOT run when doing this via SSM, and I have no idea why. As a work-around, I tried running the sh start_celery.sh command with bootstrapping in user data for each EC2, but that failed too.
So, why does SSM fail to actually run the process that I succeed in doing by actually via SSH to each instance running identical commands? The details below relate to machine and Python configuration:

Missing log lines when writing to cloudwatch from ECS Docker containers

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs)
Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0
Description:
A Dockerfile runs node.js or Python scripts using the CMD command.
These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.
Sample Dockerfile:
FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]
All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.
...
LogConfiguration:
LogDriver: 'awslogs'
Options:
awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...
Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.
It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?
sample logs:
incomplete stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
completed log stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading
UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below
I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!
I think it's due to CloudWatch Logs Agent publishing log events in batches:
How are log events batched?
A batch becomes full and is published when any of the following conditions are met:
The buffer_duration amount of time has passed since the first log event was added.
Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.
The number of log events has reached batch_count.
Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.
(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)
So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)
There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.
With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.
Dockerfile
FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .
# The original command
# ENTRYPOINT ["python", "main.py"]
# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
run_then_wait.sh
#!/bin/sh
set -e
# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
# Run the given command
"$#"
main.py
import logging
import time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
# After testing some random values, had most luck to induce the
# issue by sleeping 9 seconds here; would occur ~30% of the time
time.sleep(9)
logger.info("Hello world")
Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.
It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!
Just contacted AWS support about this issue and here is their response:
...
Based on that case, I can see that this occurs for containers in a
Fargate Task that exit quickly after outputting to stdout/stderr. It
seems to be related to how the awslogs driver works, and how Docker in
Fargate communicates to the CW endpoint.
Looking at our internal tickets for the same, I can see that our
service team are still working to get a permanent resolution for this
reported bug. Unfortunately, there is no ETA shared for when the fix
will be deployed. However, I've taken this opportunity to add this
case to the internal ticket to inform the team of the similar and try
to expedite the process
In the meantime, this can be avoided by extending the lifetime of the
exiting container by adding a delay (~>10 seconds) between the logging
output of the application and the exit of the process (exit of the
container).
...
Update:
Contacted AWS around August 1st, 2019, they say this issue has been fixed.
I observed this as well. It must be an ECS bug?
My workaround (Python 3.7):
import atexit
from time import sleep
atexit.register(finalizer)
def finalizer():
logger.info("All tasks have finished. Exiting.")
# Workaround:
# Fargate will exit and final batch of CloudWatch logs will be lost
sleep(10)
I had the same problem with flushing logs to CloudWatch.
Following asavoy's answer I switched from exec form to shell form of the ENTRYPOINT and added a 10 sec sleep at the end.
Before:
ENTRYPOINT ["java","-jar","/app.jar"]
After:
ENTRYPOINT java -jar /app.jar; sleep 10