Amazon S3 cp keeps stopping partway through a copy procees - amazon-web-services

I have an overnight scheduled task (batch script) on a WS2K3 server which copies a ~2GB zip file to an S3 bucket for archival purposes.
I'm using AWS CLI v1 which I appreciate isn't ideal (nor is the fact the OS is out of date and unsupported). I had enough problems coercing AWS CLI into actually running on the server! (it's a legacy system the client refuses to update).
The copy process appears to have problems. I suspect there are intermittent connectivity isses at the client's site and it's causing the copy process to exit. I've added a debug argument to the command but for some reason the debug info isn't being echoed to the logfile so I can't identify the specific reason why it keeps exiting. All I can do is retry until the copy completes.
The S3 upload section of the script is below
:s3cp
aws configure set default.s3.max_concurrent_requests 1
#echo S3 Backup Started: %date% %time%
aws s3 cp %backup%.zip %bucket% --debug
IF %ERRORLEVEL% NEQ 0 goto:s3cp
#echo S3 Backup Finished: %date% %time%
goto exit
I've reduced concurrent connections in a attempt to restrict the resource overhead for this routine (the server struggles with most tasks!) but it's had no effect.
The last time it ran, the upload started at 20:04, restarted 23 times before finally completing a successful upload nearly 6 hours after the first copy process was called.
Are there any additional arguments I can pass to the AWS CLI to cope with timeouts and connection unreliability to ensure the process doesn't exit immediately?

Related

Google Cloud Ops Agent

I am having an issue where Google Cloud Ops Agent logging gathers a lot of data and fills up my entire debian server hard drive in about 3 weeks due to the ever increasing size of the log file.
I do not want to increase the size of my server hard drive.
Does anyone know how to configure Google Cloud Ops Agent so that it only retains log data for the previous 7 days ?
EDIT: Google Cloud Ops Agent log file is stored in directory below
/var/log/google-cloud-ops-agent/subagents/logging-module.log
I faced the same issue recently while using agent 2.11.0. And it's not just an enormous log file, it's also a ridiculous CPU usage! Check it out in htop.
If you open the log file you'll see it spamming errors about buffer chunks. Apparently, they got broken smh, so the agent can't read them and send away. Thus, high IO and CPU usage.
The solution is to stop the service:
sudo service google-cloud-ops-agent stop
Then clear all buffer chunks:
sudo rm -rf /var/lib/google-cloud-ops-agent/fluent-bit/buffers/
And delete log file if you want:
sudo rm -f /var/log/google-cloud-ops-agent/subagents/logging-module.log
Then start the agent:
sudo service google-cloud-ops-agent start
This helped me out.
Btw this issue is described here and it seems that Google "fixed" it since 2.7.0-1. Whatever they mean by it since we still faced it...

Missing log lines when writing to cloudwatch from ECS Docker containers

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs)
Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0
Description:
A Dockerfile runs node.js or Python scripts using the CMD command.
These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.
Sample Dockerfile:
FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]
All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.
...
LogConfiguration:
LogDriver: 'awslogs'
Options:
awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...
Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.
It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?
sample logs:
incomplete stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
completed log stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading
UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below
I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!
I think it's due to CloudWatch Logs Agent publishing log events in batches:
How are log events batched?
A batch becomes full and is published when any of the following conditions are met:
The buffer_duration amount of time has passed since the first log event was added.
Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.
The number of log events has reached batch_count.
Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.
(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)
So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)
There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.
With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.
Dockerfile
FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .
# The original command
# ENTRYPOINT ["python", "main.py"]
# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
run_then_wait.sh
#!/bin/sh
set -e
# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
# Run the given command
"$#"
main.py
import logging
import time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
# After testing some random values, had most luck to induce the
# issue by sleeping 9 seconds here; would occur ~30% of the time
time.sleep(9)
logger.info("Hello world")
Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.
It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!
Just contacted AWS support about this issue and here is their response:
...
Based on that case, I can see that this occurs for containers in a
Fargate Task that exit quickly after outputting to stdout/stderr. It
seems to be related to how the awslogs driver works, and how Docker in
Fargate communicates to the CW endpoint.
Looking at our internal tickets for the same, I can see that our
service team are still working to get a permanent resolution for this
reported bug. Unfortunately, there is no ETA shared for when the fix
will be deployed. However, I've taken this opportunity to add this
case to the internal ticket to inform the team of the similar and try
to expedite the process
In the meantime, this can be avoided by extending the lifetime of the
exiting container by adding a delay (~>10 seconds) between the logging
output of the application and the exit of the process (exit of the
container).
...
Update:
Contacted AWS around August 1st, 2019, they say this issue has been fixed.
I observed this as well. It must be an ECS bug?
My workaround (Python 3.7):
import atexit
from time import sleep
atexit.register(finalizer)
def finalizer():
logger.info("All tasks have finished. Exiting.")
# Workaround:
# Fargate will exit and final batch of CloudWatch logs will be lost
sleep(10)
I had the same problem with flushing logs to CloudWatch.
Following asavoy's answer I switched from exec form to shell form of the ENTRYPOINT and added a 10 sec sleep at the end.
Before:
ENTRYPOINT ["java","-jar","/app.jar"]
After:
ENTRYPOINT java -jar /app.jar; sleep 10

Running updates on EC2s that roll back on failure of status check

I’m setting up a patch process for EC2 servers running a web application.
I need to build an automated process that installs system updates but, reverts back to the last working ec2 instance if the web application fails a status check.
I’ve been trying to do this using an Automation Document in EC2 Systems Manager that performs the following steps:
Stop EC2 instance
Create AMI from instance
Launch new instance from newly created AMI
Run updates
Run status check on web application
If check fails, stop new instance and restart original instance
The Automation Document runs the first 5 steps successfully, but I can't identify how to trigger step 6? Can I do this within the Automation Document? What output would I be able to call from step 5? If it uses aws:runCommand, should the runCommand trigger a new automation document or another AWS tool?
I tried the following to solve this, which more or less worked:
Included an aws:runCommand action in the automation document
This ran the DocumentName "AWS-RunShellScript" with the following parameters:
Downloaded the script from s3:
sudo aws s3 cp s3://path/to/s3/script.sh /tmp/script.sh
Set the file to executable:
chmod +x /tmp/script.sh
Executed the script using variables set in, or generated by the automation document
bash /tmp/script.sh -o {{VAR1}} -n {{VAR2}} -i {{VAR3}} -l {{VAR4}} -w {{VAR5}}
The script included the following getopts command to set the inputted variables:
while getopts o:n:i:l:w: option
do
case "${option}"
in
n) VAR1=${OPTARG};;
o) VAR2=${OPTARG};;
i) VAR3=${OPTARG};;
l) VAR4=${OPTARG};;
w) VAR5=${OPTARG};;
esac
done
The bash script used the variables to run the status check, and roll back to last working instance if it failed.

Disable progress output aws s3 sync without disabling all output

Is there any way to disable the
Completed 1 of 12 part(s) with 11 file(s) remaining...
progress output with the aws s3 sync command (from the aws cli tools).
I know there is a --quiet option but I don't want to use it because I still want the Upload... details in my logfile.
Not a big issue, but creates mess in the logfile like:
Completed 1 of 12 part(s) with 11 file(s) remaining^Mupload: local/file to s3://some.bucket/remote/file
Where ^M is a control character.
As of October 2017, it is possible to only suppress upload progress with aws s3 cp and aws s3 sync by using the --no-progress option:
--no-progress (boolean) File transfer progress is not displayed. This flag is only applied when the quiet and only-show-errors flags are not
provided.
Example:
aws s3 sync /path/to/directory s3://bucket/folder --no-progress
Output:
upload: /path/to/directory to s3://bucket/folder
I had a quick look at the CLI tools code and currently it is not possible to disable that message.
You should use --only-show-errors flag while running the command. Also, you would want --no-progress. This is going to minimize the logging.
More specs: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
You can't disable the message completely. You can only delete by editing but when you run again, it would show up again.

AWS S3 upload fails: RequestTimeTooSkewed

I'm using
aws s3 sync ~/folder/ s3:// --delete
to upload (and sync) a large number of files to an S3 bucket. Some - but not all - of the files fail, throwing this error message:
upload failed: to s3://bucketname/folder/
A client error (RequestTimeTooSkewed) occurred when calling the UploadPart operation: The difference between the request time and the current time is too large
I know that the cause of this error is usually a local time that's out of sync with Internet time, but I'm running NTP (on my Ubuntu PC) and the date/time seem absolutely accurate - and this error has only been reported for about 15 out of the forty or so files I've uploaded so far.
Some of the files are relatively large - up to about 70MB each - and my upload speeds aren't fantastic: could S3 possibly be comparing the initial and completion times and reporting their difference as an error?
Thanks,
The time verification happens at the start of your upload to S3, so it won't be to do with files taking too long to upload.
Try comparing your system time with what S3 is reporting and see if there is any unnecessary time drift, just to make sure:
# Time from Amazon
$ curl http://s3.amazonaws.com -v
# Time on your local machine
$ date -u
(Time is returned in UTC)
I was running aws s3 cp inside docker container on a MacBook Pro and got this error. Restart the Docker for Mac fixed this issue.
Amazon S3 uses NTP for its system clocks, to sync with your clock.
Run
sudo apt-get install ntp
then open /etc/ntp.conf and add at the bottom
server 0.amazon.pool.ntp.org iburst
server 1.amazon.pool.ntp.org iburst
server 2.amazon.pool.ntp.org iburst
server 3.amazon.pool.ntp.org iburst
Then run
sudo service ntp restart
It now seems that multipart uploads were failing on aws s3. Using s3cmd instead works perfectly.
You have to sync you local time on your machine. The time is out of world time.
I'm having the issue on MacOS.
I fixed it by
Preference -> Date & Time -> check the box "Set date and time automatically"
Restarting the machine fixed this issue for me.
In cmd
Give aws configure
Set your default region name = us-east-1
Whatever it may be but there should be any ,
But not none
Default region name [none] -->> ×××××
Default region name [us-east-1] -->> √√
Create a bucket via GUI in the aws website and check its the time of creation
In the creation date
Note down that date and time from aws
and set the date and time of your pc with as same (which you have noted down)from settings in your pc
And now try to give the command in cmd
-> aws s3 ls