Failing to push 500GB container every single time - google-container-registry

So, I've created a container with size 430GB and the push fails every single time with the same layer.
15d907c6c4d1: Preparing
....
15d907c6c4d1: Retrying in 20 seconds
....
15d907c6c4d1: Retrying in 1 second
write tcp 10.132.0.5:50149->74.125.133.82:443: write: broken pipe
I'm doing this push from GCP virtual machine, so network should be fast and stable.
$ gcloud docker -- --version
Docker version 1.12.3, build 6b644ec
I'm quite lost how to debug the issue further.

The likely issue is that the user you're trying to push as does not have write access to the Cloud Storage destination bucket. The bucket is in the format [region].artifacts.[PROJECT-ID].appspot.com and it uses standard GCS access controls, see the documentation for further details.

Related

Simple HelloWorld app on cloudrun (or knative) seems too slow

I deployed a sample HelloWorld app on Google Cloud Run, which is basically k-native, and every call to the API takes 1.4 seconds at best, in an end-to-end manner. Is it supposed to be so?
The sample app is at https://cloud.google.com/run/docs/quickstarts/build-and-deploy
I deployed the very same app on my localhost as a docker container and it takes about 22ms, end-to-end.
The same app on my GKE cluster takes about 150 ms, end-to-end.
import os
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
target = os.environ.get('TARGET', 'World')
return 'Hello {}!\n'.format(target)
if __name__ == "__main__":
app.run(debug=True,host='0.0.0.0',port=int(os.environ.get('PORT', 8080)))
I am a little experience in FaaS and I expect API calls would get faster as I invoked them in a row. (as in cold start vs. warm start)
But no matter how many times I execute the command it doesn't go below 1.4 seconds.
I think the network distance isn't the dominant factor here. The round-trip time via ping to the API endpoint is only 50ms away, more or less
So my questions are as follows:
Is it potentially an unintended bug? Is it a technical difficulty which will be resolved eventually? Or maybe nothing's wrong, it's just the SLA of k-native?
If nothing's wrong with Google Cloud Run and/or k-native, what is the dominant time-consuming factor here for my API call? I'd love to learn the mechanism.
Additional Details:
Where I am located at: Seoul/Asia
The region for my Cloud Run app: us-central1
type of Internet connection I am testing under: Business, Wired
app's container image size: 343.3MB
the bucket location that Container Registry is using: gcr.io
WebPageTest from Seoul/Asia (warmup time):
Content Type: text/html
Request Start: 0.44 s
DNS Lookup: 249 ms
Initial Connection: 59 ms
SSL Negotiation: 106 ms
Time to First Byte: 961 ms
Content Download: 2 ms
WebPageTest from Chicago/US (warmup time):
Content Type: text/html
Request Start: 0.171 s
DNS Lookup: 41 ms
Initial Connection: 29 ms
SSL Negotiation: 57 ms
Time to First Byte: 61 ms
Content Download: 3 ms
ANSWER by Steren, the Cloud Run product manager
We have detected high latency when calling Cloud Run services from
some particular regions in the world. Sadly, Seoul seems to be one of
them.
[Update: This person has a networking problem in his area. I tested his endpoint from Seattle with no problems. Details in the comments below.]
I have worked with Cloud Run constantly for the past several months. I have deployed several production applications and dozens of test services. I am in Seattle, Cloud Run is in us-central1. I have never noticed a delay. Actually, I am impressed with how fast a container starts up.
For one of my services, I am seeing cold start time to first byte of 485ms. Next invocation 266ms, 360ms. My container is checking SSL certificates (2) on the Internet. The response time is very good.
For another service which is a PHP website, time to first byte on cold start is 312ms, then 94ms, 112ms.
What could be factors that are different for you?
How large is your container image? Check Container Registry for the size. My containers are under 100 MB. The larger the container the longer the cold start time.
Where is the bucket located that Container Registry is using? You want the bucket to be in us-central1 or at least US. This will change soon with when new Cloud Run regions are announced.
What type of Internet connection are you testing under? Home based or Business. Wireless or Ethernet connection? Where in the world are you testing from? Launch a temporary Compute Engine instance, repeat your tests to Cloud Run and compare. This will remove your ISP from the equation.
Increase the memory allocated to the container. Does this affect performance? Python/Flask does not require much memory, my containers are typically 128 MB and 256 MB. Container images are loaded into memory, so if you have a bloated container, you might now have enough memory left reducing performance.
What does Stackdriver logs show you? You can see container starts, requests, and container terminations.
(Cloud Run product manager here)
We have detected high latency when calling Cloud Run services from some particular regions in the world. Sadly, Seoul seems to be one of them.
We will explicitly capture this as a Known issue and we are working on fixing this before General Availability. Feel free to open a new issue in our public issue tracker

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Signature expired: is now earlier than error : InvalidSignatureException

I am trying a small example with AWS API Gateway and IAM authorization. The AWS API Gateway generated the below Endpoint :
https://xyz1234.execute-api.us-east-2.amazonaws.com/Users/users
with POST action and no parameters.
Initially I had turned off the IAM for this POST Method and I verified results using Postman it works.
Then I created a new IAM User and attached AmazonAPIGatewayInvokeFullAccess Policy to the user thereby giving permission to invoke any API's. Enabled the IAM for the POST Method.
I then went to Postman - and added Authorization with AccessKey, Secret Key, AWS Region as us-east-2 and Service Name as execute-api and tried to execute the Request but I got InvalidSignatureException Error with 403 as return code.
The body contains following message :
Signature expired: 20170517T062414Z is now earlier than 20170517T062840Z (20170517T063340Z - 5 min.)"
What am I missing ?
A request signed with AWS sigV4 includes a timestamp for when the signature was created. Signatures are only valid for a short amount of time after they are created. (This limits the amount of time that a replay attack can be attempted.)
When the signature is validated the timestamp is compared to the current time. If this indicates that the signature was not created recently, then signature validation fails with the error message you mentioned.
If you get this on in a Docker container on Windows that uses WSL, then it may help to fix the WSL time with by running wsl -d docker-desktop -e /sbin/hwclock -s in a Powershell. You can verify this is the case beforehand by logging into the container and
typing date in the terminal and comparing it with your host machine time.
A common cause of this is when the local clock on the host generating the signature is off by more than a couple of minutes.
You need to synchronize your machines local clock with NTP.
for eg. on an ubuntu machine:
sudo ntpdate pool.ntp.org
System time goes out of sync quite often. You need to keep them in sync periodically.
You can run a daily CRON job to keep your system time in sync as mentioned at this link: Periodically synchronize time in Linux
Create a bash script to sync time called ntpdate and put the below
into it
#!/bin/sh
# sync server time
/usr/sbin/ntpdate pool.ntp.org >> /tmp/ntpdate.log
You can place this script anywhere you like and then set up a cron I
will be putting it into the daily cron directory so that it runs once
every day So my ntpdate script is now in /etc/cron.daily/ntpdate and
it will run every day
Make this script executable
chmod +x /etc/cron.daily/ntpdate
Test it by running the script once and look for some output in
/tmp/ntpdate.log
/etc/cron.daily/ntpdate
In your log file you should see something like
26 Aug 12:19:06 ntpdate[2191]: adjust time server 206.108.0.131 offset 0.272120 sec
Faced similar issue when I use timedatectl command to change datetime of underlying machine... Explanation given by MikeD & others are really informative to fix the issue....
sudo apt install ntp
sudo apt install ntpdate
sudo ntpdate ntp.ubuntu.com
After synchronizing time with correct current datetime, this issue will be resolved
For me, the issue happened while using WSL. The date in WSL was out of sync.
The solution was to run the command
wsl --shutdown and restart docker.
This one command did the trick
sudo ntpdate pool.ntp.org
Make sure your PC's clock is set correctly. I faced the same issue and then realized my clock wasn't showing the right time due to some reason. As soon as I corrected the time, it started working fine again! Hope this helped.
I was also facing this issue , added
correctClockSkew: true
and issue fixed for me
const nodemailer = require('nodemailer');
const ses = require('nodemailer-ses-transport');
let transporter = nodemailer.createTransport(ses({
correctClockSkew: true,
accessKeyId: **,
secretAccessKey: **,
region: **
}));
If you are in AWS Ec2 Ubuntu server and somehow not able to fix time with NTP thing.
sudo date -s "$(wget -qSO- --max-redirect=0 google.com 2>&1 | grep Date: | cut -d' ' -f5-8)Z"
Source:https://askubuntu.com/a/655528
Had this problem on Windows. The current time got out of sync after a power outage. Solved it by: Setting -> date and time -> Sync now.
For those who face this issue while running Lambda functions (that use other AWS services like DynamoDB) locally with sam local invoke:
The time in docker container, used by sam, may not be in sync with host. Restarting your docker on host (Docker Desktop on Windows) should resolve this issue.
I was making AWS API requests from a VM on my local machine. I checked the date was correct and was syncing, but I was still getting the error above. I halted and re-upped my VM and the error went away. I never figured out the exact cause, but "turning it off and back on again" fixed it.
Complementing what as #miked-at-aws post about AWS sigV4, There are at least 2 main possible root causes for the clock skew:
your CPU is overloaded (reaching 99% usage or in EC2 instances with CPU limits that run out on CPU credits).
Why would this generate the time skew? because when the amazon SDK creates the time stamp to the moment the request is sent, normally there shouldn't be more than just a few nano or micro seconds, but if your CPU is overwhelmed it may take it several seconds or even minutes in some cases to process, so for this root cause you will experience not a 100% events lost but just some x% that may not be too big.
for the second root cause which is that your machine clock isn't just adjusted, well probably 100% of your events are being lost and you just have to make sure that your machine clock is being set and adjusted correctly.
I have tried all the solution related to time sync, but nothing works out. What I did was, while creating a service client, I set the correctClockSkew option as true. This solved my problem.
For instance:
let dynamodb = new AWS.DynamoDB({correctClockSkew: true});
Hope this will sort out.
Reference: https://github.com/aws/aws-sdk-js/issues/527
I have face this same problem while fetching video from Amazon Kinesis to my local website. So, in order to solve this problem i have install crony in my computer.This crony solved my problem.You can see the Amazon crony installation in this following link.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html
What worked for me was to change the time on my computer. I am in the UK so I put it forward one hour to put on a European time zone. Then it worked. This is not the best fix but it worked for me to move forward.
I set the timezone to eu-west-2 which is London so I am not sure why it only worked when I put the time on my computer forward an hour. I need to look into that.
Just try to update the system date and time they might be outdated synchronize your clock, and reload your console. This worked for me.
This is a question asking for any recent updates/suggestions.
Is this problem can be solved using aws amplifier, aws cognito SDK, service worker?
Time synchronization is not working

Forwarding journald to Cloudwatch Logs

I'm a newbie to CentOS and wanted to know the best way to parse journal logs to CloudWatch Logs.
My thought processes so far are:
Use FIFO to parse the journal logs and ingest this to Cloudwatch Logs, - It looks like this could come with draw backs where logs could be dropped if we hit buffering limits.
Forward journal logs to syslog and send syslogs to Cloudwatch Logs --
The idea is essentially to have everything logging to journald as JSON and then forward this across to CloudWatch Logs.
What is the best way to do this? How have others solved this problem?
Take a look at https://github.com/advantageous/systemd-cloud-watch
We had problems with journald-cloudwatch-logs. It just did not work for us at all.
It does not limit the size of the message or commandLine that it sends to CloudWatch and the CloudWatch sends back an error that journald-cloudwatch-logs cannot handle which makes it out of sync.
systemd-cloud-watch is stateless and it asks CloudWatch where it left off.
systemd-cloud-watch also creates the log-group if missing.
systemd-cloud-watch also uses the name tag and the private ip address so that you can easily find the log you are looking for.
We also include a packer file to show you how to build and configure a systemd-cloud-watch image with EC2/Centos/Systemd. There is no question about how to configure systemd because we have a working example.
Take a look at https://github.com/saymedia/journald-cloudwatch-logs by Matin Atkins.
This open source project creates a binary that does exactly what you want - ship your (systemd) journald logs to AWS CloudWatch Logs.
The project depends on libsystemd to forward directly to CloudWatch. It does not rely on forwarding to syslog. This is a good thing.
The project appears to use golang's concurrent channels to read the logs and batches writes.
Vector can be used to ship logs from journald to AWS CloudWatch Logs.
journald can be used as a source and AWS Cloudwatch Logs as a sink.
I'm working on integrating this with an existing deployment of about 6 EC2 instances that generate about 30 GB of logs daily. I'll update this answer with any caveats or gotchas after we've used Vector in production for a few weeks.
EDIT 8/17/2020
A few things to be aware of. The match batch size for the PutLogEvents is 1MB and there is a max of 5 requests per second per stream. See the limits here..
To help with that, in my set up each journald unit has it's own log stream. Also, there are a lot of fields that the Vector journald sink includes, I used a vector transform to remove all the ones I didn't need. However, I'm still running into rate limits.
EDIT 10/6/2020
I have this running in production now. I had to update the version of vector I was using from 0.8.1 to 0.10.0 to take care an issue with vector not respecting the max bytes per batch requirement for AWS CloudWatch logs. As far as the rate limit issues I was experiencing, it turns out I wasn't having any issues. I was getting this message in the vector logs tower_limit::rate::service: rate limit exceeded, disabling service. What that actually means is that vector is pausing send logs temporarily to respect the rate limit of the sink. Also, each Cloudwatch Log Stream can consume up to 18 GB per hour which is fine for my 30 GB per day requirement for over 30 different services on 6 VMs.
One issue I did run into was causing the CPU to spike on our main API service. I had a source for each service unit to tail the journald logs. I believe this somehow blocked our API from not being able to write to journald (not 100% though). What I did was have one source and specified multiple units to follow so there was only one command tailing the logs and I increased the batch size since each service generates a lot of logs. I then used vector's template syntax to split the Log Group and Log Stream based on the service name. Below is an example configuration:
[sources.journald_logs]
type = "journald"
units = ["api", "sshd", "vector", "review", "other-service"]
batch_size = 100
[sinks.cloud_watch_logs]
type = "aws_cloudwatch_logs"
inputs = ["journald_logs"]
group_name = "/production/{{host}}/{{_SYSTEMD_UNIT}}"
healthcheck = true
region = "${region}"
stream_name = "{{_SYSTEMD_UNIT}}"
encoding = "json"
I have one final issue I need to iron out, but it's not related to this question. I'm using a file source for nginx since it writes to an access log file. Vector is consuming 80% of the CPU on that machine getting the logs and sending them to AWS CloudWatch. Filebeat also runs on the same box sending the logs to Logstash, but it's never caused any issues. Once we get vector working reliably we'll retire the Elastic Stack, but for now we have them running side by side.

Hello World PipeLine with ShelCommandlActivity

I'm trying to create a simple dataFlow pipeline with a single Activity of ShellCommandActivity type. I've attached the configuration of the activity and ec2 resource.
When I execute this the Ec2Resource sits in the WAITING_ON_DEPENDENCIES state then after sometime changes to TIMEDOUT. The ShellCommandActivity is always in the CANCELED state. I see the instance launch and very quicky changes to the terminated stated.
I've specified a s3 log file url, but that never gets updated.
Can anyone give me any pointers? Also is there any guidance out there on debugging this?
Thanks!!
You are currently forcing your instance to shut down after 1 minute which gives the TIMEOUT status if it can't execute in that time. Try increasing it to 50 minutes.
Also make sure you are using an AMI that runs Amazon Linux and that you are using full absolute paths in your scripts.
S3 log files are written as:
s3://bucket/folder/