AWS Batch Container Error: "no users found"

AWS Batch Container Error: "no users found" - amazon-web-services

When triggering a batch job (Fargate Job Queue), the status is going to FAILED with the following error message:
Cannotstartcontainererror: ResourceInitializationError: unable to
create new container: mount callback failed on
/tmp/containerd-mount3975084381: no users found
Unfortunately I can't find any similar errors online.
For reference, the Dockerfile that I'm building is simply the following:
FROM python:3.8-slim-buster
WORKDIR /app
USER root
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
COPY . .
CMD [ "python3", "run.py"]
And the contents of run.py are as folows:
print("Python script has run!")
The only other file in the image is requirements.txt, which contains just the line requests.

Fixed my own issue:
The job definition had the user set to ubuntu, which wasn't available in the python:3.8-slim-buster image.
Changing this to root fixed the issue.

Related

GCP deploy cloud function failed

I was trying to deploy a GCP cloud function using git hub actions and I got this error:
ERROR: (gcloud.functions.deploy) OperationError: code=3, message=Build failed: ERROR: Invalid requirement: 'Creating virtualenv xxxxxx-nJ6OriN1-py3.8 in /github/home/.cache/pypoetry/virtualenvs' (from line 1 of requirements.txt)
Hint: It looks like a path. File 'Creating virtualenv retry-failed-reports-nJ6OriN1-py3.8 in /github/home/.cache/pypoetry/virtualenvs' does not exist.; Error ID: 0ea8a540
Up until today, when all git hub workflows failed with this problem, my CI-CD pipeline was working as expected.
I was able to deploy the function manually using gcloud, it fails on cloud build step
How can I solve this issue?

We were using this command to generate the requirements.txt:
poetry export -f requirements.txt --without-hashes > requirements.txt
However, this stopped working.
Now I tried this command and is working fine:
poetry export -f requirements.txt --output requirements.txt --without-hashes

docker image runs ok locally but in ECS I get a message: executable file not found in $PATH

I've a weird error, I'm trying to run a python script in ECS, the dockerfile is pretty basic:
FROM python:3.8
COPY . /
RUN pip install -r requirements.txt
CMD ["python", "./get_historical_data.py"]
building this in my local machine works perfect,
docker run --network=host historical-price
I uploaded this image to ECR and run on ECS, a basic config, just set container name, pointing the Image to my ECR repo and set some environment variables...when I run this I get
Status reason CannotStartContainerError: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "python": executable file not found in $PATH: unknown
but (really weird) if I enter in the EC2 server and run the container manually
docker run -it -e TICKER='SOL/USDT' -e EXCHANGE='BINANCE' -e DB_HOST='xxx' -e DB_NAME='xxx' -e DB_PASSWORD='xxx' -e DB_PORT='xxx' -e DB_USER='xxx' xxx.dkr.ecr.ap-southeast-2.amazonaws.com/xxx:latest /bin/bash
I can see this running ok...
I've tried several dockerfiles, using
CMD python ./get_historical_data.py
or using python3 command instead of python
also I tried to skip the CMD command in the Dockerfile and add this in the ECS task definition
nothing work...
I really don't know what can be happen here because the last week I ran a similar task and this worked perfectly, hope you can help me
thank you, please let me know if you need more details

Dataflow with python flex template - launcher timeout

I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with timeout.
Here is some of logs I found in GCE console:
INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template/requirements.txt', '--exists-action', 'i', '--no-binary', ':all:'
Shutting down the GCE instance, launcher-202011121540156428385273524285797, used for launching.
Timeout in polling result file: gs://my_bucket/staging/template_launches/2020-11-12_15_40_15-6428385273524285797/operation_result.
Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service my_service_account#developer.gserviceaccount.com may not have enough permissions to pull container image gcr.io/indigo-computer-272415/samples/dataflow/streaming-beam-py:latest or create new objects in gs://my_bucket/staging/template_launches/2020-11-12_15_40_15-6428385273524285797/operation_result.
3. Transient errors occurred, please try again.
For 1, I see no useful lo. For 2, service account is default service account so it should all permissions.
How can I debug this further?
Here is my Docker file:
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
ADD localdeps localdeps
COPY requirements.txt .
COPY main.py .
COPY setup.py .
COPY bq_field_pb2.py .
COPY bq_table_pb2.py .
COPY core_pb2.py .
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
RUN pip install -U --no-cache-dir -r ./requirements.txt
I'm following this guide - https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates

A possible cause of this issue can be found within the requirements.txt file. If you are trying to install apache-beam within the requirements file the flex template will experience the exact issue you are describing: Jobs stay some time in the Queued state and finally fail with Timeout in polling result.
The reason being, they are affected by this issue. This only affects flex templates, the jobs run properly locally or with Standard Templates.
The solution is to install it separately in the Dockerfile.
RUN pip install -U apache-beam==<your desired version>
RUN pip install -U -r ./requirements.txt

Download the requirements to speed up launching the Dataflow job.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
COPY . .
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
RUN apt-get update \
# Upgrade pip and install the requirements.
&& pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE \
# Download the requirements to speed up launching the Dataflow job.
&& pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
# Since we already downloaded all the dependencies, there's no need to rebuild everything.
ENV PIP_NO_DEPS=True

getting "memory error" during aws elastic beanstalk docker deployment

I have a machine learning model with flask api dockerized and when I am trying to deploy it gives me a vague error but checking the expanded logs I can see that it was a memory error when installing tensorflow
As an aside, I also have included the model.h5 file within the docker image, should I not be doing this? Otherwise it is just the .py and config files (no venv in directory)
This is my Dockerfile (it works locally btw, but beanstalk has a memory limit which is getting hit and im not sure why)
FROM python:latest
COPY . /app
WORKDIR /app
RUN pip install --no-cache-dir -r requirements.txt
EXPOSE 5000
ENTRYPOINT ["python"]
CMD ["app.py", "--host=0.0.0.0"]

Running gcloud run deploy from inside Cloud Build results in error

I have a custom build step in Google Cloud Build, which first builds a docker image and then deploys it as a cloud run service.
This last step fails, with the following log output;
Step #2: Deploying... Step #2: Setting IAM Policy.........done Step
2: Creating Revision............................................................................................................................failed
Step #2: Deployment failed Step #2: ERROR: (gcloud.run.deploy) Cloud
Run error: Invalid argument error. Invalid ENTRYPOINT. [name:
"gcr.io/opencobalt/silo#sha256:fb860e758eb1957b90ff3761fcdf68dedb9d10f832f2bb21375915d3de2aaed5"
Step #2: error: "Invalid command \"/bin/sh\": file not found" Step #2:
]. Finished Step #2 ERROR ERROR: build step 2
"gcr.io/cloud-builders/gcloud" failed: step exited with non-zero
status: 1
The build steps look like this;
["run","deploy","silo","--image","gcr.io/opencobalt/silo","--region","us-central1","--platform","managed","--allow-unauthenticated"]}
The image is built an exists in the registry, and if I change the last build step to deploy a compute engine VM instead, it works. Those build steps looks like this;
{"name":"gcr.io/cloud-builders/gcloud","args":["compute","instances",
"create-with-container","silo","--container-image","gcr.io/opencobalt/silo","--zone","us-central1-a","--tags","silo,pharo"]}
I can also build the image locally but run into the same error when running gcloud run deploy locally.
I am trying to figure out how to solve this problem. The image works, since it runs fine locally and runs fine when deployed as a Compute Engine VM, the error only show up when I'm trying to deploy the image as a Cloud Run service.
(added) The Dockerfile looks like this;
######################################
# Based on Ubuntu image
######################################
FROM ubuntu
######################################
# Basic project infos
######################################
LABEL maintainer="PeterSvensson"
######################################
# Update Ubuntu apt and install some tools
######################################
RUN apt-get update \
&& apt-get install -y wget \
&& apt-get install -y git \
&& apt-get install -y unzip \
&& rm -rf /var/lib/apt/lists/*
######################################
# Have an own directory for the tool
######################################
RUN mkdir webapp
WORKDIR webapp
######################################
# Download Pharo using Zeroconf & start script
######################################
RUN wget -O- https://get.pharo.org/64/80+vm | bash
COPY service_account.json service_account.json
RUN export certificate="$(cat service_account.json)"
COPY load.st load.st
COPY setup.sh setup.sh
RUN chmod +x setup.sh
RUN ./setup.sh; echo 0
RUN ./pharo Pharo.image load.st; echo 0
######################################
# Expose port 8080 of Zinc outside the container
######################################
EXPOSE 8080
######################################
# Finally run headless as server
######################################
CMD ./pharo --headless Pharo.image --no-quit
Any advice warmly welcome.
Thank you.

After a lot of testing, I managed to come further. It seems that the /bin/sh missing file thing is a red herring.
I tried to change the startup command from CMD to ENTRYPOINT, since that was mentioned in the error, but it did not work. However, when I copied the startup instruction into a new file 'startup.sh' and changed the last line of the Dockerfile to;
ENTRYPOINT ./startup.sh
It did work. I needed to chmod +x the new file of course, but the strange thing is that ENTRYPOINT ./pharo --headless Pharo.image --no-quit gave the same error, and even ENTRYPOINT ["./pharo", "--headless", "Pharo.image", "--no-quit"] also gave the same error.
But having just one argument to ENTRYPOINT made cloud run work. Go figure.

It appears that Google Cloud Run has a dislike for the ubuntu:20.04 image. I have the exact same problem with a Play framework application.
The command
ENTRYPOINT /opt/play-codecheck/bin/play-codecheck -Dconfig.file=/opt/codecheck/production.conf
failed with
error: "Invalid command \"/bin/sh\": file not found"
I also tried
ENTRYPOINT ["/bin/bash", "/opt/play-codecheck/bin/play-codecheck", "-Dconfig.file=/opt/codecheck/production.conf"]
and was rewarded with
error: "Invalid command \"/bin/bash\": file not found"
The trick of putting the command in a shell script didn't work for me either. However, when I changed
FROM ubuntu:20.04
to
FROM ubuntu:18.04
the image deployed. At this point, that's an acceptable fix for me, but it seems like something that Google needs to address.

See also:
Unable to deploy Ubuntu 20.04 Docker container on Google Cloud Run
My workaround was to use a CMD directive that calls Python directly rather than a shell (either /bin/sh or /bin/bash). It's working well so far.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js