How can one download files from a GCP Storage bucket to a Container-Optimised OS (COS) on instance startup?
I know of the following solutions:
gcloud compute copy-files
SSH through console
SCP
Yet all of these have to be done manually and externally after an instance is started.
There is also cloud init, yet I can't find any info on how to copy files from a Storage bucket. Examples seem to be suggesting that it's better to include content of files in the cloud init file directly, which is not something I want to do because security. Is it possible to download files from Storge bucket using cloud init?
I considered using a startup script, yet COS lacks CLI tools such as gcloud or gsutil to be able to run any such commands in a startup script.
I know I could copy the files manually and then save the image as a boot disk, but I'm hoping there are solutions that avoid having to do so.
Most of all, I'm assuming I'm not asking for something impossible, given that COS instance setup allows me to specify Docker volumes that I could mount onto the starting container. This seems to suggest I should be able to have some private files on the instance the moment COS will attempt to run my image on startup. But how?
Trying to execute a startup-script with a cloud-sdk image and copying files there as suggested by Guillaume didn't work for me for a while, showing this log. Eventually I realised that the cloud-sdk image is 2.41GB when uncompressed and takes over 2 minutes to complete pulling. I tried again with an empty COS instance and the startup script completed successfully, downloading the data from a Storage bucket.
However, a 2.41GB image and over 2 minutes of boot time sound like a bit of an overkill to download a 2KB file. Don't they?
I'm glad to see a working solution to my question (thanks Guillaume!) although I'm still wondering: isn't there a nicer way to do this? I feel that this method is even less tidy than manually putting the files on the COS instance and then creating a machine image to use in the future.
Based on Guillaume's answer I created and published a gsutil wrapper image, available as voyz/gsutil_wrap. This way I am able to run a startup-script with the following command:
docker run -v /host/path:/container/path \
--entrypoint gsutil voyz/gsutil_wrap \
cp gs://bucket/path /container/path
It's essentially a copy of what Guillaume suggested, except it is using an image containing only a minimum setup required to run gsutil. As a result it weighs 0.22GB and pulls within 10-20 seconds on average - as opposed to 2.41GB and over 2 minutes respectively for the google/cloud-sdk image suggested by Guillaume.
Also, credit to this incredibly useful StackOverflow answer that allows gsutil to use the default service account for authentication.
The startup-script is the correct location to do this. And YES, COS lacks some useful library.
BUT you can run container! And, for example, the Google Cloud SDK container!
So, add this startup-script in the VM metadata:
key -> startup-script
value ->
docker run -v /local/path/to/copy/files:/dummy/container/path \
--entrypoint gsutil google/cloud-sdk \
cp gs://your_bucket/path/to/file /dummy/container/path
Note: the startup script is ran in root mode. Perform a chmod/chown in your startup script if you need to change the file access mode.
Let me know if you need more explanation on this command line
Of course, with a fresh COS image, the startup time is quite long (pull the container image and extract it).
To reduce the startup time, you can "bake" your image. I mean, start with a COS, download/install what you want on it (or only perform a docker pull of the googkle/cloud-sdk container) and create a custom image from this.
Like this, all the required dependencies will be present on the image and the boot start will be quicker.
Related
I have a batch of python jobs, that only differ in the input file they are reading, say:
python main.py --input=file1.json > log_file1.txt
python main.py --input=file2.json > log_file2.txt
python main.py --input=file3.json > log_file3.txt
...
All these jobs are independent, and use a prebuilt anaconda environment.
I'm able to run my code on an on-demand EC2 instance using the following workflow:
Mount an EBS volume with the input files and prebuilt conda environment.
Activate the conda environment.
Run python programs, such that each program reads a different input file, and writes to a separate log file. The input files are stored in the EBS volume, and the log files will be written to the EBS volume.
Now, I want to scale this to use AWS spot instances -- basically, if I have N jobs, request N spot instances that run one of the above jobs each to read different files from an existing volume, and write the outputs to different files on the same volume. But I couldn't find a comprehensive guide on how to go about it. Any help would be appreciated.
Maybe this will give you something to ponder as my solution isn't exactly like yours, but here goes. (oh, and i'm going to look at batch as well, just haven't gotten there). I have decent sized stock option files that I analyze and transform for 500 different symbols. I've used some tools to figure out my memory demands on the largest files are around 4MB max. I spin up 1 spot instance with at least 30 MB that is from an image I make of the ec2 and ebs store, so it's always the like the one I test on, just more memory.
I run a shell script that breaks up the 500 or so symbols into 6-10 different chunks and run them concurrently on one machine. I'm not time sensitive so I don't really need multiple machines in parallel. But I could, I would just run a different script.
here's the script:
for y in {0..500..50}
do
start_slice=$(($y))
end_slice=$(($y + 50))
# echo $low_limit
# echo $high_limit
/usr/local/bin/pipenv run ~/.local/share/virtualenvs/ec2-user-zzkNbF-x/bin/python /home/ec2-user/code/etrade_getoptionsdata/get_bidask_of_baseline_combos_intraday_parallel.py -s $start_slice -e $end_slice &
# echo 'next file'
done
my environment is pipenv and put the environment path in so it has access to all my modules
again, the script just breaks up same analysis into 50 symbols each
in my file I use a for loop that uses the passed in arguments -s and -e
for key_cons in keys_list[s:e]
to launch the shell script I've been playing around with nohup ./shell.sh $ so it runs in background and won't stop when my ssh session ends.
if you need one instance per job, then that's what it takes. each individual transformation I run takes 30-45 seconds, so it still takes a couple hours.
let me know if you have any questions.
I have multiple projects in Google cloud and I need to find-out unused external ip address in all the projects. I have a query which works for one project but is there a way to run a query which runs on all projects together.
I am trying to avoid time and effort for switching projects every time.
Command to extract reserved pip's in a single project - gcloud compute addresses list --filter=status:reserved
For a process like this, It would be better to create a script that runs this for you! One great thing about gcloud commands is that they can be used in shell languages to help make things like this possible!
Open cloud shell in GCP, create a file called "script.sh" and write something like this to the file...
#The below line will do an action for every project in the project list
for project in $(gcloud project list --format='(project_id)');
do
#This gcloud command will run for every instance of project in projectlist
echo $(gcloud compute addresses list --project=$project --filter=status:reserved)
#ouput to csv
done >> output.csv
once this is done, make sure to grant yourself permission to run this script by typing...
chmod 755 script.sh
then run the script...
./script.sh
Let me know if this helps! Comment to this answer if you need more clarification or help!
Google cloud run does not support the docker registry, therefore I have to manually pull the image, tag it and push it to GCR.
Container image URL should match pattern [region.]gcr.io/repo-path[:tag or #digest]
Is there any simpler way to do this?
Sadly, that's the easiest way to move a Docker image from one container registry to another one.
Just for documentation purposes, I will add the steps for the benefit of the community:
Pull the Docker image using the following command:
docker pull [REPOSITORY-NAME]/[IMAGE]:[TAG]
Then, tag that pulled image using the following command:
docker tag [IMAGE] gcr.io/[PROJECT-ID]/[IMAGE]
Push that image to your gcr repository using the following command:
docker push gcr.io/[PROJECT-ID]/[IMAGE]
I'm afraid in any case, "simpler" won't be a thing. Though, you may try to use Docker web hooks to call a simple Cloud function (pull, tag, push) in order to keep your images in sync in your GCR.
There seems to be some projects to manage that kind of hassle like dregsy but I didn't try them...
I've been working on some tooling called regclient that supports this use case. For copying a single image, the command would be:
regctl image copy ${source} ${target}
e.g.
regctl image copy ubuntu:latest gcr.io/your-project/ubuntu:latest
This checks the digests before copying with a HEAD request to allow the command to be run frequently but only using your quota when the upstream image doesn't match what's on GCR. It also copies multi-platform images which you wouldn't get with a docker pull and docker push (docker dereferences the image to your platform on the pull). And unlike the docker pull, the individual layers are only copied when they don't exist on the target registry.
If you have lots of images to continuously mirror, there's also a regsync command that copies according to a yaml file with a list of images, tags, and schedule to run the copies.
These can run as containers, but they are also available as standalone binaries that don't require docker to run.
Does anyone know of a way to persist configurations done using "gcloud init" commands inside cloudshell, so they don't vanish each time you disconnect?
I figured out how to persist python pip installs using the --user
example: pip install --user pandas
But, when I create a new configuration using gcloud init, use it for a bit, close cloudshell (or cloudshell times out on me), then reconnect later, the configurations are gone.
Not a big deal, I bounce between projects/etc so it's nice to have the configs saved so I can simply run
gcloud config configurations activate config-name
Thanks...Rich Murnane
Google Cloud Shell only persists data in your $HOME directory. Commands like gcloud init modify the environment variables and store configuration files in /tmp which is deleted when the VM is restarted. The VM is terminated after being idle for 20 minutes or 60 minutes depending on which document you read.
Google Cloud Shell is a Docker container. You can modify the docker image to customize to fit your needs. This method will allow you to install packages, tools, etc that are not located in your $HOME directory.
You can also store your files and configuration scripts on Google Cloud Storage. Modify .bashrc to download your cloud files and run your configuration script.
Either method will allow you to create a persistent environment.
This StackOverflow answer covers in detail what gcloud init does and how to basically emulate the same thing via script or command line.
gcloud init details
this isn't exactly what I wanted, but since my
account (userid) isn't changing, I'm simply going to
do the command
gcloud config set project second-project-name
good enough, thanks...Rich
We have some docker images we build with sbt-native-packager that need to interact with AWS services. When running them outside of AWS, we need to explicitly provide credentials.
I know we can explicitly pass environment variables containing the AWS credentials. Doing this complicates keeping our credentials secret. One option is to provide them via the command line, typically storing them into our shell history (yes I know this can be avoided by adding a space to the start of the command, but that is easy to forget) and putting them at higher risk of accidental copy/paste sharing. Alternatively, we can provide them via an env-file. But this exposes us to possibly checking them into version control or pushing them to another server unintentionally.
We've found that the ideal practice is to mount our local ~/.aws/ directory into the running user's home directory for the docker container. However, our attempts at getting this to work with the sbt-native-packager images have been unsuccessful.
One unique detail for sbt-native-packager images (compared to our others) is they are build using docker's ENTRYPOINT instead of CMD to start the application. I don't know if this has bearing on the problem.
So the question: Is it possible to provide AWS credentials to a docker container created by sbt-native-packager by mounting the AWS credentials folder via command line parameters at startup?
The problem I was running into was related to permissions. The .aws files have very restricted access on my machine, and the default user within the sbt-native-packager image is daemon. This user does not have access to read my files when mounted into the container.
I am able to obtain the behavior I desire by adding the following flags to my docker run command: -v ~/.aws/:/root/.aws/ --user=root
I was able to discover this by using the --entrypoint=ash flag when running to look at the HOME environment variable (location to mount the /.aws/ folder) and attempting to cat the contents of mounted folder.
Now I just need to understand what security vulnerabilities I'm opening myself up to by running docker containers in this way.
I'm not entirely sure why mounting ~/.aws would be a problem - typically it could be related to read permissions on that directory and the different UID between the host system and the container.
That said, I can suggest a couple of workarounds:
Use an environment variable file instead of explicitly specifying them in the command line. In docuer run, you can do this by specifying --env-file. To me this sounds like the most simple approach.
Mount a different credentials file and provide the AWS_CONFIG_FILE environment variable to specify it's location.