I have a batch of python jobs, that only differ in the input file they are reading, say:
python main.py --input=file1.json > log_file1.txt
python main.py --input=file2.json > log_file2.txt
python main.py --input=file3.json > log_file3.txt
...
All these jobs are independent, and use a prebuilt anaconda environment.
I'm able to run my code on an on-demand EC2 instance using the following workflow:
Mount an EBS volume with the input files and prebuilt conda environment.
Activate the conda environment.
Run python programs, such that each program reads a different input file, and writes to a separate log file. The input files are stored in the EBS volume, and the log files will be written to the EBS volume.
Now, I want to scale this to use AWS spot instances -- basically, if I have N jobs, request N spot instances that run one of the above jobs each to read different files from an existing volume, and write the outputs to different files on the same volume. But I couldn't find a comprehensive guide on how to go about it. Any help would be appreciated.
Maybe this will give you something to ponder as my solution isn't exactly like yours, but here goes. (oh, and i'm going to look at batch as well, just haven't gotten there). I have decent sized stock option files that I analyze and transform for 500 different symbols. I've used some tools to figure out my memory demands on the largest files are around 4MB max. I spin up 1 spot instance with at least 30 MB that is from an image I make of the ec2 and ebs store, so it's always the like the one I test on, just more memory.
I run a shell script that breaks up the 500 or so symbols into 6-10 different chunks and run them concurrently on one machine. I'm not time sensitive so I don't really need multiple machines in parallel. But I could, I would just run a different script.
here's the script:
for y in {0..500..50}
do
start_slice=$(($y))
end_slice=$(($y + 50))
# echo $low_limit
# echo $high_limit
/usr/local/bin/pipenv run ~/.local/share/virtualenvs/ec2-user-zzkNbF-x/bin/python /home/ec2-user/code/etrade_getoptionsdata/get_bidask_of_baseline_combos_intraday_parallel.py -s $start_slice -e $end_slice &
# echo 'next file'
done
my environment is pipenv and put the environment path in so it has access to all my modules
again, the script just breaks up same analysis into 50 symbols each
in my file I use a for loop that uses the passed in arguments -s and -e
for key_cons in keys_list[s:e]
to launch the shell script I've been playing around with nohup ./shell.sh $ so it runs in background and won't stop when my ssh session ends.
if you need one instance per job, then that's what it takes. each individual transformation I run takes 30-45 seconds, so it still takes a couple hours.
let me know if you have any questions.
Related
I have a AWS Cloud9 Instance that starts running at 11:52 PM MST and stops running at 11:59 PM MST. I have a dockerfile within the Instance that when ran with the correct mount will run a set of c++ .cpp files that collect live web data. The ultimate goal of this instance is to be fully automatic so that every night it collects the live web data for that date, hence why the Instance is open at the very end of the day each night. Is it possible to have my AWS Instance run a given command in a terminal window at a certain time, say 11:55 PM or even upon startup. So at the time, or at startup, the command "docker run -it...." is ran within the instance.
Is automating this process possible? I have looked into CloudWatch events and think that might be the best way to go about automating this process but I am not quite sure how I would create a rule to fulfill the job. If it is not possible to automate a certain command within a terminal window, could I automate the dockerfile to run at a certain time?
ofcourse you can automate running of commands not just docker but for the fact any commands using cron daemon. all you need to do is place your command in shell script file say doc.sh in your desired directory.
ssh into your instance
open terminal and type crontab -e
enter the following details in this manner a b c d e /directory/command
where a -Minute, b-hour c-day d-month e-day of the week
the /directory/command specifies the location and script you want to run.
for more reference cron examples,https://www.cyberciti.biz/faq/how-do-i-add-jobs-to-cron-under-linux-or-unix-oses/
If you have a dockerfile that you want to run for a few minutes a day, you should look into Fargate. You can schedule an event with Cloudwatch, run the container and then shut it down when it's done.
It will probably cost around $0.01/day to run this.
How can one download files from a GCP Storage bucket to a Container-Optimised OS (COS) on instance startup?
I know of the following solutions:
gcloud compute copy-files
SSH through console
SCP
Yet all of these have to be done manually and externally after an instance is started.
There is also cloud init, yet I can't find any info on how to copy files from a Storage bucket. Examples seem to be suggesting that it's better to include content of files in the cloud init file directly, which is not something I want to do because security. Is it possible to download files from Storge bucket using cloud init?
I considered using a startup script, yet COS lacks CLI tools such as gcloud or gsutil to be able to run any such commands in a startup script.
I know I could copy the files manually and then save the image as a boot disk, but I'm hoping there are solutions that avoid having to do so.
Most of all, I'm assuming I'm not asking for something impossible, given that COS instance setup allows me to specify Docker volumes that I could mount onto the starting container. This seems to suggest I should be able to have some private files on the instance the moment COS will attempt to run my image on startup. But how?
Trying to execute a startup-script with a cloud-sdk image and copying files there as suggested by Guillaume didn't work for me for a while, showing this log. Eventually I realised that the cloud-sdk image is 2.41GB when uncompressed and takes over 2 minutes to complete pulling. I tried again with an empty COS instance and the startup script completed successfully, downloading the data from a Storage bucket.
However, a 2.41GB image and over 2 minutes of boot time sound like a bit of an overkill to download a 2KB file. Don't they?
I'm glad to see a working solution to my question (thanks Guillaume!) although I'm still wondering: isn't there a nicer way to do this? I feel that this method is even less tidy than manually putting the files on the COS instance and then creating a machine image to use in the future.
Based on Guillaume's answer I created and published a gsutil wrapper image, available as voyz/gsutil_wrap. This way I am able to run a startup-script with the following command:
docker run -v /host/path:/container/path \
--entrypoint gsutil voyz/gsutil_wrap \
cp gs://bucket/path /container/path
It's essentially a copy of what Guillaume suggested, except it is using an image containing only a minimum setup required to run gsutil. As a result it weighs 0.22GB and pulls within 10-20 seconds on average - as opposed to 2.41GB and over 2 minutes respectively for the google/cloud-sdk image suggested by Guillaume.
Also, credit to this incredibly useful StackOverflow answer that allows gsutil to use the default service account for authentication.
The startup-script is the correct location to do this. And YES, COS lacks some useful library.
BUT you can run container! And, for example, the Google Cloud SDK container!
So, add this startup-script in the VM metadata:
key -> startup-script
value ->
docker run -v /local/path/to/copy/files:/dummy/container/path \
--entrypoint gsutil google/cloud-sdk \
cp gs://your_bucket/path/to/file /dummy/container/path
Note: the startup script is ran in root mode. Perform a chmod/chown in your startup script if you need to change the file access mode.
Let me know if you need more explanation on this command line
Of course, with a fresh COS image, the startup time is quite long (pull the container image and extract it).
To reduce the startup time, you can "bake" your image. I mean, start with a COS, download/install what you want on it (or only perform a docker pull of the googkle/cloud-sdk container) and create a custom image from this.
Like this, all the required dependencies will be present on the image and the boot start will be quicker.
Currently I am trying to download a large dataset (200k+ of large images) Its all stored on google cloud. The authors provide a wget script to download it:
wget -r -N -c -np --user username --ask-password https://alpha.physionet.org/files/mimic-cxr/2.0.0/
Now it downloads etc, but its been 2 days and its still going and I don't know how long its going to take. AFAIK its downloading each file individually. is there a way for me to download it in parallel?
EDIT: I don't have sudo access to the machine doing the downloading. I just have user access.
wget is a great tool but it is not designed to be efficient for downloading 200K files.
You can either wait for it to finish or find another tool that does parallel downloads provided that you have a fast Internet connection to support parallel downloads which might decrease the time by half over wget.
Since the source is an HTTPS web server, there really is not much you can do to speed this up besides downloading two to four files in parallel. Depending on your Internet speed, distance to the source server, you might not achieve any improvement with parallel downloads.
Note: You do not specify what you are downloading onto. If the destination is a Compute Engine VM, and you picked a tiny one (f1-micro) you may be resource limited. For any hi-speed data transfer pick at least an n1 instance size.
If you don't know the urls then use the good old httrack website copier to download files in parallel:
httrack -v -w https://user:password#example.com/
Default is 8 parallel connections but you can use cN option to increase it.
If the files are large you can use aria2c this will download single file with multiple threads:
aria2c -x 16 url
You could find out if the files are store in GCS, if so then you can just use
gsutil -m <src> <destination>
This will download files in multithreaded mode
Take a look at the updated official MIMIC-CXR https://mimic-cxr.mit.edu/about/download/downloads page.
There you'll find the info how to download via wget (locally) and gsutil (Google Cloud Storage)
just wondering if there is a way (either with third party solutions or native) to take snapshots of persistent disks every 10 minutes (or less).
At the moment, the automatic schedule only allows hourly backups.
thanks
Anil.
I have found a workaround that uses a couple of bash scripts to make a snapshot of a subset of persistent disks in a project with a manually specified period.
The subset is defined by filtering disks with a label backup=yes. To apply this label to a disk, run this command:
gcloud beta compute disks add-labels <DISK-NAME> --zone=<DISK-LOCATION> --labels=backup=yes
Step by step, this is how it worked for me:
Get the scripts: git clone https://github.com/cizara/google-cloud-auto-snapshot.git
cd into the directory where the code is
Change lines 8 and 11 of entrypoint.sh, writing the period in seconds (e.g. SLEEP=600, for 10 minutes) and the path to the other script, for instance ./google-cloud-auto-snapshot.sh.
Give execution permissions with chmod +x entrypoint.sh google-cloud-auto-snapshot.sh to both scripts and run entrypoint.sh.
Note that performing this operation with short periods and too many/large disks can be very expensive.
I have a machine learning project and I have to get data from a website every 15 minutes. And I cannot use my own computer so I will use Google cloud. I am trying to use Google Compute Engine and I have a script for getting data (here is the link: https://github.com/BurkayKirnik/Automatic-Crypto-Currency-Data-Getter/blob/master/code.py). This script gets data every 15 mins and writes it down to csv files. I can run this code by opening an SSH terminal and executing it from there but it stops working when I close the terminal. I tried to run it by executing it in startup script but it doesn't work this way too. How can I run this and save the csv files? BTW I have to install an API to run the code and I am doing it in startup script. There is no problem in this part.
Instances running in Google Cloud Platform can be configured with the same tools available in the operating system that they are running. If your instance is a Linux instance, the best method would be to use a cronjob to execute your script repeatedly at your chosen interval.
Once you have accessed the instance via SSH, you can open the crontab configuration file by running the following command:
$ crontab -e
The above command will provide access to your personal crontab configuration (for the user you are logged in as). If you want to run the script as root you can use this instead:
$ sudo crontab -e
You can now edit the crontab configuration and add an entry that tells cron to execute your script at your required interval (in your case every 15 minutes).
Therefore, your crontab entry should look something like this:
*/15 * * * * /path/to/you/script.sh
Notice the first entry is for minutes, so by using the */15, you are telling the cron daemon to execute the script once every 15 minutes.
Once you have edited the crontab configuration file, it is a good idea to restart the cron daemon to ensure the change you made will take place. To do this you can run:
$ sudo service cron restart
If you would like to check the status to ensure the cron service is running you can run:
$ sudo service cron status
You script will now execute every 15 minutes.
In terms of storing the CSV files, you could either program your script to store them on the instance, or an alternative would be to use Google Cloud Storage bucket. File can be copied to buckets easily by making use of the gsutil (part of Cloud SDK) command as described here. It's also possible to mount buckets as a file system as described here.