Resently I am working with Google Cloud Compute Engine to train a ml model
So I am tring to extract a .7z fike that has the data.
But it is too big and the machine even freezes or stops for uncatching error
I am using the Linux command below:
!7zr 'path of the file'
Any help to be able extracting the file ... Thanks in advance
You could try it by using GCS
Create a directory that only has the compressed file in it and nothing else,
yourdir/myfile.7z
Create an environment variable MYFILE=myfile.7z
Create a bucket on GCS using the gsutil cli:
gsutil mb gs://yourbucket/MY_DIR_FOR_ZIP_FILE
Next you upload the file to the bucket, like so
gsutil cp -m -v $MYFILE gs://MYBUCKET/MY_DIR_FOR_ZIP_FILE
Within the VM you can now download the file, again using gsutil cli
gsutil cp -m -v gs://MYBUCKET/MY_DIR_FOR_ZIP_FILE /YOU_DIR
Then extract and also remove the compresses file,
7z x $MYFILE && rm -v $MYFILE
You should now have the uncompressed file on the VM
Make sure to use the -m flag this will perform a parallel (multi-threaded/multi-processing) copy.
Here is the reference cp - Copy files and objects
Using the gsutil tool
The instructions above assumes that the size of your data is less than 1TB, and also you are using a VM with a disk size large enough to accomadate the data.
If your data is more than 1TB, you will need to use Transfer service for on-premises data.
The steps to follow when setting up transfer jobs are listed here
Creating a transfer job
Related
I am trying to backup all of our Google Cloud data to an external storage device.
There is a lot of data so I am attempting to download the entire bucket at once and am using the following command to do so, but it halts saying that there isn't enough storage on the device to complete the transfer.
gsutil -m cp -r \
"bucket name" \
.
What do I need to add to this command to download this information to my local D: drive? I have searched through the available docs and have not been able to find the answer.
I used the gsutil command that GCP provided for me automatically, but it seems to be trying to copy the files to a destination without enough storage to hold the needed data.
Remember that you are running the command from the Cloud Shell and not in a local terminal or Windows Command Line. If you inspect the Cloud Shell's file system/structure, it resembles more that of a Unix environment in which you can specify the destination like such instead: ~/bucketfiles/. Even a simple gsutil -m cp -R gs://bucket-name.appspot.com ./ will work since Cloud Shell can identify the ./ directory which is the current directory.
A workaround to this is to perform the command on your Windows Command Line. You would have to install Google Cloud SDK beforehand.
Alternatively, this can also be done in Cloud Shell, albeit with an extra step:
Download the bucket objects by running gsutil -m cp -R gs://bucket-name ~/ which will download it into the home directory in Cloud Shell
Transfer the files downloaded in the ~/ (home) directory from Cloud Shell to the local machine either through the User Interface or by running gcloud alpha cloud-shell scp.
Using the FFmpeg I'm trying to output file to the s3 bucket.
ffmpeg -i myfile.mp4 -an -crf 20 -vf crop=200:200 -s 800x600 -f mp4 pipe:1 | aws s3 cp - s3://my.test.bucket
As I'm already advised that this cannot be done since creating an mp4 file requires seeking and piping doesn't allow seeking. if I change this command to store the file on the local disk
ffmpeg -i myfile.mp4 -an -crf 20 -vf crop=200:200 -s 800x600 myfile.mp4
it will store locally under project root folder which is fine.
But since I'm running my app from the container and the ffmpeg itself is installed in the Dockerfile I'm trying to figure out what are the possible options here? (if mp4 cannot be stored on S3 from ffmpeg command).
I need to download the output file myfile.mp4 into the server path if I use IWebHostEnvironment where it would actually be saved? is it inside container? Can I mount some s3 bucket folder into docker file and use it from the actual ffmpeg command again?
Since my input file is on s3 bucket and I want my output file to be on the same s3 bucket is there any solution where I wouldn't need to download the output file from the ffmpeg and upload it again?
I guess this is a lot of questions but I feel like I run into a rabbit hole here.
There are really a lot of questions. :D
To make it fair, a few questions from me, to see if I understand everything.
Where is your docker container running? Lambda, ec2 machine, kubernetes cluster?
If it is on ec2, you can use https://aws.amazon.com/efs/ but....
Can you simply save the file in /tmp? And then make an aws s3 cp command from tmp folde ?
In some environments (for example lambda), /tmp was the only place where I had programmatically access to file system.
Although if I understand correctly, you have write rights in your environment? Because you download the original image from s3 bucket. So can you do something like this?
download source file from s3
create new file with ffmpeg
uploaded the file to s3
I have a problem downloading entire folder in GCP. How should I download the whole bucket? I run this code in GCP Shell Environment:
gsutil -m cp -R gs://my-uniquename-bucket ./C:\Users\Myname\Desktop\Bucket
and I get an error message: "CommandException: Destination URL must name a directory, bucket, or bucket subdirectory for the multiple source form of the cp command. CommandException: 7 files/objects could not be transferred."
Could someone please point out the mistake in the code line?
To download an entire bucket You must install google cloud SDK
then run this command
gsutil -m cp -R gs://project-bucket-name path/to/local
where path/to/local is your path of local storage of your machine
The error lies within the destination URL as specified by the error message.
I run this code in GCP Shell Environment
Remember that you are running the command from the Cloud Shell and not in a local terminal or Windows Command Line. Thus, it is throwing that error because it cannot find the path you specified. If you inspect the Cloud Shell's file system/structure, it resembles more that of a Unix environment in which you can specify the destination like such instead: ~/bucketfiles/. Even a simple gsutil -m cp -R gs://bucket-name.appspot.com ./ will work since Cloud Shell can identify the ./ directory which is the current directory.
A workaround to this issue is to perform the command on your Windows Command Line. You would have to install Google Cloud SDK beforehand.
Alternatively, this can also be done in Cloud Shell, albeit with an extra step:
Download the bucket objects by running gsutil -m cp -R gs://bucket-name ~/ which will download it into the home directory in Cloud Shell
Transfer the files downloaded in the ~/ (home) directory from Cloud Shell to the local machine either through the User Interface or by running gcloud alpha cloud-shell scp
Your destination path is invalid:
./C:\Users\Myname\Desktop\Bucket
Change to:
/Users/Myname/Desktop/Bucket
C: is a reserved device name. You cannot specify reserved device names in a relative path. ./C: is not valid.
There is not a one-button solution for downloading a full bucket to your local machine through the Cloud Shell.
The best option for an environment like yours (only using the Cloud Shell interface, without gcloud installed on your local system), is to follow a series of steps:
Downloading the whole bucket on the Cloud Shell environment
Zip the contents of the bucket
Upload the zipped file
Download the file through the browser
Clean up:
Delete the local files (local in the context of the Cloud Shell)
Delete the zipped bucket file
Unzip the bucket locally
This has the advantage of only having to download a single file on your local machine.
This might seem a lot of steps for a non-developer, but it's actually pretty simple:
First, run this on the Cloud Shell:
mkdir /tmp/bucket-contents/
gsutil -m cp -R gs://my-uniquename-bucket /tmp/bucket-contents/
pushd /tmp/bucket-contents/
zip -r /tmp/zipped-bucket.zip .
popd
gsutil cp /tmp/zipped-bucket.zip gs://my-uniquename-bucket/zipped-bucket.zip
Then, download the zipped file through this link: https://storage.cloud.google.com/my-uniquename-bucket/zipped-bucket.zip
Finally, clean up:
rm -rf /tmp/bucket-contents
rm /tmp/zipped-bucket.zip
gsutil rm gs://my-uniquename-bucket/zipped-bucket.zip
After these steps, you'll have a zipped-bucket.zip file in your local system that you can unzip with the tool of your choice.
Note that this might not work if you have too much data in your bucket and the Cloud Shell environment can't store all the data, but you could repeat the same steps on folders instead of buckets to have a manageable size.
We are using automation scripts to upload thousands of files from MAPR HDFS to GCP storage. Sometimes the files in the main bucket appear with tmp~!# suffix it causes failures in our pipeline.
Example:
gs://some_path/.pre-processing/file_name.gz.tmp~!#
We are using rsync -m and in certain cases cp -I
some_file | gsutil -m cp -I '{GCP_DESTINATION}'
gsutil -m rsync {MAPR_SOURCE} '{GCP_DESTINATION}'
It's possible that copy attempt failed and retried later from a different machine, eventually, we have both the file and another one with the tmp~!# suffix
I'd want to get rid of these files without actively looking for them.
we have gsutil 4.33, appreciate any lead. Thx
To elaborate,
There is a tar.gz file on my AWS S3, let's call it example.tar.gz.
So, what I want to do is download the extracted contents of example.tar.gz to /var/home/.
One way to do it is to simply download the tar.gz, extract it, then delete the tar.gz.
However, I don't want to use space downloading the tar.gz file, I just want to download the extracted version or only store the extracted version.
Is this possible?
Thanks!
What you need is the following:
aws s3 cp s3://example-bucket/file.tar.gz - | tar -xz
This will stream the file.tar.gz from s3 and extract it directly (in-memory) to the current directory. No temporary files, no extra storage and no clean up after this one command.
Make sure you write the command exactly as above.
Today I tested with Python Boto 3 and aws cli and I noticed that tar.gz is extracted automatically when the file is downloaded
There isn't currently a way you can do this with S3.
You could create the following script though and just run it whenever you wish to download the tar. Just as long as you have the IAM role / access keys setup.
!#/bin/bash
aws s3 cp s3://$1/$2 $3
tar -xvf $3
rm $3
Then just call the script using ./myScript BUCKET_NAME FILE_LOCATION OUTPUT_FILE