This command copies a huge number of files from Google Cloud storage to my local server.
gsutil -m cp -r gs://my-bucket/files/ .
There are 200+ files, each of which is over 5GB in size.
Once all files are downloaded, another process kicks in and starts reading the files one by one and extract the info needed.
The problem is, even though the gsutil copy process is fast and downloads files in batches of multiple files at a very high speed, I still need to wait till all the files are downloaded before starting to process them.
Ideally I would like to start processing the first file as soon as it is downloaded. But with multiple cp mode, there seems to be no way of knowing when a file is downloaded (or is there?).
From Google docs, this can be done in individual file copy mode.
if ! gsutil cp ./local-file gs://your-bucket/your-object; then
<< Code that handles failures >>
fi
That means if I run the cp without -m flag, I can get a boolean on success for that file and I can kick off the file processing.
Problem with this approach is the overall download will take much longer as files are now downloading one by one.
Any insight?
One thing you could do is have a separate process that periodically lists the directory, filtering out the files that are incompletely downloaded (they are downloaded to a filename ending with '.gstmp' and then renamed after the download completes) and keeps track of files you haven't yet processed. You could terminate the periodic listing process when the gsutil cp process completes, or you could just leave it running, so it processes downloads for the next time you download all the files.
Two potential complications with doing that are:
If the number of files being downloaded is very large, the periodic directory listings could be slow. How big "very large" is depends on the type of file system you're using. You could experiment by creating a directory with the approximate number of files you expect to download, and seeing how long it takes to list. Another option would be to use the gsutil cp -L option, which builds a manifest showing what files have been downloaded. You could then have a loop reading through the manifest, looking for files that have downloaded successfully.
If the multi-file download fails partway through (e.g., due to a network connection that's dropped for longer than gsutil will retry), you'll end up with a partial set of files. For this case you might considering using gsutil rsync, which can be restarted and pick up where you left off.
Related
Just like Unix command tar -czf xxx.tgz xxx/, is there a method can do the same thing in HDFS? I have a folder in HDFS has over 100k small files, and want to download it to local file system as fast as possible. hadoop fs -get is too slowly, I know hadoop archive can output a har, but it seems cannot solve my problem.
From what I see here,
https://issues.apache.org/jira/browse/HADOOP-7519
it is not possible to perform tar operation using hadoop commands. This has been filed as an improvement as I mentioned above and not resolved/available yet to use.
Hope this answers your question.
Regarding your scenario - having 100k small files in HDFS is not a good practice. You can find a way to merge them all (may be by creating tables through Hive or Impala from this data) or move all the small files to a single folder in HDFS and use hadoop fs -copyToLocal <HDFS_FOLDER_PATH>; to get the whole folder to your local along with all the files in it.
enter image description here
When I resume copying files its works on two processes 1) Skipping Copied files and 2) Copying files.
Because of this it's taking a long time.
Is there any method to skip this already copied files during the process?
Yes, this can be achieved by using the the gsutil rsync command.
The gsutil rsync command synchronises the contents of the source directory and the destination directory by copying only the missing files. It's therefore much more efficient than the standard vanilla cp command.
However, if you're using the -n switch with the cp command, this forces the cp command to skip the files already copied. So whether or not using gsutil rsync is faster than gsutil cp -n is open to debate, and maybe depends on different scenarios.
To use gsutil rsync you can run something like this (the -r flag makes the command recursive):
gustil rsync -r source gs://mybucket/
For more detailed on the gsutil rsync command, please take a look here.
I understand you have some concerns about the time it take for both commands to calculate which files need to be copied. As both the gsutil cp -n and gsutil rsync commands both need to make a comparison between the source and destination directories, there is always going to be a certain amount of overhead/delay on top of the the copying process, especially with very large collections.
If you want to cut out this part of the process altogether, and you only want to copy files less than a certain age, you could specify this at source and use a standard gsutil copy command to see if that is faster. However, doing so would remove some of the benefits of gsutil cp -n and gsutil rsync as there would no longer be a direct comparison between the source and destination directories.
For example, you could generate a variable at the source of files which have been modified recently, for example, within the last day. You could then use a standard gsutil cp command to only copy these files.
For example, to create a variable containing a list of files modified within the last day:
modified="$(find . -mtime -1)"
Then use the variable as the target for the copy command.
gsutil -m cp $modified gs://publicobject/
You need to work out whether or not this would work for your use case, as although there is chance it may be faster, some of the advantages of the other two methods are lost (automatic syncing of directories).
I want to copy all files from server A to server B that have the same parent directory-name in different levels of filesystem hierarchy, e.g:
/var/lib/data/sub1/sub2/commonname/filetobecopied.foo
/var/lib/data/sub1/sub3/commonname/filetobecopied.foo
/var/lib/data/sub2/sub4/commonname/anotherfiletobecopied.foo
/var/lib/data/sub3/sub4/differentname/fileNOTtobecopied.foo
I want to copy the first three files that all have the commonname in path to server B. I already spent a lot of time in finding the correct include/exclude patterns for rsync but I dont get it. The following command does not work:
rsync -a --include='**/commonname/*.foo' --exclude='*' root#1.2.3.4:/var/lib/data /var/lib/data
I either match too much or to few of the files. How can I sync only the files with the commonname in its path?
I guess you're looking for this:
rsync -a -m --include='**/commonname/*.foo' --include='*/' --exclude='*' root#1.2.3.4:/var/lib/data /var/lib/data
There are 2 differences with your command:
The most important one is --include='*/'. Without this, as you specified --exclude='*', rsync will never enter the subdirectories, since everything is excluded. With --include='*/', the subdirectories are not excluded anymore, so rsync can happily recurse.
The least important one is -m: this prunes the empty directories. Without this, you'd also get the (empty) subdirectory /var/lib/data/sub3/sub4/differentname/ copied.
I made a very foolish error with a large image directory on our server which is mounted via S3FS to an EC2 instance and I ran Image_Optim on it. It seemed to do a good job until I noticed missing files on the website, which when I looked id noticed were files which had been left at 0kb...
...Now fortunately I have versioning on and a quick look seems to show at the exact same time on all the 0kb files is the correct version as well.
It has happened to about 1300 files in a 2500 directory. Question is, is it possible for me to batch process all the 0kb files and tell them to restore to the latest version that is bigger than 0kb??
The only batch restore tool I can find is S3 Browser but that causes you to restore all files in a folder to their latest version. In some cases this would work for the 0kb files but for many it won't, I also don't own the program so would rather do it with a command line script if possible.
Once your file(s) have become 0 bytes or 0kb, you cannot recover them, well at least not easily. If you mean restore / recover from ext. Backup then that will work.
I am compressing a big directory about 50GB containing files and folder using putty SSH COMMAND LINE.i am using this command:
tar czspf file.tar.gz directory/
it starts work fine, but after some time it gets terminated with single word message "Terminated" and compression stopped near about 16GB of tar archive.
Is there any way to escape from terminated error or how to deal this problem, or any other method to make a tar of directory avoiding the terminate error.Thanks
Probably you conflict with some kind of file size limit. Not all file system supports very big files. In this case you could pipe the output of the tar into a split command like this:
tar czsp directory/|split -b4G fileprefix-