aws s3 sync doesn't download the file and instead create an empty directory - amazon-web-services

The question was originally posted on Super User (https://superuser.com/questions/1768933/aws-sync-doesnt-download-the-file-and-instead-create-an-empty-directory), but with no answers. I thought stack overflow, although not officially designed for this type of question, could really help given the huge amount of experts here.
The file I'm trying to download is:
s3://ont-open-data/gm24385_q20_2021.10/analysis/20210805_1713_5C_PAH79257_0e41e938/guppy_5.0.15_sup/align_unfiltered/chr19/calls2ref.bam
This is an open dataset. I followed the tutorial here: https://labs.epi2me.io/tutorials/.
The command I used was:
aws s3 sync --no-sign-request s3://ont-open-data/gm24385_q20_2021.10/analysis/20210805_1713_5C_PAH79257_0e41e938/guppy_5.0.15_sup/align_unfiltered/chr19/calls2ref.bam destination.bam
However, instead of downloading the file, aws simply created an empty directory named destination.bam, which makes no sense...
$ aws --version
aws-cli/1.15.15 Python/3.6.10 Linux/4.4.0-210-generic botocore/1.10.15

Couldn't solve the problem with sync, but I realized that cp works! The following command downloaded the file:
aws s3 cp --no-sign-request s3://ont-open-data/gm24385_q20_2021.10/analysis/20210805_1713_5C_PAH79257_0e41e938/guppy_5.0.15_sup/align_unfiltered/chr19/calls2ref.bam destination.bam

Your issue is caused by the fact that calls2ref.bam is not a directory.
If you list the chr19/ directory, it returns:
2021-10-07 00:57:41 228 basecall_stats.log
2021-10-07 00:57:42 1033176198 calls2ref.bam
2021-10-07 00:58:33 507904 calls2ref.bam.bai
2021-10-07 00:58:33 15443858 calls2ref_stats.txt
2021-10-07 00:58:35 0 extract_region_from_bam.log
The cp command works because you have provided the full Key of an object that you wish to copy. The sync command, however, expects to receive a directory and it will copy the entire contents of the directory including sub-directories.

Related

how to use gsutil rsync. login and download bucket contents to a local directory

I have the following questions.
I got access to a cloud bucket to my email id. Now I want to download the whole bucket folder into a local directory on ubuntu. I installed gsutil from pip.
Is the command correct?
gsutil rsync gs://bucket_name .
the command seems generic how do I give my gmail credentials to it? The file is 1TB of size and I am allowed to download only once so I want to get the command right.
The command is correct if you want your current directory to mirror the contents of the bucket (including deleting any files on the right not found on the left). If you merely want to copy, you might want cp -r instead.
Here are the current docs on how to authenticate when running a standalone gsutil. It looks like you just need to run gsutil config.

Errno 22 When downloading multiple files from S3 bucket "sub-folder"

I've been trying to use the AWS CLI to download all files from a sub-folder in AWS however, after the first few files download it fails to download the rest. I believe this is because it adds an extension to the filename and it then sees that as an invalid filepath.
I'm using the following command;
aws s3 cp s3://my_bucket/sub_folder /tmp/ --recursive
It gives me the following error for almost all of the files in the subfolder;
[Errno 22] Invalid argument: 'C:\\tmp\\2019-08-15T16:15:02.tif.deDBF2C2
I think this is because of the .deDBF2C2 extension it seems to be adding to the files when downloading though I don't know why it does. The filenames all end with .tif in the actual bucket.
Does anyone know what causes this?
Update: The command worked once I executed it from a linux machine. Seems to be specific to windows.
This is an oversight by AWS using Windows reserved characters in Log files names! When you execute the command it will create all the directory's however any logs with :: in the name fail to download.
Issue is discussed here: https://github.com/aws/aws-cli/issues/4543
So frustrated I came up with a workaround by executing a "DryRun" which prints the expected log output and porting that to a text file, eg:
>aws s3 cp s3://config-bucket-7XXXXXXXXXXX3 c:\temp --recursive --dryrun > c:\temp\aScriptToDownloadFilesAndReplaceNames.txt
The output file is filled with these aws log entries we can turn into aws script commands:
(dryrun) download: s3://config-bucket-7XXXXXXXXXXX3/AWSLogs/7XXXXXXXXXXX3/Config/ap-southeast-2/2019/10/1/ConfigHistory/7XXXXXXXXXXX3_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz to \AWSLogs\7XXXXXXXXXXX3\Config\ap-southeast-2\2019\10\1\ConfigHistory\703014955993_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz
In Notepad++ or other text editor you replace the (dryrun) download: with aws s3 cp
Then you will see the following lines with the command: aws s3 cp, the Bucket file and the local file path. We need to remove the :: in the local file path on the right side of the to:
aws s3 cp s3://config-bucket-7XXXXXXXXXXX3/AWSLogs/7XXXXXXXXXXX3/Config/ap-southeast-2/2019/10/1/ConfigHistory/7XXXXXXXXXXX3_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz to AWSLogs\7XXXXXXXXXXX3\Config\ap-southeast-2\2019\10\1\ConfigHistory\7XXXXXXXXXXX3_Config_ap-southeast-2_ConfigHistory_AWS::RDS::DBInstance_20191001T103223Z_20191001T103223Z_1.json.gz
We can replace the :: with - only in local paths not S3 Bucket path's using a regex (.*):: that removes the last occurrence of chars at the end of each line:
And here we can see I've replaced the ::'s with hyphens $1- by clicking 'Replacing All' twice:
Next remove the to (ignore the | cursor icon in the below image, to should be replaced with nothing).
FIND: json.gz to AWSLogs
REPLACE: json.gz AWSLogs
Finally select all the lines copy/paste into a command prompt to download all the files with reserved file characters!
UPDATE:
If you have WSL (Windows Subsystem Linux) you should be able to download the files and then issue a simple file rename replacing the ::'s before copying to the mounted Windows folder system.
I tried from my raspberry pi and it worked. Seems to only be an issue with Windows OS.

Amazon s3 not deleting directory that contains large no. of files? How to delete large size folder?

The command that I'm using is -
aws s3 rm --recursive --debug s3://abc-elastic-snap-shot/sample-bash-rest-ex/abccom/sarthak/ --endpoint-url https://abc.xyz.com
The output that I see in the scree is the list of files getting deleted but when I check the directory those files have not been deleted. The command doesn't throw any error at all. When I use the same command for a directory that has less no. of files then it works. I think it has something to do with --page-size as the maximum value is 1000, but is there a way to efficiently delete a directory and atleast get a meaninful error for handling?
The issue seems to be with the latest version of the aws command line tool i.e. 1.16. It was re-uploading files. So when I installed an older version of the tool i.e. 1.15 that is working fine.

How to copy file from bucket GCS to my local machine

I need copy files from Google Cloud Storage to my local machine:
I try this command o terminal of compute engine:
$sudo gsutil cp -r gs://mirror-bf /var/www/html/mydir
That is my directory on local machine /var/www/html/mydir.
i have that error:
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
Where the mistake?
You must first create the directory /var/www/html/mydir.
Then, you must run the gsutil command on your local machine and not in the Google Cloud Shell. The Cloud Shell runs on a remote machine and can't deal directly with your local directories.
I have had a similar problem and went through the painful process of having to figuring it out too, so I thought I would provide my step by step solution (under Windows, hopefully similar for unix users) with snapshots and hope it helps others:
The first thing (as many others have pointed out on various stackoverflow threads), you have to run a local Console (in admin mode) for this to work (ie. do not use the cloud shell terminal).
Here are the steps:
Assuming you already have Python installed on your machine, you will then need to install the gsutil python package using pip from your console:
pip install gsutil
The Console looks like this:
You will then be able to run the gsutil config from that same console:
gsutil config
As you can see from the snapshot bellow, a .boto file needs to be created. It is needed to make sure you have permissions to access your drive.
Also note that you are now provided an URL, which is needed in order to get the authorization code (prompted in the console).
Open a browser and paste this URL in, then:
Log in to your Google account (ie. account linked to your Google Cloud)
Google ask you to confirm you want to give access to GSUTIL. Click Allow:
You will then be given an authorization code, which you can copy and paste to your console:
Finally you are asked for a project-id:
Get the project ID of interest from your Google Cloud.
In order to find these IDs, click on "My First Project" as circled here below:
Then you will be provided a list of all your projects and their ID.
Paste that ID in you console, hit enter and here you are! You now have created your .boto file. This should be all you need to be able to play with your Cloud storage.
Console output:
Boto config file "C:\Users\xxxx\.boto" created. If you need to use a proxy to access the Internet please see the instructions in that file.
You will then be able to copy your files and folders from the cloud to your PC using the following gsutil Command:
gsutil -m cp -r gs://myCloudFolderOfInterest/ "D:\MyDestinationFolder"
Files from within "myCloudFolderOfInterest" should then get copied to the destination "MyDestinationFolder" (on your local computer).
gsutil -m cp -r gs://bucketname/ "C:\Users\test"
I put a "r" before file path, i.e., r"C:\Users\test" and got the same error. So I removed the "r" and it worked for me.
Check with '.' as ./var
$sudo gsutil cp -r gs://mirror-bf ./var/www/html/mydir
or maybe below problem
gsutil cp does not support copying special file types such as sockets, device files, named pipes, or any other non-standard files intended to represent an operating system resource. You should not run gsutil cp with sources that include such files (for example, recursively copying the root directory on Linux that includes /dev ). If you do, gsutil cp may fail or hang.
Source: https://cloud.google.com/storage/docs/gsutil/commands/cp
the syntax that worked for me downloading to a Mac was
gsutil cp -r gs://bucketname dir Dropbox/directoryname

Moving several files from one bucket to another in AWS returns error: Timed Out

I'm using s3cmd to copy over 5000 files from one bucket to a folder within another bucket.
Like this:
s3cmd mv --recursive -v s3://test.bucket/1111_stuff/ s3://actual.bucket/input/dataloader_input/
However, this keeps getting me this:
INFO: Retrieving list of remote files for s3://dataloader.bucket/1111_stuff/ ...
INFO: Summary: 5186 remote files to move
ERROR: timed out
It is stuck on Retrieving list of remote files for quite some time, and all I got out of it was an error.
Is this a problem on AWS side, or is it something I can fix? Is there any other way to do this?
Thanks.
This looks like the socket is timing out, you might want to try changing the value of socket_timeout to 180 which is about 3 minutes, in your .s3cfg
And if the above, does not help, you might want to read some similar questions
Can I move a file into a 'folder' inside an S3 bucket using the s3cmd mv command?
s3cmd failed too many times