AWS CLI - A file containing items to be ignored for S3 copy or sync - amazon-web-services

Is it somewhat possible to have a file containing ignored files and folders during uploading items through AWS CLI.
It has an --exclude flag like mentioned here. However, the concept I seek is something like .gitignore or .dockerignore file rather than enlisting with a flag.

No, there is no in-built capability within the AWS Command-Line Interface (CLI) to support .ignore file capabilities.

I know it's not exactly what you are looking for but you could set an alias in your ~/.bash_profile something like:
alias s3_cp=`aws s3 cp --exclude "yadda, yadda, yadda"`
This would at least reduce the need to type them every time, even though it isn't in a concise file.
Edit: Here is a link that shows it doesn't look like the base config file supports what you are looking for. https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

Related

how to access files in s3 bucket for other commands using cli

I want to use some commands with the aws cli on large files that are stored in a s3 bucket without copying the files to the local directory
(I'm familiar with the aws cp command, it's not what I want).
For example, let's say I want to use a simple bash commands like "head" or "more".
If I try to use it like this:
head s3://bucketname/file.txt
but then I get:
head: cannot open ‘s3://bucketname/file.txt’ for reading: No such file or directory
How else can I do it?
How else can I do it?
Thanks in advance.
Wether a command will be able to access a file over s3 bucket or not depends entirely on the command. Under the hood, every command is just a program. When you run something like head filename, the argument filename is passed as an argument to the head command's main() function. You can check out the source code here: https://github.com/coreutils/coreutils/blob/master/src/head.c
Essentially, since the head command does not support S3 URIs, you cannot do this. You can either:
Copy the s3 file to stdout and then pipe it to head: aws s3 cp fileanme - | head. This doesn't seem the likely option if file is too big for the pipe buffer.
Use s3curl to copy a range of bytes: how to access files in s3 bucket for other commands using cli

Where is a sensible place to put kube_config.yaml files on MWAA?

The example code in the MWAA docs for connecting MWAA to EKS has the following:
#use a kube_config stored in s3 dags folder for now
kube_config_path = '/usr/local/airflow/dags/kube_config.yaml'
This doesn't make me think that putting the kube_config.yaml file in the dags/ directory is a sensible long-term solution.
But I can't find any mention in the docs about where would be a sensible place to store this file.
Can anyone link me to a reliable source on this? Or make a sensible suggestion?
From KubernetesPodOperator Airflow documentation:
Users can specify a kubeconfig file using the config_file parameter, otherwise the operator will default to ~/.kube/config.
In a local environment, the kube_config.yaml file can be stored in specific directory reserved for Kubernetes (e.g. .kube, kubeconfig). Reference: KubernetesPodOperator (Airflow).
In the MWAA environment, where DAG files are stored in S3, the kube_config.yaml file can be stored anywhere in the root DAG folder (including any subdirectory in the root DAG folder, e.g. /dags/kube). The location of the file is less important than explicitly excluding it from DAG parsing via the .airflowignore file. Reference: .airflowignore (Airflow).
Example S3 directory layout:
s3://<bucket>/dags/dag_1.py
s3://<bucket>/dags/dag_2.py
s3://<bucket>/dags/kube/kube_config.yaml
s3://<bucket>/dags/operators/operator_1.py
s3://<bucket>/dags/operators/operator_2.py
s3://<bucket>/dags/.airflowignore
Example .airflowignore file:
kube/
operators/

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

Rename parent folder in S3

We know that in S3 objects, the folder path is just a part of the object key itself.
Having an object structure similar to the following:
/files/user/09874/01/
/files/user/09875/01/
/files/user/09875/02/
/files/user/09876/01/
/files/user/09876/02/
What kind of operation would you recommend to rename the parent /files/ to /something/, having in mind that there are thousands of files and that the number of requests should be the cheapest/minimum?
(with the following docs under consideration)
As you said prefix are part of the file name. Unfortunately, The only way that i can think of is to iterate the list of objects and move to the new prefix.
in cli:
aws s3 mv --recursive s3://bucket/files s3://bucket/something/

How to copy only files from many subdirectory under the directory to another project bucket in GCP?

I have huge number of data in my Google Cloud storage bucket. I have to copy all the files to another project bucket. But the main problem is, in this bucket i created some folder and under this folder have many sub-folders and all sub-folders have data. So when i am using normal gsutil copy command then its copying all the data along with folders.
I need help to resolve this problem. Because it is taking too much time to copy from one project to another project bucket.
You can use this command to have all the files in the root path.
gsutil cp 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
If you have nested directories inside your bucket, use this command:
gsutil cp -r 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
Pay attention to single quotes around the first command.
You can take a look at the Wildcard Names if you need more advanced features.
You can use Google Data Transfer Service
It is the second option in the Google Cloud Storage subcategory.
Use gsutil cp command without -r option.
The -R and -r options are synonymous. Causes directories,
buckets, and bucket subdirectories to be copied recursively.
If you neglect to use this option for an upload, gsutil will
copy any files it finds and skip any directories. Similarly,
neglecting to specify this option for a download will cause
gsutil to copy any objects at the current bucket directory
level, and skip any subdirectories.
If I understand well, you want to copy all the files from one bucket to another bucket, but you don't want to have the same hierarchy, instead, you want to have all the files in the root path.
Nowadays there’s no possible way to do that with gsutil, but you can do it with a script, here you have my solution:
from google.cloud import storage
bucketOrigin = storage.Client().get_bucket("<BUCKET_ID_ORIGIN>")
bucketDestination = storage.Client().get_bucket("<BUCKET_ID_DESTINATION")
for blob in bucketOrigin.list_blobs():
strfile=blob.download_as_string()
blobDest = bucketDestination.blob(blob.name[blob.name.rfind("/")+1:])
blobDest.upload_from_string(strfile)
As mentioned by Akash Dathan, you can use the Cloud Storage Transfer Service to move your bucket content. I recommend you to take a look on this Moving and Renaming Buckets guide, where you can find the steps required to perform this task.
Bear in mind the following requirments:
Transfer Service service account must have permission to read from
your source and write to your destination.
If you're deleting the source files, the Transfer Service's service account will need delete access to the source.
If your service account doesn't have these
permissions yet, a bucket owner must grant them.
Note. If you have 'storage.buckets.setIamPolicy' permission for the source and destination buckets, creating a transfer job will grant that service account the required source and destination permissions to complete the transfer.
You can list all files from your subfolders and get the file name by using split() method. Then you can use use a copy() method to copy the file to another bucket. The method below remove all subfolders:
const [files] = await storage.bucket(srcBucketName).getFiles();
files.forEach((file) => {
let fileName = file.name.split("/").pop();
if (fileName)
file.copy(storage.bucket(destBucketName).file(`${prefix}/${fileName}`));
});