We have multiple files on our S3 bucket with the same file extensions.
I would like to find a way to list all these file extensions with the amount of space they're taking up in our bucket in human readable format.
For example, instead of just listing out all the files with aws s3 ls s3://ebio-rddata --recursive --human-readable --summarize
I'd like to list only the file extensions with the total size they're taking:
.basedon.peaks.l2inputnormnew.bed.full | total size: 100 GB
.adapterTrim.round2.rmRep.sorted.rmDup.sorted.bam | total size: 200 GB
.logo.svg | total size: 400 MB
Here's a Python script that will count objects by extension and compute the total size by extension:
import boto3
s3_resource = boto3.resource('s3')
sizes = {}
quantity = {}
for object in s3_resource.Bucket('jstack-a').objects.all():
if not object.key.endswith('/'):
extension = object.key.split('.')[-1]
sizes[extension] = sizes.get(extension, 0) + object.size
quantity[extension] = quantity.get(extension, 0) + 1
for extension, size in sizes.items():
print(extension, quantity[extension], size)
It goes a bit funny if there is an object without an extension.
You will have to use the SDK for that, with your favorite language and a script to filter out objects recursively for the file formats you want
and then export the list as csv or json whatever you prefer more readable
Here's an idea for how to solve this with the awscli and a couple of other command lines tools (grep and awk, freely available on Mac and Linux).
aws s3 ls s3://mybucket --recursive \
| grep -v -E '^.+/$' \
| awk '{na=split($NF, a, "."); tot[a[na]] += $3; num[a[na]]++;} END {for (e in tot) printf "%15d %6d %s\n", tot[e], num[e], e};'
Step by step, aws s3 ls s3://mybucket --recursive results in output like this:
2021-11-24 12:45:39 57600 cat.png
2021-09-29 13:15:48 93651 dog.png
2021-09-29 14:16:06 1448 names.csv
2021-02-15 15:09:56 0 pets/
2021-02-15 15:09:56 135 pets/pets.json
Piping that through grep -v -E '^.+/$' removes the folders, and the result looks like this:
2021-11-24 12:45:39 57600 cat.png
2021-09-29 13:15:48 93651 dog.png
2021-09-29 14:16:06 1448 names.csv
2021-02-15 15:09:56 135 pets/pets.json
Finally, the AWK script is called for each line. It splits the last word of each line on the period character (split($NF, a, ".")) so it can work out what the file extension is (stored in a[na]). It then aggregates the file size by extension in tot[extension] and the file count by extension in num[extension]. It finally prints out the aggregated file size and file count by extension, which looks something like this:
151251 2 png
1448 1 csv
135 1 json
You could also solve this fairly simply e.g. in Python using the boto3 SDK.
Related
I'm having an error when I run a command to extract data to a csv file, using the AWS CLI with jq.
Command:
aws dynamodb scan --table-name MyTable --select ALL_ATTRIBUTES --page-size 500 --max-items 100000 --output json --profile production | jq -r '.Items' | jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ].S])[] | #csv' > export.my-table.csv
Error:
'charmap' codec can't encode characters in position 1-3: character maps to <undefined> parse error: Unfinished JSON term at EOF at line 5097, column 21
I believe that is a query that I wrote previously that does not work on nested attributes. You will have to modify it accordingly.
I want to copy latest file from a gcs bucket to local using airflow composer.
I was trying to use gustil cp to get the latest file and load into local airflow but got issue: CommandException: No URLs matched error . If I check the XCom I am getting value='Objects' .Any suggestion?
download_file = BashOperator(
task_id='download_file',
bash_command="gsutil cp $(gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''') /home/airflow/gcs/dags",
xcom_push=True
)
Executing the gsutil command gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''' will also display the row with total size, objects and etc., will sort by date then get the last row and get the third column of row. That's why you get 'objects' as value like the output sample below:
TOTAL: 6 objects, 28227013 bytes (26.92 MiB)
Try this code to get the second last row only :
download_file = BashOperator(
task_id='download_file',
bash_command="gsutil cp $(gsutil ls -l gs://bucket_name | sort -k 2 | tail -2 | head -n1 | awk '''{print $3}''') /home/airflow/gcs/dags",
xcom_push=True
)
I defined several tables in AWS glue.
Over the past few weeks, I've had different issues with the table definition which I had to fix manually - I want to change column names, or types, or change the serialization lib. However, If i already have partitions created, the repairing of table doesn't change them, and so I have to delete all partitions manually and then repairing.
Is there a simple way to do this? Delete all partitions from an AWS Glue table?
I'm using aws batch-delete-partition CLI command, but it's syntax is tricky, and there are some limitations on the amount of partitions you can delete in one go, the whole thing is cumbersome...
For now, I found this command line solution, runinng aws glue batch-delete-partition iteratively for batches of 25 partitions using xargs
(here I am assuming there are max 1000 partitions):
aws glue get-partitions --database-name=<my-database> --table-name=<my-table> | jq -cr '[ { Values: .Partitions[].Values } ]' > partitions.json
seq 0 25 1000 | xargs -I _ bash -c "cat partitions.json | jq -c '.[_:_+25]'" | while read X; do aws glue batch-delete-partition --database-name=<my-database> --table-name=<my-table > --partitions-to-delete=$X; done
Hope it helps someone, but I'd prefer a more elegant solution
Using python3 with boto3 looks a little bit nicer. Albeit not by much :)
Unfortunately AWS doesn't provide a way to delete all partitions without batching 25 requests at a time. Note that this will only work for deleting the first page of partitions retrieved.
import boto3
glue_client = boto3.client("glue", "us-west-2")
def get_and_delete_partitions(database, table, batch=25):
partitions = glue_client.get_partitions(
DatabaseName=database,
TableName=table)["Partitions"]
for i in range(0, len(partitions), batch):
to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
glue_client.batch_delete_partition(
DatabaseName=database,
TableName=table,
PartitionsToDelete=to_delete)
EDIT: To delete all partitions (beyond just the first page) using paginators makes it look cleaner.
import boto3
glue_client = boto3.client("glue", "us-west-2")
def delete_partitions(database, table, partitions, batch=25):
for i in range(0, len(partitions), batch):
to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
glue_client.batch_delete_partition(
DatabaseName=database,
TableName=table,
PartitionsToDelete=to_delete)
def get_and_delete_partitions(database, table):
paginator = glue_client.get_paginator('get_partitions')
itr = paginator.paginate(DatabaseName=database, TableName=table)
for page in itr:
delete_partitions(database, table, page["Partitions"])
Here is a PowerShell version FWIW:
$database = 'your db name'
$table = 'your table name'
# Set the variables above
$batch_size = 25
Set-DefaultAWSRegion -Region eu-west-2
$partition_list = Get-GLUEPartitionList -DatabaseName $database -TableName $table
$selected_partitions = $partition_list
# Uncomment and edit predicate to select only certain partitions
# $selected_partitions = $partition_list | Where-Object {$_.Values[0] -gt '2020-07-20'}
$selected_values = $selected_partitions | Select-Object -Property Values
for ($i = 0; $i -lt $selected_values.Count; $i += $batch_size) {
$chunk = $selected_values[$i..($i + $batch_size - 1)]
Remove-GLUEPartitionBatch -DatabaseName $database -TableName $table -PartitionsToDelete $chunk -Force
}
# Now run `MSCK REPAIR TABLE db_name.table_name` to add the partitions again
From EC2 console terminal, I am trying to list a part of an S3 bucket directory names into an array excluding a prefix "date=" but cannot figure out a complete solution.
I've already tried the following code and getting close:
origin="bucket/path/to/my/directory/"
for path in $(aws s3 ls $origin --profile crossaccount --recursive | awk '{print $4}');
do echo "$path"; done
note: directory contains multiple directories like /date=YYYYMMDD/ and all I want to be returned into an array is the YYYYMMDD where YYYYMMDD is >= a certain value.
I expect the output to be an array:
YYYYMMDD, YYYYMMDD, YYYYMMDD
actual result is:
path/to/my/directory/date=YYYYMMDD/file#1
path/to/my/directory/date=YYYYMMDD/file#2
path/to/my/directory/date=YYYYMMDD/file#3
https://docs.aws.amazon.com/cli/latest/reference/s3/ls.html
path="bucket/path/to/my/directory/date="
for i in $(aws s3 ls $path --profile crossaccount --recursive | awk -F'[=/]' '{if($6>20190000)print $6}');
do python3.6 my_python_program.py $i; done
I used awk. In the bracket are the column delimiters =/, and $6 is the 6th column after the directory full name has been delimited. It gave me the date I needed to feed into my python program.
I'm trying to list files from a virtual folder in S3 within a specific date range. For example: all the files that have been uploaded for the month of February.
I currently run a aws s3 ls command but that gives all the files:
aws s3 ls s3://Bucket/VirtualFolder/VirtualFolder --recursive --human-readable --summarize > c:File.txt
How can I get it to list only the files within a given date range?
You could filter the results with a tool like awk:
aws s3 ls s3://Bucket/VirtualFolder/VirtualFolder --recursive --human-readable --summarize \
| awk -F'[-: ]' '$1 >= 2016 && $2 >= 3 { print }'
Where awk splits each records using -, :, and space delimiters so you can address fields as:
$1 - year
$2 - month
$3 - day
$4 - hour
$5 - minute
$6 - second
The aws cli ls command does not support filters, so you will have to bring back all of the results and filter locally.
Realizing this question was tagged command-line-interface, I have found the best way to address non-trivial aws-cli desires is to write a Python script.
Tersest example:
$ python3 -c "import boto3; print(boto3.client('s3').list_buckets()['Buckets'][0])"
Returns: (for me)
{'Name': 'aws-glue-scripts-282302944235-us-west-1', 'CreationDate': datetime.datetime(2019, 8, 22, 0, 40, 5, tzinfo=tzutc())}
That one-liner isn't a profound script, but it can be expounded into one. (Probably with less effort than munging a bash script, much as I love bash.) After looking up a few boto3 calls, you can deduce the rest from equivalent cli commands.