Much like in S3-Bucket/Management/Lifecycles using prefixes, I'd like to prune old files that have certain words.
I'm looking to remove files that start with Screenshot or has screencast in the filename older than 365 days.
Examples
/Screenshot 2017-03-19 10.11.12.png
folder1/Screenshot 2019-03-01 14.31.55.png
folder2/sub_folder/project-screencast.mp4
I'm currently testing if lifecycle prefixes work on files too.
You can write a program to do it, such as this Python script:
import boto3
s3 = boto3.client('s3', region_name='ap-southeast-2')
response = s3.list_objects_v2(Bucket='my-bucket')
keys_to_delete = [{'Key': object['Key']}
for object in response['Contents']
if object['LastModified'] < datetime(2018, 3, 20)
and ('Screenshot' in object['Key'] or 'screencast' in object['Key'])
]
s3.delete_objects(Bucket='my-bucket', Delete={'Objects': keys_to_delete})
You could modify it to be "1 year ago" rather than a specific date.
I don't believe that you can apply lifecycle rules with wildcards such as *screencast*, only with prefixes such as "taxes/" or "taxes/2010".
For your case, I would probably write a script (or perhaps an Athena query) to filter an S3 Inventory report for those files that match your name/age conditions, and then prune them.
Of course, you could write a program to do this as #John Rotenstein suggests. The one time that might not be ideal is if you have millions or billions of objects because the time to enumerate the list of objects would be significant. But it would be fine for a reasonable number of objects.
Related
I have AWS S3 buckets with hundreds of top-level prefixes (folders). Each prefix contains somewhere between five thousand and a few million files in each prefix - most growing at rate of 10-100k per year. 99% of the time, all I care about are the newest 1-2000 or so in each folder...
Using ListObjectV2 returns me 1000 files and that is the max (setting "MaxKeys" to a higher value still truncates the list at 1000). This would be reasonably fine, however (per the documentation) it's returning me the file list in ascending alphabetical order (which, given my keys/filenames have the date in them effectively results in a oldest->newest sort) ... which is considerably less useful than if it returned me the NEWEST files (or reverse-alphabetical).
One option is to do a continuation allowing me to pull the entire prefix, then use the tail end of the entire array of keys as needed... but that would be (most importantly) slow for large 'folders'. A prefix with 2 million files would require 2,000 separate API calls, just to get the newest few-hundred filenames. (not to mention the costs incurred by pulling the entire bucket list even though I'm only really interested in the newest 1-2000 files.)
Is there a way to have the ListObjectV2 call (or any other s3 call) give me the list of the newest (or reverse-alphabetical) files? New files come in every few minutes - and the most important file is THE most recent file, so doing an S3 Inventory doesn't seem like it would do the trick.
(or, perhaps, a call that gives me filenames in a created-by date range...?)
Using javascript - but I'm sure every language has more-or-less the same features when it comes to trying to list objects from an S3 bucket.
Edit: weird idea: If AWS doesn't offer a 'sort' option on a basic API call for one of it's most popular services... Would it make sense to document all the filenames/keys in a dynamo table and query that instead?
No. The ListObjectsV2() will always return up to 1000 objects alphabetically in the requested Prefix.
You could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
If you need real-time or fairly fast access to a list of all available objects, your other option would be to trigger an AWS Lambda function whenever objects are created/deleted. The Lambda function would store/update information in a database (eg DynamoDB) that can provide very fast access to the list of objects. You would need to code this solution.
I have a Glue job, it looks at the files for the current date (each date has a folder in S3) and process the data in this folder (e.g: "s3://bucket_name/year/month/day"), now I want to find a way to define the input s3 path which tells Glue to look at the previous day and current day, is there a way to do this?
current_glue_input_path = "s3://bucket_name/2021/08/12"
I want to find a regex expression (maybe a wildcard?) and tell Glue to look at "s3://bucket_name/2021/08/11" and "s3://bucket_name/2021/08/12", is there a way to do so?
From this documentation: under the 'Example of Excluding a Subset of Amazon S3 Partitions' section:
The second part, 2015/0[2-9]/**, excludes days in months 02 to 09, in year 2015.
Not sure if this makes sense, can someone help please? Thanks.
(I just realized that this documentation is the regex for Glue crawler, I'm talking about the Glue job, am I looking at the wrong place...?)
Would calculating current and previous date programmatically work? Python sample below -
from datetime import datetime, timedelta
date_today = datetime.today().strftime('%Y%m%d')
date_yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y%m%d')
current_glue_input_path = f's3://bucket_name/{date_today[0:4]}/{date_today[4:6]}/{date_today[6:8]}'
yesterday_glue_input_path = f's3://bucket_name/{date_yesterday[0:4]}/{date_yesterday[4:6]}/{date_yesterday[6:8]}'
I was wondering if anyone knew what exactly an s3 prefix was and how it interacts with amazon's published s3 rate limits:
Amazon S3 automatically scales to high request rates. For example,
your application can achieve at least 3,500 PUT/POST/DELETE and 5,500
GET requests per second per prefix in a bucket. There are no limits to
the number of prefixes in a bucket.
While that's really clear I'm not quite certain what a prefix is?
Does a prefix require a delimiter?
If we have a bucket where we store all files at the "root" level (completely flat, without any prefix/delimters) does that count as single "prefix" and is it subject to the rate limits posted above?
The way I'm interpreting amazon's documentation suggests to me that this IS the case, and that the flat structure would be considered a single "prefix". (ie it would be subject to the published rate limits above)
Suppose that your bucket (admin-created) has four objects with the
following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
The s3-dg.pdf key does not have a prefix, so its object appears
directly at the root level of the bucket. If you open the Development/
folder, you see the Projects.xlsx object in it.
In the above example would s3-dg.pdf be subject to a different rate limit (5500 GET requests /second) than each of the other prefixes (Development/Finance/Private)?
What's more confusing is I've read a couple of blogs about amazon using the first N bytes as a partition key and encouraging about using high cardinality prefixes, I'm just not sure how that interacts with a bucket with a "flat file structure".
You're right, the announcement seems to contradict itself. It's just not written properly, but the information is correct. In short:
Each prefix can achieve up to 3,500/5,500 requests per second, so for many purposes, the assumption is that you wouldn't need to use several prefixes.
Prefixes are considered to be the whole path (up to the last '/') of an object's location, and are no longer hashed only by the first 6-8 characters. Therefore it would be enough to just split the data between any two "folders" to achieve x2 max requests per second. (if requests are divided evenly between the two)
For reference, here is a response from AWS support to my clarification request:
Hello Oren,
Thank you for contacting AWS Support.
I understand that you read AWS post on S3 request rate performance
being increased and you have additional questions regarding this
announcement.
Before this upgrade, S3 supported 100 PUT/LIST/DELETE requests per
second and 300 GET requests per second. To achieve higher performance,
a random hash / prefix schema had to be implemented. Since last year
the request rate limits increased to 3,500 PUT/POST/DELETE and 5,500
GET requests per second. This increase is often enough for
applications to mitigate 503 SlowDown errors without having to
randomize prefixes.
However, if the new limits are not sufficient, prefixes would need to
be used. A prefix has no fixed number of characters. It is any string
between a bucket name and an object name, for example:
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Prefixes of the object 'file' would be: /folder1/sub1/ ,
/folder1/sub2/, /1/, /2/. In this example, if you spread reads
across all four prefixes evenly, you can achieve 22,000 requests per
second.
This looks like it is obscurely addressed in an amazon release communication
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
Performance scales per prefix, so you can use as many prefixes as you
need in parallel to achieve the required throughput. There are no
limits to the number of prefixes.
This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster performance.
That means you can now use logical or sequential naming patterns in S3
object naming without any performance implications. This improvement
is now available in all AWS Regions. For more information, visit the
Amazon S3 Developer Guide.
S3 prefixes used to be determined by the first 6-8 characters;
This has changed mid-2018 - see announcement
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
But that is half-truth. Actually prefixes (in old definition) still matter.
S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.
Learning new access patterns takes time. Repartitioning of the data takes time.
Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).
TLDR:
"Old prefixes" still do matter.
Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)
"New prefixes" do work, but not initially. It takes time to accommodate to load.
PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.
In order for AWS to handle billions of requests per second, they need to shard up the data so it can optimise throughput. To do this they split the data into partitions based on the first 6 to 8 characters of the object key. Remember S3 is not a hierarchical filesystem, it is only a key-value store, though the key is often used like a file path for organising data, prefix + filename.
Now this is not an issue if you expect less than 100 requests per second, but if you have serious requirements over that then you need to think about naming.
For maximum parallel throughput you should consider how your data is consumed and use the most varying characters at the beginning of your key, or even generate 8 random character for the first 8 characters of the key.
e.g. assuming first 6 characters define the partition:
files/user/bob would be bad as all the objects would be on one partition files/.
2018-09-21/files/bob would be almost as bad if only todays data is being read from partition 2018-0. But slightly better if the objects are read from past years.
bob/users/files would be pretty good if different users are likely to be using the data at the same time from partition bob/us. But not so good if Bob is by far the busiest user.
3B6EA902/files/users/bob would be best for performance but more challenging to reference, where the first part is a random string, this would be pretty evenly spread.
Depending on your data, you need to think of any one point in time, who is reading what, and make sure that the keys start with enough variation to partition appropriately.
For your example, lets assume the partition is taken from the first 6 characters of the key:
for the key Development/Projects1.xls the partition key would be Develo
for the key Finance/statement1.pdf the partition key would be Financ
for the key Private/taxdocument.pdf the partition key would be Privat
for the key s3-dg.pdf the partition key would be s3-dg.
The upvoted answer on this was a bit misleading for me.
If these are the paths
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Your prefix for file would actually be
folder1/sub1/
folder1/sub2/
1/file
2/file
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
Please se docs. I had issues with the leading '/' when trying to list keys with the airflow s3hook.
In the case you query S3 using Athena, EMR/Hive or Redshift Spectrum increasing the number of prefixes could mean adding more partitions (as the partititon id is part of the prefix). If using datetime as (one of) your partititon keys the number of partittions (and prefixes) will automatically grow as new data is added over time and the total max S3 GETs per second grow as well.
S3 - What Exactly Is A Prefix?
S3 recently updated their document to better reflect this.
"A prefix is a string of characters at the beginning of the object key name. A prefix can be any length, subject to the maximum length of the object key name (1,024 bytes). "
From - https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
Note: "You can use another character as a delimiter. There is nothing unique about the slash (/) character, but it is a very common prefix delimiter."
As long as two objects have different prefixes, s3 will provide the documented throughput over time.
Update: https://docs.aws.amazon.com/general/latest/gr/glos-chap.html#keyprefix reflecting the updated definition.
I am running pyspark job on EMR ( 5.5.1 ) with Spark 2.1.0, Hadoop 2.7.3, Hive 2.1.1, Sqoop 1.4.6 and Ganglia 3.7.2 which is loading data from s3. There are multiple buckets that contain input files so I have a function which uses boto to traverse through them and filter them out according to some pattern.
Cluster Size: Master => r4.xlarge , Workers => 3 x r4.4xlarge
Problem:
The function getFilePaths returns a list of s3 paths which is directly fed to spark dataframe load method.
Using Dataframe
file_list = getFilePaths() # ['s3://some_bucket/log.json.gz','s3://some_bucket/log2.json.gz']
schema = getSchema() # for mapping to the json files
df = sparkSession.read.format('json').load(file_list, schema=schema)
Using RDD
master_rdd = sparkSession.sparkContext.union(
map(lambda file: sparkSession.sparkContext.textFile(file), file_list)
)
df = sparkSession.createDataFrame(master_rdd, schema=schema)
The file_list can be a huge list ( max 500k files ) due to large amount of data & files. Calculation of these paths only takes 5-20mins but when trying to load them as dataframe with spark, spark UI remains inactive for hours i.e. not processing anything at all. The inactivity period for processing 500k files is above 9hrs while for 100k files it is around 1.5hrs.
Viewing Gangilla metrics shows that only driver is running/processing while workers are idle. There are no logs generated until the spark job has finished and I haven't got any success with 500k files.
I have tried s3, s3n connectors but no success.
Question:
Figure out the root cause of this delay?
How can I debug it properly ?
In general, Spark/Hadoop prefer to have large files they can split instead of huge numbers of small files. One approach you might try though would be to parallelize your file list and then load the data in a map call.
I don't have the resources right now to test this out, but it should be something similar to this:
file_list = getFilePaths()
schema = getSchema() # for mapping to the json files
paths_rdd = sc.parallelize(file_list)
def get_data(path):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, path)
data = obj.get()['Body'].read().decode('utf-8')
return [json.loads(r) for r in data.split('\n')]
rows_rdd = rdd.flatMap(get_data)
df = spark.createDataFrame(rows_rdd, schema=schema)
You could also make this a little more efficient by using mapPartition instead so you don't need to recreate the s3 object each time.
EDIT 6/14/18:
With regards to handling the gzip data, you can decompress a stream of gzip data using python as detailed in this answer: https://stackoverflow.com/a/12572031/1461187 . Basically just pass in obj.get()['Body'].read() into the function defined in that answer.
There's two performance issues surfacing
reading the files: gzip files can't be split to have their workload shared across workers, though with 50 MB files, there's little benefit in splitting things up
The way the S3 connectors spark uses mimic a directory structure is a real performance killer for complex directory trees.
Issue #2 is what slows up partitioning: the initial code to decide what to do, which is done before any of the computation.
How would I go about trying to deal with this? Well, there's no magic switch here. But
have fewer, bigger files; as noted, Avro is good, so are Parquet and ORC later.
use a very shallow directory tree. Are these files all in one single directory? Or in a deep directory tree? The latter is worse.
Coalesce the files first.
I'd also avoid any kind of schema inference; it sounds like you aren't doing that (good!), but for anyone else reading this answer: know that for CSV and presumably JSON, schema inference means "read through all the data once just to work out the schema"
so this problem has been driving me nuts, and it is starting to feel like spark with s3 is not the right tool for this specific job. Basically, I have millions of smaller files in an s3 bucket. For reasons I can't necessarily get into, these files cannot be consolidated (one they are unique encrypted transcripts). I have seen similar questions as this one, and every single solution has not produced good results. First thing I tried was wild cards:
sc.wholeTextFiles(s3aPath + "/*/*/*/*.txt").count();
Note: the count was more debugging on how long it would take to process the files. This job almost took an entire day with over 10 instances but still failed with the error posted at the bottom of the listing. I then found this link, where it basically said this isn't optimal: https://forums.databricks.com/questions/480/how-do-i-ingest-a-large-number-of-files-from-s3-my.html
Then, I decided to try another solution that I can't find at the moment, which said load all of the paths, then union all of the rdds
ObjectListing objectListing = s3Client.listObjects(bucket);
List<JavaPairRDD<String, String>> rdds = new ArrayList<>();
List<JavaPairRDD<String, String>> tempMeta = new ArrayList<>();
//initializes objectListing
tempMeta.addAll(objectListing.getObjectSummaries().stream()
.map(func)
.filter(item -> item != null && item.getMediaType().equalsIgnoreCase("transcript"))
.map(item -> SparkConfig.getSparkContext().wholeTextFiles("s3a://" + bucket + "/" + item.getFileName()))
.collect(Collectors.toList()));
while(objectListing.isTruncated()) {
objectListing = s3Client.listNextBatchOfObjects(objectListing);
tempMeta.addAll(objectListing.getObjectSummaries().stream()
.map(func)
.filter(item -> item != null && item.getMediaType().equalsIgnoreCase("transcript"))
.map(item -> SparkConfig.getSparkContext().wholeTextFiles("s3a://" + bucket + "/" + item.getFileName()))
.collect(Collectors.toList()));
if (tempMeta.size() > 5000) {
rdds.addAll(tempMeta);
tempMeta = new ArrayList<>();
}
}
if (!tempMeta.isEmpty()){
rdds.addAll(tempMeta);
}
return SparkConfig.getSparkContext().union(rdds.get(0), rdds.subList(1, rdds.size()));
Then, even when I set set the emrfs-site config to:
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3.consistent.retryPolicyType": "fixed",
"fs.s3.consistent.retryPeriodSeconds": "15",
"fs.s3.consistent.retryCount": "20",
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.consistent": "false"
}
}
I got this error within 6 hours of every time I tried running the job:
17/02/15 19:15:41 INFO AmazonHttpClient: Unable to execute HTTP request: randomBucket.s3.amazonaws.com:443 failed to respond
org.apache.http.NoHttpResponseException: randomBucket.s3.amazonaws.com:443 failed to respond
So first, is there a way to use smaller files with spark from s3? I don't care if the solution is suboptimal, I just want to try and get something working. I thought about trying spark streaming, since its internals are a little different with loading all of the files. I would then use fileStream and set newFiles to false. Then I could batch process them. However, that is not what spark streaming was built for, so I am conflicted in going that route.
As a side note, I generated millions of small files into hdfs, and tried the same job, and it finished within an hour. This makes me feel like it is s3 specific. Also, I am using s3a, not the ordinary s3.
If you are using amazon EMR, then you need to use s3:// URLs; the s3a:// ones are for the ASF releases.
A big issue is just how long it takes to list directory trees in s3, especially that recursive tree walk. The spark code assumes its a fast filesystem where listing dirs and stating files is low cost, whereas in fact each operation takes 1-4 HTTPS requests, which, even on reused HTTP/1.1 connections, hurts. It can be so slow you can see the pauses in the log.
Where this really hurts is that it is the up front partitioning where a lot of the delay happens, so it's the serialized bit of work which is being brought to its knees.
Although there's some speedup in treewalking on S3a coming in Hadoop 2.8 as part of the S3a phase II work, wildcard scans of //*.txt form aren't going to get any speedup. My recommendation is to try to flatten your directory structure so that you move from a deep tree to something shallow, maybe even all in the same directory, so that it can be scanned without the walk, at a cost of 1 HTTP request per 5000 entries.
Bear in mind that many small file are pretty expensive anyway, including in HDFS, where they use up storage. There's a special aggregate format, HAR files, which are like tar files except that hadoop, hive and spark can all work inside the file itself. That may help, though I've not seen any actual performance test figures there.