How do I get the last modified date of a directory in Amazon S3? - amazon-web-services

So I'm aware that Amazon S3 doesn't really have directories. My question is: does this make it impossible to reliably get the last-modified timestamp of a "directory" in S3?
I know you can get the last-modified date of a file, as in this question.
I say "reliably" because it would be possible to define the latest last-modified timestamp of a file inside a directory as the last-modified timestamp of the directory. But that's not really accurate, since if a file inside a directory gets deleted, it wouldn't register as a change to that directory (indeed the deletion might cause the last-modified date to go backwards in time).
We're using boto to scrape S3.

If its really important for you to know this, you could develop a solution using the S3 event notifications. Each time a file is put or deleted from a folder you can have either an SNS or Lamba event get fired, and you could use that information to update a table/log someplace where this information is kept for use when you need it.
Probably not a ton of work to do it, but if its critical to know, it is an avenue worth exploring.
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Since what we label as a directory is just part of the object name, there is no creation time, modify time, etc. since it does not really exist as an entity on its own. The object has a path, and when you add '/' to the name, client presentation applications treat that as a separator, split the name, and make it look like a path. Like you suggested, there is no directory, and this is where that concept really is different than a traditional file system and how end users interact with it.
I suggest asking what you are trying to do and why the timestamp of the directory is important. E.J. Brennan suggests what you may be trying to do and is not a bad idea for the case he mentions. There is likely a different way to skin your cat.

Related

S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory tree

We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar to write parquet files to an S3 bucket.
We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7' in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet being a file)
When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/' (ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet' get created in s3) we get the below error
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7' since it is a file.
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
As discussed in HADOOP-15542. you can't have files under directories in a "normal" FS; you don't get them in the S3A connector, at least where it does enough due diligence.
It just confuses every single tree walking algorithm, renames, deletes, anything which scans for files. This will include the spark partitioning logic. That new directory tree you are trying to create would probably appear invisible to callers. (you could test this by creating it, doing the PUT of that text file into place, see what happens)
We try to define what an FS should do in The Hadoop Filesystem Specification, including defining things "so obvious" that nobody bothered to write them down or write tests for, such as
Only directories can have children
All children must have a parent
Only files can have data (exception: ReiserFS)
Files are as long as they say they are (this is why S3A doesn't support client-side encryption, BTW).
Every so often we discover some new thing we forgot to consider, which "real" filesystems enforce out the box, but which object stores don't. Then we add tests, try our best to maintain the metaphor except when the performance impact would make it unusable. Then we opt not to fix things and hope nobody notices. Generally, because people working with data in the hadoop/hive/spark space have those same preconceptions of what a filesystem does, those ambiguities don't actually cause problems in production.
Except of course eventual consistency, which is why you shouldn't be writing data straight to S3 from spark without a consistency service (S3Guard, consistent EMRFS), or a commit protocol designed for this world (S3A Committer, databricks DBIO).

OSX- Auto Delete file after x-time

Can we add metadata to unlink/remove a file after x-time automatically. That is system automatically removes that file, if it finds that particular metadata attached with that file
Note- file can be present at any location, and user may move that file anywhere on their system, but based on that metadata file should get deleted(i.e system should call unlink/remove) for that file.
Is there a cocoa/objective-c/c++ api to set such metadata/attributes of a file?
The main point is i am creating an application through which i am providing some trial files to the user, and those files are also usable by other application which recognises them. After trial expiry, i want to delete those files, but user can always move my files to a different location and use them forever, how to protect those files from permanent use?
No, there is no built-in mechanism to auto-delete a file based on some metadata.
You could add the feature yourself, with an accompanying agent that would trawl for files with the metadata and delete them when the time came.
If you are doing this for good housekeeping you can follow #Petesh answer.
If you are doing this because you really want those files gone then no. The user could move the file to a USB stick and remove it, or edit the metadata, etc.
Your earlier question "Completely restricting all types of access to a folder" seems to addressing the same issue and the suggestions are the same as given there - use encryption or implement your own file system.
E.g. have a special "trial file" format which is the same as the ordinary format - which is readable by other apps - but encrypted and includes an expiry date. Your app then decrypts the file, checks the date, and either does its thing or reports to the user the file is out of date.
The system isn't unbreakable, but its a reasonable barrier - easy for you to do, too hard for the average user to break.

What are the best practices for user uploads with S3?

I was wondering what you recommend for running a user upload system with s3. I plan on using MongoDB for storing metadata such as the uploader, size, etc. How should I go about storing the actual file in s3.
Here are some of my ideas, what do you think is the best? All of these examples would involve saving the metadata to MongoDB.
1.Should I just store all the files in a bucket?
2. Maybe organize them into dates (e.g. 6/8/2014/mypicture.png)?
3.Should I save them all in one bucket, but with an added string (such as d1JdaZ9-mypicture.png) to avoid duplicates.
4. Or should I generate a long string for a folder, and store the file in that folder. (to retain the original file name). e.g. sh8sb36zkj391k4dhqk4n5e4ndsqule6/mypicture.png
This depends primarily on how you intend to use the pictures and which objects/classes/modules/etc. in your code will actually deal with retrieving them.
If you find yourself wanting to do things like - "all user uploads on a particular day" - A simple naming convention with folders for the year, month and day along with a folder at the top level for the user's unique ID will solve the problem.
If you want to ensure uniqueness and avoid collisions in your bucket, you could generate a unique string too.
However, since you've got MongoDB which (i'm assuming) will actually handle these queries for user uploads by date, etc., it makes the choice of your bucket more aesthetic than functional.
If all you're storing in mongoDB is the key/URL, it doesn't really matter what the actual structure of your bucket is. Nevertheless, it makes sense to still split this up in some coherent way - maybe group all a user's uploads and give each a unique name (either generate a unique name or prefix a unique prefix to the file name).
That being said, do you think there might be a point when you might look at changing how your images are stored? You might move to a CDN. A third party might come up with an even cheaper/better product which you might want to try. In a case like that, simply storing the keys/URLs in your MongoDB is not a good idea since you'll have to update every entry.
To make this relatively future-proof, I suggest you give your uploads a definite structure. I usually opt for:
bucket_name/user_id/yyyy/mm/dd/unique_name.jpg
Your database then only needs to store the file name and the upload time stamp.
You can introduce a middle layer in your logic (a new class perhaps or just a helper function/method) which then generates the URL for a file based on this info. That way, if you change your storage method later, you only need to make a small change in this middle layer (after migrating your files of course) and not worry about MongoDB.

What's wrong with this user ignore file for Mercurial?

A little retrospective now that I've settled into Mercurial. Forget forget files combined with hg remove. It's crazy and bass-ackwards. You can use hg remove once you've established that something is in a forget file that isn't forgetting because the item in question was tracked before the original repo was created. Note that hg remove effectively clears tracked status but it also schedules the file for deletion in anything that gets changes from your repo. If ignored, however the tracking deactivation still happens but that delete-me change set won't ever reach another repo and for some reason will never delete in yours which IMO is counter-intuitive. It is a very sure sign that somebody and I don't know these guys, is UNWILLING TO COMPROMISE ON DUH DESIGN PROBLEMS. The important thing to understand is that you don't determine what's important, Mercurial does. Except when you're merging on a pull of course. It's entirely reasonable then. But I digress...
Ignore-file/remove is a good combo for already-tracked but very specific files you want forgotten but if you're dealing with a larger quantity of built files determined with broader patterns it's not worth the risk. Just go with a double-repo and pull -u from the remote repo to your syncing repo and then pull -u commits from your working repo and merge in a repo whose sole purpose is to merge changes and pass them on in a place where your not-quite tracked or untracked files (behavior is different when pulling rather than pushing of course because hey, why be consistent?) won't cause frustration. Trust me. The idea that you should have to have two repos just to get 'er done offends for good reason AND THAT SO MANY OF US ARE DOING IT should suggest a serioush !##$ing design problem, but it's much less painful than all the other awful things that will make you regret seeking a sensible alternative.
And use hg help. It's actually Mercurial's best feature and often better than the internet (which I don't fault for confusion on the matter of all things hg) for getting answers to everything that is confusing and counter-intuitive in this VCS.
/retrospective
# switch to regexp syntax.
syntax: regexp
#Config Files
#.Net
^somecompany\.Net[\\/]MasterSolution[\\/]SomeSolution[\\/]SomeApp[\\/]app\.config
^somecompany\.Net[\\/]MasterSolution[\\/]SomeSolution[\\/]SomeApp_test[\\/]App\.config
#and more of the same following
And in my mercurial.ini at the root of my user directory
[ui]
username = ereppen
merge = bcomp
ignore = C:\<path to user ignore file>\.hgignore-config
Context:
I wrote an auto-config utility in node. I just want changes to the files it changes to get ignored. We have two teams and both aren't on the same page with making this a universal thing so it needs to be user-specific for now.
The config file is in place and pointed at by my ini file. I clone. I run the config utility and change the files and stat reveals a list of every single file with an M next to it. I thought it was the utf-8 thing and explicitly set the file to utf-16 little endian. I don't think I'm doing with the regEx that any modern flavor of regEx worth actually calling regEx wouldn't support.
The .hgignore file has no effect for files that are tracked. Its function is to stop you from seeing files you want ignored listed as "untracked". If you're seeing "M" then they're already added (you got them with the clone) so .hgignore does nothing.
The usual way config files that differ from machine to machine are handled is to put a app.config.sample in source control, have app.config in .hgignore and have people do a copy when they're making their config edits.
Alternately if your config files allow for includes and overrides you end them with include app-local.config and override any settings in a app-local.config which you don't add and do include in .hgignore.

Easiest way to sign/certify text file in C++?

I want to verify if the text log files created by my program being run at my customer's site have been tampered with. How do you suggest I go about doing this? I searched a bunch here and google but couldn't find my answer. Thanks!
Edit: After reading all the suggestions so far here are my thoughts. I want to keep it simple, and since the customer isn't that computer savy, I think it is safe to embed the salt in the binary. I'll continue to search for a simple solution using the keywords "salt checksum hash" etc and post back here once I find one.
Obligatory preamble: How much is at stake here? You must assume that tampering will be possible, but that you can make it very difficult if you spend enough time and money. So: how much is it worth to you?
That said:
Since it's your code writing the file, you can write it out encrypted. If you need it to be human readable, you can keep a second encrypted copy, or a second file containing only a hash, or write a hash value for every entry. (The hash must contain a "secret" key, of course.) If this is too risky, consider transmitting hashes or checksums or the log itself to other servers. And so forth.
This is a quite difficult thing to do, unless you can somehow protect the keypair used to sign the data. Signing the data requires a private key, and if that key is on a machine, a person can simply alter the data or create new data, and use that private key to sign the data. You can keep the private key on a "secure" machine, but then how do you guarantee that the data hadn't been tampered with before it left the original machine?
Of course, if you are protecting only data in motion, things get a lot easier.
Signing data is easy, if you can protect the private key.
Once you've worked out the higher-level theory that ensures security, take a look at GPGME to do the signing.
You may put a checksum as a prefix to each of your file lines, using an algorithm like adler-32 or something.
If you do not want to put binary code in your log files, use an encode64 method to convert the checksum to non binary data. So, you may discard only the lines that have been tampered.
It really depends on what you are trying to achieve, what is at stakes and what are the constraints.
Fundamentally: what you are asking for is just plain impossible (in isolation).
Now, it's a matter of complicating the life of the persons trying to modify the file so that it'll cost them more to modify it than what they could earn by doing the modification. Of course it means that hackers motivated by the sole goal of cracking in your measures of protection will not be deterred that much...
Assuming it should work on a standalone computer (no network), it is, as I said, impossible. Whatever the process you use, whatever the key / algorithm, this is ultimately embedded in the binary, which is exposed to the scrutiny of the would-be hacker. It's possible to deassemble it, it's possible to examine it with hex-readers, it's possible to probe it with different inputs, plug in a debugger etc... Your only option is thus to make debugging / examination a pain by breaking down the logic, using debug detection to change the paths, and if you are very good using self-modifying code. It does not mean it'll become impossible to tamper with the process, it barely means it should become difficult enough that any attacker will abandon.
If you have a network at your disposal, you can store a hash on a distant (under your control) drive, and then compare the hash. 2 difficulties here:
Storing (how to ensure it is your binary ?)
Retrieving (how to ensure you are talking to the right server ?)
And of course, in both cases, beware of the man in the middle syndroms...
One last bit of advice: if you need security, you'll need to consult a real expert, don't rely on some strange guys (like myself) talking on a forum. We're amateurs.
It's your file and your program which is allowed to modify it. When this being the case, there is one simple solution. (If you can afford to put your log file into a seperate folder)
Note:
You can have all your log files placed into a seperate folder. For eg, in my appplication, we have lot of DLLs, each having it's own log files and ofcourse application has its own.
So have a seperate process running in the background and monitors the folder for any changes notifications like
change in file size
attempt to rename the file or folder
delete the file
etc...
Based on this notification, you can certify whether the file is changed or not!
(As you and others may be guessing, even your process & dlls will change these files that can also lead to a notification. You need to synchronize this action smartly. That's it)
Window API to monitor folder in given below:
HANDLE FindFirstChangeNotification(
LPCTSTR lpPathName,
BOOL bWatchSubtree,
DWORD dwNotifyFilter
);
lpPathName:
Path to the log directory.
bWatchSubtree:
Watch subfolder or not (0 or 1)
dwNotifyFilter:
Filter conditions that satisfy a change notification wait. This parameter can be one or more of the following values.
FILE_NOTIFY_CHANGE_FILE_NAME
FILE_NOTIFY_CHANGE_DIR_NAME
FILE_NOTIFY_CHANGE_SIZE
FILE_NOTIFY_CHANGE_SECURITY
etc...
(Check MSDN)
How to make it work?
Suspect A: Our process
Suspect X: Other process or user
Inspector: The process that we created to monitor the folder.
Inpector sees a change in the folder. Queries with Suspect A whether he did any change to it.
if so,
change is taken as VALID.
if not
clear indication that change is done by *Suspect X*. So NOT VALID!
File is certified to be TAMPERED.
Other than that, below are some of the techniques that may (or may not :)) help you!
Store the time stamp whenever an application close the file along with file-size.
The next time you open the file, check for the last modified time of the time and its size. If both are same, then it means file remains not tampered.
Change the file privilege to read-only after you write logs into it. In some program or someone want to tamper it, they attempt to change the read-only property. This action changes the date/time modified for a file.
Write to your log file only encrypted data. If someone tampers it, when we decrypt the data, we may find some text not decrypted properly.
Using compress and un-compress mechanism (compress may help you to protect the file using a password)
Each way may have its own pros and cons. Strength the logic based on your need. You can even try the combination of the techniques proposed.