What are the best practices for user uploads with S3? - amazon-web-services

I was wondering what you recommend for running a user upload system with s3. I plan on using MongoDB for storing metadata such as the uploader, size, etc. How should I go about storing the actual file in s3.
Here are some of my ideas, what do you think is the best? All of these examples would involve saving the metadata to MongoDB.
1.Should I just store all the files in a bucket?
2. Maybe organize them into dates (e.g. 6/8/2014/mypicture.png)?
3.Should I save them all in one bucket, but with an added string (such as d1JdaZ9-mypicture.png) to avoid duplicates.
4. Or should I generate a long string for a folder, and store the file in that folder. (to retain the original file name). e.g. sh8sb36zkj391k4dhqk4n5e4ndsqule6/mypicture.png

This depends primarily on how you intend to use the pictures and which objects/classes/modules/etc. in your code will actually deal with retrieving them.
If you find yourself wanting to do things like - "all user uploads on a particular day" - A simple naming convention with folders for the year, month and day along with a folder at the top level for the user's unique ID will solve the problem.
If you want to ensure uniqueness and avoid collisions in your bucket, you could generate a unique string too.
However, since you've got MongoDB which (i'm assuming) will actually handle these queries for user uploads by date, etc., it makes the choice of your bucket more aesthetic than functional.
If all you're storing in mongoDB is the key/URL, it doesn't really matter what the actual structure of your bucket is. Nevertheless, it makes sense to still split this up in some coherent way - maybe group all a user's uploads and give each a unique name (either generate a unique name or prefix a unique prefix to the file name).
That being said, do you think there might be a point when you might look at changing how your images are stored? You might move to a CDN. A third party might come up with an even cheaper/better product which you might want to try. In a case like that, simply storing the keys/URLs in your MongoDB is not a good idea since you'll have to update every entry.
To make this relatively future-proof, I suggest you give your uploads a definite structure. I usually opt for:
bucket_name/user_id/yyyy/mm/dd/unique_name.jpg
Your database then only needs to store the file name and the upload time stamp.
You can introduce a middle layer in your logic (a new class perhaps or just a helper function/method) which then generates the URL for a file based on this info. That way, if you change your storage method later, you only need to make a small change in this middle layer (after migrating your files of course) and not worry about MongoDB.

Related

Does amazon s3 have a limit to the number of subfolders you create?

I am thinking of using amazon s3 to implement my own backup solution. The idea is to have have a script that accepts a directory and recursively uploads all files underneath that directory into s3. However, I am not sure if it would work because of the following reasons.
s3 apparently doesn't have folders.
s3 imposes a limit on the size of the name of objects (1024 characters).
I take this to mean that if an object identified as "/foo/bar/baz.txt", then the "/foo/bar/" portion of that "filepath" is actually part of the object's name and counts towards the character limit on object names. If this is true, then I could see this becoming an issue when uploading deeply nested files with long filepaths (although 1024 characters does seem fairly generous).
Am I understanding things correctly?
Yes, this is accurate.
S3 is a key/value store, not a filesystem, though backups are certainly somethings its authors expect it to be used for (as evidenced by the documentation's choice of example keys being mostly filepaths!). If your computer has directory structures and filenames so long and so deeply nested that their entire path exceeds a thousand characters, I'd strongly recommend reorganising your hard drive!
If you can't do that and have lots of long paths, you may wish to try something other than attempting a one-to-one mapping between the two things. For example, you could store data blobs (the content of a file) with a key that is some GUID. Have a separate key/value store that maps GUIDs to filepaths. Although that doesn't help you with reverse lookup. Basically do the same thing you'd do if you were trying to structure this efficiently in code, using algorithms and data structures. Because, really, that's what you're doing here, too!
Putting backups aside and speaking more generally, if you were using subdirectories on disk only as a sort of metadata, there are other metadata properties you can use in S3 for that. But your object keys would still have to be unique across the whole dataset.
You can read more about S3 objects in the AWS documentation.

How do I get the last modified date of a directory in Amazon S3?

So I'm aware that Amazon S3 doesn't really have directories. My question is: does this make it impossible to reliably get the last-modified timestamp of a "directory" in S3?
I know you can get the last-modified date of a file, as in this question.
I say "reliably" because it would be possible to define the latest last-modified timestamp of a file inside a directory as the last-modified timestamp of the directory. But that's not really accurate, since if a file inside a directory gets deleted, it wouldn't register as a change to that directory (indeed the deletion might cause the last-modified date to go backwards in time).
We're using boto to scrape S3.
If its really important for you to know this, you could develop a solution using the S3 event notifications. Each time a file is put or deleted from a folder you can have either an SNS or Lamba event get fired, and you could use that information to update a table/log someplace where this information is kept for use when you need it.
Probably not a ton of work to do it, but if its critical to know, it is an avenue worth exploring.
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
Since what we label as a directory is just part of the object name, there is no creation time, modify time, etc. since it does not really exist as an entity on its own. The object has a path, and when you add '/' to the name, client presentation applications treat that as a separator, split the name, and make it look like a path. Like you suggested, there is no directory, and this is where that concept really is different than a traditional file system and how end users interact with it.
I suggest asking what you are trying to do and why the timestamp of the directory is important. E.J. Brennan suggests what you may be trying to do and is not a bad idea for the case he mentions. There is likely a different way to skin your cat.

Structure for storing data from thousands of files on a mobile device

I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).
Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.

YAHOO! Place Finder API

Is there a way to get all of the POI (Points of Interest) and AOI (Areas of Interest) lets say for a specific state.
I would like to be able to auto-complete while someone is typing if it contains any of that data, but I don't know how to get a list of all that data.
I would even be good with storing it my database if need be, because I can't see it changing very often.
No, PlaceFinder itself does not provide a way to download its places database. It's only intended for individual lookups.
You might look at the GeoPlanet data downloads which make available under CC license the entire WOEID (Where on Earth ID) geo database.

Easiest way to sign/certify text file in C++?

I want to verify if the text log files created by my program being run at my customer's site have been tampered with. How do you suggest I go about doing this? I searched a bunch here and google but couldn't find my answer. Thanks!
Edit: After reading all the suggestions so far here are my thoughts. I want to keep it simple, and since the customer isn't that computer savy, I think it is safe to embed the salt in the binary. I'll continue to search for a simple solution using the keywords "salt checksum hash" etc and post back here once I find one.
Obligatory preamble: How much is at stake here? You must assume that tampering will be possible, but that you can make it very difficult if you spend enough time and money. So: how much is it worth to you?
That said:
Since it's your code writing the file, you can write it out encrypted. If you need it to be human readable, you can keep a second encrypted copy, or a second file containing only a hash, or write a hash value for every entry. (The hash must contain a "secret" key, of course.) If this is too risky, consider transmitting hashes or checksums or the log itself to other servers. And so forth.
This is a quite difficult thing to do, unless you can somehow protect the keypair used to sign the data. Signing the data requires a private key, and if that key is on a machine, a person can simply alter the data or create new data, and use that private key to sign the data. You can keep the private key on a "secure" machine, but then how do you guarantee that the data hadn't been tampered with before it left the original machine?
Of course, if you are protecting only data in motion, things get a lot easier.
Signing data is easy, if you can protect the private key.
Once you've worked out the higher-level theory that ensures security, take a look at GPGME to do the signing.
You may put a checksum as a prefix to each of your file lines, using an algorithm like adler-32 or something.
If you do not want to put binary code in your log files, use an encode64 method to convert the checksum to non binary data. So, you may discard only the lines that have been tampered.
It really depends on what you are trying to achieve, what is at stakes and what are the constraints.
Fundamentally: what you are asking for is just plain impossible (in isolation).
Now, it's a matter of complicating the life of the persons trying to modify the file so that it'll cost them more to modify it than what they could earn by doing the modification. Of course it means that hackers motivated by the sole goal of cracking in your measures of protection will not be deterred that much...
Assuming it should work on a standalone computer (no network), it is, as I said, impossible. Whatever the process you use, whatever the key / algorithm, this is ultimately embedded in the binary, which is exposed to the scrutiny of the would-be hacker. It's possible to deassemble it, it's possible to examine it with hex-readers, it's possible to probe it with different inputs, plug in a debugger etc... Your only option is thus to make debugging / examination a pain by breaking down the logic, using debug detection to change the paths, and if you are very good using self-modifying code. It does not mean it'll become impossible to tamper with the process, it barely means it should become difficult enough that any attacker will abandon.
If you have a network at your disposal, you can store a hash on a distant (under your control) drive, and then compare the hash. 2 difficulties here:
Storing (how to ensure it is your binary ?)
Retrieving (how to ensure you are talking to the right server ?)
And of course, in both cases, beware of the man in the middle syndroms...
One last bit of advice: if you need security, you'll need to consult a real expert, don't rely on some strange guys (like myself) talking on a forum. We're amateurs.
It's your file and your program which is allowed to modify it. When this being the case, there is one simple solution. (If you can afford to put your log file into a seperate folder)
Note:
You can have all your log files placed into a seperate folder. For eg, in my appplication, we have lot of DLLs, each having it's own log files and ofcourse application has its own.
So have a seperate process running in the background and monitors the folder for any changes notifications like
change in file size
attempt to rename the file or folder
delete the file
etc...
Based on this notification, you can certify whether the file is changed or not!
(As you and others may be guessing, even your process & dlls will change these files that can also lead to a notification. You need to synchronize this action smartly. That's it)
Window API to monitor folder in given below:
HANDLE FindFirstChangeNotification(
LPCTSTR lpPathName,
BOOL bWatchSubtree,
DWORD dwNotifyFilter
);
lpPathName:
Path to the log directory.
bWatchSubtree:
Watch subfolder or not (0 or 1)
dwNotifyFilter:
Filter conditions that satisfy a change notification wait. This parameter can be one or more of the following values.
FILE_NOTIFY_CHANGE_FILE_NAME
FILE_NOTIFY_CHANGE_DIR_NAME
FILE_NOTIFY_CHANGE_SIZE
FILE_NOTIFY_CHANGE_SECURITY
etc...
(Check MSDN)
How to make it work?
Suspect A: Our process
Suspect X: Other process or user
Inspector: The process that we created to monitor the folder.
Inpector sees a change in the folder. Queries with Suspect A whether he did any change to it.
if so,
change is taken as VALID.
if not
clear indication that change is done by *Suspect X*. So NOT VALID!
File is certified to be TAMPERED.
Other than that, below are some of the techniques that may (or may not :)) help you!
Store the time stamp whenever an application close the file along with file-size.
The next time you open the file, check for the last modified time of the time and its size. If both are same, then it means file remains not tampered.
Change the file privilege to read-only after you write logs into it. In some program or someone want to tamper it, they attempt to change the read-only property. This action changes the date/time modified for a file.
Write to your log file only encrypted data. If someone tampers it, when we decrypt the data, we may find some text not decrypted properly.
Using compress and un-compress mechanism (compress may help you to protect the file using a password)
Each way may have its own pros and cons. Strength the logic based on your need. You can even try the combination of the techniques proposed.