downloading from AWS S3 while file is being updated

downloading from AWS S3 while file is being updated - amazon-web-services

This may seem like a really basic question, but if I am downloading a file from S3 while it is being updated by another process, do I have to worry about getting an incomplete file?
Example: a 200MB CSV file. User A starts to update the file with 200MB of new content at 1Mbps. 16 seconds later, User B starts download the file at 200Mbps. Does User B get all 200MB of the original file, or does User B get ~2MB of User A's changes and nothing else?

User B gets all 200MB of the original file.
Here's why:
PUT operations on S3 are atomic. There's technically no such thing as "modifying" an object. What actually happens when an object is overwritten is that the object is replaced with another object having the same key. But the original object is not actually replaced until the new (overwriting) object is uploaded in its entirety, and successfully...and even then, the overwritten object is not technically "gone" yet -- it's only been replaced in the bucket's index, so that future requests will be served the new object.
(Serving the new object is actually documented as not being guaranteed to always happen immediately. In contrast with uploads of new objects, which are immediately available for download, overwrites of existing objects are eventually consistent, meaning that it's possible -- however unlikely -- that for a short period of time after you upload an object that the old copy could still be served up for subsequent requests).
But when you overwrite an object, and versioning is not enabled on the bucket, the old object and new objects are actually stored independently in S3, in spite of the same key. The old object is now no longer referenced by the bucket's index, so you are no longer billed for storage of it, and it will shortly be purged from S3's backing store. It's not actually documented how much later this happens... but (tl;dr) overwriting an object that is currently being downloaded should not cause any unexpected side effects.
Updates to a single key are atomic. For example, if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it will never write corrupted or partial data.
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

Related

IPFS and Editing Permissions

I just uploaded a folder of 5 images to IPFS (using the Mac Desktop IPFS Client App, so it was a very simple drag and drop operation.)
So being that I’m the one that created and published this folder, does that mean that I’m the only one that’s allowed to make further modifications to it - like adding or deleting more images from it? Or can anyone out there on IPFS do that as well?
If they can, is there a way to prevent that from happening?
=======================================
UPDATED QUESTION:
My specific use-case has to do with updating the metadata of ERC721 Tokens - after they’ve already been minted.
Imagine for example a game where certain objects - like say a magical sword - gains special powers after a certain amount of usage or after the completion of certain missions by its owner. So we’d want to update this sword’s attributes by editing it’s Metadata and re-committing this updated metadata file to the Blockchain.
If our game has 100 swords for example, and we initially uploaded to IPFS a folder containing all 100 json files (one for each sword), then I’m pretty sure IPFS still let’s you access the specific files within the hashed-folder by their specific human-readable names (and not only by their hash.)
So if our sword happens to be sword #76, and our naming convention for our JSON files was of this format: “sword000.json” , then sword#76’s JSON metadata file would have a path such as:
http://ipfs.infura.io/QmY2xxxxxxxxxxxxxxxxxxxxxx/sword076.json
If we then edited the “sword076.json“ file and drag-n-dropped it back into our master JSON folder, it would obviously cause that folder’s Hash/CID value to change. BUT, as long as we’re able update our Solidity Contract’s “tokenURI” method to look for and serve our “.json” files from this newly updated HASH/CID folder name, we could still refer to the individual files within it by their regular English names. Which means we’d be good to go.
Whether or not this is a good scheme to employ is something we can definitely discuss, but I FIRST want to go back to my original question/concern, which is that I want to make sure that WE are the ONLY ones that can update the contents of our folder - and that no one else has permission to do that.
Does that make sense?

IPFS is immutable, meaning when you add your directory along with the files, the directory gets a unique CID based on the contents of the directory. So in a sense, nobody can modify it, not even you, because it's immutable. I believe this confusion can be resolved with more background on how IPFS works.
When you add things to IPFS each file is hashed, and given a CID. The same is true for directories, but their CID can more easily be understood as a sum of the contents of the directory. So if any files in the directory are updated, added, or deleted, the directory gets a new CID.
Understanding this, if someone else added the exact same content in the exact same way, they'd end up with the exact same CID! With this, if two people added the same CID, and a third person requested that file (or directory), both nodes would be able to serve the data, as we know it's exactly the same. The same is true if you simply shared your CID and another node pinned it, both nodes would have the same data, so if anyone requested it, both nodes would be able to serve it.
So your local copy, cannot be edited by anyone. In a sense, if you're relying on the IPFS CID as the address of your data, not even by you! This is why IPFS is typically referred to as "immutable", because any data you request via an IPFS CID will always be the same. If you change any of the data, you'll get a new CID.
More info can be found here: Content Addressing & Immutability
If you read all this and thought "well what if I want mutable data?", I'd recommend looking into IPNS and possibly ipfs-sync if you're looking for a tool to automatically update IPNS for you.

Repository pattern: isn't getting the entire domain object bad behavior (read method)?

A repository pattern is there to abstract away the actual data source and I do see a lot of benefits in that, but a repository should not use IQueryable to prevent leaking DB information and it should always return domain objects, not DTO's or POCO's, and it is this last thing I have trouble with getting my head around.
If a repository pattern always has to return a domain object, doesn't that mean it fetches way too much data most of the times? Lets say it returns an employee domain object with forty properties and in the service and view layers consuming that object only five of those properties are actually used.
It means the database has fetched a lot of unnecessary data a pumped that across the network. Doing that with one object is hardly noticeable, but if millions of records are pushed across that way and a lot of of the data is thrown away every time, is that not considered bad behavior?
Yes, when adding or editing or deleting the object, you will use the entire object, but reading the entire object and pushing it to another layer which uses only a fraction of it is not utilizing the underline database and network in the most optimal way. What am I missing here?

There's nothing preventing you from having a separate read model (which could a separately stored projection of the domain or a query-time projection) and separating out the command and query concerns - CQRS.
If you then put something like GraphQL in front of your read side then the consumer can decide exactly what data they want from the full model down to individual field/property level.
Your commands still interact with the full domain model as before (except where it's a performance no-brainer to use set based operations).

How to set the starting index for an ObjectWriteStream?

I'm trying to build a connection to the file in a google storage bucket, but I have a difficulty to implement an ObjectWriteStream. The problem is that if I create an ObjectWriteStream to the file that is already on the cloud, it will delete the old file and start from the beginning of it. Here is an example code
namespace gcs = google::cloud::storage;
void test(gcs::Client client, string bucket_name, string file_name){
auto writeCon = client.WriteObject(bucket_name.c_str(), file_name.c_str());
writeCon << "This is a test";
writeCon.Close();
}
What should I do to prevent the ObjectWriteStream from deleting my file and upload data from the location I want(e.g. append data to the file)? I have tried to call the standard ostream function seekp to set the stream location. This would not work since ObjectWriteStream does not support it. Strangely ObjectReadStream does not support this operation neither but it has an option gcs::ReadRange(start, end) to set the starting location. Therefore, I am wondering if there is a non-standard way to set the position for ObjectWriteStream. I will appreciate it if anyone can advise me.

it will delete the old file and start from the beginning of it.
This is by design. Remember that GCS is not a filesystem. GCS is an object store. In an object store, the object is atomic unit. You cannot modify objects.
If you require filesystem semantics, you may want to use Cloud Filestore instead.

The answers indicating that objects are immutable is correct. However, two or more objects can be concatenated together using the compose API. Here's the relevant javadoc.
So you could combine a few techniques to effectively append to objects in GCS.
You could copy your existing object (A) to a new object (B) in the same location and storage class (this will be very fast), delete A, upload new data into object C, and then compose B+C into A's original location. Then delete B and C. This will require a copy, delete, upload, compose, and then two deletes -- so six operations. Be mindful of operations costs.
You could simply upload a new object (B) and compose A+B into a new object, C, and record the name of the new object in a metadata database, if you're using one. This would require only an upload, compose, and two deletes.

Within Google Cloud Storage, objects are immutable. See:
https://cloud.google.com/storage/docs/key-terms#immutability
What this means is that you simply can't append to a file. You can re-write the file passing in the original content and then add more content.

How can I design a ColdFusion page to update an Array of data across requests without using the SESSION scope?

This is more of a conceptual question. I have a working application that allows users to upload a CSV file of addresses, then parses the data into an Array of Address objects, then validates each Address object against certain rules (certain fields are required, etc.). The page then displays any addresses that failed validation, giving the user the ability to edit or delete each.
Right now, I am storing the entire Array in a SESSION variable, assigning each Address an Id value, then updating each Address in the SESSION Array when the user makes edits and submits the form.
I'm trying to think of a way to do this without using the SESSION scope, or using a physical database, or physical file. Any ideas?

If you don't use a physical database you would have to use some sort of persistent scope. That means the SESSION scope, the CLIENT scope (if you have that enabled), the APPLICATION scope, or the SERVER scope. But I think the safest way (as all those persistent scopes are cleared if your server goes down) is to store them in a database -- whether that database is a RDBMS, text file, or a Verity or Solr collection. I apologize in advance if that doesn't answer your question.

The data needs to be stored / preserved somewhere if you want to work with it across multiple requests. Aside from the options your question rules out (session, database, file), I can think of two other (non-ideal) options:
External cache mechanism like memcached -- not necessarily recommended because it's inherently volatile and doesn't guarantee to preserve your data
Pass the contents of the CSV around from request to request, e.g. via hidden FORM containing JSON -- not recommended if the CSV can get large
Personally, my colleagues and I tend to use temporary database storage for this type of issue.

I don't know anything about your requirements or use cases, but if you can depend on your users to have modern browsers, another viable option might be HTML5 localStorage.
Skimming some of the localstorage questions might give you some ideas.

read objects persisted but not yet flushed with doctrine

I'm new to symfony2 and doctrine.
here is the problem as I see it.
i cannot use :
$repository = $this->getDoctrine()->getRepository('entity');
$my_object = $repository->findOneBy($index);
on an object that is persisted, BUT NOT FLUSHED YET !!
i think getRepository read from DB, so it will not find a not-flushed object.
my question: how to read those objects that are persisted (i think they are somewhere in a "doctrine session") to re-use them before i do flush my entire batch ?
every profile has 256 physical plumes.
every profile has 1 plumeOptions record assigned to it.
In plumeOptions, I have a cartridgeplume which is a FK for PhysicalPlume.
every plume is identified by ID (auto-generated) and an INDEX (user-generated).
rule: I say profile 1 has physical_plume_index number 3 (=index) connected to it.
now, I want to copy a profile with all its related data to another profile.
new profile is created. New 256 plumes are created and copied from older profile.
i want to link the new profile to the new plume index 3.
check here: http://pastebin.com/WFa8vkt1

I think you might want to have a look at this function:
$entityManager->getUnitOfWork()->getScheduledEntityInsertions()
Gives you back a list of entity objects which are persisting yet.
Hmm, I didn't really read your question well, with the above you will retrieve a full list (as an array) but you cannot query it like with getRepository. I will try found something for u..

I think you might look at the problem from the wrong angle. Doctrine is your persistance layer and database access layer. It is the responsibility of your domain model to provide access to objects once they are in memory. So the problem boils down to how do you get a reference to an object without the persistance layer?
Where do you create the object you need to get hold of later? Can the method/service that create the object return a reference to the controller so it can propagate it to the other place you need it? Can you dispatch an event that you listen to elsewhere in your application to get hold of the object?
In my opinion, Doctrine should be used at the startup of the application (as early as possible), to initialize the domain model, and at the shutdown of the application, to persist any changes to the domain model during the request. To use a repository to get hold of objects in the middle of a request is, in my opinion, probably a code smell and you should look at how the application flow can be refactored to remove that need.

Your is a business logic problem effectively.
Querying down the Database a findby Query on Object that are not flushed yet, means heaving much more the DB layer querying object that you have already in your function scope.
Also Keep in mind a findOneBy will retrieve also other object previously saved with same features.
If you need to find only among those new created objects, you should make f.e. them in a Session Array Variable, and iterate them with the foreach.
If you need a mix of already saved items + some new items, you should threate the 2 parts separately, one with a foreach , other one with the repository query!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js