What are the minimum SAS permissions that the EventProcessorClient needs from Storage Accounts? - azure-eventhub

I couldn't find what SAS permissions I need to give, for a storage account I'm solely using to connect to Eventhub for consumption.
picture of permissions
So it's stored in blobs, definitely needs to read...does it update? or Write?
Documentation only shows with connection strings.

The EventProcessorClient needs to be able to:
List blobs in a container
Add a new blob to a container
Update an existing blob in the container (metadata only)
Read an existing blob in the container (metadata only)
We generally recommend using a container dedicated to the processor and allowing the processor control over that container.

Related

AWS S3 filename

I’m trying to build application with backend in java that allows users to create a text with images in it (something like a a personal blog). I’m planning to store these images to s3 bucket. When uploading image files to bucket i’m hashing the original name and store the hashed one in the bucket. Images are for display purpose only, no user will be able to download them. Frontend displays these images by getting a path to them from the server. So the question is, is there any need to store original name of the image file in the database? And what are the reasons, if any, of doing so?
I guess in general it is not needed because what is more important is how these resources are used or managed in the system.
Assuming your service is something like data access (similar to google drive), I don't think it's necessary to store it in DB, unless you want to make faster search queries.

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

how to restrict google cloud storage upload

I have a mobile application that uses Google Cloud Storage. The application allows each registered user to upload a specific number of files.
My question is, is there a way to do some kind of checks before the storage upload? Or do I need to implement a separate reservation API of sorts that OKs an upload step?
Any alternative suggestions are welcome too, of course.
warning: Not an authoritative answer. Happy to accept removal or update requests.
I am not aware of any GCS or Firebase Cloud Storage mechanisms that will inherently limit the number of files (objects) that a given user can create. If it were me, this is how I would approach the puzzle.
I would create a database (eg. Firestore / Datastore) that has a key for each user and a value which is the number of files they have uploaded. When a user wants to upload a new file, it would first make a REST call to a Cloud Function that I would write. This Cloud Function would implicitly know the identity of the calling user. It would look up the record in the database and determine if we are allowed to upload a new file. If no, then return an error and end of story. If yes, then increment the value in the database. Next I would create a GCS "signed URL" that can be used to permit an upload. It would be that signed URL that the Cloud Function would return. The app that now wishes to upload can use that signed URL to perform the actual upload.
I would also add metadata to each file uploaded to identify the logical uploader (user) of the file. That can be then used for reconciliation if needed. We could examine all the files in the bucket and re-build the database of how many files each user had uploaded.
A possible alternative to this story is for the Cloud Function to not return a signed-url but instead receive the data to be uploaded in the same request. If the check on number of files passes, then the Cloud Function could be a proxy to a GCS write to create the file directly. This alternative needs to be carefully examined as a function of the sizes of the files to be uploaded. If the files are large this may be a very poor solution. We want to be in and out of Cloud Functions as quickly as possible and holding a Cloud Function "around" to service data pass through isn't great. We may want to look at Cloud Run in that case as it supports concurrency in the instance without increasing the cost per call.

Can Aerospike be used as an alternate to S3?

I am new to both the technologies, so need some guidance here
I have an S3 bucket of lots of images (20 million, 870GB) , which unfortunately has poorly thought of keys which makes the read process slow (1s - 1.8s atleast for a read).
We are planning to migrate this to a better read optimised bucket.
Then I came across aerospike and in some docs I read that we can even store images as blobs in the key value pair in aerospike. While the storage consumption would be high, the read would be faster than anything else given that it uses SSD integration.
Would it be recommended to use Aerospike to store keys like S3 and the values would then be the corresponding images? Is there any other alternative to S3 that has a faster read?
No. Aerospike is not an alternative for S3. They are meant for different purposes. Aerospike is a key-value store which is used as OLTP database. S3 is file storage service. Note that Aerospike has a limit on how large a record can be. Its 1MB (max allowed value for write-block-size). If your objects are <1MB, then you may consider Aerospike for your use case.

File Storage usage with Django and Amazon S3

I have a scenario where storage use needs to be determined in my app by user since it will be placing limits on how much storage can be used. I'm currently using django-storage, boto, and S3 to manage and store files. What is the best way to aggregate storage use on a per user basis?
I thought about keeping track of each file that's uploaded, incrementing the file size on upload/decrementing on delete, and storing that aggregated file size in the DB, but I'm wondering if there is a way to get this more cleanly. What solutions have others done out there? Many thanks.
It's not a full solution, but a start might be to store the user-uploaded fields in folders that are specific to the user. The "upload_to" attribute of a model field can be a callable that returns the location to store the media; this could end up putting users' uploads into user-specific buckets within your S3 storage.
Then you can use boto (or other tools) to query for the size of a user's bucket.