Gone through amazon SDK/documentation and there isn't a lot around programtically querying/searching for documents on S3 bucket.
Sure, can get document by id/name but i want to have ability to search by other meta tags such as author.
Would appreciate some guidance and a specific example of a query being executed and not a local iteration once all documents or items have been pulled locally.
[…] there isn't a lot around programtically querying/searching for documents on S3 bucket.
Right. S3 is flat file storage, and doesn't provide a query interface.
[…] i want to have ability to search by other meta tags such as author.
This will need to be solved by your application logic. This is not built-in to S3.
For example, you can store the metadata about an S3 document/file in DynamoDB. You query DynamoDB for the metadata, which includes a pointer to the file in S3.
Unfortunately, if you already have a bunch of files in S3, you'll need to find a way to build that initial index of your data.
Amazon just released new features for cloud search
http://aws.amazon.com/about-aws/whats-new/2014/03/24/amazon-cloudsearch-introduces-powerful-new-search-and-admin-features/.
Related
What AWS service is appropriate for storing a single key-value pair data that is updated daily? The stored data will be retrieved by other several services throughout the day (~ 100 times total per day).
My current solution is to create and upload a JSON to an S3 bucket. All other services download the JSON and get the data. When it's time to update the data, I create a new JSON and upload it to replace the previously uploaded JSON. This works pretty well but I'm wondering if there is a more appropriate way.
There's many:
AWS Systems Manager Parameter Store
AWS Secrets Manager
Dynamo
S3
^ those are some of the most common. Without knowing more I'd suggest you consider Dynamo or Param Store. Both are simple and inexpensive--although S3 is fine, too.
The only reason to not use S3 is governance of the key expires etc., automatically from AWS side - like using a secret manager - therefore, giving it to third parties will be much harder.
Your solution seems very good, especially since S3 IS the object store database - json is an object.
The system you described is such a low usage that you shouldn't spend time thinking if there is any better way :)
Just make sure you are aware that amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write
and to refer to your comment:
The S3 way seemed a little hacky, so I am trying to see if there is a better approach
S3 way is not hacky at all - intended use of S3 is to store some objects in the key-value database :)
I'm trying to get size of folders in my s3 bucket from react-native app.
Couldn't find a way to do that yet, would really appreciate any help.
Amazon S3 does not have folders in the traditional sense of a filesystem. Folders are merely supported as a concept of grouping objects.
To get the size of a set of objects with the same prefix, you have to list the objects and summarize the size of each object. The S3 API to list objects accept a Prefix parameter to only return the objects that you need. This parameter should be exposed by the SDK that you are using.
An aside: S3 reports the size of the entire bucket to Amazon CloudWatch once per day, but this is insufficient for your purposes.
I use pyspark to read objects on an s3 bucket on amazon s3. My bucket is composed if many json files which I read and then save as parquet files with
spark.read.json('s3://my-bucket/directory1/')
spark.write.parquet('s3://bucket-with-parquet/', mode='append')
Every day I will upload some new files on s3://my-bucket/directory1/ and I would like to update them to s3://bucket-with-parquet/ is there a way to ensure that I do not update the data two times. My idea is to tag every files which I read with spark (do not know how to do it). I can then use those tags to tell spark not to read the file again after (dunno how to do it as well). If an AWS guru could help me on that I would be very grateful.
There are a couple of things you could do, one is to write a script which reads timestamp from the metadata of the bucket and gives the list of files added on that day. You can process only those files which are mentioned in this list. (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729)
Second, you can enable versioning in S3 bucket to make sure if you overwrite any files you can retrieve the old file. You can also set ACL for read-only and write once permission as mentioned here Amazon S3 ACL for read-only and write-once access.
I hope this helps.
I'm comparing cloud storage for a large set of files with certain 'attributes' to query. Right now it's about 2.5TB of files and growing quickly. I need high throughput writes and queries. I'll first write the file and attributes to store, then will query to summarize attributes (counts, etc), additionally querying attributes to pull small set of files (by date, name, etc).
I've explored Google Cloud Datastore as a noSQL option, but trying to compare it to AWS services.
One option would be to store files in S3 with 'tags'. I believe you can query these with the REST API, but concerned about performance. I also have seen suggestions to connect Athena, but not sure if that will pull in the tags and the correct use-case.
The other option would be using something like Dynamo or possibly a large RDS? Redshift says it's for Petabyte scale, which we're not quite there...
Thoughts on best AWS storage solution? Pricing is a consideration, but more concerned with best solution moving forward.
You don't want to store the files themselves in a database like RDS or Redshift. You should definitely store the files in S3, but you should probably store or copy the metadata somewhere that is more indexable and searchable.
I would suggest setting up a new object trigger in S3 that triggers a Lambda function whenever a new file is uploaded to S3. The Lambda function could take the file location, size, any tags, etc. and insert that metadata into Redshift, DynamoDB, Elastic Search, or an RDS database like Aurora, where you could then perform queries against that metadata. Unless you are talking many millions of files, then the metadata will be fairly small and you probably won't need the scale of Redshift. The exact database you pick to store the metadata will depend on your use case such as the specific queries you want to perform.
I am planning to develop a web application which can perform some basic text edit functions (like insert and delete) on S3 files. Could anyone show me a path forward? I am currently learning Lambda, and have followed tutorial here: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
I can create a Lambda function which can modify files on S3, and call the function by AWS CLI now. What else do I need to know and do to create this web application? Thank you very much.
You would need to look at AWS API Gateway. This can be the front end to your web application.
Also note that S3 is a block storage mechanism, and if your file edits are too frequent it is not suitable for your use case because every time you want to edit the text you will have to download the entire file, modify that and upload that back again. And be mindful of the S3 eventual consistency
Amazon S3 Data Consistency Model
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.