Scenario
I have Full text search requirement which can search inside the document. I am uploading documents in s3 bucket and encrypting it using envelope encryption.
Can we do full text search in encrypted document(in S3 bucket). If yes what are the rest API(NodeJS API) for the same.
Example => bucket1 =>Encrypted content in the files
bucket1/abc.pdf
bucket1/def.doc
bucket1/ghi.txt
and I want to search text like "I am from planet earth" in the above files.
I want in result file name(s) with above text.
Solution
I am reading following article:
aws article here
encryption of data at rest
Problem
Will it works if s3 bucket data is encrypted?
What will be the best solution for this scenario?
Elasticsearch does not search inside documents, you need to index the content of the documents inside elasticsearch to be able to perform searchs, it also does not support search on encrypted data, the data needs to be stored in clear text.
What you can do is configure SSL/TLS and authentication on Elasticsearch, so you only will be able to make requests if you use the correct certificate and a username and password.
Related
I’m trying to build application with backend in java that allows users to create a text with images in it (something like a a personal blog). I’m planning to store these images to s3 bucket. When uploading image files to bucket i’m hashing the original name and store the hashed one in the bucket. Images are for display purpose only, no user will be able to download them. Frontend displays these images by getting a path to them from the server. So the question is, is there any need to store original name of the image file in the database? And what are the reasons, if any, of doing so?
I guess in general it is not needed because what is more important is how these resources are used or managed in the system.
Assuming your service is something like data access (similar to google drive), I don't think it's necessary to store it in DB, unless you want to make faster search queries.
I want to do the following: a user in a browser types some text and after he presses a 'Save' button, the text should be saved in a file (for example: content.txt) in a folder (for example: /username_text) on the root of an S3 bucket.
Also, I want the user to be able, when he visits the same page, load the content from S3 and continue working on the file. Then, if he/she is done, save the file to S3 again.
Probably important to mention, but I plan on using NodeJS for my back-end...
My question now is: What is the best way to set this storing-and-retrieving thing up? Do I create an API gateway + Lambda function to GET and POST files through that? Or do I for example use the aws-sdk in Node to directly push and pull files from S3? Or is there a better way to do this?
I looked at the following two guides:
Using AWS S3 Buckets in a NodeJS App – Codebase – Medium
Image Upload and Retrieval from S3 Using AWS API Gateway and Lambda
Welcome to StackOverflow!
I think you are worrying too much about the not-so-important stuff. S3 is nothing but a storage system. You could have decided to store the content of these files on DynamoDB, RDS, etc. What would you do if you stored its contents on these real databases? You'd fetch for data and display it to the user, wouldn't you?
This is what you need to do with S3! S3 is a smart choice on your scenario because your "file" can grow very big and S3 is a great place for storing files. However, apparently, you're not actually storing files (think of .pdf, .mp4, .mov, etc.), you're essentially only storing human-readable text.
So here's one approach on how to solve your problem:
FETCHING FILE CONTENT
User logs in
You fetch the user's personal information based on some token. You can store all the metadata in DynamoDB, where given a user_id, fetch all the "files" from this user. These "files" (metadata only) would be the bucket and key for the actual file on S3.
You use the getObject API from S3 to fetch the file based on your query and display the body of your file to your user in a RESTful way. Your response should look something like this:
{
"content": "some content"
}
SAVING FILE CONTENT
User logs in
The user writes anything in a form and submits it. In your Lambda function, you grab the content of this form and process it. This request should look something like this:
{
"file_id": "some-id",
"user_id": "some-id",
"content": "some-content"
}
If the file_id exists, update the content in S3. Otherwise, upload a new file in S3 and then create a new entry in DynamoDB. You'd then, of course, have to handle if the user submitting the changes actually owns the file, but if you're using UUIDs it shouldn't be too much of a problem, but still worth checking in case an ID is leaked somehow.
This way, you don't need to worry about uploading/downloading files as these are CPU intensive tasks, so you can keep your costs low as well as using very little RAM in your functions (128MB should be more than enough), after all, you're now only serving text. Not only this will simplify your way of designing it, but will also make things simpler both in API Gateway and in your code as you won't have to deal with binary types. The maximum you'll do is convert the buffer from S3 to a String when serving some content, but this should be completely fine.
EDIT
On your question regarding whether you should upload it from the browser or not, I suggest you take a look into this answer where I cover the pros/cons of doing it via API Gateway vs from the Browser.
I have an application running since many time that uploads files (images) on S3 storage.
Now I've been requested to update this application and upload file using SSE-C encryption (Server Side Encryption with Customer provided key). So I did it.
I'm also able to upload SSE-C encrypted files using aws cli.
What I need now, and here is my question, is to find a way to apply SSE-C encryption to earlier files already on S3 without SSE-C encryption.
Could someone explain me if and how this can be accomplished or point me to some doc or support page in order to find a solution?
One (maybe inefficient) way I found is doing the following for each file:
copy filename to filename.encrypted applying the SSE-C encryption
move filename.encrypted to filename
Is this the only way to do it or there is a better one?
NOTES:
Since I have many many files I obviously excluded the option to download the file and then upload again with SSE-C encryption because it'll be too slow and too expensive.
A solution that let apply the SSE-C without data transfert from and back to S3 is the one I'm looking for.
Thank you very much for any feedback on this.
You can apply encryption to already-existing objects by simply copying the object on top of itself:
aws s3 cp s3://bucket/foo.txt s3://bucket/foo.txt --sse-c --sse-c-key fileb://key.bin
This works as long as something (eg the encryption) is changing.
I got the --sse-c syntax from: How to supply a key on the command line that's not Base 64 encoded
I want to index some sample pdf and then search keywords in those pdf. I have tried using elastic search on my local desktop and used fscrawler to index the pdf. But my main aim is to create a web application where I can upload pdf and then enter a search term. I have created elasticsearch cluster on AWS but cannot figure out how to index pdf in AWS. Can I store the pdf on S3 and then index them?
Supporting S3 as a FS implementation is something I'd love to support in the future. See https://github.com/dadoonet/fscrawler/issues/263
That being said, I believe that Workplace Search will support it at some point.
I'm testing out S3 Select and as far as I understand from the examples, you can treat a single object (CSV or JSON) as a data store.
I wanted to have a single JSON document per S3 object and search the entire bucket as a 'database'. I'm saving each 'file' as <ID>.json and each file has JSON documents with the same schema.
Is it possible to search multiple objects in a single call? i.e. Find all JSON documents where customerId = 123 ?
It appears that Amazon S3 Select operates on only one object.
You can use Amazon Athena to run queries across paths, which will include all files within that path. It also supports partitioning.
Simple, just iterate over the folder key in which you have all the files and grab the key and use the same to leverage S3 Select.