Index pdf files in AWS elasticsearch - amazon-web-services

I want to index some sample pdf and then search keywords in those pdf. I have tried using elastic search on my local desktop and used fscrawler to index the pdf. But my main aim is to create a web application where I can upload pdf and then enter a search term. I have created elasticsearch cluster on AWS but cannot figure out how to index pdf in AWS. Can I store the pdf on S3 and then index them?

Supporting S3 as a FS implementation is something I'd love to support in the future. See https://github.com/dadoonet/fscrawler/issues/263
That being said, I believe that Workplace Search will support it at some point.

Related

AWS S3 filename

I’m trying to build application with backend in java that allows users to create a text with images in it (something like a a personal blog). I’m planning to store these images to s3 bucket. When uploading image files to bucket i’m hashing the original name and store the hashed one in the bucket. Images are for display purpose only, no user will be able to download them. Frontend displays these images by getting a path to them from the server. So the question is, is there any need to store original name of the image file in the database? And what are the reasons, if any, of doing so?
I guess in general it is not needed because what is more important is how these resources are used or managed in the system.
Assuming your service is something like data access (similar to google drive), I don't think it's necessary to store it in DB, unless you want to make faster search queries.

Custom vocabulary with AWS Transcribe- Japanese Language in AWS

While using aws transcriber, I want to create custom vocab but Not able to create custom vocabulary with Japanese words and nor able to find any sample of custom vocab phrases file.
Tried character code from the table and the direct japanese words array of strings. Neither worked.
Got the error "The vocabulary that you’re trying to create contains invalid characters or incorrectly formatted terms. See the developer guide for more information."
Here is my code
response = transcribe.create_vocabulary(
VocabularyName = 'vocab2',
LanguageCode = 'ja-JP',
Phrases = ["0x3005 0x3005"]
)
Any leads would be appreciated!
Upload to S3 first, forget the upload file button
AWS provides two ways to create custom vocabulary on the console, upload a file or fetch from s3. For the same file, I failed when uploading directly, but succeed when uploading to s3 first. I guess it's a bug in AWS, but we have to live with it.

Do indexing metadata in amazon elasticsearch works with the encrypted content?

Scenario
I have Full text search requirement which can search inside the document. I am uploading documents in s3 bucket and encrypting it using envelope encryption.
Can we do full text search in encrypted document(in S3 bucket). If yes what are the rest API(NodeJS API) for the same.
Example => bucket1 =>Encrypted content in the files
bucket1/abc.pdf
bucket1/def.doc
bucket1/ghi.txt
and I want to search text like "I am from planet earth" in the above files.
I want in result file name(s) with above text.
Solution
I am reading following article:
aws article here
encryption of data at rest
Problem
Will it works if s3 bucket data is encrypted?
What will be the best solution for this scenario?
Elasticsearch does not search inside documents, you need to index the content of the documents inside elasticsearch to be able to perform searchs, it also does not support search on encrypted data, the data needs to be stored in clear text.
What you can do is configure SSL/TLS and authentication on Elasticsearch, so you only will be able to make requests if you use the correct certificate and a username and password.

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.

.csv upload not working in Amazon Web Services Machine Learning - AWS

I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.