ElasticSearch search pdf document for content - amazon-web-services

I'm working on a project leveraging AWS Lex chatbot and ElasticSearch. My goal is to parse a query with the intent of searching a single pdf document and pulling out some relevant information.
I'm under the impression this is possible with ElasticSearch, though my research has reached a roadblock. I understand ElasticSearch has the ability to index documents, but that seems to be limited to indexing actual files for the search of files that match the query. I'm hoping to snag actual content within the PDF document and attempt to pull some content out based on a query. Is this possible?

Elasticsearch can't index PDFs directly. You can extract the text of the PDF, index it, then query as usual. Apache Tika "detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)."
You can run Tika as a Docker container: docker-tikaserver
To index a PDF, send your data to Tika (for this example, running as a docker container accessible via http://tika:9998), get the text, and index it:
doc = {...} # other content to index
try:
# open PDF and read contents into data
# send content to tika to extract text
doc["content"] = requests.put("http://tika:9998/tika", data=data).text
es.index(index="my-index", id=doc["id"], body=doc))
except Exception as e:
log.error("error extracting text: %s", e)

Related

neo4j use Load CSV to read data from Google Cloud Storage

My original data is from Bigquery. I have created a dag job to extract the relevant fields data based on a "WHERE" condition into a csv file stored in Google Cloud Storage
As a next step, I am aiming to use "LOAD CSV WITH EHADERS FROM gs://link-to-bucket/file.csv to read the data from the CSV to Neo4j database
It seems however that I cannot just give the the gcs uri as the CSV link. Is there anyway to establish a secure connection to read the file, other then making the the bucket public?
My attempt
uri = "gs://link-to-bucket/file.csv"
def create_LP_query(uri):
query_string = f"""
LOAD CSV WITH HEADERS FROM '{uri}' AS row
MERGE (l:Limited_Partner:Company {{id: row.id}})
SET l.Name = row.Name """
It is not possible, you would have to create a Neo4j plugin that acts as a new ProtocolHandler.
I did one in the past for S3, you might take it as inspiration, it can be similar for GS.
https://github.com/ikwattro/neo4j-load-csv-s3-protocol

Is there any limit on number of pdf pages to be OCRed using AWS Textract?

I am OCRing image based pdfs using AWS Textract
my each PDF I have has 60+ pages
but when I try to OCR the pdf file it only does that for the first 4 pages of each file.
is there any limit on number of pages in the pdf file for AWS extract
I found this https://docs.aws.amazon.com/textract/latest/dg/limits.html
but it does not mention any limit on the number of pages!!
Any one know if there is any limit of the pdf pages?
and if so, how can I do the OCR for the whole file 60+ pages?
The hard limits for textract are 1000 pages or 500mb for PDFs.
I think that your problem is related to the batch response of textract. You have to look if the key "NextToken" in the json output is populated and if so, you have to make another request with that token.

Test data to requests for Postman monitor

I run my collection using Test data from a csv file, However there is no option to upload the test data file when adding monitor for the collection. On searching through internet could see that the test data file have to be provided in URL (saved in cloud ..google drive,.). But i couldn't get source for how to provide this URL to the collection . Can anyone please help
https://www.postman.com/praveendvd-public/workspace/postman-tricks-and-tips/request/8296678-d06b3fc0-6b8b-4370-9847-aee0f526e7db
you cannot use csv file in monitor , but could store the content of csv as variable and use that to drive the monitor . An example can be seen in the above public repository

Custom vocabulary with AWS Transcribe- Japanese Language in AWS

While using aws transcriber, I want to create custom vocab but Not able to create custom vocabulary with Japanese words and nor able to find any sample of custom vocab phrases file.
Tried character code from the table and the direct japanese words array of strings. Neither worked.
Got the error "The vocabulary that you’re trying to create contains invalid characters or incorrectly formatted terms. See the developer guide for more information."
Here is my code
response = transcribe.create_vocabulary(
VocabularyName = 'vocab2',
LanguageCode = 'ja-JP',
Phrases = ["0x3005 0x3005"]
)
Any leads would be appreciated!
Upload to S3 first, forget the upload file button
AWS provides two ways to create custom vocabulary on the console, upload a file or fetch from s3. For the same file, I failed when uploading directly, but succeed when uploading to s3 first. I guess it's a bug in AWS, but we have to live with it.

GoCD compare output in json in bash

My situation is that i need to be able to get the gocd compare result in the form of json in bash, so that i can extract the commit ids from the output.
https://<go server url>/go/compare/<pipeline name>/<old build>/with/<new build>
The output of the above url in a GET (with authentication is an html document).
However i need to be able to get this in JSON format so that i can fetch the commit ids list from the comparison page output. Please suggest if there are any ideas.