Reading the documentation, it's not really clear.
What I want is to be able to store and retrieve simple json documents. With CloudSearch it seems possible to store documents in SDF format, and then search for them but it only returns the document ID and a small part (200 chars I think) of the fields specified.
Is there a way to retrieve the full document by ID just using CloudSearch? or is it intended to work as an additional tool for searching and then using your primary storage service?
If you index the id as a literal and search with that exact id then yes you can but it seems like a waste to use CloudSearch in that way. What about S3?
Related
Actually we need to extract details from the document like Invoice/delivery Challan etc. So I was going through aws Textract demo version where we can simply upload the PDF document and see, what all details it is extracting as key value pair, Table etc.
While doing above activity, I found that few specific keys like Invoice Number,PAN etc which are very important for us, sometimes getting extracted but sometimes they are not, though the document I am using is of quite high quality.
So my question is - Is there any way where we can specifically specify that what all keys, we are required to extract from the document?
If they are available in the document, aws should extract them else, it should keep those fields empty in the Response.
Thanks,
Kavita
as we use cloudsearch to find our documents and data, we have this issue that some of the data that are returned back to us, we have to know that how they've been found.
I know that we can specify which fields to search for them, but is there any way that amazon gives us a hint or some information that how the returned data has been found. on which fields it exists?
this can be really really useful information for us and affects the way we show data to our users.
I know amazon provides highlighting service, but highlights change results, we don't want to change results or values, we just want to use this knowledge for backend purposes.
I have a requirement to get OCR(Optical character recognition) data from PDFs and images files in S3 so that user can perform search on that OCR data. I am using AWS Textract for text extraction to get OCR data.
I was planning to store the OCR data in Dynamo DB and perform search query in that.
Issue that I am facing is because of the size limit of dynamo db items which is limited to 400KB.
I have situation where user upload 100+ MB PDF file in S3 where the extracted text content will exceed this limit. So what is the best approach in this case.
Please help
Thanks in advance!
I'm sure you could still use DynamoDB, you would just need to split the data up across multiple items. In this case, your partition key might be the PDF file key/name and the sort key might be some kind of part key. You can then get all items containing text for the file using Query (rather then GetItem).
DynamoDB gets really expensive when you're dealing with a lot of data so another option could be S3 and Athena:
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
Basically, you write the OCR data to a text file and store that in S3. You can then use Athena to run queries over that data. This solution is very flexible and is likely to be much cheaper than DynamoDB. There might be some downsides in performance.
I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?
The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!
I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.
I am building a service which would have millions of rows of data in it. We wanted to have good search on it. Eg. we can search by some field values. The structure of the row will be like as follows:
{
"field1" : "value1",
"field2" : "value2",
"field3" : {
"field4": "value4",
"field5": "value5"
}
}
Also, the structure of field3 can be changing with field4 present sometime and sometime not.
We wanted to have filters on following fields field1, field2 and field 4. We can create indexes in dynamodb to do that. But I am not sure if we can create index on field4 in dynamodb easily without flattening the json.
Now, my question is, should we use elastic search datastore for it, which as far as I know, will create indexes on every field in the document and then one can search on every field? Is that right? Or should we use dynamodb or completely any other data store?
Please provide some suggestions.
If search is a key requirement for your application, then use a search product - not a database. Dynamodb is great for a lot of things, but adhoc search is not one of them - you are going to end up running lots of very expensive (slow) scans if you go with dynamodb; this is what ES was built for.
I've a decent working experience with dynamoDB and extensive working experience with Elasticsearch(ES).
Let's first understand the key difference between these two:
dynamoDB is
Amazon DynamoDB is a key-value and document database
while Elasticsearch
Elasticsearch is a distributed, open source search and analytics
engine for all types of data, including textual, numerical,
geospatial, structured, and unstructured data.
Now coming to question, let's discuss how these system works internally and how it affects the performance.
DynamoDB is great to fetch the documents based on keys but not great for filtering and searching, as in relations database for improving performance of these oprations you create index on the columns, in similar way you have to create an index in dynamoDB as its a database, not search engine. And creating index on fields on the fly is pain and its not cached in DynamoDB.
Elasticsearch stores data differently by creating the inverted index for all indexed fields(default as mentioned by OP) and filtering on these fields are super fast if you use the filter context which is the same use case here, more info with example is explained in official ES doc https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context, Also as these filters are not used for score calculation and cached at elasticsearch so their performance(both read and write) is super fast as compared to dynamoDB and you can benchmark that as well.