get data of how results have been found in Amazon Cloudsearch - amazon-web-services

as we use cloudsearch to find our documents and data, we have this issue that some of the data that are returned back to us, we have to know that how they've been found.
I know that we can specify which fields to search for them, but is there any way that amazon gives us a hint or some information that how the returned data has been found. on which fields it exists?
this can be really really useful information for us and affects the way we show data to our users.
I know amazon provides highlighting service, but highlights change results, we don't want to change results or values, we just want to use this knowledge for backend purposes.

Related

DynamoDB Indexing Assistance and Getting My Data Out

I preface all of this to say I’m still actively learning DynamoDB, and I think an answer to my question will help me understand a few things.
I have an analytics microservice that I’m pushing custom (internal) analytics events into a DynamoDB table. Columns in our Dynamo rows/items include data like:
User ID
IP Address
Event Action
Timestamp
Split Test ID
Split Test Value
One of the main questions we want to pull from this db is:
"How many users saw split test x with values y?"
I’m struggling to understand how I should index my database to account for this kind of requests? I set up a “Keys Only” index targeting Split Test ID, and the query to gather these are fairly efficient, but it only pulls UserID and Split Test ID. Ideally I want an efficient query that returns multiple other associated values as well…
How do I achieve this? Do I need to be doing something much differently? Additionally, if any of my understanding of Dynamo, based on my explanations, sounds completely lacking in some regard, please point me in the right direction!
You're thinking of DynamoDB as a schema-less database, which it obviously is. However, that does not mean that a schema is not important. Schemas in NoSQL databases are usually more important than they are in SQL databases and they are usually less straightforward.
The most important thing to determine how you will store your data is how you will access it. You will have to take into account all the ways that you will want to access your data and ensure it is possible by creating the necessary data columns and necessary indexes. In this case, if you want to know how many times two values are combined in a certain way, you could easily add a column that has these combined values (e.g., splitId#splitValue ) and use that in your indexes.
If you want to know more about advanced patterns and such, I advise you to watch this pretty famous re:invent talk by Rick Houlihan or to read the DynamoDB book.
As a last note, I want to add that switching to a SQL server usually is not the solution. Picking NoSQL over SQL is usually based on non-functional requirements. There is a reason NoSQL databases are used in applications that require very low-latency retrieval of data in huge datasets, but as with everything, trade-offs are the name of the game.

Extract Business Related data from Invoice using aws textract

Actually we need to extract details from the document like Invoice/delivery Challan etc. So I was going through aws Textract demo version where we can simply upload the PDF document and see, what all details it is extracting as key value pair, Table etc.
While doing above activity, I found that few specific keys like Invoice Number,PAN etc which are very important for us, sometimes getting extracted but sometimes they are not, though the document I am using is of quite high quality.
So my question is - Is there any way where we can specifically specify that what all keys, we are required to extract from the document?
If they are available in the document, aws should extract them else, it should keep those fields empty in the Response.
Thanks,
Kavita

How to efficiently implement a page view counter in DynamoDB?

I am essentially trying to build a website where members can post blog entries and i want to record unique and overall page views for the different posts in absolute terms as well as over different time-frames e.g., last 24h, last week etc.
My initial approach was to use the date as primary key and the blogPostId as secondary key, i could then add all the posts visited during a given day. If i then include the userIds as an attribute i should then be able to a)get unique page views and b)overall page views (which might include duplicate visits by a specific user) for a given day. Finally, i would then pull the primary key for let's say the last 7 days and extract the most popular post.
As far as i can tell this should work fine as long as there aren't too many entries, however, i'm sceptical if this will scale. More specifically, if the number of blog posts increases a lot for a given interval, or if i want to find the all-time most viewed post i'd essentially have to read the whole table.
Has anyone an idea how i could implement this more efficiently?
DynamoDB will almost certainly work for you, and if you need an excuse to use it, by all means give it a try. If you get a ton or traffic it might end up being expensive.
Personally, I would consider using redis for what you are asking to do, and here is a pretty good/detailed question/answer on how you might implement it:
Scalable way of logging page request data from a PHP application?
DynamoDB can be used to iterate and create this feature quickly.
Nonetheless, this is a feature for Amazon Kinesis Data Streams, which will let you ingest data and then manipulate it to your needs.
Know that Kinesis can become expensive if you try to be as frugal as possible.
But, if you start receiving a lot of traffic, Kinesis will work as a Queue and let you manipulate the data before ingesting it to DynamoDB (Or another Data Store) (It will be cheaper than sending all those write requests).
Another limitation you'd like to check out is that DynamoDB will only return up to 1MB per Query.
Amazon recommends you use Redshift to handle all those operations as it is more suited to perform aggregation and calculation across Data warehouses.

is it possible to use CloudSearch *only* for storage?

Reading the documentation, it's not really clear.
What I want is to be able to store and retrieve simple json documents. With CloudSearch it seems possible to store documents in SDF format, and then search for them but it only returns the document ID and a small part (200 chars I think) of the fields specified.
Is there a way to retrieve the full document by ID just using CloudSearch? or is it intended to work as an additional tool for searching and then using your primary storage service?
If you index the id as a literal and search with that exact id then yes you can but it seems like a waste to use CloudSearch in that way. What about S3?

Retrieving "businesses" with Google Maps API?

This is an example of a Business on Google Maps
It has elements attached such as:
Reviews from various sites (qype, viewlondon, etc...)
Details provided by various sites
Photos and other content
I don't know how to go on about retrieving such Business and associate any items generated on my website.
What I have implemented up to date is a system using geocoding (geopy) which once given an address, it gives back Latitude and Longitude, but such system does not help me with this dilemma.
What you want is this API:
http://code.google.com/apis/ajaxsearch/local.html
Also check this:
http://googleajaxsearchapi.blogspot.com/2007/06/local-search-control-for-maps-api.html
By writing a relay server script you could do things like this, which obtains most of that information with a different layout. I don't know if it's legal to do that.