Google Document AI - Add New Label Workflow - google-cloud-platform

We’re just getting started with Document AI. So far, we have about 80 labeled documents and one trained version.
We are making changes to the schema and adding a property. We’d like to go back in and extract this new label to the previously labeled documents.
The Document AI user interface presents some challenges here.
I want to isolate the documents that don’t contain this label. With the filtering capabilities, it looks like I’m only able to filter on documents that have that label, not the inverse.
I also don’t see a way to mark a bunch of documents as unlabeled once they have been marked as labeled. That would be useful for indicating which previously labeled documents need some additional work.
For those that are making schema changes and need to go back and re-label documents, what does your workflow look like?

Related

Using google cloud for image classification, cropping and OCR

Please allow me to ask a rather newbie question. So far, I have been using local tools like imagemagick or GOCR to perform the job, but that is rather old-fashioned, and I am urged to "move to google cloud AI".
The setup
I have a (training) data set of various documents (as JPG and PDF) of different kinds, and by certain features (like prevailing color, repetitive layout) I intend to classify them, e.g. as invoice type 1, invoice type 2, not an invoice. In a 2nd step, I would like to OCR certain predefined areas of each document and extract e.g. the address of the company sending the invoice and the date.
The architecture I am envisioning
In a modern platform as a service (pass), I have already set up an UI where I can upload new files. These are then locally stored in a directory with filenames (or in a MongoDB). Meta info like upload timestamp, user, original file name is stored in a DB.
The newly uploaded file should should then be submitted to google cloud which should do the classification step, and deliver back the label to be saved in the database.
The document pages should be auto-cropped, i.e. black or white margins are removed, most probably with google cloud as well. The parameters of the crop should be persisted in the DB.
In case it is e.g. an invoice, OCR should be performed (again by google cloud) for certain regions of the documents, e.g. a bounding box of spanning from the mid of the page to the right margin in the upper 10% of the cropped page. The results of the OCR should be again persisted locally.
The problem
I seem to be missing the correct search term to figure out how to do it with google cloud. Is there an google-API (e.g. REST), I can use to upload and which gives me back the results of steps 2 to 4?
I think that your best option here is to use Document AI (REST API and Libraries).
Using Document AI, you can:
Convert images to text
Classify documents
Analyze and extract entities
Additionally, for your use case, we have a new Document AI feature that is still in preview and has limited access which is the Invoice parser.
Invoice parser is similar to Form parser but for invoices instead of forms. Check out the Invoice parser page and you will see what I mean by preview and limited access.
AFIK, there isn't any GCP tool for image edition.

Extract Business Related data from Invoice using aws textract

Actually we need to extract details from the document like Invoice/delivery Challan etc. So I was going through aws Textract demo version where we can simply upload the PDF document and see, what all details it is extracting as key value pair, Table etc.
While doing above activity, I found that few specific keys like Invoice Number,PAN etc which are very important for us, sometimes getting extracted but sometimes they are not, though the document I am using is of quite high quality.
So my question is - Is there any way where we can specifically specify that what all keys, we are required to extract from the document?
If they are available in the document, aws should extract them else, it should keep those fields empty in the Response.
Thanks,
Kavita

The correct way to remove or update Item

I am building recommendation system for classified ads website , ads are added and deleted daily.
What I thought of is to use PutItems to add new ads and make field called status = 0 , if user deleted the ad , I will use the same PutItem API with the same ITEM_ID to update the stored Item, and use filter to select only ads with status = 0 when generation recommendation.
Is that correct ? will the PutItems API update the existing ad ? and is there anyway to delete the Item ?
Currently there is no way to remove items that were already added to Datasets.
Your workaround looks good, however from my experience with working with Personalize, the filter might decrease your recommendations quality.
To understand why, this is the more or less algorithm, that Personalize uses for filtering recommendations:
Get recommended items for user
Filter recommendations using filter expression
Return first N recommended items left after filtering
Because the filtering is done after getting recommendations, it means, that Personalize will simply fill recommendations list with items, that were somewhere down on the recommended list.
And there is a problem with that approach - items lower on the list, have lower "Score" value, which indicates accuracy of recommendations. That's why you will end up with in general worse recommendations, but it will depend how many ads that have status = 0 were recommended, before filtering out them.
To check your recommendations scores, simply get recommendations in Personalize web UI. It will return list of recs with scores.
Better approach
If your ads are updated daily, then you can definitely workaround it by following those steps:
Create a Lambda function, that is triggered every 24 hours
Lambda will fetch all of the ads and put them into S3 bucket as CSV file. It should exclude ads that are no longer available (status = 0)
Call CreateDatasetImportJob API using any AWS SDK of your choice and provide the data which is stored on S3 bucket
Personalize will start import job. When it finishes, all of the items are replaced with the newest dump
However it has some downsides.
If you are not using the User-Personalization (aws-user-personalization) Recipe, then after each import of Items, you need to update your Solution by creating new Solution Version. Otherwise it won't include changes made by items dataset import job.
Creating a new Solution Version is quite slow and expensive, that's why I would recommend to use User-Personalization Recipe, if you want to use this approach and since HRNN Recipes are marked as legacy, it's a good idea to migrate anyways.
If you are using User-Personalization Recipe, then according to AWS documentation:
Amazon Personalize automatically updates your latest solution version every two hours to include new data. Your campaign automatically uses the updated solution version. For more information see Automatic Updates.
So pretty much all of the work is done on Personalize side and you don't have to worry about Solution retraining after each Items import job.
And the last problem...
Since for User-Personalization Recipe documentation claims, that your solution will be updated within two hours, then you might end up with recommending items, that are not available, for some short period of time. If you are updating items daily, it might be a significant problem.
To fix that case, I would recommend simply using Filter approach, that you mentioned. Thanks to this, you have benefits of both approaches
and your recommendations are always valid.

Do I need to update item csv in AWS personalize?

I'm trying to use AWS personalize, and following their documents.
So I've uploaded dataset files(interaction, user, item) to S3, then created a solution and a campaign.
And I implemented PutEvents API using java.
GetRecommendations API call works good.
At this moment I'm curious I need to update dataset files, especially item csv.
In general it's done at this point for very basic recommendations.
Since you are using PutEvents call, then all of the real-time events are added to Interactions dataset this way. Interactions datasets created by manual import and by PutEvents calls are separated from themselves. You can actually see them in Personalize Datasets web console.
Still you might want to update dataset files, using dataset import job feature, but it's going to replace your existing dataset. In general I would recommend using it only when:
You just created a fresh/bigger/better dump of your database with Interactions.
You've found, that your previous interactions dataset was invalid.
The schema of dataset changed (pretty much you are forced to do it then).
User or Item dataset changed/improved, it's actually a good idea to refresh it often, so Personalize can produce better recommendations. Keep in mind, that it also requires retraining of the Solution, so the new Items/Users will be included during the recommendations generation.
So for interactions you usually don't want to update dataset. For other datasets it might be a good idea to even create an automatic import mechanism.
Keep in mind, that Items and Users datasets are used only with Personalize Recipes, that support metadata. Otherwise they are simply ignored.

Sitecore 6 Filtering Items based on a profile

I am looking for a generic method of filtering a series of sitecore items based on the users current profile, I found one promising example:
How do I trigger a profile in Sitecore DMS?
However a few critical references are missing which is a shame as it looks to be a suitably generic function
Resources.Settings.AnalyticsUserProfileEnableSwitch I assume to simply be a boolean switch
The killer is ApplyUserProfile(filter)
Please keep in mind that user profiles are NOT the same thing as profiles in DMS. In DMS this is in reference to Analytics profiles related not to the specific user, but in visiting profiles... i.e. Marketing personas.
If you want to filter items based on user profiles, you simply get the Sitecore.Context.User.Profile and get whatever the property is and implement your logic to how you want to filter.
If you want to filter items based on DMS profiles, then that's something that's going to be difficult to do due to the fact that personas are not entered into the Analytics database real time. Those really aren't something you'll even be aware of at run time and therefore it's going to be difficult to categorize the persona at run time. You could, however, use the rules system to do some filtering based on other criteria (such as using the Engagement plans or something else)... but without more information, that's about as much as can be said.