Can I use pre-labeled data in AWS SageMaker Ground Truth NER? - amazon-web-services

Let's say I have some text data that has already been labeled in SageMaker. This data could have either been labeled by humans or an ner model. Then let's say I want to have a human go back over the dataset, either to label new entity class or correct existing labels. How would I set up a labeling job to allow this? I tried using an output manifest from another labeling job, but all of the documents that were already labeled cannot be accessed by workers to re-label.

Yes, this is possible you are looking for Custom Labelling worklflows you can also apply either Majority Voting (MV) or MDS to evaluate the accuracy of the job

Related

Vertex AI batch predictions - getting feature names for DataFrame

I'm using batch predictions with custom trained models. Generally, one would want to write a line like,
df = pd.DataFrame(instances)
...as one of the first steps prior to doing any custom preprocessing of features. However, this doesn't work with batch predictions - the resulting DataFrame will not have column names as expected. It appears to be a numpy array.
Is there a decent or canonical approach to retrieving the feature (column) names, in case the table changes? (It's better not to assume that the table's columns and their positions all stay the same.)
I'm initiating the batch prediction job with the python client. I based my model off of this example.

How do I apply my model to a new dataset in WEKA?

I have created a new prediction model based on a dataset that was given to me. It predicts a nominal (binary) class attribute (positive/negative) based on a number of numerical attributes.
Now I have been asked to use this prediction model to predict classes for a new dataset. This dataset has all the same attributes except for the class column, which does not exist yet. How do I apply my model to this new data? I have tried adding an empty class column to my new dataset and then doing the following:
Simply loading the new dataset in WEKA's explorer and loading the model. It tells me there is no training data.
Opening my training set in WEKA's explorer and then opening my training model, then choosing my new data as a 'supplied test set'. It runs but does not output any predictions.
I should note that the model works fine when testing on the training data for cross validation. It also works fine with a subset of the training data I separated ages ago for test/eval use. I think it may be a problem with how I am adding a new class column, maybe?
For making predictions, Weka requires the two datasets, training and the one for making predictions, to have the exact same structure, down to the order of labels. That also means, that you need to have a class attribute with the correct labels present. In terms of values for your class attribute, simply use the missing value (denoted by a question mark).
See the FAQ How do i make predictions with a trained model? on the Weka wiki for more information on how to make predictions.

Does the Google BigQuery ML automatically makes the time series data stationary?

So I am a newbie to Google BigQuery ML and was wondering if the auto.arima automatically makes my time series data stationary ?
Suppose, I have a data that is not stationary and if I give the data as is to the auto arima model using Google BigQuery ML, will it first makes my data stationary before taking it as input ?
but that's part of the modeling procedure
From the documentation that explains What's inside a BigQuery ML time series model, it does appear that auto.ARIMA will make the data stationary.
However, I would not expect it to alter the source data table; it won't make that stationary, but in the course of building candidate models it may alter the input data prior to actual model fitting (transforms; e.g. box-cox, make stationary, etc.)

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

Weka setDataset(), do I need to use a full data set

Do I need to use my full training data set, or can I use a data set with only the attribute descriptions built from an arff file with the exact same attributes and say one instance?
I am using a classifier on an EC2 instance so I don't want to have the entire data-set on the EC2 instance as it is very large and grows.
Dose weka require the entire data-set or only the description from the arff file?
The setDataset() method only takes the attribute descriptions for your Instance from the Instances object (your dataset) that you have defined previously. Therefore it does not matter how large the dataset to which you are referring via the setDataset() method is.