When training an entity recognizer on comprehend using a entity list of abbreviation. Would it start identifying other abbreviations not on the list?

When training an entity recognizer on comprehend using a entity list of abbreviation. Would it start identifying other abbreviations not on the list? - amazon-web-services

Basically:
If I train a custom entity recogniser on comprehend using an entity list of gene abreviations, would it potentially start identifying other abbreviations that weren't in the entity lists as gene names?
The model has identified the abbreviation UVB as a gene name, however it's not in any of my dictionaries and I'm not sure why this would have happened

Correct, Comprehend's entity list mode is an easy way to provide a non-exhaustive list of entities while training a model. The model will learn the context in documents with these entities and subsequently generalize to unseen entities in similar context.

Related

Gremlin load data format

I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?

You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')

To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.

When training an entity recognizer on comprehend using a entity list of abbreviation, why would entities included in the list not be identified?

I've trained a custom entity recogniser on AWS Comprehend using the entity list method to recognise gene abbreviations. However, when analysing .txt files, it doesn't identify some of the genes included in the entity list.
They have the same capitalisation as the entity list and I've adhered to the training guidelines at: https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-custom-entity-recognition concerning entity number and document size.
Do you know why this might happen and how it may be fixed?

How to assign an ID to a entity in a Google AuoML natural language entity extraction model?

How to give assign an ID to the entities added into AutoML Natural Language entity recognition model. For example, I Have added an entity "Chelsea" under "Sports" Label. How will I assign "Chelsea" an ID, so that whenever an article with an entity "Chelsea" comes in, it gets auto-tagged to a database?

You'll probably want to follow the documentation which will be helpful in explaining how AutoML Natural Language works.
Basically, you are labeling your entities by category, and then an ML model will be built to recognize when a word or phrase should fall within that category -- even one's you haven't labeled. You are not building a list of future labeled entities to extract from text, that would be a simple algorithm and not machine learning.

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.

If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).

I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

Basic List on Entity Form

In Dynamics CRM online, I can add a list of entities to another entity, for example a list of products to an opportunity.
Is there any way I can have a list that is not picked from pre-populated items, e.g. just a simple list of {number, date, text} that you type in each time you want to add to the list, not picking items from a pre-defined list.
I am just using the web interface to customise at the moment, but I am open to any suggestions.
EDIT:
So far i have;
Created two entities, proposal and proposal version
Added a 1:N relationship between proposal and proposal version
Added a sub-grid to the proposal form, tried to make it editable but it refuses to work
This lets me add new rows by opening up the proposal version form and adding a new one or picking from already created ones for other proposals but that is rather clunky for a simple list.
I don't want it to offer to search for previous entries, just let me add to the list by typing stuff in, surely this should be fairly simple?

If you want a pre-defined list of items that are simple (number, date, text..) then you can create an option set field in CRM. These lists are fixed and can only be extended by customising the system. An example option set field might be Organisation Type:
Prospect
Site
Head Office
...
If you want a pre-defined list that can be extended, you need to create a new entity. Following from the previous example, you would create a custom entity called Organisation Type and then create a record for each type you wanted, populating only the name field with the type: Prospect, Site etc.
Then you would add a lookup field pointing to the Organisation Type entity on any other entity that used the field, such as Organisation (Account).
You see how the custom entity still appears as simple data because you're only populating the name field, which can be text, a number etc. You can also apply security roles to this entity, limiting which users can create and delete options from your list.
Edit: to only allow the creation of new records in a subgrid, make sure the lookup attribute to the parent entity on the child entity is business required.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js