GATE MatchesAnnots Feature Output - gate

I'm attempting to perform coreference using the GATE MatchesAnnots document feature, and I get the following output:
{null = [[866, 871, 872], [869, 873, 877, 879], [874, 895, 896]]}
Can anyone help me understand what this means? I'm assuming each of these arrays are each a coreference chain - but what are the numbers? Character start numbers? I'm a bit lost.

Question
what are the numbers?
Answer
They are GATE annotation ids of the chained annotations.
Explanation
GATE document feature MatchesAnnots contains a map ( Map<String, List<List<Integer>>> ) with following content:
Each key correspond to the name of corresponding AnnotatienSet.
Each value is a list of all the coreference chains.
Each coreference chain is a list of ids of annotations belonging to the chain.
See also
Parse GATE Document to get Co-Reference Text (similar SO question)
GATE Annotations (official documentation)

Related

Parse Days in Status field from Jira Cloud for Google Sheets

I am using Jira Cloud for Sheets Adds on in order to get Days in Status field from Jira, it seems to have the following syntax, from this post
<STATUS_ID>_*:*_<NUMBER_OF_TIMES_ISSUE_WAS_IN_THIS_STATUS>_*:*_<SECONDS>_*|
Here is an example:
10060_*:*_1_*:*_1121033406_*|*_3_*:*_1_*:*_7409_*|*_10000_*:*_1_*:*_270003163_*|*_10088_*:*_1_*:*_2595005_*|*_10087_*:*_1_*:*_1126144_*|*_10001_*:*_1_*:*_0
I am trying to extract for example how many times the issue was In QA status and the duration on a given status. I am dealing with parsing this pattern for obtaining this information and return it using an ARRAYFORMULA. Days in Status field information will be provided only when the issue was completed (is in Done status), otherwise, no information will be provided. if the issue is in Done status, but it didn't transition for a given status, this information will not be provided in the Days in Status string.
I am trying to use REGEXEXTRACT function to match a pattern for example:
=REGEXEXTRACT(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
and it returns an empty value, where I expect 10068. I brought my attention that when I use REGEXMATCH function it returns TRUE:
=REGEXMATCH(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
so the syntax is not clear. Google refers as a reference for Regular Expression to the following documentation. It seems to be an issue with the vertical bar |, per this documentation it is a special character that should be represented like this \v, but this doesn't work. The REGEXMATCH returns FALSE. I am trying to use some online RegEx tester, that implements Google Sheets syntax (RE2), I found ReGo, that I don't know if it is a valid one.
I was trying to use SPLITfunction like this:
=query(SPLIT(C2, "_*:*_"), "SELECT Col1")
but it seems to be a more complicated approach for getting all the values I need from Days in Status field string, but it separates well all the values from the previous pattern. In this case, I am getting the first Status ID. The number of columns returned by SPLITwill varies because it depends on the number of statuses the issues transitioned in order to get to DONE status.
It seems to be a complex task given all the issues I have encounter, but maybe some of you were dealing with this before and may advise about some ideas. It requires properly parsing the information and then extracting the information on specific columns using ARRAYFORMULA function when it applies for a given status from Status column.
Here is a google spreadsheet sample with the input information. I would like to populate the information for the following columns for Times In QA (C column) and Duration in QA (D column, the information is provided in seconds I would need in days but this is a minor task) for In QA status, then the same would apply for the rest of the other statuses. I added the tab Settings for mapping the Status ID to my Status, I would need to use a lookup function for matching the Status column in the Jira Issues tab. I would like to have a solution, without adding helper columns maybe it will require some script.
https://docs.google.com/spreadsheets/d/1ys6oiel1aJkQR9nfxWJsmEyd7XiNkVB-omcNL0ohckY/edit?usp=sharing
try:
=INDEX(IFERROR(1/(1/QUERY(1*IFNA(REGEXEXTRACT(C2:C, "10087.{5}(\d+).{5}(\d+)")),
"select Col1,Col2/86400 label Col2/86400''"))))
...so after we do REGEXEXTRACT some rows (which cannot be extracted from) will output as #N/A error so we wrap it into IFNA to remove those errors. then we multiply it by *1 to convert everything into numeric numbers (regex works & outputs always only plain text format). then we use QUERY to convert 2nd column into proper seconds in one go. at this point every row has some value so to get rid of zeros for rows we don't need (like row 2,3,5,8,9,etc) and keep the output numeric, we use IFERROR(1/(1/ wrapping. and finally, we use INDEX or ARRAYFORMULA to process our array.

Use a custom classifier in Glue for multi line records

I have some files in the following format
AB1|STUFF|1234|
AB2|SF|STUFF|
AB1|STUFF|45670|
AB2|AF|STUFF|
Each bit of data is delimited by '|' and a record is made up of the data in lines AB1 and AB2.
I would like to use a custom grok classifier in Glue something like the following:
?<LINE1>(?:AB1)?|%{WORD:ignore1}|%{NUMBER:id}\|\n%{WORD:LINE2}|%{WORD:make}|%{WORD:stuff2}\|
That is a multi line grok expression to extract the data from a multi line record as shown above. I am unsure how the classifiers in Glue work any comments or advice would be very helpful.
According to the Glue Documentation:
Grok patterns can process only one line at a time. Multiple-line
patterns are not supported. Also, line breaks within a pattern are not
supported.
I am not sure what the actual question is, if you need general guidance on how to create your own classifier, I would advise you to read this and this.

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.

How to store graph in files

I want to store the following information in a file.
My program is consisted of set of string that are connected forming a graph.
I call each single string "Tag".
let's say we have 3 main tags $Mohammed , $car , $color
Each of the main tags contains sub tags and each sub tag has a value or another sub tag or set of sub tags.
$Mohammad:
$Age: "18"
$color: $red
$kind_of: $human
$car:
$type: $toyota
$color: $blue
$doors:
$number: "3"
$car:
$made_of: $metal
$used_for: $transporting
$types: {$mercedes,$toyota,$nissan}
$best_color: $red
$color:
$usedto: $coloring_things
$example: {$red,$green,$blue,...}
But this in not the only thing, there is a connection between the tags of the same name, so that $Mohammed->$car->$color must be connected with the main tag $color. and $Mohammed->$color:$red , $car->$best_color:$red , $color->$best_color: $red and the main tag $red must all be connected to each other.
The tags connected means be stored in a way that I can call the connected tags at once. just like the computer memory. when it calls something from the memory, it calls the information before and after the requested information.
When I looked to my situation in the first time, I thought that XML would solve it, but then I realized that XML can't represent graph.
I don't want to use databases for this. I want to keep database as my last weapon.
Any idea or suggestion about how can I store,connect and recall the informations from my program?
Thanks in advance.
You actually could use XML, but I would recommend JSON or Yaml.
Your example format is already very close to Yaml.
Take a loot at boost's property_tree
It contains a nice c++ way to represent your graph, and let's you very easily decide what kind of file-representation you want. Be that xml, json, info.
Also, I don't see why your graph can't be represented by xml, as it supports named nodes.
Although property_tree also supports the ini format, that actually can't represent your >2 level deep tree.

reading a file of set of keyword documents used to search in a library computer system

Description of Scenario for project: To search and articles in a library computer system one uses keywords in combination with Boolean operators such as 'and'(&) and 'or(). For example, if you are going to search for articles and books dealin with uses of nanotechnology and bridge construction, the query would be Nanotechnology & bridge construction. In order to retrive the books and articlkes properly, every document is represented using a set of keywords that represent the content of the document.
Assume that each document (books, articles, etc.) is represented by a document number which is unique. You will be provided with a set of documents represented by their numbers and keywords that are contained in that document as given below.
887 5
nanotechnology
bridge construction
carbon fiber
digital signal processing
wireless
The number 887 above corresponds to the document number and 5 is the number of keywords that are given for the document. Each keyword will be on a separate line. The input for your project will contain a set of document numbers and keywords for each document. The first line of the input will contain an integer that corresponds to the number of document records to process.
An Inverted List data structure is where for each keyword we store a set of document numbers that contain the keyword. For example, for the keyword carbon fiber we will have the following:
bridge construction 887, 117, 665, 900
carbon fiber 887, 1098, 654, 665, 117
The documents numbered 887, 1098, 654, 665, and 117 all will contain the keyword carbon fiber and the keyword bridge construction is found in documents numbered 887, 117, 665 and 900.
There are two main aspects to this project:
I am required to read a file (using standard input) that contains the document information and build the inverted list data structure
to apply Boolean queries to the Inverted List data structure.
The Boolean queries are processed as illustrated in the following example. To obtain the documents containing the keywords bridge construction & carbon fiber we perform a set intersection operation and get the documents 887, 117, and 665. The Boolean query bridge construction | carbon fiber will result in a set union operation and the documents for this query are 887, 1098, 654, 665, and 900.
OK SO MY QUESTION IS:
How do I read the document since on my first class is a setClass that stores a set of Document numbers?
My problem is that all documents are all in one text file for example:
25 //first document number
329 7 //second document number
ARAMA
ROUTING ALGORITHM
AD-HOC
CSMA
MAC LAYER
JARA
MANET
107 4 //third document number
ANALYSIS
CROSS-LAYER
GEOGRAPHIC FORWARDING
WIRELESS SENSOR NETWORKS
So how can I read the document numbers since they all have different amount of keywords right after another?
Is the "25" on the first line actually the number of documents in the file? I'll go with that assumption (if not, just read documents until you hit EOF)
Here is some pseudo-code for reading the file:
int numDocs = readLine // assuming first number is number of docs
for (int i = 0; i < numDocs; ++i)
{
string line = readLine
int docNumber = getFirstNumber(line)
int numKeywords = getSecondNumber(line)
for (int j = 0; j < numKeywords; ++j)
{
string keyword = readline
associate keyword with docNumber // however this works
}
}