AIML - topic - unexpected answer does not match with STAR (*) - aiml

When using the AB.jar Google reference (alice) bot:
When having this simple short script:
<category><pattern>TOPIC 1</pattern>
<template>Topic 2 with current topic '<get name="topic"/>'.<think><set name="topic">topic2</set></think></template>
</category>
<topic name="TOPIC2">
<category><pattern>YES</pattern>
<template>Going to topic3-yes <think><set name="topic">topic3-yes</set></think></template>
</category></topic>
<topic name="TOPIC2">
<category><pattern>*</pattern>
<template>Going to topic3-rest on '<star/>' <think><set name="topic">topic3-rest</set></think></template>
</category></topic>
... answering not 'yes' will not navigate to the topic-3 '*' pattern. Why is that?
This is the conversation. I marked the unexpected answer with '// here'
Human : topic 1
Robot : Topic 2 with current topic 'unknown'.
Human : any
Robot : any is a name. // here -- expected to go to topic-3-rest

Putting this '_' pattern (in stead of the '*' pattern) inside a topic answers the question.
Thanks to Ubercoder:
The element takes priority over other patterns at the same pattern level. I don't know if you're using AIML v1 or v2, but broadly speaking there are 3 levels of patterns [but see note below]
Most important level = patterns including underscore wildcards (_)
Middle level = atomic patterns without any wildcards
Lowest level = patterns including star wildcards (*)

Related

Regex NLTK chunking - Can't get my regex rule to identify certain pos tags

Hi I am attempting to identify very specific sentence structures but the rule i am writing in regex seems to skip occasional parts of my test samples. Here is an example:
chunkRule= r"""Action: {<PRP|PRP$|NNP|NN>+<VB|VBD|VBG|VBN|VBZ|RB|JJ|NNP|NN>+<VBG|RP|RB|NNP|NN|PRP$>*}"""
Input text: My wife goes out
POS Tag: [('My', 'PRP$'), ('wife', 'NN'), ('goes', 'VBZ'), ('out', 'RP')
Return Value: (Action wife/NN goes/VBZ out/RP)
As you can see it's skipping "My"/PRP$ POS tag. Does anyone have any ideas how to adjust this to allow it to detect this?
Thanks for your help in advanced!
If I understand correctly, you're trying to define a rule for 3-word sequences. If so, your rule doesn't match a structure like PRP$ -> NN -> VBZ. I don't know what you're expecting as an output. If it is: "Action my/PRP$ wife/NN goes/VBZ", add VBZ to your third word:
chunkRule= r"""Action: {<PRP|PRP$|NNP|NN>+<VB|VBD|VBG|VBN|VBZ|RB|JJ|NNP|NN>+<VBG|RP|RB|NNP|NN|PRP$|VBZ>*}"""

AIML - context - why does context not have highest priority in all cases?

When using an AIML context (via <that>) I get some conversations I cannot explain.
I expected that a (that) context would have priority over anything else.
Below I first show the script. Then I show a few conversations. I marked the inexpected parts with a // behind the response.
I added this Aiml file to the standard ALICE conversations.
The script:
<category><pattern>STEP 1</pattern>
<template>Step 2</template>
</category>
<category><pattern>YES</pattern><that>STEP 2</that>
<template>step 3</template>
</category>
<category><pattern>NO</pattern><that>STEP 2</that>
<template>step 3</template>
</category>
<category><pattern>*</pattern><that>STEP 2</that>
<template>step 3</template>
</category>
<category><pattern>*</pattern><that>STEP 3</that>
<template>Step 4! and you typed '<star/>'</template>
</category>
In the following conversation I marked the unexpected responses with // ?
Human : step 1
Robot : Step 2
Human : yes
Robot : step 3
Human : yes
Robot : Step 4! and you typed 'yes'
Human : step 1
Robot : Step 2
Human : no
Robot : step 3
Human : no
Robot : So. // ? I expected here step 4
Human : step 1
Robot : Step 2
Human : any
Robot : any is a name. // ? I expected here step 3
Can you explain both UNexpected flows of conversation?
The <that> element takes priority over other patterns at the same pattern level. I don't know if you're using AIML v1 or v2, but broadly speaking there are 3 levels of patterns [but see note below]
Most important level = patterns including underscore wildcards (_)
Middle level = atomic patterns without any wildcards
Lowest level = patterns including star wildcards (*)
Your unexpected responses are because there is an ALICE response at a higher priority level. Eg when robot replies "step 3" and human says "no", you want <pattern>*</pattern><that>STEP 3</that> category to take effect. But if there is an ALICE response at a higher level (eg <pattern>NO</pattern> or <pattern>STEP _</pattern>) the ALICE responses will take effect over your level 3 category <pattern>*</pattern><that>STEP 3</that>. The quickest way of finding the ALICE category is just to ask "NO" and see what the bot replies. You could also search the ALICE files but this would be very time consuming.
[note] In AIML v2 there are at least two extra levels: level 0 above underscore wildcards, and level 2.5 using pattern side sets. However the simpler levels of AIML v1 explain your anomalies.

Webscraping (potentially) ill-formated HTML in R with xpath or regex

I'm trying to extract the abstract from this link. However, I'm unable to extract only the content of the abstract. Here's what I accomplished so far:
url <- "http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1981-38212013000100001&lng=en&nrm=iso&tlng=en"
textList <- readLines(url)
text <- textList[grep("Abstract[^\\:]", textList)] # get the correct element
text1 <- gsub("\\b(.*?)\\bISSN", "" , text)
Up to this point I got almost what I want, but then I couldn't get rid of the rest of the string that isn't of interest to me.
I even tried another approach, with xpath, but unsuccessfully. I tried something like the code below, but to no effect whatsoever.
library(XML)
arg.xpath <-"//p/#xmlns"
doc <- htmlParse( url) # parseia url
linksAux <- xpathSApply(doc, arg.xpath)
How can I accomplih what I want, either with regex or xpath, or maybe both?
ps.: my general aim is webscraping of several similar pages like the one I provided. I alredy can extract the link. I only need to get the abstract now.
free(doc)
I would strongly recommend the XML approach because regular expressions with HTML can be quite a headache. I think your xpath expression was just a bit off. Try
doc <- htmlParse(url)
xpathSApply(doc, "//p[#xmlns]", xmlValue)
This returns (clipped for length)
[1] "HOLLANDA, Cristina Buarque de. Human rights ..."
[2] "This article is dedicated to recounting the main ..."
[3] "Keywords\n\t\t:\n\t\tHuman rights; transitional ..."
[4] ""
someone better could give you a better answer but this kinda works
reg=regexpr("<p xmlns=\"\">(.*?)</p>",text1)
begin=reg[[1]]+12
end=attr(reg,which = "match.length")+begin-17
substr(text1,begin,end)
Here is another approach, which is klunky as written, but offers the technique of keeping the right parts after splitting at tag tokens:
text2 <- sapply(strsplit(x = text1, ">"), "[", 3)
text2
[1] "This article is dedicated to recounting the main initiative of Nelson Mandela's government to manage the social resentment inherited from the segregationist regime. I conducted interviews with South African intellectuals committed to the theme of transitional justice and with key personalities who played a critical role in this process. The Truth and Reconciliation Commission is presented as the primary institutional mechanism envisioned for the delicate exercise of redefining social relations inherited from the apartheid regime in South Africa. Its founders declared grandiose political intentions to the detriment of localized more palpable objectives. Thus, there was a marked disparity between the ambitious mandate and the political discourse about the commission, and its actual achievements.</p"
text3 <- sapply(strsplit(text2, "<"), "[", 1)

Find all paragraphs of text that are related to a topic

Given a set of words ["college", "sports", "coding"], and a set of paragraphs of text (i.e. facebook posts), how can I see for each word the paragraphs that are related to that topic?
So for college, how can I find all the paragraphs of text that may be about the topic college?
I'm new to natural language processing, and not very advanced at regex. Clues about how to get started, what the right terms to google, etc are appreciated.
One basic ideea would be to iterate over your posts and see if any post matches any of the topic.
Let's say we have the following posts:
Post 1:
Dadadad adada college fgdssfgoksh jkhsfdkjshdkj sports hfjkshgkjshgjhsdgjkhskjgfs.
Post 2:
Sports dadadad adada fgdssfgoksh jkhsfdkjshdkj hfjkshgkjshgjhsdgjkhskjgfs.
Post 3:
Coding adskjdsflkshdflksjlg lsdjk hsjdkh kdsafkj asfjkhsa coding fhksajhdf kjhskfhsfd ssdggsd.
and the following topics:
["college", "sports", "coding"]
The regex could be: (topicName)+
E.g.: (college)+ or (sports)+ or (coding)+
Small pseudocode:
for every topicName
for every post
var customRegex = new RegExp('(' + topicName + ')+');
if customRegex.test(post) then
//post matches topicName
else
//post doesn't match topicName
endif
endfor
endfor
Hope it could give you a starting point.
Exact string matching won't take you far, especially with small fragments of text. I suggest you to use semantic similarity for this. A simple web search will give several implementations.

How to do ANDing of conditions in a regular expression?

I want to match and modify part of a string if following conditions are true:
I want to capture information regarding a project, like project duration, client, technologies used, etc..
So, I want to select string starting with word "project" or string may start with other words like "details of project" or "project details" or "project #1".
RegEx. should first look at word "project" and it should select the string only when few or all of the following words are found after word "project".
1) client
2) duration
3) environment
4) technologies
5) role
I want to select a string if it matches at least 2 of the above words. Words can appear in any order and if the string contains ANY two or three of these words, then the string should get selected.
I have sample text given below.
Details of Projects :
*Project #1: CVC – Customer Value Creation (Sep 2007 – till now) Time
Warner Cable is the world's leading
media and entertainment company, Time
Warner Cable (TWC) makes coaxial
quiver.
Client : Time Warner Cable,US. ETL
Tool : Informatica 7.1.4
Database : Oracle 9i.
Role : ETL Developer/Team Lead.
O/S : UNIX.
Responsibilities: Created Test Plan and Test Case Book. Peer reviewed team members > Mappings. Documented Mappings. Leading the Development Team. Sending Reports to onsite. Bug >fixing for Defects, Data and Performance related.
Details of Project #2: MYER – Sales
Analysis system (Nov 2005 – till now)
Coles Myer is one of Australia's largest retailers with more than 2,000 > stores throughout Australia,
Client : Coles Myer
Retail, Australia. ETL Tool :
Informatica 7.1.3 Database : Oracle
8i. Role : ETL Developer. O/S :
UNIX. Responsibilities: Extraction,
Transformation and Loading of the data
using Informatica. Understanding the
entire source system.
Created and Run Sessions and
Workflows. Created Sort files using
Syncsort Application.*
Does anyone know how to achieve this using regular expressions?
Any clues or regular expressions are welcome!
Many thanks!
(client|duration|environment|technologies|role).+(client|duration|environment|technologies|role)(?!\1)
I would break it down into a few simpler regex's to get these results. The first would select only the chunk of text between projects: (?=Project #).*(?<=Project #)
With the match that this produces, i would run a seperate regex to ask if it contains any of those words : client | duration | environment | technologies | role
If this match comes back with a count of more then 2 distinct matches, you know to select the original string!
Edit:
string originalText;
MatchCollection projectDescriptions = Regex.Matches(originalText, "(?=Project #).(?:(?!Project #).)*", RegexOptions.IgnoreCase | RegexOptions.Singleline);
Foreach(Match projectDescription in projectDescriptions)
{
MatchCollection keyWordMatches = Regex.Matches(projectDescription.value, "client | duration | environment | technologies | role ", RegexOptions.IgnoreCase);
if(keyWordMatches.Distinct.Count > 2)
{
//At this point, do whatever you need to with the original projectDescription match, the Match object will give you the index etc of the match inside the original string.
}
}
Maybe you need to break that requirements in two steps: first, take your key/value pairs from your string, than apply your filter.
string input = #"Project #...";
Regex projects = new Regex(#"(?<key>\S+).:.(?<value>.*?\.)");
foreach (Match project in projects.Matches(input))
{
Console.WriteLine ("{0} : {1}",
project.Groups["key" ].Value,
project.Groups["value"].Value);
}
Try
^(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*$
One note: This will also match if only one of the terms appears twice.
In C#:
foundMatch = Regex.IsMatch(subjectString, #"\A(?:(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*)\Z", RegexOptions.Singleline | RegexOptions.IgnoreCase);