Determine the point, topic, subject of a NL sentence - clojure

How can I best or better figure out the topic, point, or subject of a Natural Language sentence with Clojure or Clojure Script?
Currently, I am using a PoS tagger and taking the nouns as the subject and if there is an adjective before it then that too.
However, this method doesn't always work. For example, it doesn't work when the subject is not a noun.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

how to match a sentence having particular word in different patterns

We have a problem here...
We have a text having different patterns of sentences.
We want to get the sentence having a particular word.
Eg:
One further point, by way of providing another model. The analysis in
the second paragraph could lead in the following direction. 'The
Destructors' deals with, obviously, destruction, whilst the book of
Genesis deals with creation. The vocabulary is similar: Blackie
notices that 'chaos had advanced', an ironic reversal of God's
imposing of form on a void. Furthermore, the phrase 'streaks of light
came in through the closed shutters where they worked with the
seriousness of creators', used in the context of destruction, also
parodies the creation of light and darkness in the early passages of
the Biblical book. Greene's ironic use of the vocabulary of the Bible
might be making the point that, for him, the Second World War
signalled the end of a particular Christian era. Now, it is perfectly
arguable that the rise of fascism is linked to this, or that it is the
cause. The cult of personality and secular leadership has, for Greene,
taken over from the key role of the church in Western societies. In
this way the two main themes identified above - the tension between
individual and community, and religion - are linked. In terms of essay
writing this link could well be made after the discussion of the theme
of the individual and the community, and its links with the theme of
leadership. This might be the general conclusion to the essay. After
thoughtful consideration and interpretation a student may well decide
that this is what the (destructors.)' boils down to: Greene is making a
clear link between the rise of fascism and the decline of the Church's
influence. Despite the fact that fascism has been recently defeated,
Greene sees the lack of any contemporary values which could provide
social cohesion as providing the potential for its reappearance.
In the above text, we have bold words (destructors). We want to get the sentences which are having the word "destructors".
The word "destructors" can be present in different formats. Eg: (destructors), (DesTrucTors), (Des.tructors), DESTRUCTORS, destructors, des-tructors.
When we tried writing a regex to match the sentences, we are failing to get the sentences at some conditions(like we are getting half sentences, etc.,).
Could you please help us with this.
If this information doesn't help you to solve, please let us know. Will update it.
Thank you...
I'm not too sure about Python, but I believe this might work:
for match in re.finditer(r"[^.]*destructors[^.]*\.[^\w\s]*", subject, re.IGNORECASE):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
In any case, I think the regex you want is:
[^.]*destructors[^.]*\.[^\w\s]*
with the case insensitive and global flags set.
It will be helpful if you could provide the regex pattern which you have tried with so far. The best I can come up with is,
str_text='your text here containing DESTRUCTORS'
match=re.search('pass all the destructors combination here', str_text, flags=re.IGNORECASE)
Try for more patterns available for string formatting with regex here,https://docs.python.org/3/library/re.html

Part of speech tagging Web Service?

I am looking for some POS tagging web-service. There are many solutions available (mostly in java) that can be integrated but I couldn't find an online service that could do the job.
My problem statement is really simple, I want to send a single word and get back what part of speech it is e.g. Noun, Verb, Adjective etc.
I want to send a single word and get back what part of speech it is e.g. Noun, Verb, Adjective etc.
This is impossible, in English.
A part of speech method would have to take the whole sentence into account to determine the parts of speech of the words.
Some English words are homonyms. They have to be interpreted in context.
Billy read the book.
read, verb
Billy, please give the book to Read.
Read, noun.
Billy, please give the book to Susie to read.
read, verb.

How to get sentence number from input?

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?
What may be comparatively accurate expression using Tperlregex in delphi?
Thanks
First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:
He said: "It's OK!"
Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.
Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).
For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.
Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.
In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about
Most people called him Professor Jones, but to me he was simply The Prof.
Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be
Most people called him Professor Jones, but to me he was simply Prof. Bill.
Check my tutorial here http://code.google.com/p/graph-expression/wiki/SentenceSplitting. This concrete example can be easily rewritten to regular expressions and some imperative code.
It will be wise to use a NLP processor with a pre-trained model. EnglishSD.nbin is one such model that is available for OpenNLP and it can be used in Visual Studio with SharpNLP.
The advantage of using this method is numerous. For example consider the input
Prof. Jessica is a wonderful woman. She is a native of U.S.A. She is married to Mr. Jacob Jr.
If you are using a regex split, for example
string[] sentences = Regex.Split(text, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
Then the above input will be split as
Prof.
Jessica is a wonderful woman.
She is a native of U.
S.
A.
She is married to Mr.
Jacob Jr.
However the desired output is
Prof. Jessica is a wonderful woman.
She is a native of U.S.A. She is married to Mr. Jacob Jr.
This kind of logical sentence split can be achieved only using trained models from OpenNLP project. The method is as simple as this.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
where mModelPath is the path of the directory containing the nbin file.
The mSentenceDetector is derived from the OpenNLP dll.
You can get the desired output by
string[] sentences = SplitSentences(text);
Kindly read through this article I have written for integrating SharpNLP with your Application in Visual Studio to make use of the NLP tools

Regexp to parse out a person's name?

This might be a hard one (if not impossible), but can anyone think of a regular expression that will find a person's name, in say, a resume? I know this won't be 100% accurate, but I can't come up with something.
Let's assume the name only shows up once in the document.
No, you can't use regular expressions for this. The only chance you have is if the document is always in the same format and you can find the name based on the context surrounding it. But this probably isn't the case for you.
If you are asking your applicants to submit their résumé online you could provide a separate field for them to enter their name and any other information you need instead of trying to automatically parse résumés.
Forget it - seriously.
Or expect to get a lot of applications from a Mr C Vitae
In my experience, having written something very similar (but a very long time ago), about 95% of resumes have the person's name as the very first line. You could probably have a pretty loose regex checking for alpha, hyphens, periods, and assume that's the name.
Obviously there's no way to do this 100% accurately, as you said, but this would be close.
Unless you wanted to build an expression that contained every possible name, or-ed together, the expression you are referring to is not "Regular," with a capital R. A good guess might be to go looking for the largest-font words in the document. If they follow a pattern that looks like firstname-lastname, name-initial-name, etc., you could call it a good guess...
That's a really hairy problem to tackle. The regex has to match two words that could be someone's name. The problem with that is that some people, of Hispanic origin, for example, might have a name that's more than 2 words. Also, how would you define two words to match for a name? Would you use a database of common first and last name fields? That might work unless someone has an uncommon name.
I'm reminded of a story of a COBOL teacher in college told me about an individual of Asian origin who's name would break every rule the programmers defined for a bank's internal system. His first name was "O." just the letter O.
The only remotely dependable way to nail down the regex would be if you had something to set off your search with; maybe if a line of text in the resume began with "Name: " then you'd know where to start looking.
tl;dr: People's names and individual resumes are too heavily varied for a regular expression to pick apart.
You could do something like Amazon does for book overviews: SIPs. This would require some after-the-fact double checking by humans but you might find the person's name(s) in there.