Try to figure out a good way to split English document into sentences in C# - regex

Is there a good way to split English document into sentences? I mean English document frequently includes Mr. Mrs. U.S.A, etc. It is difficult to separate them out. Do we need a special natural language library to accomplish this? I suspect that we need it.
Thank you.

Technically, you need a complete understanding of English to do the job.
As a decent "almost" solution, you could use a dictionary of "things that end in period" and split on periods which do not immediately follow one of those tokens.

If every sentence begins with a capital and ends with a period, then I would define a sentence as the above but contains >1 word and does not end with (common abbreviation list or regex [a-zA-Z].+)

You can use sentence detector provided by numerous NLP tools such as OpenNLP or Stanford CoreNLP. They can handle cases like Mr. Mrs. U.S.A, etc.
Both OpenNLP and Stanford CoreNLP are written in Java.
SharpNLP is C# (ported) version of OpenNLP.

Related

Is it possible using NLP? Natural Language processing

I have a set of Project Names, a set of keywords and a set of paragraphs.
Now my task is, to check whether keywords match any project names , and keywords match any word in any paragraph.
If any set of paragraphs are matched with a keyword and any project matched with same keyword, then I have to assign these paragraphs to that project.
I have been using String Regex for this. But can this be implemented using Natural Language Processing concepts.
If yes... Please let me know how can it be implemented. It would be very helpful for me.
Thanks in advance.
There's no NLP involved in this as such. No matter what you do, you must have to go through all the projects and all the paragraphs at least once. Yes, you can optimize your process by using hashmaps or dictionaries but at the end of the day, you will be searching and matching strings no matter what.
You can do it using dictionaries as mapping becomes easy with the help of dictionaries and regex will be in action too.

What is the proper way to check if a string contains a set of words in regex?

I have a string, let's say, jkdfkskjak some random string containing a desired word
I want to check if the given string has a word from a set of words, say {word1, word2, word3} in latex.
I can easily do it in Java, but I want to achieve it using regex. I am very new to regular expressions.
if you want only to recognise the words as part of a word, then use:
(word1|word2|...|wordn)
(see first demo)
if you want them to appear as isolated words, then
\b(word1|word2|...|wordn)\b
should be the answer (see second demo)
I am not able to understand the complete context like what kind of text you have or what kind of words will this be but I can offer you a easy solution the literal way programmatically you can generate this regex (dormammu|bargain) and then search this in text like this "dormammu I come to bargain". I have no clue about latex but I think that is not your question.
For more information you can tinker with it at [regex101][1].
If you are having trouble understanding it [regexone][2] this is the place to go. For beginners its a good start.
[1]: http://regex101.com [2]: https://regexone.com/

Regular Expression to match sentences

I'm trying to make a regular expression in python that matches sentences. The one I see that mostly works is: [^\.\?\!].*?[\.\?\!] ,but with the test sentences below it has a few errors. You can see using the site https://regex101.com/. I'm looking for a regular expression that encompasses all the problems below such as ellipsis, honorifics, and the i.e. thing.
For performing tokenization in languages other than English, we can
load the respective language pickle file found in tokenizers/punkt and
then tokenize the text in another language, which is an argument of
the tokenize() function. For the tokenization of French text, we will
use the french.pickle file as follows: Mr. Smith bought cheapsite.com
for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam
Jones Jr. thinks he didn't. In any case, this isn't true... Well, with
a probability of .9 it isn't.
p.s. If you're wondering I got the above sentences from a natural language processing book and another stack overflow question on the same subject.
the easiest way is to split it in 3 operations.
substitute i.e., ellipsis and what ever you want with other markers without dots like ###ie### and ###ellipsis###.
match sentences.
After that rebuild i.e. and ellipsis.
Update: Some code how to do it. You have to do the substitutions for each item with dots you want to exclude from the sentence-matcher.
sentences = re.sub(r'i\.e\.', "###ie###", sentences);
matches = re.match(r'[^\.\?\!].*[\.\?\!]', sentences);
matches = re.sub(r'###ie###', "i.e.", matches);

Best practices for localization

HI there,
i am localizing my application. i have some REG EX(for english) expression for client side validation.if i want to localize for non-english ,what is the best approach
should i have REG-Ex for all languages chosen for localization
comments\suggestions appreciated.
Thanks
DEE
Having separate regular expressions for different language inputs stored in external resource files would be the best route. If you are doing .Net development, you can use resource files. If you are doing Java, you can use property files.
Depends on what you're validating with your regexps. I'll assume you're doing some very basing natural language processing (may the giant spaghetti monster help you if so.)
Obviously, if some parts of your regexps are english-dependant (like, full english words), you will need to write the corresponding regexps in the other language (translating the english word by word might not be enough, since you might be capturing some grammar in your regexp).
An example of your regexps might help.
Can you post some examples? As Derek mentioned, separate regexes/files for each language is probably a must-have, but hard to tell without looking at what you are doing...

Regex misspellings

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.