Words to exclude from a search - list

I am looking for a list of words that I can use as exclusions from a product search, as they will give to many false positives.
This would include things like 'a', 'with', 'and', 'the' and so forth. Does anyone have or know where I could download a list of these types of words? The list would only need to be in English (British if possible).
Thanks in advance

What you are looking for is called stopword list. They are available for different languages for example at https://github.com/Alir3z4/stop-words

Related

Form Input Pattern & Validation for Full Name

I have a Google form where I am asking for the user to include their "full name" to keep the form short and sweet (without two inputs for first name/last name). You are able to validate answers in Google Forms using regular expressions, but I'm not sure where to start.
I want at minimum two words in the input, each with at least 2 characters, and I don't want to block any special characters (so that people with names like O'Leary can still write it). Basically, I just want to make sure there are two words included in a field, each with at least 2 letters.
I have no experience with regex or the patterns so any help is appreciated!
I builded this regular expresion to accept full names from a lot of countries:
^([a-zA-Z\-ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïòóôõöùúûüýÿ']{2,75} ?){2,5}$
You can test it on regex101.com. This site also helps you understand this regex with explanations on the top right.
Hope it helps.

Is it possible using NLP? Natural Language processing

I have a set of Project Names, a set of keywords and a set of paragraphs.
Now my task is, to check whether keywords match any project names , and keywords match any word in any paragraph.
If any set of paragraphs are matched with a keyword and any project matched with same keyword, then I have to assign these paragraphs to that project.
I have been using String Regex for this. But can this be implemented using Natural Language Processing concepts.
If yes... Please let me know how can it be implemented. It would be very helpful for me.
Thanks in advance.
There's no NLP involved in this as such. No matter what you do, you must have to go through all the projects and all the paragraphs at least once. Yes, you can optimize your process by using hashmaps or dictionaries but at the end of the day, you will be searching and matching strings no matter what.
You can do it using dictionaries as mapping becomes easy with the help of dictionaries and regex will be in action too.

Regex to find common letters between two strings

I've been searching on Google for a few hours and got a partial solution.
I'm new to both Groovy and regular expressions. I've used regex sporadically over the years, but I am far from comfortable with it.
I've got a simple game that checks how many letters you have in common with a hidden word.
For simplicity's sake, let's say the word is "pan" and the person types "can".
I want the result of the regex to give me "an".
Right now, I've got this partly working by doing this (in Groovy):
// Where "guess" is the user's try and "word" is the word they need to guess.
def expr = "[$word]"
def result = guess.find(expr)
The result string contains only the first matching letter. Anyone have any more elegant solutions?
Thanks in advance
I think this is no use case for a regex. You'll have to take care of things like not leting the user guess automatically if he enters .* or the like.
Typical collection work is better suited for this task IMO. One solution would be to find the intersection of both words treating them as sets of characters:
(word as Set).intersect(guess as Set).join()
Or filtering the guess' characters that appear in the secret word:
guess.findAll { word.contains(it) }.unique().join()
Suppose the two strings are s1 and s2
now to find the common string do:
commonString=s1.replaceAll("[^"+s2+"]","");
and if your word contain meta-character then
first do:
Pattern.quote(s2);
and then
commonString=s1.replaceAll("[^"+s2+"]","");
You could try:
guess.findAll( /[$word]/ ).join()

Try to figure out a good way to split English document into sentences in C#

Is there a good way to split English document into sentences? I mean English document frequently includes Mr. Mrs. U.S.A, etc. It is difficult to separate them out. Do we need a special natural language library to accomplish this? I suspect that we need it.
Thank you.
Technically, you need a complete understanding of English to do the job.
As a decent "almost" solution, you could use a dictionary of "things that end in period" and split on periods which do not immediately follow one of those tokens.
If every sentence begins with a capital and ends with a period, then I would define a sentence as the above but contains >1 word and does not end with (common abbreviation list or regex [a-zA-Z].+)
You can use sentence detector provided by numerous NLP tools such as OpenNLP or Stanford CoreNLP. They can handle cases like Mr. Mrs. U.S.A, etc.
Both OpenNLP and Stanford CoreNLP are written in Java.
SharpNLP is C# (ported) version of OpenNLP.

Pattern matching with regex

I am a novice in regex and trying to understand it by solving small problems. So here I am with a problem which I couldn't solve (warning: it may be extremely silly). Your inputs will help me understand the concept.
I want to write a regex which will match all items in list1 but none of those from list2
list1
pit
spot
spate
slap two
respite
list2
pt
Pot
peat
part
I was thinking like "give me all the items that starts with p|s|r and endswith it|ot|e|o
So i wrote ^[p|s|r].*[it|ot|e|o]$ which eventually resulted in undesired result.
Thanks in advance for your inputs.
In notepad you can't do or operations (taken from Notepad++: A guide to using regular expressions and extended search mode and tested on my Notepad++ 5.9.3)
This would work in other "standard" regexes :-)
^[psr].*(it|ot|e|o)$
Try here. http://gskinner.com/RegExr/?2uudn
What were you doing was using the [] instead of the grouping (). It was equivalent to: [itoe|] (were the | was a "standard" character instead of or) and in general everything in an [] is in or :-) [ab] means a or b.
/(pit|spot|spate|slap two|respite)/.test('Pot')
This matches the words from list one, and none from list two
I feel like I must be missing something.
^(pit|spot|spate|slap two|respite)$
It depends entirely on how you categorise the differences between the lists:
/p[ioa\s]t/