List of words containing a string in a given language - aspell

Is there a way to use either ispell or aspell to generate a list of words that contain a given substring? For example, I would like to list all words in the English language that contain the string 'th'.
Note that my question is not limited to English only.

Related

A Regex to ignore a set of words

Is there a way to set regex to ignore a set of words separated by space?
I have different products names like:
"Matrix 10X, 10 ml + DISPENSER"
"Matrix 10X,10ml + DISPENSER" where the quantity varies
What I'm trying to do is to replace using regex all words except for:
"10 ml" | "10 ML" | "10ml" ---> these are to be ignored
I have found a code to replace all characters except words separated by space (like "10 ml")
https://regex101.com/r/bG8vB4/5
and to replace them when they are together (like "10ml")
https://regex101.com/r/bG8vB4/4
but can find a way to mix them together to keep just "10 ml" OR "10 ML" OR "10ml" and remove other characters up to the end of the string
Regexps are a mathematical model to do efficient computer recognition of strings. As easy as getting a regular expression to match a string if it has any of some words, math demonstrates that the regexp to get a matcher of strings that just matches a string if it has none of those words is possible. The way to get such a regexp, although is far more complex.
On regular expressions theory, a regular language is one that allows you to set a finite automaton from a regular expression, and the automaton that recognizes a string if the original doesn't is feasible by just switching all accept states into non-accepting states. Once done this, the hardest part is to build a regular expression that matches that automaton (that is possible, but the final regular expression is far more complex, in general than the original) This can be solved with an example (a simple one) and you'll see that that is a complex thing (of course, some regexp libraries allow you to use an operand for this, but you don't specify if the one you are using does) One such sample is when you have to recognize a simple C language comment. A comment is a string delimited by the sequences /* and */ but in the inner part, you cannot have the sequence */.
The first approach could be to use the following regexp:
\/\*.*\*\/
but that fails, as the inner regexp includes the recognition of */ as part of it, so /* bla bla bla */ bla bla bla */ will be recognized as a comment in whole (it should end at the first */) so wee need a regexp that recognizes anything but not something that includes */
Such subexpression is:
([^*]|\*[^/])*
which means and undefinite concatenation of characters different that *, or sequences that, including the first character as * are not followed by /. If you follow that concatenation, you'll see that it's impossible to form a sequence */ leading to our final regexp:
\/\*([^*]|\*[^/])*\*\/
(now you see how the things complicate)
To extend this to a single word (as word, more than two letters) you have to consider that you can allow:
([^w]|w[^o]|wo[^r]|wor[^d])*
in the set, and if you have two words (like foo and bar) you have to write:
([^f]|f[^o]|fo[^o]|[^b]|b[^a]|ba[^r])*
meaning that for each word you have such regexps, making the final regexp a bit complicated. Also, there can be interactions between words if some can be the prefix to another or some have the same prefix chars. This also can have the problem that the compilation of regexps into finite automata has produced many libraries that consider the | operator non conmutative and resolve them in a non conmutative way, leading to erroneous results.
You have not explained also what you mean with ignoring. If you mean matching them and pass around, is different to mean to ignore the whole line they could appear on. The regexps then (an the definition of the problem you need to solve is quite different ---my explanation was in the sense of rejecting a full sentence if it has any of the words on it, which probably is not what you mean) So please, explain (in your question) what do you mean with:
accepting you have matched a sentence containing a word.
rejecting such a sentence.
what are you rejecting (or ignoring) at all.
Rejecting just a word, is simply selecting a sencence that contains that word, and mark the word to be able to pass over it. But that's a different problem, and it requires to select sentences that do have the word.

Pattern matching for strings independent from symbols

I have need for an algorithm which can find pre-defined patterns in data (which is present in the form of strings) independent from the actual symbols/characters of the data and the pattern. I only care about the relations between the symbols, not the symbols themselves. It is also legal to have different pattern symbols for the same symbol in the data. The only thing the pattern matching algorithm has to enforce is that multiple occurences of the same symbol in the pattern are preserved. To give you an example:
The pattern is abca, so the first and the last letter are the same. For my application, an equivalent way to write this would be 1 2 3 1, where the digits are just variables. The data I have is thistextisatest. The resulting algorithm should give me two correct matches here, text and test. Because only in these two cases, the first and the fourth letter are the same, as in the pattern.
As a second example, the pattern abcd should return 12 matches (one for each position in thistextisat). Since no variable in the pattern is repeated, it is trivially matched everywhere. Even in the case of text and test, because it is legal that the variables a and d of the pattern map to the same symbol.
The goal of this algorithm should be to detect similarities in written language. Imagine having a dictionary of the English language and parsing it with the pattern unseen or equivalently 1 2 3 4 4 2. You would then see that, for example, the word belittle contains the same pattern of letters.
So, now that I hopefully made clear what I need, I have some questions:
What is this algorithm called? Is it a well-known problem that has been solved?
Are there publications on the matter? It is really hard to find anything useful when you don't know the correct search terms to separate this problem from regular pattern matching.
Is there a ready implementation of this?
I have not used Regex for anything too complicated, so I don't know if anything like this would even be possible in Regex, when you basically do not care about the symbols as such, but only consider the pattern of their occurences.
I'd really appreciate your help!
I don't think you need regular expressions here. Your search term:
unseen
123442
This has six characters, so index each word of your text into 6-mers
belittle
12,12,12,12,11,12,12 2-mers
123,123,123,122,112,123 3-mers
1234,1234,1233,1223,1123 4-mers
12345,12344,12334,12234 5-mers
123455,123442,123321 6-mers
So just looking at the 6-mers, you've got a match. Any 6 digit number less than your search term would also be a match, to allow for the abcd (1234) case matching an abca (1231) word.
So given a search term of n characters, just split each word into its constituent n-mers and check for numeric equal or less than.

Get uncommon words from sentence

I need a function that can take a sentence and extract only the uncommon words.
I was thinking to build a dictionary that will hold the common words for each language.
Then I will search each word from my sentence (one by one) against this dictionary, and that way I can tell if it common or uncommon.
Is there a better way?

Java Regex to find if a given String contains a set of characters in the same order of their occurrence.

We need Java Regex to find if a given String contains a set of characters in the same order of their occurrence.
E.g. if the given String is "TYPEWRITER",
the following strings should return a match:
"YERT", "TWRR" & "PEWRR" (character by character match in the order of occurrence),
but not
"YERW" or "YERX" (this contains characters either not present in the given string or doesn't match the order of occurrence).
This can be done by character by character matching in a for loop, but it will be more time consuming. A regex for this or any pointers will be highly appreciated.
First of all REGEX has nothing to do with it. Regex is powerful but not that much powerful to accomplish this.
The thing you are asking is a part of Longest Common Subsequence(LCS) Algorithm implementation. For your case you need to change the algorithm a bit. I mean instead of matching part of string from both, you'll require to match your one string as a whole subsequence from the Larger one.
The LCS is a dynamic algorithm and so far this is the fastest way to achieve this. If you take a look at the LCS Example here you'll find that what I am talking about.

How do I find words with all the specified characters, with repetition?

Is there a way to find the words containing all the given characters, include the repetitive ones, with regular expression? For example, I want to find all words from list
aabc, abbc, bbbc, aaac, aaab, baac, caab, abca
that contain exactly one 'b' and two 'a's, i.e. aabc, baac, caab, and abca (but NOT aaab as it has an additional 'a'). Word length doesn't matter.
While this question
GREP How do I only retrieve words with only the specified letters?
could give me some hint, I wasn't able to extend it so it will find repeative characters.
I am just playing with re module from Python, but there is no restrcition on language / tool for the question.
EDIT:
A better example / usecase would be: Given a list of words, show only those that contain all the letters entered by a user, e.g. I would like to find all words containing exactly one 'a', two 'd's and one 's'. Is this something regex capable of? (I already know how to do it without regex.)
To match exactly 2 a's and 1 b (in any order) in your input string use this regex:
(?=^(?:[^a]*a){2}[^a]*$)(?=^[^b]*b[^b]*$)^.+$
Here is a live demo for you.
If your regex flavor supports lookaheads, then you can use this:
\b(?=.*b)(?=([^a]*a){2}[^a]*\b)[abc]+\b
This requires at least one b and exactly 2 a's, and allows only a, b and c in the string. If you want to require exactly one b and exactly 4 characters in total, use this:
\b(?=[^b]*b[^b]*\b)(?=([^a]*a){2}[^a]*\b)[abc]{4}\b