Regex: Find a word that consists of certain characters - regex

I have a list of dictionary words, I would like to find any word that consists of (some or all) certain characters of a source word in any order :
For Example:
Characters (source word) to look for : stainless
Found Words : stainless, stain, net, ten, less, sail, sale, tale, tales, ants, etc.
Also if a letter is found once in the source word it can't be repeated in the found word
Unacceptable words to find : tent (t is repeated), tall (l is repeated) , etc.
Acceptable words to find : less (s is already repeated in the source word), etc.

You could take this approach:
Match any sequence of characters that are in the search word, requiring that the match is a word (word-boundaries)
Prohibit that a certain character occurs more often than it is present in the search word, using a negative look-ahead. Do this for every character that is in the search word.
For the given example the regular expression would be:
(?!(\S*s){4}|(\S*t){2}|(\S*a){2}|(\S*i){2}|(\S*n){2}|(\S*l){2}|(\S*e){2})\b[stainless]+\b
The biggest part of the pattern deals with the negative look-ahead. For example:
(\S*s){4} would match four times an 's' in a single word.
(?! | ) places these patterns as different options in a negative look-ahead so that none of them should match.
Automation
It is clear that making such a regular expression for a given word needs some work, so that is where you could use some automation. Notepad++ cannot help with that, but in a programming environment it is possible. Here is a little snippet in JavaScript that will give you the regular expression that corresponds to a given search word:
function regClassEscape(s) {
// Escape "[" and "^" and "-":
return s.replace(/[\]^-]/g, "\\$&");
}
function buildRegex(searchWord) {
// get frequency of each letter:
let freq = {};
for (let ch of searchWord) {
ch = regClassEscape(ch);
freq[ch] = (freq[ch] ?? 0) + 1;
}
// Produce negative options (too many occurrences)
const forbidden = Object.entries(freq).map(([ch, count]) =>
"(\\S*[" + ch + "]){" + (count + 1) + "}"
).join("|");
// Produce character set
const allowed = Object.keys(freq).join("");
return "(?!" + forbidden + ")\\b[" + allowed + "]+\\b";
}
// I/O management
const [input, output] = document.querySelectorAll("input,div");
input.addEventListener("input", refresh);
function refresh() {
if (/\s/.test(input.value)) {
output.textContent = "Input should have no white space!";
} else {
output.textContent = buildRegex(input.value);
}
}
refresh();
input { width: 100% }
Search word:<br>
<input value="stainless">
Regular expression:
<div></div>

Related

Regex find most center

I need to manully hyphante words that are too long. Using hyphen.js, I get soft hyphens between every syllable, like below.
I want to find the hyphen closes to the middle. All words will be more than 14 characters long. Regex that works in https://regex101.com/ or node/js example.
Basically, find the middle character excluding hyphens, check if there is a hyphen there, then step backwards one step and then forwards one step, then backwards to steps etc.
re-spon-si-bil-i-ties => [re-spon-si,-bil-i-ties]
com-pe-ten-cies. => [com-pe,-ten-cies.]
ini-tia-tives. => [ini-tia,-tives]
vul-ner-a-bil-i-ties => [vul-ner-a,-bil-i-ties]
Here's a simple js approach based on string splitting. There could be a binary search style algorithm as you mentioned which would avoid the array allocation but that seems overkill for these small data sets.
function halve(str) {
var right = str.split('-');
var left = right.splice(0, Math.ceil(right.length / 2));
return right.length > 0 ? [left.join('-'), '-' + right.join('-')] : left;
}
console.log(halve('re-spon-si-bil-i-ties'));
console.log(halve('com-pe-ten-cies.'));
console.log(halve('ini-tia-tives.'));
console.log(halve('vul-ner-a-bil-i-ties'));
console.log(halve('none')); // no hyphens returns ["none"]
You can work this out with this method:
Get middle point of string
From the middle point, and checking each character in both directions (left from middle, right from middle) check if that position is the - character. Set the index to the first such match.
If it matches that character, stop the loop and split the string on that index, otherwise return the original word.
words = [
're-spon-si-bil-i-ties',
'com-pe-ten-cies.',
'ini-tia-tives.',
'vul-ner-a-bil-i-ties',
'test',
'-aa',
'aa-'
];
split = '-'
for(word of words) {
m=Math.floor(word.length/2),offset=0,i=null
do{
if(word[m-offset] == split) i = m-offset
else if(word[m+offset] == split) i = m+offset
else offset++
}while(offset<=m && i == null)
if(i!=null && i>0) console.log([word.substring(0,i),word.substring(i)])
else console.log(word)
}
You can achieve this with:
var words = [
're-spon-si-bil-i-ties',
'com-pe-ten-cies.',
'ini-tia-tives.',
'vul-ner-a-bil-i-ties',
're-ports—typ-i-cal-ly',
'none'
];
for(var i = 0; i < words.length; ++i){
var matches = words[i]
.match(
new RegExp(
'^((?:[^-]+?-?){' // Start the regex
+parseInt(
words[i].replace( /-/g, '' ).length/2 // Round down the halfway point of this word's length without the hyphens
)
+'})(-.+)?$' // End the regex
)
)
.slice( 1 ); // Remove position 0 because it is the entire word
console.log( matches );
}
Regex explanation for re-spon-si-bil-i-ties:
^((?:[^-]+?-?){8})(-.+)$
^( - start the capture group leading up to the half way point
(?:[^-]+?-?) - find everything not a hyphen with an optional hyphen after it. Make the hyphen optional so that the second capture group can greedily claim it
{8} - 8 times; this will get us half way
) - close the half way capture group
(-.+)?$ - greedily get the hyphen and everything after it till the end of the string

Regular expression: Numeric + alphanum + special characters Only

I am trying to build a regular expression that can find patterns that MUST contain both numeric and alphanumeric values along side special characters.
I found an answer that deals with this type of regular expressions but without the special characters.
How can I include the special characters including: ^$=()_"'[\# in the Regular expression?
^([0-9]+[a-zA-Z]+|[a-zA-Z]+[0-9]+)[0-9a-zA-Z]*$
can you explain it a little please ?
Regex tester : http://regexlib.com/RETester.aspx
Thank you.
AS a solution I found this regular expression: ^(?=.*\d)(?=.*[a-zA-Z]).{4,8}$
Maybe it can help you.
Why so complicated !?
enum { numeric = 1; alpha = 2, special = 4; }
bool check(const std::string& s) {
for(std::string::size_type i = 0; i < s.size; ++i) {
if(is_numeric(s[i])) result |= numeric;
if(is_alpha(s[i])) result |= alpha;
if(is_special(s[i])) result |= special;
if(result == numeric | alpha | special)
return true;
}
return false;
}
A little more typing but less brain damage
Your regex is formed of two parts, both must capture a complete line as they're between start-of-line (^) and end-of-line ($):
([0-9]+[a-zA-Z]+|[a-zA-Z]+[0-9]+)
This is formed of two regexs or'd (|) together. The first regex is one or more numbers ([0-9]+) followed by one or more letters ([a-zA-Z]+). This regex is or'd with the opposite case regex: one or more letters followed by one or more numbers.
The second group says that the above is followed by a regex zero or more letters or numbers ([0-9a-zA-Z]*)

Regex for parenthesis (JavaScript)

This is the regexp I created so far:
\((.+?)\)
This is my test string: (2+2) + (2+3*(2+3))
The matches I get are:
(2+2)
And
(2+3*(2+3)
I want my matches to be:
(2+2)
And
(2+3*(2+3))
How should I modify my regular expression?
You cannot parse parentesized expressions with regular expression.
There is a mathematical proof that regular expressions can't do this.
Parenthesized expressions are a context-free grammar, and can thus be recognized by pushdown automata (stack-machines).
You can, anyway, define a regular expression that will work on any expression with less than N parentheses, with an arbitrary finite N (even though the expression will get complex).
You just need to acknowledge that your parentheses might contain another arbitrary number of parenteses.
\(([^()]+(\([^)]+\)[^)]*)*)\)
It works like this:
\(([^()]+ matches an open parenthesis, follwed by whatever is not a parenthesis;
(\([^)]+\)[^)]*)* optionally, there may be another group, formed by an open parenthesis, with something inside it, followed by a matching closing parenthesis. Some other non-parenthesis character may follow. This can be repeated an arbitrary amount of times. Anyway, at last, there must be
)\) another closed parenthesis, which matches with the first one.
This should work for nesting depth 2. If you want nesting depth 3, you have to further recurse, allowing each of the groups I described at point (2) to have a nested parenthesized group.
Things will get much easier if you use a stack. Such as:
foundMatches = [];
mStack = [];
start = RegExp("\\(");
mid = RegExp("[^()]*[()]?");
idx = 0;
while ((idx = input.search(start.substr(idx))) != -1) {
mStack.push(idx);
//Start a search
nidx = input.substr(idx + 1).search(mid);
while (nidx != -1 && idx + nidx < input.length) {
idx += nidx;
match = input.substr(idx).match(mid);
match = match[0].substr(-1);
if (match == "(") {
mStack.push(idx);
} else if (mStack.length == 1) {
break;
}
nidx = input.substr(idx + 1).search(mid);
}
//Check the result
if (nidx != -1 && idx + nidx < input.length) {
//idx+nidx is the index of the last ")"
idx += nidx;
//The stack contains the index of the first "("
startIdx = mStack.pop();
foundMatches.push(input.substr(startIdx, idx + 1 - startIdx));
}
idx += 1;
}
How about you parse it yourself using a loop without the help of regex?
Here is one simple way:
You would have to have a variable, say "level", which keeps track of how many open parentheses you have come across so far (initialize it with a 0).
You would also need a string buffer to contain each of your matches ( e.g. (2+2) or (2+3 * (2+3)) ) .
Finally, you would need somewhere you can dump the contents of your buffer into whenever you finish reading a match.
As you read the string character by character, you would increment level by 1 when you come across "(", and decrement by 1 when you come across ")". You would then put the character into the buffer.
When you come across ")" AND the level happens to hit 0, that is when you know you have a match. This is when you would dump the contents of the buffer and continue.
This method assumes that whenever you have a "(" there will always be a corresponding ")" in the input string. This method will handle arbitrary number of parentheses.

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.

Matching token sequences

I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111
012345678901234
ab ab bc a cba
In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
I am hoping to implement this in Java.
Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}
This tested Java program has a commented regex which does the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "ab ab bc a cba";
Pattern p = Pattern.compile(
"# Match 5 char sequences containing: a and b and c\n" +
"(?=[abc]) # Assert first char is a, b or c.\n" +
"(?=.{0,4}a) # Assert an 'a' within 5 chars.\n" +
"(?=.{0,4}b) # Assert an 'b' within 5 chars.\n" +
"(?=.{0,4}c) # Assert an 'c' within 5 chars.\n" +
".{5} # If so, match the 5 chers.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
while (m.find()) {
System.out.print("Match = \""+ m.group() +"\"\n");
}
}
}
Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.
Here is the output from the script:
Match = "ab bc"
Match = "a cba"
Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:
abc..|ab.c.|ab..c| .... etc.
This can be factorised somewhat:
ab(c..|.c.|..c)|a.(bc.|b.c .... etc.
I'm not sure if you can do better with regex.
Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc a cba";
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group());
}
output:
ab bc
a cb
This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.
The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.