string comparison using Regular Expression - regex

over on the Excel VBA forum here someone has asked for help with matching strings like the below:
Examples:
ACBD,AC - Match
ACBD,CA - Match
ACBD,ADB - Match
AC,ABCD - Match
ABC, ABD - No Match
the rule is that strings match on condition that all of the letters in one string is contained in the other (i.e either one of the two strings contain all the letters of the other)
So it occurred to me that a Regular expression might be the answer, but I am an absolute newbie on that so can you help please?
Is it possible to match both strtings against each other ?
thanks
Philip

While Regex would certainly make the check easier, I don't that this is not possible without additional coding. You would need the code to do one of the following things:
1) match each character individually then see if all matches were true,
2) re-arrange the order of the characters in all possible order permutations and check each order to see if that matched
Either way, you would need to manipulate the "checking" string in order to cover all of the possible requirements of the match.
If you had asked for "any of these characters" or "all of these characters, in this order", you might be able to do it without extra logic, but since you need "any of these characters, in any order", you've need to manipulate the inputs.

I haven't got an answer for you in VBA but can tell you the steps you need to take.
For each element create a variable with the characters sorted into alphabetical order - you will need to search the net for a sort function to do this as there is not one built into VBA.
Insert a .* between each character in both variables - these are your regexs. You probably want to incorporate this step in with the sort function.
Then all you need to do is match element one of your array with the regex variable created from the second element and then do the second with the first.

Related

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

Regex for string that contains characters in unspecified order

When having strings like
helloworld
worldhello
ollehdlrow
Is there a regex that can match all those cases? So, basically a pattern that will match all strings that contain all characters, in unspecified order.
I tried using
/[helloworld]{10}/
but this doesn't work for obvious reasons, as it will also match eeeeeeeeee.
You definitely don't want to use regular expressions for this.
In order to check if a character exists in the string, in your case, you would have to use a positive lookahead. It would look something like this (?=a) to check for the character a. Thats fine. If we want to check for a string containing the character a and b we can do /^(?=.*a)(?=.*b)/. Problems arise if we want to check for multiple as.
View this example: http://regex101.com/r/iV2jC8
As you can see, the regex has been "told" to look two times for the letter 'a'. However, the first case still matches. This is because the engine does not save the position where it initially found the first 'a', and thus the next assertion finds the very same a. This is the case in all three of the examples. So in reality, none of them are really being validated.
You would have to do something like this: http://regex101.com/r/cR8eR4
Which as you probably can imagine will quickly get out of hand with larger patterns.
I hope this helps, best of luck.

issue in a regexp

I'm using the following expression:
/^[alopinme]{5}$/
This regexp take me words from a set of words with letters contained within the brackets.
well, now i need to add some more functionality to such expression because i need that the fetched words could contain ONLY one more letter from another set of letters. Let's say that i want to get words formed with letters from set A and could (if exist) contain one more letter from set B.
i'm trying to guess how could i complete my regular expression but i do not find the right way.
Anyone could help me?
Thanks.
EDIT:
Here i post an example:
SELECT sin_acentos FROM Finder.palabras_esp WHERE sin_acentos REGEXP '^[tehsolm]{5}$'
This expression choose me words like: helms, moths meths homes and so on.....
but i need to add a set B of letters and get words that could contain ONLY one from such set. Lets say I have another set of letters [xzk] so the expression could get more words but only with the possibility of choosing one letter from set B.
The result could get words like: mozes, hoxes, tozes, and so on... if you check such words, you can see that most of letters for every word are from set A but only one from set B.
If the one of the other characters should appear exactly once, you can use:
^(?=.{5}$)[alopinme]*(?:[XYZ][alopinme]*)?$
(?=.{5}$) - Check the string is 5 characters long, even before matching. (this might not work on MySql)
[alopinme]* - Characters from A
(?:[XYZ][alopinme]*)? - Optional - one character from B, and some more from A.
Working example: http://rubular.com/r/aw6l561Int
Or, for if you want them up to 3 times, for example:
^(?=.{5}$)[alopinme]*(?:[XYZ][alopinme]*){0,3}$
Since the words that you are looking for are all five-character long, I can think of a rather ugly expression that would do the trick: let's say [alopinme] is your base set, and [xyz] is your optional set. Then the expression
/^([alopinmexyz][alopinme]{4}|[alopinme][alopinmexyz][alopinme]{3}|[alopinme]{2}[alopinmexyz][alopinme]{2}|[alopinme]{3}[alopinmexyz][alopinme]|[alopinme]{4}[alopinmexyz])$/
should allow five-letter words of the structure that you are looking for.
In general, a need to count anything makes your regex non-readable. Problems like this one are good to illustrate this point: it is much easier to write /^[alopinmexyz]{5}$/ expression, and add an extra step in code to check that [xyz] appears in the text no more than once. You can even use a regexp to do the additional check:
/^[^xyz]*[xyz]?[^xyz]*$/
The result in SQL would look as follows:
SELECT sin_acentos
FROM Finder.palabras_esp
WHERE sin_acentos REGEXP '^[tehsolmxyz]{5}$' -- Length == 5, all from tehsolm+xyz
AND sin_acentos REGEXP '^[^xyz]*[xyz]?[^xyz]*$' -- No more than one character from xyz

RegExp Find skip letter in the word

I want to find word even this word is written with skip letter.
For example I want to find
references
I want also find refrences or refernces, but not refer
I write this Regexp
(\brefe?r?e?n?c?e?s?\b)
And I want to add checking for length of matched group, this group should be greather than 8.
Can I do only with regexp methods?
I don't think regex is a good tool to find similar words like you try to. What are you doing if two letters are swapped, like "refernece"? Your regex will not find it.
But to show the regex way to check for the length, you could do this by using a lookahead like this
(\b(?=.{8,}\b)refe?r?e?n?c?e?s?\b)
The (?=.{8,}\b) will check if the length from the first \b to the next \b is at least 8 characters ({8,})
See it here on Regexr
I think that using regex is not a good idea. You need more power functions. For example, if you are programming in php, you need function like similar_text. More details here: http://www.php.net/manual/en/function.similar-text.php
Basically you are asking that (in pseudo code):
input == "references" or (levenshtein("references", input)==1 and length(input) == (lenght("references")-1))
Levenshtein distance is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Since you want to detect only the strings where a char was skipped, you must add the constraint on the string length.