RegExp Find skip letter in the word - regex

I want to find word even this word is written with skip letter.
For example I want to find
references
I want also find refrences or refernces, but not refer
I write this Regexp
(\brefe?r?e?n?c?e?s?\b)
And I want to add checking for length of matched group, this group should be greather than 8.
Can I do only with regexp methods?

I don't think regex is a good tool to find similar words like you try to. What are you doing if two letters are swapped, like "refernece"? Your regex will not find it.
But to show the regex way to check for the length, you could do this by using a lookahead like this
(\b(?=.{8,}\b)refe?r?e?n?c?e?s?\b)
The (?=.{8,}\b) will check if the length from the first \b to the next \b is at least 8 characters ({8,})
See it here on Regexr

I think that using regex is not a good idea. You need more power functions. For example, if you are programming in php, you need function like similar_text. More details here: http://www.php.net/manual/en/function.similar-text.php

Basically you are asking that (in pseudo code):
input == "references" or (levenshtein("references", input)==1 and length(input) == (lenght("references")-1))
Levenshtein distance is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Since you want to detect only the strings where a char was skipped, you must add the constraint on the string length.

Related

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

Regex for Matching Pinyin

I'm looking for a regular expression that can correctly match valid pinyin (e.g. "sheng", "sou" (while ignoring invalid pinyin, e.g. "shong", "sei"). Most of the regex provided in the top Google results match invalid pinyin in some cases.
Obviously, no matter what approach one takes, this will be a monster regex, and I'm especially interested in the different approaches one could take to solve this problem. For example, "Optimizing a regular expression to parse chinese pinyin" uses lookbacks.
A table of valid pinyin can be found here:
http://pinyin.info/rules/initials_finals.html
I went for a regex that grouped smaller regexes by the pinyin's initial (usually the first letter). So, the first group includes all "b", "p" and "m" sounds, then "f", then "d" and "t", etc.
This approach seems easy to read and should be easy to edit (if it needs corrections or additions). I also added exceptions to the begging of groups in order to improve readability.
([mM]iu|[pmPM]ou|[bpmBPM](o|e(i|ng?)?|a(ng?|i|o)?|i(e|ng?|a[no])?|u))|
([fF](ou?|[ae](ng?|i)?|u))|([dD](e(i|ng?)|i(a[on]?|u))|
[dtDT](a(i|ng?|o)?|e(i|ng)?|i(a[on]?|e|ng|u)?|o(ng?|u)|u(o|i|an?|n)?))|
([nN]eng?|[lnLN](a(i|ng?|o)?|e(i|ng)?|i(ang|a[on]?|e|ng?|u)?|o(ng?|u)|u(o|i|an?|n)?|ve?))|
([ghkGHK](a(i|ng?|o)?|e(i|ng?)?|o(u|ng)|u(a(i|ng?)?|i|n|o)?))|
([zZ]h?ei|[czCZ]h?(e(ng?)?|o(ng?|u)?|ao|u?a(i|ng?)?|u?(o|i|n)?))|
([sS]ong|[sS]hua(i|ng?)?|[sS]hei|[sS][h]?(a(i|ng?|o)?|en?g?|ou|u(a?n|o|i)?|i))|
([rR]([ae]ng?|i|e|ao|ou|ong|u[oin]|ua?n?))|
([jqxJQX](i(a(o|ng?)?|[eu]|ong|ng?)?|u(e|a?n)?))|
(([aA](i|o|ng?)?|[oO]u?|[eE](i|ng?|r)?))|
([wW](a(i|ng?)?|o|e(i|ng?)?|u))|
[yY](a(o|ng?)?|e|in?g?|o(u|ng)?|u(e|a?n)?)
Here's the Debuggex example I created.
I would use a combination approach that is not solely regex.
Check for valid pinyin:
grab word
grab letters from the beginning of the word as long as they are consonants. This separates the initial sound from the final sound.
check that the initial and final are valid...
...and if so, see if their combination is allowed (via a table like this, but the entries are simply 1's and 0's).

string comparison using Regular Expression

over on the Excel VBA forum here someone has asked for help with matching strings like the below:
Examples:
ACBD,AC - Match
ACBD,CA - Match
ACBD,ADB - Match
AC,ABCD - Match
ABC, ABD - No Match
the rule is that strings match on condition that all of the letters in one string is contained in the other (i.e either one of the two strings contain all the letters of the other)
So it occurred to me that a Regular expression might be the answer, but I am an absolute newbie on that so can you help please?
Is it possible to match both strtings against each other ?
thanks
Philip
While Regex would certainly make the check easier, I don't that this is not possible without additional coding. You would need the code to do one of the following things:
1) match each character individually then see if all matches were true,
2) re-arrange the order of the characters in all possible order permutations and check each order to see if that matched
Either way, you would need to manipulate the "checking" string in order to cover all of the possible requirements of the match.
If you had asked for "any of these characters" or "all of these characters, in this order", you might be able to do it without extra logic, but since you need "any of these characters, in any order", you've need to manipulate the inputs.
I haven't got an answer for you in VBA but can tell you the steps you need to take.
For each element create a variable with the characters sorted into alphabetical order - you will need to search the net for a sort function to do this as there is not one built into VBA.
Insert a .* between each character in both variables - these are your regexs. You probably want to incorporate this step in with the sort function.
Then all you need to do is match element one of your array with the regex variable created from the second element and then do the second with the first.

Regex for string that contains characters in unspecified order

When having strings like
helloworld
worldhello
ollehdlrow
Is there a regex that can match all those cases? So, basically a pattern that will match all strings that contain all characters, in unspecified order.
I tried using
/[helloworld]{10}/
but this doesn't work for obvious reasons, as it will also match eeeeeeeeee.
You definitely don't want to use regular expressions for this.
In order to check if a character exists in the string, in your case, you would have to use a positive lookahead. It would look something like this (?=a) to check for the character a. Thats fine. If we want to check for a string containing the character a and b we can do /^(?=.*a)(?=.*b)/. Problems arise if we want to check for multiple as.
View this example: http://regex101.com/r/iV2jC8
As you can see, the regex has been "told" to look two times for the letter 'a'. However, the first case still matches. This is because the engine does not save the position where it initially found the first 'a', and thus the next assertion finds the very same a. This is the case in all three of the examples. So in reality, none of them are really being validated.
You would have to do something like this: http://regex101.com/r/cR8eR4
Which as you probably can imagine will quickly get out of hand with larger patterns.
I hope this helps, best of luck.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments