Regex for string that contains characters in unspecified order - regex

When having strings like
helloworld
worldhello
ollehdlrow
Is there a regex that can match all those cases? So, basically a pattern that will match all strings that contain all characters, in unspecified order.
I tried using
/[helloworld]{10}/
but this doesn't work for obvious reasons, as it will also match eeeeeeeeee.

You definitely don't want to use regular expressions for this.
In order to check if a character exists in the string, in your case, you would have to use a positive lookahead. It would look something like this (?=a) to check for the character a. Thats fine. If we want to check for a string containing the character a and b we can do /^(?=.*a)(?=.*b)/. Problems arise if we want to check for multiple as.
View this example: http://regex101.com/r/iV2jC8
As you can see, the regex has been "told" to look two times for the letter 'a'. However, the first case still matches. This is because the engine does not save the position where it initially found the first 'a', and thus the next assertion finds the very same a. This is the case in all three of the examples. So in reality, none of them are really being validated.
You would have to do something like this: http://regex101.com/r/cR8eR4
Which as you probably can imagine will quickly get out of hand with larger patterns.
I hope this helps, best of luck.

Related

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

Regex not separating n't (not)

I am trying to write a complex regex for a large corpus. However, due to many ORs, I am not able to capture the "not" in weren't don't wasn't didn't shouln't doesn't
I would like it to match base verb and n't separately: E.g. were and n't
I have added it in the first line on: https://www.regexpal.com/?fam=106183 with the regex.
Any clue why it is not picking despite it being present in the expression on first order: [a-z]{1}'\w
Edit:
The regex is long because it is part of a large corpus. My problem is that the n't is not getting separated out, even though I placed in first order of preference for OR.
Thanks in advance
Trying to parse natural language perfectly with a regular expression is never going to be "perfect". Language contains too many quirks and exceptions.
However, with that said, trying to cover all scenarios explicitly like you have done ("a 2 letter lower case word", "a 4 letter capitalised word", "a word with a multiple of 3 letters" (??!), ... is a doomed approach.
Keep the pattern as simple as you possibly can, and only add exceptions if you really need to.
Here's a basic approach:
/n't|\b\w+(?!'t)/
This is matching "n't", or 'any word, excluding the last letter if it's proceeded by "'t"'.
You may wish to build upon that slightly, but it solved the use case you've provided:
Demo
In order to understand why your original pattern doesn't work, let's consider a Minimal, Complete, Verifiable Example:
Cutting your pattern down to:
/[a-z]?'[a-z]{1,}|[\w-]+/
Consider how it matches the string:
"weren't"
First, the characters weren are matched by the [\w-]+ portion of the pattern.
Then, the 't characters are matched by the [a-z]?'[a-z]{1,} portion of the pattern.
Fundamentally, having the greedy [\w-]+ section in this pattern will mean it cannot work. This will always match up-to-and-including the "n" in "n't", which means the overall match fails for non-3-letter words.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments

issue in a regexp

I'm using the following expression:
/^[alopinme]{5}$/
This regexp take me words from a set of words with letters contained within the brackets.
well, now i need to add some more functionality to such expression because i need that the fetched words could contain ONLY one more letter from another set of letters. Let's say that i want to get words formed with letters from set A and could (if exist) contain one more letter from set B.
i'm trying to guess how could i complete my regular expression but i do not find the right way.
Anyone could help me?
Thanks.
EDIT:
Here i post an example:
SELECT sin_acentos FROM Finder.palabras_esp WHERE sin_acentos REGEXP '^[tehsolm]{5}$'
This expression choose me words like: helms, moths meths homes and so on.....
but i need to add a set B of letters and get words that could contain ONLY one from such set. Lets say I have another set of letters [xzk] so the expression could get more words but only with the possibility of choosing one letter from set B.
The result could get words like: mozes, hoxes, tozes, and so on... if you check such words, you can see that most of letters for every word are from set A but only one from set B.
If the one of the other characters should appear exactly once, you can use:
^(?=.{5}$)[alopinme]*(?:[XYZ][alopinme]*)?$
(?=.{5}$) - Check the string is 5 characters long, even before matching. (this might not work on MySql)
[alopinme]* - Characters from A
(?:[XYZ][alopinme]*)? - Optional - one character from B, and some more from A.
Working example: http://rubular.com/r/aw6l561Int
Or, for if you want them up to 3 times, for example:
^(?=.{5}$)[alopinme]*(?:[XYZ][alopinme]*){0,3}$
Since the words that you are looking for are all five-character long, I can think of a rather ugly expression that would do the trick: let's say [alopinme] is your base set, and [xyz] is your optional set. Then the expression
/^([alopinmexyz][alopinme]{4}|[alopinme][alopinmexyz][alopinme]{3}|[alopinme]{2}[alopinmexyz][alopinme]{2}|[alopinme]{3}[alopinmexyz][alopinme]|[alopinme]{4}[alopinmexyz])$/
should allow five-letter words of the structure that you are looking for.
In general, a need to count anything makes your regex non-readable. Problems like this one are good to illustrate this point: it is much easier to write /^[alopinmexyz]{5}$/ expression, and add an extra step in code to check that [xyz] appears in the text no more than once. You can even use a regexp to do the additional check:
/^[^xyz]*[xyz]?[^xyz]*$/
The result in SQL would look as follows:
SELECT sin_acentos
FROM Finder.palabras_esp
WHERE sin_acentos REGEXP '^[tehsolmxyz]{5}$' -- Length == 5, all from tehsolm+xyz
AND sin_acentos REGEXP '^[^xyz]*[xyz]?[^xyz]*$' -- No more than one character from xyz

RegExp Find skip letter in the word

I want to find word even this word is written with skip letter.
For example I want to find
references
I want also find refrences or refernces, but not refer
I write this Regexp
(\brefe?r?e?n?c?e?s?\b)
And I want to add checking for length of matched group, this group should be greather than 8.
Can I do only with regexp methods?
I don't think regex is a good tool to find similar words like you try to. What are you doing if two letters are swapped, like "refernece"? Your regex will not find it.
But to show the regex way to check for the length, you could do this by using a lookahead like this
(\b(?=.{8,}\b)refe?r?e?n?c?e?s?\b)
The (?=.{8,}\b) will check if the length from the first \b to the next \b is at least 8 characters ({8,})
See it here on Regexr
I think that using regex is not a good idea. You need more power functions. For example, if you are programming in php, you need function like similar_text. More details here: http://www.php.net/manual/en/function.similar-text.php
Basically you are asking that (in pseudo code):
input == "references" or (levenshtein("references", input)==1 and length(input) == (lenght("references")-1))
Levenshtein distance is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Since you want to detect only the strings where a char was skipped, you must add the constraint on the string length.