Regex for this particular pattern - regex

I have three different things
xxx
xxx>xxx
xxx>xxx>xxx
Where xxx can be any combination of letters and number
I need a regex that can match the first two but NOT the third.

To match ASCII letters and digits try the following:
^[a-zA-Z0-9]{3}(>[a-zA-Z0-9]{3})?$
If letters and digits outside of the ASCII character set are required then the following should suffice:
^[^\W_]{3}(>[^\W_]{3})?$

^\w+(?:>\w+)?$
matches an entire string.
\w+(?:>\w+)?\b(?!>)
matches strings like this in a larger substring.
If you want to exclude the underscore from matching, you can use [\p{L]\p{N}] instead (if your regex engine knows Unicode), or [^\W_] if it doesn't, as a substitute for \w.

Related

Regex that only allows empty string, letters, numbers or spaces?

Need help coming up with a regex that only allows numbers, letters, empty string, or spaces.
^[_A-z0-9]*((-|\s)*[_A-z0-9])*$
This one is the closest I've found but it allows underscores and hyphen.
Only letters, numbers, space, or empty string?
Then 1 character class will do.
^[A-Za-z0-9 ]*$
^ : start of the string or line (depending on the flag)
[A-Za-z0-9 ]* : zero or more upper-case or lower-case letters, or digits, or spaces.
$ : end of the string or line (depending on the flag)
The A-z range contains more than just letters.
You can see that in the ASCII table.
And \s for whitespace also includes tabs or linebreaks (depending on the flag).
But if you also want those, then just use that instead of the space.
^[A-Za-z0-9\s]*$
Also, depending on the regex engine/dialect that your language/tool uses, you could use \p{L} for any unicode letter.
Since [A-Za-z] only includes the normal ascii letters.
Reference here
Your regex is too complicated for what you need.
the first part is fine, you are allowing letter and number, you could simply add the space character with it.
Then, if you use the * character, which translate to 0 or any, you could take care of your empty string problem.
See here.
/^[a-z0-9 ]*$/gmi
Notice here that i'm not using A-z like you were because this translate to any character between the A in ascii (101) and the z(172). this mean it will also match char in between (133 to 141 that are not number nor letter). I've instead use a-z which allow lowercase letter and used the flag i which tell the regex to not take care of the case.
Here is a visual explanation of the regex
You can also test more cases in this regex101
Matching only certain characters is equivalent to not matching any other character, so you could use the regex r = /[^a-z\d ]/i to determine if the string contains any character other than the ones permitted. In Ruby that would be implemented as follows.
"aBc d01e e$9" !~ r #=> false
"aBc d01e ex9" !~ r #=> true
In this situation there may not much to choose between this approach and attempting to match /\A[a-z\d ]+\z/i, but in other situations the use of a negative match can simplify the regex considerably.

Regex that excludes spaces and requires 2 capital letters or more

I'm trying to create a regular expression that matches strings with:
19 to 90 characters
symbols
at least 2 uppercase alphabetical characters
lowercase alphabetical characters
no spaces
I already know that for the size and space exclusion the regex would be:
^[^ ]{19,90}$
And I know that this one will match any a string with at least 2 uppercase characters:
^(.*?[A-Z]){2,}.*$
What I don't know is how to combine them. There is no context for the strings.
Edit: I forgot to say that it is better ifthe regex excludes strings that finish with a .com or .jpeg or .png or any .something (that "something" being of 2-5 characters).
This regex should do what you want.
^(?=(?:\w*\W+)+\w*$)(?=(?:\S*?[A-Z]){2,}\S*?$)(?=(?:\S*?[a-z])+\S*?$)(?!.*?\.\w{2,5}$).{19,90}$
Basically it uses three positive lookaheads and a negative lookahead to guarantee the conditions that you specified:
(?=(?:\w*\W+)+\w*$)
ensures that there is at least one non-word (symbol) character
(?=(?:\S*?[A-Z]){2,}\S*?$)
ensures that there are at least two uppercase characters, and also excludes a match if there are any spaces in the string
(?=(?:\S*?[a-z])+\S*?$)
ensures that there is at least one lowercase character in the string. The negative lookahead
(?!.*?\.\w{2,5}$)
ensures that strings that end with a . and 2-5 characters are excluded
Finally,
.{19.90}
performs the actual match and ensures that there are between 19 and 90 characters.
Following your requrements, I suggest to use the following pattern:
^(?=.*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[^\s]).{19,90}$
Demo
Instead of just excluding spaces, I used \ssince you probably don't want allow tabs, newlines, etc. either. However, it is still unclear which symbols you want to allow, e.g. [a-zA-Z!"§$%&\/()=?+]
^(?=.*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[^\s])(?=[a-zA-Z!"§$%&\/()=?+]).{19,90}$
To match your additional requirement not to match file-like extensions at the end of the string, add a negative look-ahead: (?!.*\.\w{2,5}$)
^(?=.*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[^\s])(?=[a-zA-Z!"§$%&\/()=?+]).{19,90}$
Demo2
You can use backreferences as described here: https://www.ocpsoft.org/tutorials/regular-expressions/and-in-regex/
Another reference with examples here: https://www.regular-expressions.info/refcapture.html

How to search for a negated Unicode backreference (PCRE)?

I'd like to find two non-identical Unicode words separated by a colon using a PCRE regex.
Take for example, this string:
Lôrem:ipsüm dõlör:sït amêt:amêt cønsectetûr:cønsectetûr âdipiscing:elït
I can easily find the two identical words separated by a colon using:
(\p{L}+):(\1)
which will match: cønsectetûr:cønsectetûr and amêt:amêt
However, I want to negate the backreference to find only non-identical Unicode words separated by a colon.
What's the proper way to negate a backreference in PCRE?
(\p{L}+):(^\1) obviously does not work.
You start by using a negative lookahead to prevent a match if the captured part repeats after the colon:
(\p{L}+):(?!\1)
Then you need to match the second unicode word, another \p{L}+:
(\p{L}+):(?!\1)\p{L}+
And last, to prevent false matches, use word boundaries:
\b(\p{L}+):(?!\1\b)\p{L}+\b
regex101 demo

Regex to convert words in TitleCase

I use this regex to convert words in TitleCase and confirm each substitution:
:s/\%V\<\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]\)\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]*\)\>/\u\1\L\2/gc
However this matches also the words who are already in Titlecase.
Does anyone know how to change the above regex in order to jump over words who are already in TitleCase?
:s/\%V\<\([a-z0-9àäâæèéëêìòöôœùüûç]\)\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]*\)\>/\u\1\L\2/gc
seems to do the trick, here.
Because you have explicitely included uppercase characters in the range you use in the first letter capture group, your pattern is going to match both foo and Foo. Removing the uppercase characters from that range seems to resolve your immediate problem.
To match only non-titlecase words, you want to match those that start either (a) with a lowercase letter or (b) with two uppercase letters. The following will do it (add accented letters and digits to taste):
\b([A-Z])([A-Z][A-Za-z]*)|\b([a-z])([a-zA-Z]+)
But some words match at groups \1 and \2, others at \3 and \4. I don't use vim so I can't say if it'll let you substitute with this kind of pattern. (E.g., \u\1\3\L\2\4; only two of the four will ever be non-empty)

Confused about regex pattern

What is the difference between search pattern like [a-zA-Z][a-zA-Z]* and [a-zA-Z]* ?
The first matches one [a-zA-Z] followed by zero or more [a-zA-Z].
The second matches zero or more [a-zA-Z].
The first can also be written as [a-zA-Z]+.
The regex [a-zA-Z][a-zA-Z]* means that you are mandating that there should be one alpabetic character optionally followed by any number of alphabets. On the other hand, [a-zA-Z]* means that the alphabet mandate is entirely off.
For example, your first regex matches the strings azxxx, abccdef but fails 2abcd, 22 and blank strings. But the second regex can match a blank string too.
For the first regex, you may just want to say: [a-zA-Z]+ instead.