Regular expression for non-consecutive characters - regex

I'm trying to create a regular expression that validates the following requirements:
Simultaneous use of Cyrillic and numbers is possible (without spaces and special characters)
Simultaneous use of Latin and numbers is possible (without spaces and special characters)
Simultaneous use of Cyrillic and Latin characters is not possible
The first letter must be capitalized, cannot be a number
Sequence length - from 2 to 16 digits inclusive
It is impossible to use 3 or more identical symbols in a row
I am using the following solution:
(?:([A-Z][A-Za-z0-9]{1,15}|[А-Я][А-ЯЁа-яё0-9]{1,15}))$
How do I change the regex to match the last requirement?
I use Google Sheets, in which it is impossible to use negative lookahead.
Sorry for my English.

I don't you can do this with a single regex without lookbehinds.
But there are workarounds for the "don't repeat same character 3 times" functionality.
The workarounds could be simpler if RE2 supported backreferences, but it does not. So the resulting rule will be longer.
You may define a column ValidNoThreeRepeats like this:
=
NOT(
OR(
AND(MID(A1;1 ;1)=MID(A1;2 ;1);MID(A1;2 ;1)=MID(A1;3 ;1));
AND(MID(A1;2 ;1)=MID(A1;3 ;1);MID(A1;3 ;1)=MID(A1;4 ;1));
AND(MID(A1;3 ;1)=MID(A1;4 ;1);MID(A1;4 ;1)=MID(A1;5 ;1));
AND(MID(A1;4 ;1)=MID(A1;5 ;1);MID(A1;5 ;1)=MID(A1;6 ;1));
AND(MID(A1;5 ;1)=MID(A1;6 ;1);MID(A1;6 ;1)=MID(A1;7 ;1));
AND(MID(A1;6 ;1)=MID(A1;7 ;1);MID(A1;7 ;1)=MID(A1;8 ;1));
AND(MID(A1;7 ;1)=MID(A1;8 ;1);MID(A1;8 ;1)=MID(A1;9 ;1));
AND(MID(A1;8 ;1)=MID(A1;9 ;1);MID(A1;9 ;1)=MID(A1;10;1));
AND(MID(A1;9 ;1)=MID(A1;10;1);MID(A1;10;1)=MID(A1;11;1));
AND(MID(A1;10;1)=MID(A1;11;1);MID(A1;11;1)=MID(A1;12;1));
AND(MID(A1;11;1)=MID(A1;12;1);MID(A1;12;1)=MID(A1;13;1));
AND(MID(A1;12;1)=MID(A1;13;1);MID(A1;13;1)=MID(A1;14;1));
AND(MID(A1;13;1)=MID(A1;14;1);MID(A1;14;1)=MID(A1;15;1))
)
)
Or in a compacted way like this:
=NOT(OR(AND(MID(A1;1 ;1)=MID(A1;2 ;1);MID(A1;2 ;1)=MID(A1;3 ;1));AND(MID(A1;2 ;1)=MID(A1;3 ;1);MID(A1;3 ;1)=MID(A1;4 ;1));AND(MID(A1;3 ;1)=MID(A1;4 ;1);MID(A1;4 ;1)=MID(A1;5 ;1));AND(MID(A1;4 ;1)=MID(A1;5 ;1);MID(A1;5 ;1)=MID(A1;6 ;1));AND(MID(A1;5 ;1)=MID(A1;6 ;1);MID(A1;6 ;1)=MID(A1;7 ;1));AND(MID(A1;6 ;1)=MID(A1;7 ;1);MID(A1;7 ;1)=MID(A1;8 ;1));AND(MID(A1;7 ;1)=MID(A1;8 ;1);MID(A1;8 ;1)=MID(A1;9 ;1));AND(MID(A1;8 ;1)=MID(A1;9 ;1);MID(A1;9 ;1)=MID(A1;10;1));AND(MID(A1;9 ;1)=MID(A1;10;1);MID(A1;10;1)=MID(A1;11;1));AND(MID(A1;10;1)=MID(A1;11;1);MID(A1;11;1)=MID(A1;12;1));AND(MID(A1;11;1)=MID(A1;12;1);MID(A1;12;1)=MID(A1;13;1));AND(MID(A1;12;1)=MID(A1;13;1);MID(A1;13;1)=MID(A1;14;1));AND(MID(A1;13;1)=MID(A1;14;1);MID(A1;14;1)=MID(A1;15;1))))
The idea is to have a rule that compares 1st, 2nd and 3rd character, then another rule that compares 2nd, 3rd, 4th, then another rule for 3rd, 4th, 5th, and so on and so forth. You join this rules with an OR, since if any of those match, it means that at some place some repetition exists. Finally, you negate the whole expresion with a NOT
Than you can check that both your regex and that column are valid.

Donno with which script language you're using
If's in PHP code form,I'd be using `Filter_var($param1, FILTER_VALIDATE..., FILTER_FLAG..)` if i were in your shoes .
It makes your way into both **validating** n **sanitizing** your snippet.
**PEACE**.

Related

Can you restrict two characters based on their ASCII order in regex?

Let's say I have a string of 2 characters. Using regex (as a thought exercise), I want to accept it only if the first character has an ascii value bigger than that of the second character.
ae should not match because a is before e in the the ascii table.
ea, za and aA should match for the opposite reason
f$ should match because $ is before letters in the ascii table.
It doesn't matter if aa or a matches or not, I'm only interested in the base case. Any flavor of regex is allowed.
Can it be done ? What if we restrict the problem to lowercase letters only ? What if we restrict it to [abc] only ? What if we invert the condition (accept when the characters are ordered from smallest to biggest) ? What if I want it to work for N characters instead of 2 ?
I guess that'd be almost impossible for me to do it then, however bobble-bubble impressively solved the problem with:
^~*\}*\|*\{*z*y*x*w*v*u*t*s*r*q*p*o*n*m*l*k*j*i*h*g*f*e*d*c*b*a*`*_*\^*\]*\\*\[*Z*Y*X*W*V*U*T*S*R*Q*P*O*N*M*L*K*J*I*H*G*F*E*D*C*B*A*#*\?*\>*\=*\<*;*\:*9*8*7*6*5*4*3*2*1*0*\/*\.*\-*,*\+*\**\)*\(*'*&*%*\$*\#*"*\!*$(?!^)
bobble bubble RegEx Demo
Maybe for abc only or some short sequences we would approach solving the problem with some expression similar to,
^(abc|ab|ac|bc|a|b|c)$
^(?:abc|ab|ac|bc|a|b|c)$
that might help you to see how you would go about it.
RegEx Demo 1
You can simplify that to:
^(a?b?c?)$
^(?:a?b?c?)$
RegEx Demo 2
but I'm not so sure about it.
The number of chars you're trying to allow is irrelevant to the problem you are trying to solve:
because you can simply add an independent statement, if you will, for that, such as with:
(?!.{n})
where n-1 would be the number of chars allowed, which in this case would be
(?!.{3})^(?:a?b?c?)$
(?!.{3})^(a?b?c?)$
RegEx Demo 3
A regex is not the best tool for the job.
But it's doable. A naive approach is to enumerate all the printable ascii characters and their corresponding lower range:
\x21[ -\x20]|\x22[ -\x21]|\x23[ -\x22]|\x24[ -\x23]|\x25[ -\x24]|\x26[ -\x25]|\x27[ -\x26]|\x28[ -\x27]|\x29[ -\x28]|\x2a[ -\x29]|\x2b[ -\x2a]|\x2c[ -\x2b]|\x2d[ -\x2c]|\x2e[ -\x2d]|\x2f[ -\x2e]|\x30[ -\x2f]|\x31[ -\x30]|\x32[ -\x31]|\x33[ -\x32]|\x34[ -\x33]|\x35[ -\x34]|\x36[ -\x35]|\x37[ -\x36]|\x38[ -\x37]|\x39[ -\x38]|\x3a[ -\x39]|\x3b[ -\x3a]|\x3c[ -\x3b]|\x3d[ -\x3c]|\x3e[ -\x3d]|\x3f[ -\x3e]|\x40[ -\x3f]|\x41[ -\x40]|\x42[ -\x41]|\x43[ -\x42]|\x44[ -\x43]|\x45[ -\x44]|\x46[ -\x45]|\x47[ -\x46]|\x48[ -\x47]|\x49[ -\x48]|\x4a[ -\x49]|\x4b[ -\x4a]|\x4c[ -\x4b]|\x4d[ -\x4c]|\x4e[ -\x4d]|\x4f[ -\x4e]|\x50[ -\x4f]|\x51[ -\x50]|\x52[ -\x51]|\x53[ -\x52]|\x54[ -\x53]|\x55[ -\x54]|\x56[ -\x55]|\x57[ -\x56]|\x58[ -\x57]|\x59[ -\x58]|\x5a[ -\x59]|\x5b[ -\x5a]|\x5c[ -\x5b]|\x5d[ -\x5c]|\x5e[ -\x5d]|\x5f[ -\x5e]|\x60[ -\x5f]|\x61[ -\x60]|\x62[ -\x61]|\x63[ -\x62]|\x64[ -\x63]|\x65[ -\x64]|\x66[ -\x65]|\x67[ -\x66]|\x68[ -\x67]|\x69[ -\x68]|\x6a[ -\x69]|\x6b[ -\x6a]|\x6c[ -\x6b]|\x6d[ -\x6c]|\x6e[ -\x6d]|\x6f[ -\x6e]|\x70[ -\x6f]|\x71[ -\x70]|\x72[ -\x71]|\x73[ -\x72]|\x74[ -\x73]|\x75[ -\x74]|\x76[ -\x75]|\x77[ -\x76]|\x78[ -\x77]|\x79[ -\x78]|\x7a[ -\x79]|\x7b[ -\x7a]|\x7c[ -\x7b]|\x7d[ -\x7c]|\x7e[ -\x7d]|\x7f[ -\x7e]
Try it online!
A (better) alternative is to enumerate the ascii characters in reverse order and use the ^ and $ anchors to assert there is nothing else unmatched. This should work for any string length:
^\x7f?\x7e?\x7d?\x7c?\x7b?z?y?x?w?v?u?t?s?r?q?p?o?n?m?l?k?j?i?h?g?f?e?d?c?b?a?`?\x5f?\x5e?\x5d?\x5c?\x5b?Z?Y?X?W?V?U?T?S?R?Q?P?O?N?M?L?K?J?I?H?G?F?E?D?C?B?A?#?\x3f?\x3e?\x3d?\x3c?\x3b?\x3a?9?8?7?6?5?4?3?2?1?0?\x2f?\x2e?\x2d?\x2c?\x2b?\x2a?\x29?\x28?\x27?\x26?\x25?\x24?\x23?\x22?\x21?\x20?$
Try it online!
You may replace ? with * if you want to allow duplicate characters.
ps: some people can come up with absurdly long regexes when they aren't the right tool for the job: to parse email, html or the present question.

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

Particular substring length in regular expression

I want to check, if given string is in this format:
Substring of \w chars can be delimited (not start or end) by , - or .. And there can be more than 2 delimiters. For example weerwer, as-sas.a are valid. -assa, a-s a-s, asd#d are not. For that I use ^\w+([ .-]\w+){0,2}$. Seems to work.
Whole string, that matches regexp above, I want to restrict to length of 8. For example asd, asd-asd are valid. asd-asd-asd is not. And that is my question. How to do that?
This is only in case, the solution depends also in another restriction I need. I need the substrings passed the above two to be able to be delimited by /. For example asd-asd/asd.asd/asdasd/asd asd/aaaaaa is valid. Pretty much same pattern as in 1. but not restricted to number of delimiters. I put it here only in case the 2. depends on it.
To close it here, I'm posting solution based on #sareed`s comment. Thank you.
I changed some things...
1. \w was changed to alphabetic and decimal numbers. Also _ delimiter was added.
2. Length of substring above was restricted to 16.
3. Number of / delimiters was restricted to 6.
Pattern:
\A(?=.{1,16}(\z|/))(((\p{Alphabetic}|\p{Numeric_Type=Decimal})+)([\. _-]((\p{Alphabetic}|\p{Numeric_Type=Decimal})+)){0,2})(/(?=.{1,16}(\z|/))(((\p{Alphabetic}|\p{Numeric_Type=Decimal})+)([\. _-]((\p{Alphabetic}|\p{Numeric_Type=Decimal})+)){0,2})){0,6}\z
It's not pretty but should work:
/^((?=[^/]{1,8}(\/|$))\w+([ .\-]\w+)*)(\/(?=[^/]{1,8}(\/|$))\w+([ .\-]\w+)*)*$/

Grep for Pattern in File in R

In a document, I'm trying to look for occurences of a 12-digit string which contains alpha and numerals. A sample string is: "PXB111X2206"
I'm trying to get the line numbers that contain this string in R using the below:
FileInput = readLines("File.txt")
prot_pattern="([A-Z0-9]{12})";
prot_string<-grep(prot_pattern,FileInput)
prot_string
This worked fine until it hit a document containing all upper-case titles and returned a line containing the word "CONCENTRATIO"
The string I am trying to look for is: "PXB111X2206". I am expecting the grep to return the line numbers containing the string : "PXB111X2206". It however is returning the line number containing the word: "CONCENTRATIO"
What is wrong with my expression above? Any idea what I am doing wrong here?
Here is some sample input:
Each design objective described herein is significantly important, yet it is just one aspect of what it takes to achieve a successful project.
A successful project is one where project goals are identified early on and where the >interdependencies of all building systems are coordinated concurrently from the planning and programming phase.
CONCENTRATION:
The areas of concentration for design objectives: accessible, aesthetics, cost effective, >functional/operational, historic preservation, productive, secure/safe, and sustainable and >their interrelationships must be understood, evaluated, and appropriately applied.
Each of these design objectives is presented in the design objectives document number. >PXB111X2206.
>
Thanks & Regards,
Simak
You are using a very powerful tool for a very simple task, the expression
[A-Z0-9]{12}
will match any alphanumeric 12 sized uppercased string, for example the word "CONCENTRATIO", however, your "PXB111X2206" is not even 12 symbols long, so it is not possible that is being matched. If you only want to match "PXB111X2206" you only have to use it as a regular expression itself, for example, if you file contents are:
foo
CONCENTRATIO.
bazz
foo bar bazz PXB111X2206 foo bar bazz
foo
bar
bazz
and you use:
grep('PXB111X2206',readLines("File.txt"))
then R will only match line 4 as you would wish.
EDIT
If you are looking for that specific pattern try:
grep('[A-Z]{3}[0-9]{3}[A-Z]{1}[0-9]{4}',readLines("File.txt"))
That expression will match strings like 'AAADDDADDDD' where A is an capital letter, and D a digit, the regular expression contains a group (symbols inside square brackets) and a quantifier (the number inside the brackets) that tells how many of the previous symbol will the expression accept, if no quantifier is present it assumes it is 1.
Let's take a look at what your regular expression means. [A-Z0-9] means any capitalized letter or number and {12} means the previous expression must occur exactly 12 times. The string CONCENTRATIO is 12 capitaized letters, so it's no surprise that grep picks it up. If you want to take out the matches that match to just letters or just numbers you could try something like
allleters <- grep("[A-Z]{12}",strings)
allnumbers <-grep("[0-9]{12}",strings)
both <- grep("[A-Z0-9]{12}",strings)
the matches you wanted would then be something like
both <- both[!both %in% union(allletters,allnumbers)]
Someone with better regexfu might have a more elegant solution, but this will work too.

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed