Regular expression (alphanumeric) - regex

I need a regular expression to allow the user to enter an alphanumeric string that starts with a letter (not a digit).

This should work in any of the Regular Expression (RE) engines. There is a nicer syntax in the PCRE world but I prefer mine to be able to run anywhere:
^[A-Za-z][A-Za-z0-9]*$
Basically, the first character must be alpha, followed by zero or more alpha-numerics. The start and end tags are there to ensure that the whole line is matched. Without those, you may match the AB12 of the "###AB12!!!" string.
Full explanation:
^ start tag.
[A-Za-z] any one of the upper/lower case letters.
[A-Za-z0-9] any one of the upper/lower case letters or digits,
* repeated zero or more times.
$ end tag
Update:
As Richard Szalay rightly points out, this is ASCII only (or, more correctly, any encoding scheme where the A-Z, a-z and 0-9 groups are contiguous) and only for the "English" letters.
If you want true internationalized REs (only you know whether that is a requirement), you'll need to use one of the more appropriate RE engines, such as the PCRE mentioned above, and ensure it's compiled for Unicode mode. Then you can use "characters" such as \p{L} and \p{N} for letters and numerics respectively. I think the RE in that case would be:
^\p{L}[\pL\pN]*$
but I'm not certain. I've never used REs for our internationalized software. See here for more than you ever wanted to know about PCRE.

I think this should do the work:
^[A-Za-z][A-Za-z0-9]*$

You're looking for a pattern like this:
^[a-zA-Z][a-zA-Z0-9]*$
That one requires one letter and any number of letters/numbers after that. You may want to adjust the allowed lengths.

Related

VBScript regex to support all accented characters

I have a below regex in VBScript, Pattern:
^(?=.*[a-z])(?=.*[A-Z])(?!.*\s)(?=.*[0-9])(?=.*[!##\$&\*])(?=.{8,20}$)
This validates "length bet 8-20, one small, Capital, special char and digit each"
Issue#1
When I entered à , it passes the validation, which shouldn't have happened. How to restrict it ?
Issue#2
Later, I realized I can use keyboard of any language so I modified my regex to support all accented letters, but its not working either. Pattern:
^(?=.*\\p{L})(?!.*\s)(?=.*[0-9])(?=.*[!##\$&\*])(?=.{8,20}$)
Does VBScript allow to use p{L} regex ? any alternative ?
Your current pattern actually does not validate à. But it is slightly off and won't implement what you have in mind. Try this instead:
^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!##\$&*])[A-Za-z0-9!##$&*]{8,20}$
^^^ important
This says to assert that there is at least one:
lowercase letter
uppercase letter
digit
special character (!##$&*)
Then, it matches any of the above four types of characters 8 to 20 times.
The critical problem with your pattern, and the reason it would admit accented characters, provided the other assertions pass, is because of this:
(?=.{8,20})
Your final lookahead does enforce 8 to 20 characters, but it admits any character. Instead, I use a range limited only to the possible types of characters you want to appear.

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.

Block all caps sentences

I'm trying to use a regex expression to block all caps sentences (sentences with only capital letters) but I can't succeed at finding the pattern. I was thinking about ^[a-z] but this doesn't work at all.
Any suggestion?
You can perhaps use something like this to make sure there's at least one lowercase character (note that's this is some kind of reverse logic):
^.*[a-z].*$
(Unless the function you're using uses regex against the whole pattern by default, you can drop the beginning and end of line anchors)
If you want the regex to be more strict (though I don't think that's very practical here), you can perhaps use something of the sort...
^[A-Z.,:;/() -]*[A-Z]+[A-Z.,:;/() -]*$
To allow only uppercase letters, and some potential punctuations (you can add or remove them from the character classes as you need) and spaces.
Simply look for [a-z]... If that matches, your sentence passes. If not, it is all caps (or punctuation).
It depends on what flavour of regex you're using but if you have one that supports lookaheads then you can use the following expression:
(?-i)^(?!(?=.*?[A-Z])(?:[A-Z]|(?i)[^a-z])*$)
It won't capture anything but will return false if the letters used are all in caps, and return true if any of the letters used are lower case.
Can't ^[A-Z]+$ simply suit your needs? If it matches, it means that the input string contains only capital letters.
Demo on RegExr.
The following regex
(^|\.)[[:space:]A-Z]+\.
will find any line containing only uppercase letters and whitespace between either start of line, or the preceding full stop.
It appears that you want to detect sentences that have words that have upper case nested inside the word, ex: hEllo, gOODbye, worD; that is any word that has an uppercase after a lower case, or any word with two or more uppercase beside each other.
uppercase after lowercase
[a-z][A-Z]
two or more paired uppercase
[A-Z][A-Z]
Combined them with alternation,
/*([a-z][A-Z]|[A-Z][A-Z])/

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.

Regex code question

I'm new to this site and don't know if this is the place to ask this question here?
I was wondering if someone can explain the 3 regex code examples below in detail?
Thanks.
Example 1
`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i
Example 2
\\1
Example 3
`[^a-z0-9]`i','`[-]+`
The first regex looks like it'll match the HTML entities for accented characters (e.g., é is é; ø is ø; æ is æ; and  is Â).
To break it down, & will match an ampersand (the start of the entity), ([a-z]{1,2}) will match any lowercase letter one or two times, (acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig) will match one of the terms in the pipe-delimited list (e.g., circ, grave, cedil, etc.), and ; will match a semicolon (the end of the entity). I'm not sure what the i character means at the end of the line; it's not part of the regex.
All told, it will match the HTML entities for accented/diacritic/ligatures. Compared, though, to this page, it doesn't seem that it matches all of the valid entities (although ti does catch many of them). Unless you run in case-insensitive mode, the [a-z] will only match lowercase letters. It will also never match the entities ð or þ (ð, þ, respectively) or their capital versions (Ð, Þ, also respectively).
The second regex is simpler. \1 in a regex (or in regex find-replace) simply looks for the contents of the first capturing group (denoted by parentheses ()) and (in a regex) matches them or (in the replace of a find) inserts them. What you have there \\1 is the \1, but it's probably written in a string in some other programming language, so the coder had to escape the backslash with another backslash.
For your third example, I'm less certain what it does, but I can explain the regexes. [^a-z0-9] will match any character that's not a lowercase letter or number (or, if running in case-insensitive mode, anything that's not a letter or a number). The caret (^) at the beginning of the character class (that's anything inside square brackets []) means to negate the class (i.e., find anything that is not specified, instead of the usual find anything that is specified). [-]+ will match one or more hyphens (-). I don't know what the i',' between the regexes means, but, then, you didn't say what language this is written in, and I'm not familiar with it.