Regex Expression Differences - regex

I would like to understand the difference between the following 3 regular expressions:
I wanted to display all the lines in a file that consisted only of lowercase alphabets in them.
Here are the 3 regular expressions I tried:
cat filename.txt | grep ^[a-z]*
Regex Description: This will display all the lines starting with 0 or more lowercase letters. So, it will match either of the following:
zapato
113078
OLIVIA
Not exactly, what we wanted.
cat filename.txt | grep ^[a-z]*$
Regex Description: This will display all the lines starting with 0 or more lowercase letters till the end of the line. This matches the following:
fubuki
BALLIN
Kristine
This time there were no results with digits in them.
cat filename.txt | grep ^[a-z]*[a-z]$
Regex Description: This one works well for me. It searches for all the lines starting with 0 or more lowercase letters and it matches it till it finds another lowercase letter. For some reason, this works for me. However, I want to know how this is different from the previous regular expressions.
tonia
ecurby
totonno
Also, when the asterisk () in the regular expression means, 0 or more, then it should include all the results when I write, ^[a-z]

Short explanations of your regular expressions:
^[a-z]*
Match string starting with 0 or more characters from [a-z].
Matches empty string and every string starting with character of set [a-z].
^[a-z]*$
Match string containing nothing but 0 or more characters from [a-z].
Matches empty string and every string containing only characters of set [a-z].
^[a-z]*[a-z]$
Match string starting with 0 or more characters from [a-z] followed by exactly one last character from [a-z].
Matches every non-empty string containing only characters of set [a-z].
Use this instead of your current third option:
^[a-z]+$
It is semantically equivalent but simpler.
The expression x*x (or xx*) is equivalent to x+ in regular expressions (with x being any expression). The latter is basically just syntactic sugar for either of the former more verbose expressions.
Or put differently: while * means 0 or more, + means 1 or more.

Related

How this regular expression work?

I want to understand how this regular expression (aka regex) stored in "regex" variable works?
regex='^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$'
I am new to bash scripting and having hard time work with regular expression!
Numbers from 1-9, 0-9, 0-4 and 0-5 are repeated at least twice, which is creating confusion!
Thank you!
Look at this part alone:
[1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]
It's a series of alternatives (separated by |), here on separate lines:
[1-9] # Matches 1-9
[1-9][0-9] # Matches 10-99
1[0-9][0-9] # Matches 100-199
2[0-4][0-9] # Matches 200-249
25[0-5] # Matches 250-255
In other words, it matches any number from 1 to 255 inclusive. It's a bit roundabout because regex has no concept of numbers, only of character strings.
The regex attempts to match a four of these numbers with periods between, in order to match a whole IPv4 address.
It looks like someone was trying to match an IPv4 address. The group
([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
Matches a number from 1 to 255, it then matches a number from 0 to 255 three more times.
(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}
They tried to separate the four numbers with dots. The original regex you posted didn't escape the "." so it would match any character between the four groups. Someone has since edited the regex to fix that character.
The regex is wrapped in ^ and $ to make sure the string contains that and only that. ^ matches the beginning of a string. $ matches the end.

Limit length of string containing at least 1 digits , 0 or more characters and optional dash

I am trying to make a regular expression for consumer products models.
I have this regular expression: ([a-z]*-?[0-9]+-?[a-z]*-?){4,}
which I expect to limit this whole special string to 4 or more but what happens is that the limit is applied to only the digits.
So this example matches: E1912H while this does not: EM24A1BF although both should match.
Can you tell me what I am doing wrong or how can I make the limit to the whole special string not only the digits?
Limitations:
1- String contains at least 1 digit
2- string can contains characters
3- string can contain "-"
4- minimum length = 4
Summary of your conditions so far:
require at least 1 digit [0-9]
require at least 4 symbols {4,}
can have characters [a-zA-Z]
can have short dash [-]
The following regexp meets them all:
^(?=.*\d)([A-Za-z0-9-]+){4,}$
Note: ^ and $ symbols mean entire input string is validated. Alter this if it`s not the case.
it cant match... EM24A1BF contains EM, which are 2 [a-z], not 1 as your regex states.
Something like this
[a-z]*-?\d+-?[a-z]*-?\d*[a-z]+
matches both your expression and all these:
E1912H
EM24A1BF
eM24A1BF
eM-24A-1BF
eM-24A-
eM24A-1BF
eM-24A1BF
To be sure your string meets both your requirements (the characters'position and composition AND the length requirement), you need to use a non-consuming regular expression
Check this out
([\w-]*\d+[\w-]*){4,}
it matches the following
32ES5200G
LE32K900
N55XT770XWAU3D

C# regular expression to match square brackets

I'm trying to use a regular expression in C# to match a software version number that can contain:
a 2 digit number
a 1 or 2 digit number (not starting in 0)
another 1 or 2 digit number (not starting in 0)
a 1, 2, 3, 4 or 5 digit number (not starting in 0)
an option letter at the end enclosed in square brackets.
Some examples:
10.1.23.26812
83.33.7.5
10.1.23.26812[d]
83.33.7.5[q]
Invalid examples:
10.1.23.26812[
83.33.7.5]
10.1.23.26812[d
83.33.7.5q
I have tried the following:
string rex = #"[0-9][0-9][.][1-9]([0-9])?[.][1-9]([0-9])?[.][1-9]([0-9])?([0-9])?([0-9])?([0-9])?([[][a-zA-Z][]])?";
(note: if I try without the "#" and just escape the square brackets by doing "\[" I get an error saying "Unrecognised escape sequence")
I can get to the point where the version number is validating correctly, but it accepts anything that comes after (for example: "10.1.23.26812thisShouldBeWrong" is being matched as correct).
So my question is: is there a way of using a regular expression to match / check for square brackets in a string or would I need to convert it to a different character (eg: change [a] to a and match for *s instead)?
This happens because the regex matches part of the string, and you haven't told it to force the entire string to match. Also, you can simplify your regex a lot (for example, you don't need all those capturing groups:
string rex = #"^[0-9]{2}\.[1-9][0-9]?\.[1-9][0-9]?\.[1-9][0-9]{0,4}(?:\[[a-zA-Z]\])?$";
The ^ and $ are anchors that match the start and end of the string.
The error message you mentioned has to do with the fact that you need to escape the backslash, too, if you don't use a verbatim string. So a literal opening bracket can be matched in a regex as "[[]" or "\\[" or #"\[". The latter form is preferred.
You need to anchor the regex with ^ and $
string rex = #"^[0-9][0-9][.][1-9]([0-9])?[.][1-9]([0-9])?[.][1-9]([0-9])?([0-9])?([0-9])?([0-9])?([[][a-zA-Z][]])?$";
the reason the 10.1.23.26812thisShouldBeWrong matches is because it matches the substring 10.1.23.26812
The regex can be simplfied slightly for readability
string rex = #"^\d{2}\.([1-9]\d?\.){2}[1-9]\d{0,4}(\[[a-zA-Z]\])?$";
In response to TimCross warning - updated regex
string rex = #"^[0-9]{2}\.([1-9][0-9]?\.){2}[1-9][0-9]{0,4}(\[[a-zA-Z]\])?$";

Can't match string using regular expression

I'm a newbie to regex and I'm trying to come up with a regular expression that matches any string that begins with 2 or 1 number and has to end with a letter: For example: 03C, 4B, 34A,
I came up with this regular expression: ^[0-9]{0,2}\w[A-Z]$ and it works most of the time but it also matches two letters i.e. AA or CD
How can I force at least one number at the beginning of the string? Strings should be no more than 3 characters long and use all uppercase letters.
Try this regular expression
^[0-9]{1,2}[A-Z]$
You are close.
Change your regex pattern to:
^[0-9]{1,2}[A-Z]$
This will match strings that begin with either 1 or 2 numbers, and end with a single uppercase letter.

Limit number of alpha characters in regular expression

I've been struggling to figure out how to best do this regular expression.
Here are my requirements:
Up to 8 characters
Can only be alphanumeric
Can only contain up to three alpha characters [a-z] (zero alpha characters are valid to)
Any ideas would be appreciated.
This is what I've got so far, but it only looks for contiguous letter characters:
^(\d|([A-Za-z])(?!([A-Za-z]{3,}))){0,8}$
I'd write it like this:
^(?=[a-z0-9]{0,8}$)(?:\d*[a-z]){0,3}\d*$
It has two parts:
(?=[a-z0-9]{0,8}$)
Looksahead and matches up to 8 alphanumeric to the end of the string
(?:\d*[a-z]){0,3}\d*$
Essentially allowing injection of up to 3 [a-z] among \d*
Rubular
On rubular.com
12345678 // matches
123456789
#(#*#$
12345 // matches
abc12345
abcd1234
12a34b5c // matches
12ab34cd
123a456 // matches
Alternatives
I do think regex is the best solution for this, but since the string is short, it would be a lot more readable to do this in two steps as follows:
It must match [a-z0-9]{0,8}
Then, delete all \d
The length must now be <= 3
Do you have to do this in exactly one regular expression? It is possible to do that with standard regular expressions, but the regular expression will be rather long and complicated. You can do better with some of the Perl extensions, but depending on what language you're using, they may or may not be supported. The cleanest solution is probably to check whether the string matches:
^[A-Za-z0-9]{0,8}$
but doesn't match:
([A-Za-z].*){4}
i.e. it's an alpha string of up to 8 characters (first regular expression), but doesn't contain 4 or more alpha characters (possibly separated by other characters (second regular expression).
/^(?!(?:\d*[a-z]){4})[a-z0-9]{0,8}$/i
Explanation:
[a-z0-9]{0,8} matches up to 8 alphanumerics.
Lookahead should be placed before the matching happens.
The (?:\d*[a-z]) matches 1 alphabetic anywhere. The {4} make the count to 4. So this disables the regex from matching when 4 alphabetics can be found (i.e. limit the count to ≤3).
It's better not to exploit regex like this. Suppose you use this solution, are you sure you will know what the code is doing when you revisit it 1 year later? A clearer way is just check rule-by-rule, e.g.
if len(theText) <= 8 and theText.isalnum():
if sum(1 for c in theText if c.isalpha()) <= 3:
# valid
The easiest way to do this would be in multiple steps:
Test the string against /^[a-z0-9]{0,8}$/i -- the string is up to 8 characters and only alphanumeric
Make a copy of the string, delete all non-alphabetic characters
See if the resulting string has a length of 3 or less.
If you want to do it in one regular expression, you can use something like:
/^(?=\d*(?:[a-z]?\d*){0,3}$)[a-z0-9]{0,8}$/i
Which looks for a alphanumeric string between length 0 and 8 (^[a-z0-9]{0,8}$), but first uses a lookahead ((?=\d*(?:[a-z]?\d*){0,3}$)) to make sure that the string
has at most 3 alphabetic characters.