here is my regex
I am trying to capture the files *08.tgz, *09.tgz, and *01.tgz
And this is what I have. but his also captures *10.tgz, due to the 09
.*\/*[09|8|1].tgz
I know I can do .*\/*[9|8|1].tgz and this will only capture *08.tgz, *09.tgz, and *01.tgz, but what I want to understand is why does the 0 captre the 10.tgz file??
data
./backup_public_html_20160308.tgz
./backup_public_html_20160301.tgz
./backup_public_html_20160302.tgz
./backup_public_html_20160306.tgz
./backup_public_html_20160304.tgz
./backup_public_html_20160303.tgz
./backup_public_html_20160307.tgz
./backup_public_html_20160305.tgz
./backup_public_html_20160309.tgz
./backup_public_html_20160310.tgz
[09|8|1] is character class, trying to match any of the characters included - so it will match either 0 or 9 or 8 or 1 or |
You might be looking for 0[189] matching 0 followed by either 1 or 8 or 9
I would be explicit and use
.*\/*(08|09|01).tgz
Let's look at this part of your regex where the actual matching of number is taking place.
[09|8|1] says
either 0 or 9
either 8
either 1
either a |
Now you are thinking it's matching 10.tgz. But it's actually matching 0.tgz
And when you change it to [9|8|1] it says.
either 9
either 8
either 1
either a |
Now 0.tgz won't match.
You misuse the character class as a group. Your regex .*\/*[09|8|1].tgz matches zero or more characters other than a newline (with .*) as many as possible (since * is a greedy quantifier), followed with zero or more / symbols, and then 1 symbol from the character class [09|8|1] - that is, either 0, 9, |, 8, or 1 followed with any character but a newline (since . matches any character but a newline) and then tgz.
For more details on how character classes work, see Character classes or Character Sets:
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey.
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
To capture the files *08.tgz, *09.tgz, and *01.tgz, use
.*0[981]\.tgz
OR
^.*0[981]\.tgz$
See the regex demo. The ^ is a start of string anchor and $ is an end of string anchor, and thus, the ^.*0[981]\.tgz$ pattern will require a full string match.
NOTE: To match a literal . you need to ecape it or place.. yes, into a character class as . loses its special meaning inside it and just denotes a literal dot there.
See the regex demo
You've confused character class and an alternation.
Try this:
.*0(9|8|1)\.tgz
Or more simply:
.*0[981]\.tgz
Note also repairs to other parts of your regex.
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I want to write a regular expression for QR8.4_Z4J25 in shell script? How can i do it?
Is this correct?
[QR][0-9][.][0-9][_][A-Z][0-9][A-Z][0-9][0-9]
It's obviously wrong because it'll only match Q8.4_Z4J25 or R8.4_Z4J25, but not QR8.4_Z4J25
A bracket matches any one character specified, so you'd like to write:
[Q][R][0-9][.][0-9][_][A-Z][0-9][A-Z][0-9][0-9]
You don't need to use brackets for a single character, though, so it can be simplified to
QR[0-9]\.[0-9]_[A-Z][0-9][A-Z][0-9][0-9]
Be sure to escape the dot if it's outside of a bracket because it would otherwise match any single character.
in case you want to match QR9.1_8A9YK as well, you should change it to
QR[0-9]\.[0-9]_[A-Z0-9]\{5\}
If you're using Extented Regular Expression, usually by supplying an option -E to the tool you're using, then you shouldn't escape the braces:
QR[0-9]\.[0-9]_[A-Z0-9]{5}
Square brackets in regular expressions denote a collection of characters.
[MX_5] will match one character that is M, X _ or 5.
[0-9] will match one character that is between 0 and 9.
[a-z] will match one character that is between lowercase a and z.
Notice the pattern? The square brackets match a single character. In order to match multiple characters they need to be followed by a + or * or {} to denote how many of those characters it should match.
However, in your case, you just want to match the actual letters QR in that order, so simply don't use square brackets.
QR[0-9]\.[0-9]_[A-Z][0-9][A-Z][0-9][0-9]
The same goes for characters like the underscore which are always in the same place. Note that the . was escaped with a \ because it has a special meaning in regex.
Going back to matching multiple characters with square brackets, if the order of the last 5 characters doesn't matter, you can further reduce your expression using a single square bracket and a {} to match all your trailing characters after the underscore.
QR[0-9]\.[0-9]_[A-Z0-9]{5}
I have want to match a string which starts with number, followed by any characters and ends with .html;
I have tried the following:
/([0-9]*[^\.html]*.html)/g
But Regexr for an example like "21212dfsd.htmlfdf.html" says 2 matches?! Why is that?
Thanks
You get two matches because of the * quantifier following the character class. * means match the preceding token "zero or more" times. Use + instead, meaning "one or more".
You can't place whole words inside of a character class as well. A character class matches any one character from a set of characters and the dot . needs to be escaped (it's a character of special meaning).
You can use the below regular expression:
/\d+.*?\.html/g
I know it seems a bit redundant but I'd like a regex to match anything.
At the moment we are using ^*$ but it doesn't seem to match no matter what the text.
I do a manual check for no text but the test view we use is always validated with a regex. However, sometimes we need it to validate anything using a regex. i.e. it doesn't matter what is in the text field, it can be anything.
I don't actually produce the regex and I'm a complete beginner with them.
The regex .* will match anything (including the empty string, as Junuxx points out).
The chosen answer is slightly incorrect, as it wont match line breaks or returns. This regex to match anything is useful if your desired selection includes any line breaks:
[\s\S]+
[\s\S] matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. the + matches one or more of the preceding expression
^ is the beginning-of-line anchor, so it will be a "zero-width match," meaning it won't match any actual characters (and the first character matched after the ^ will be the first character of the string). Similarly, $ is the end-of-line anchor.
* is a quantifier. It will not by itself match anything; it only indicates how many times a portion of the pattern can be matched. Specifically, it indicates that the previous "atom" (that is, the previous character or the previous parenthesized sub-pattern) can match any number of times.
To actually match some set of characters, you need to use a character class. As RichieHindle pointed out, the character class you need here is ., which represents any character except newlines (and it can be made to match newlines as well using the appropriate flag). So .* represents * (any number) matches on . (any character). Similarly, .+ represents + (at least one) matches on . (any character).
I know this is a bit old post, but we can have different ways like :
.*
(.*?)
I'm new to regular expression and I having trouble finding what "\'.-" means.
'/^[A-Z \'.-]{2,20}$/i'
So far from my research, I have found that the regular expression starts (^) and requires two to twenty ({2,20}) alphabetical (A-Z) characters. The expression is also case insensitive (/i).
Any hints about what "\'.-" means?
The character class is the entire expression [A-Z \'.-], meaning any of A-Z, space, single quote, period, or hyphen. The \ is needed to protect the single quote, since it's also being used as the string quote. This charclass must be repeated 2 to 20 times, and because of the leading ^ and trailing $ anchors that must be the entire content of the matching string.
It means to escape the single quote (') that delmits the regex (as to not prematurely end the string), and then a . which means a literal . and a - which means a literal -.
Inside of the character range, the . is treated literally, and if the - isn't part of a valid range, e.g. a-z, then it is treated literally as well.
Your regex says Match the characters a-zA-Z '.- between 2 and 20 times as the entire string, with an optional trailing \n.
This regex is in a string. The backslash is there to escape the single quote so the string doesn't end early, in the middle of the regex. The dot and dash are just what they are, a period and a dash.
So, you were nearly right, except it's 2-20 characters that are letters, space, single quote, period, or dash.
It's quoting the quote.
The regular expression is ^[A-Z'.-]{2,20}$.
In the programming language you are using, you write it as a quoted string:
'SOMETHING'
To get a single quote in there, it's been backslashed.
Everything inside the square brackets is part of the character class, and will match a single character listed. In your example, the characters listed are the letters A through Z, a space, a single quote, a period, or a hyphen. (Note the hyphen must be listed last to avoid indicating a range, like A-Z.) Your full regular expression will match between 2 and 20 of the listed characters. The single quote is needed so the compiler knows you are not ending the string that defines the regular expression.
Some examples of things this will match:
....................
abaca af - .
AAfa- - ..
.z
And so on.
I have one question related with regular expression. In my case, I have to make sure that
first letter is alphabet, second onwards it can be any alphanumeric + some special characters.
Regards,
Anto
Try something like this:
^[a-zA-Z][a-zA-Z0-9.,$;]+$
Explanation:
^ Start of line/string.
[a-zA-Z] Character is in a-z or A-Z.
[a-zA-Z0-9.,$;] Alphanumeric or `.` or `,` or `$` or `;`.
+ One or more of the previous token (change to * for zero or more).
$ End of line/string.
The special characters I have chosen are just an example. Add your own special characters as appropriate for your needs. Note that a few characters need escaping inside a character class otherwise they have a special meaning in the regular expression.
I am assuming that by "alphabet" you mean A-Z. Note that in some other countries there are also other characters that are considered letters.
More information
Character Classes
Repetition
Anchors
Try this :
/^[a-zA-Z]/
where
^ -> Starts with
[a-zA-Z] -> characters to match
I think the simplest answer is to pick and match only the first character with regex.
String str = "s12353467457458";
if ((""+str.charAt(0)).matches("^[a-zA-Z]")){
System.out.println("Valid");
}