Regular expression for letters, numbers and - _ - regex

I'm having trouble checking in PHP if a value is is any of the following combinations
letters (upper or lowercase)
numbers (0-9)
underscore (_)
dash (-)
point (.)
no spaces! or other characters
a few examples:
OK: "screen123.css"
OK: "screen-new-file.css"
OK: "screen_new.js"
NOT OK: "screen new file.css"
I guess I need a regex for this, since I need to throw an error when a give string has other characters in it than the ones mentioned above.

The pattern you want is something like (see it on rubular.com):
^[a-zA-Z0-9_.-]*$
Explanation:
^ is the beginning of the line anchor
$ is the end of the line anchor
[...] is a character class definition
* is "zero-or-more" repetition
Note that the literal dash - is the last character in the character class definition, otherwise it has a different meaning (i.e. range). The . also has a different meaning outside character class definitions, but inside, it's just a literal .
References
regular-expressions.info/Anchors, Character Classes and Repetition
In PHP
Here's a snippet to show how you can use this pattern:
<?php
$arr = array(
'screen123.css',
'screen-new-file.css',
'screen_new.js',
'screen new file.css'
);
foreach ($arr as $s) {
if (preg_match('/^[\w.-]*$/', $s)) {
print "$s is a match\n";
} else {
print "$s is NO match!!!\n";
};
}
?>
The above prints (as seen on ideone.com):
screen123.css is a match
screen-new-file.css is a match
screen_new.js is a match
screen new file.css is NO match!!!
Note that the pattern is slightly different, using \w instead. This is the character class for "word character".
API references
preg_match
Note on specification
This seems to follow your specification, but note that this will match things like ....., etc, which may or may not be what you desire. If you can be more specific what pattern you want to match, the regex will be slightly more complicated.
The above regex also matches the empty string. If you need at least one character, then use + (one-or-more) instead of * (zero-or-more) for repetition.
In any case, you can further clarify your specification (always helps when asking regex question), but hopefully you can also learn how to write the pattern yourself given the above information.

you can use
^[\w.-]+$
the + is to make sure it has at least 1 character. Need the ^ and $ to denote the begin and end, otherwise if the string has a match in the middle, such as ####xyz%%%% then it is still a match.
\w already includes alphabets (upper and lower case), numbers, and underscore. So the rest ., -, are just put into the "class" to match. The + means 1 occurrence or more.
P.S. thanks for the note in the comment about preventing - to denote a range.

This is the pattern you are looking for
/^[\w-_.]*$/
What this means:
^ Start of string
[...] Match characters inside
\w Any word character so 0-9 a-z A-Z
-_. Match - and _ and .
* Zero or more of pattern or unlimited
$ End of string
If you want to limit the amount of characters:
/^[\w-_.]{0,5}$/
{0,5} Means 0-5 characters

To actually cover your pattern, i.e, valid file names according to your rules, I think that you need a little more. Note this doesn't match legal file names from a system perspective. That would be system dependent and more liberal in what it accepts. This is intended to match your acceptable patterns.
^([a-zA-Z0-9]+[_-])*[a-zA-Z0-9]+\.[a-zA-Z0-9]+$
Explanation:
^ Match the start of a string. This (plus the end match) forces the string to conform to the exact expression, not merely contain a substring matching the expression.
([a-zA-Z0-9]+[_-])* Zero or more occurrences of one or more letters or numbers followed by an underscore or dash. This causes all names that contain a dash or underscore to have letters or numbers between them.
[a-zA-Z0-9]+ One or more letters or numbers. This covers all names that do not contain an underscore or a dash.
\. A literal period (dot). Forces the file name to have an extension and, by exclusion from the rest of the pattern, only allow the period to be used between the name and the extension. If you want more than one extension that could be handled as well using the same technique as for the dash/underscore, just at the end.
[a-zA-Z0-9]+ One or more letters or numbers. The extension must be at least one character long and must contain only letters and numbers. This is typical, but if you wanted allow underscores, that could be addressed as well. You could also supply a length range {2,3} instead of the one or more + matcher, if that were more appropriate.
$ Match the end of the string. See the starting character.

Something like this should work
$code = "screen new file.css";
if (!preg_match("/^[-_a-zA-Z0-9.]+$/", $code))
{
echo "not valid";
}
This will echo "not valid"

[A-Za-z0-9_.-]*
This will also match for empty strings, if you do not want that exchange the last * for an +

Related

Why doesn't this regex accept "s 1" type input?

I have the follwwing regex:
/([^\s*][\l\u\w\d\s]+) (\d)/
It should match strings of the form: "some-string digit", e.g. "stackoverflow 1". Those strings cannot have whitespace at the beginning.
It works great, except for the simple strings with one character on the beginning, e.g.: "s 1". How can I fix it? I am using it in boost::regex (PCRE-compatible).
The [^\s*] is eating up your first string character, so when you require one-or-more string characters after it, that'll fail:
/([^\s*][\l\u\w\d\s]+) (\d)/
^^^^ ^^^^^^^^^^ ^^
"s" no match "1"
If you fix your misplaced *:
/([^\s]*[\l\u\w\d\s]+) (\d)/
^^^ ^^^^^^^^^^ ^^
"s"; "s" "1"
match
then cancelled
by backtracking
But in order to avoid the backtracking, I would instead write the regex like this:
/([\l\u\w\d]+[\l\u\w\d\s]*) (\d)/
Note that I am only showing the regex itself — re-apply your extra backslashes for use in a C++ string literal as required; e.g.
const std::string my_regex = "/([\\l\\u\\w\\d]+[\\l\\u\\w\\d\\s]*) (\\d)/";
This can probably be done more optimally anyway (I'm sure most of those character classes are redundant), but this should fix your immediate problem.
You can test your regexes here.
The problem is that you have the * in the wrong place: [^\s*] matches exactly one character that is neither whitespace nor an asterisk. (The s in "s 1" qualifies as "neither whitespace nor an asterisk", so it is matched and consumed, and no longer available to serve as a match for the next part, [\l\u\w\d\s]+. Note that "s 1", with two spaces, would succeed.)
You probably meant [^\s]*, which matches any number (including zero) of whitespace characters. If you make that small change, that will fix your regular expression.
However, there are other improvements to be made. First, the backslash+letter sequences that are short for character classes can be negated by capitalizing the letter: the character class "everything that's not in \s" can be written as above, with [^\s], but it can also be written more simply as \S.
Next, I don't know what \l and \u are. You've tagged this c++, so you're presumably using the standard regex library, which uses ECMAScript regex syntax. But the ECMAScript regular expression specification doesn't define those metacharacters.
If you're trying to match "lowercase letters" and "uppercase letters", those are [:lower:] and [:upper:] - but both sets of letters are already included in \w, so you don't need to include them in a character class that also has \w.
Pulling those out leaves a character class of [\w\d\s] - which is still redundant, because \w also includes the digits, so we don't need \d. Removing that, we have [\w\s], which matches "an underscore, letter, digit, space, tab, formfeed, or linefeed (newline)."
That makes the whole regular expression \S*[\s\w]+ (\d): zero or more non-whitespace characters, followed by at least one whitespace or word character, followed by exactly one space, followed by a digit. That seems like an unusual set of criteria to me, but it should definitely match "s 1". And it does, in my testing.
I would expect you could do something like this:
Add
{X,} where X is a number, onto the second set of brackets
Like below
([^\\s*][\\l\\u\\w\\d\\s]{2,}) (\d)
Replace 2 with whatever you want to be your minimum string length.

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Match a specific string with several constants using Regex

There are now different requirements to the regex I am looking for, and it is too complex to solve it on my own.
I need to search for a specific string with the following requirements:
String starts with "fu: and ends with "
In between those start and end requirements there can be any other string which has the following requirements:
2.1. Less than 50 characters
2.2. Only lower case
2.3. No trailing spaces
2.4. No space between "fu: and the other string.
The result of the regex should be cases where case no' 1 matches but cases no' 2./2.1/2.2/2.3/2.4 don't.
At the moment I have following regex: "fu:([^"]*?[A-Z][^"]*?)",
which finds strings with start with "fu: and end with " with any upper case inbetween like this one:
"fu:this String is wrong cause the s from string is upper case"
I hope it all makes sense, I tried to get into regex but this problem seems to complex for someone who is not working with regex every day.
[Edit]
Apparently I was not clear enough. I want to have matches which are "wrong".
I am looking for the complement of this regex: "fu:(?:[a-z][a-z ]{0,47}[a-z]|[a-z]{0,2})"
some examples:
Match: "fu: this is a match"
Match: "fu:This is a match"
Match: "fu:this is a match "
NO Match: "fu:this is no match"
Sorry, its not easy to explain :)
Try the following:
"fu:([a-z](?:[a-z ]{0,48}[a-z])?)"
This will match any string that begins with "fu: and ends with a " and the string between those will contain 1-50 characters - only lower-case and not able to begin with a space nor have trailing spaces.
"fu: # begins with "fu:
( # group to match
[a-z] # starts with at least one character
(?: # non-matching sub-group
[a-z ]{0,48} # matches 0-48 a-z or space characters
[a-z] # sub-group must end with a character
)? # group is not required
)
" # ends with "
EDIT: In the event that you need an empty-string to match too, i.e. the full string is "fu:", you can add another ? to the end of the matching-group in the regex:
"fu:([a-z](?:[a-z ]{0,48}[a-z])?)?"
I've kept the two regexes separated (one that allows 1-50 characters in the string and one that allows 0-50) to show the minor difference.
EDIT #2: To match the inverse of the above, i.e. - to find all strings that do not match the required format, you can use:
^((?!"fu:([a-z](?:[a-z ]{0,48}[a-z])?)?").)*$
This will explicitly match any line that does not match that pattern. This will consequently also match lines that do not contain "fu: - if that matters.
The only way I can figure out to truly match the opposite of the above and still include the anchors of "fu: and " are to explicitly attempt to match the rules that fail:
"fu:([^a-z].*|[^"]{51,}|[a-z]([^"]*?[A-Z][^"]*?)+|[a-z ]{0,49}[ ])"
This regex will match anything that starts with not a lowercase a-z character, any string that's longer than 50 characters, any string that contains an uppercase letter, or any string that has trailing whitespace. For each additional rule, you'll need to update the regex to match the opposite of what's needed.
My recommendation is, in whatever language you're using, to match all input strings that actually follow your requirements - and if there are no matches then that string must violate your rules.
"fu:([^A-Z" ](?:[^A-Z"]{0,48}[^A-Z" ])?)"
The above regex should match the specified requirements.
That's probably what you need
"fu:([a-z](?:[a-z ]{,48}[a-z])?)"
Try this:
"fu:(?:[a-z][a-z ]{0,47}[a-z]|[a-z]?)"

Match Regular Expressoin if string contains exactly N occrences of a character

I'd like a regular expression to match a string only if it contains a character that occurs a predefined number of times.
For example:
I want to match all strings that contain the character "_" 3 times;
So
"a_b_c_d" would pass
"a_b" would fail
"a_b_c_d_e" would fail
Does someone know a simple regular expression that would satisfy this?
Thank you
For your example, you could do:
\b[a-z]*(_[a-z]*){3}[a-z]*\b
(with an ignore case flag).
You can play with it here
It says "match 0 or more letters, followed by '_[a-z]*' exactly three times, followed by 0 or more letters". The \b means "word boundary", ie "match a whole word".
Since I've used '*' this will match if there are exactly three "_" in the word regardless of whether it appears at the start or end of the word - you can modify it otherwise.
Also, I've assumed you want to match all words in a string with exactly three "_" in it.
That means the string "a_b a_b_c_d" would say that "a_b_c_d" passed (but "a_b" fails).
If you mean that globally across the entire string you only want three "_" to appear, then use:
^[^_]*(_[^_]*){3}[^_]*$
This anchors the regex at the start of the string and goes to the end, making sure there are only three occurences of "_" in it.
Elaborating on Rado's answer, which is so far the most polyvalent but could be a pain to write if there are more occurrences to match :
^([^_]*_){3}[^_]*$
It will match entire strings (from the beginning ^ to the end $) in which there are exactly 3 ({3}) times the pattern consisting of 0 or more (*) times any character not being underscore ([^_]) and one underscore (_), the whole being followed by 0 ore more times any character other than underscore ([^_]*, again).
Of course one could alternatively group the other way round, as in our case the pattern is symmetric :
^[^_]*(_[^_]*){3}$
This should do it:
^[^_]*_[^_]*_[^_]*_[^_]*$
If you're examples are the only possibilities (like a_b_c_...), then the others are fine, but I wrote one that will handle some other possibilities. Such as:
a__b_adf
a_b_asfdasdfasfdasdfasf_asdfasfd
___
_a_b_b
Etc.
Here's my regex.
\b(_[^_]*|[^_]*_|_){3}\b

Simple Regex for upper and lower case letters, numbers, and a few symbols

How can I create a Regular expression to match the following characters:
A-Z a-z 0-9 " - ? . ', !
... as well as new lines and spaces
This will match any single one of those characters:
[A-Za-z0-9"?.',! \n\r-]
There's a good chance you want something like:
^[A-Za-z0-9"?.',! \n\r-]+$
Or possibly a bit simpler will meet your needs:
^[\w\s"?.',!-]+$
Remembering that if this is inside a string, you will need to escape either the " or ' in that (either by doubling up, or by prefixing with a backslash).
Also note that the - is last so that it is not treated as a range inside the character class. (Can also be placed first, or prefixed with backslash to prevent that).
The \w will match a "word" character, which is almost always [A-Za-z0-9_].
The \s will match a whitespace character, (i.e. space,tab,newline,carriage return).
But really you need to give more context to what you're trying to do so people can suggest more fitting solutions.