Does \w match all alphanumeric characters defined in the Unicode standard? - regex

Does Perl's \w match all alphanumeric characters defined in the Unicode standard?
For example, will \w match all (say) Chinese and Russian alphanumeric characters?
I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my #ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (#ok) {
die unless ($ok =~ /^\w+$/);
}

perldoc perlunicode says
Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance.
So it looks like the answer to your question is "yes".
However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.

Yes and no.
If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]. The \w contains both more and less than that. It specifically excludes any \pN which is not \p{Nd} nor \p{Nl}, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}, and are not included in \w.
Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w in a regex matches any single code point that has any of the following four properties:
\p{GC=Alphabetic}
\p{GC=Mark}
\p{GC=Connector_Punctuation}
\p{GC=Decimal_Number}
Number 4 above can be expressed in any of these ways, which are all considered equivalent:
\p{Digit}
\p{General_Category=Decimal_Number}
\p{GC=Decimal_Number}
\p{Decimal_Number}
\p{Nd}
\p{Numeric_Type=Decimal}
\p{Nt=De}
Note that \p{Digit} is not the same as \p{Numeric_Type=Digit}. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit} property and not plain \p{Digit}. That is because it is considered a \p{Other_Number} or \p{No}. It does, however, have the \p{Numeric_Value=2} property as you would imagine.
It’s really point number 1 above, \p{Alphabetic} ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter} (\pL), but it is not.
Alphabetics include much more than that, all because of the \p{Other_Alphabetic} property, as this in turn
includes some but not all \p{GC=Mark}, all of \p{Lowercase} (which is not the same as \p{GC=Ll} because it adds \p{Other_Lowercase}) and all of \p{Uppercase} (which is not the same as \p{GC=Lu} because it adds \p{Other_Uppercase}).
That’s how it pulls in \p{GC=Letter_Number} like Roman numerals and also
all the circled letters, which are of type \p{Other_Symbol} and \p{Block=Enclosed_Alphanumerics}.
Aren’t you glad we get to use \w? :)

In particular \w also matches the underscore character.
#!/usr/bin/perl -w
$name = 'Arun_Kumar';
($name =~ /\w+/)? print "Underscore is a word character\n": print "No underscores\n";
$ underscore.pl
Underscore is a word character.

Related

data mismatch with spark regexp_extract [duplicate]

In Java RegEx, how to find out the difference between .(dot) the meta character and the normal dot as we using in any sentence. How to handle this kind of situation for other meta characters too like (*,+,\d,...)
If you want the dot or other characters with a special meaning in regexes to be a normal character, you have to escape it with a backslash. Since regexes in Java are normal Java strings, you need to escape the backslash itself, so you need two backslashes e.g. \\.
Solutions proposed by the other members don't work for me.
But I found this :
to escape a dot in java regexp write [.]
Perl-style regular expressions (which the Java regex engine is more or less based upon) treat the following characters as special characters:
.^$|*+?()[{\ have special meaning outside of character classes,
]^-\ have special meaning inside of character classes ([...]).
So you need to escape those (and only those) symbols depending on context (or, in the case of character classes, place them in positions where they can't be misinterpreted).
Needlessly escaping other characters may work, but some regex engines will treat this as syntax errors, for example \_ will cause an error in .NET.
Some others will lead to false results, for example \< is interpreted as a literal < in Perl, but in egrep it means "word boundary".
So write -?\d+\.\d+\$ to match 1.50$, -2.00$ etc. and [(){}[\]] for a character class that matches all kinds of brackets/braces/parentheses.
If you need to transform a user input string into a regex-safe form, use java.util.regex.Pattern.quote.
Further reading: Jan Goyvaert's blog RegexGuru on escaping metacharacters
Escape special characters with a backslash. \., \*, \+, \\d, and so on. If you are unsure, you may escape any non-alphabetical character whether it is special or not. See the javadoc for java.util.regex.Pattern for further information.
Here is code you can directly copy paste :
String imageName = "picture1.jpg";
String [] imageNameArray = imageName.split("\\.");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
And what if mistakenly there are spaces left before or after "." in such cases? It's always best practice to consider those spaces also.
String imageName = "picture1 . jpg";
String [] imageNameArray = imageName.split("\\s*.\\s*");
for(int i =0; i< imageNameArray.length ; i++)
{
system.out.println(imageNameArray[i]);
}
Here, \\s* is there to consider the spaces and give you only required splitted strings.
I wanted to match a string that ends with ".*"
For this I had to use the following:
"^.*\\.\\*$"
Kinda silly if you think about it :D
Heres what it means. At the start of the string there can be any character zero or more times followed by a dot "." followed by a star (*) at the end of the string.
I hope this comes in handy for someone. Thanks for the backslash thing to Fabian.
If you want to end check whether your sentence ends with "." then you have to add [\.\]$ to the end of your pattern.
I am doing some basic array in JGrasp and found that with an accessor method for a char[][] array to use ('.') to place a single dot.
I was trying to split using .folder. For this use case, the solution to use \\.folder and [.]folder didn't work.
The following code worked for me
String[] pathSplited = Pattern.compile("([.])(folder)").split(completeFilePath);

Regex: match string unless it contains a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Regular Expression for validating username

I'm looking for a regular expression to validate a username.
The username may contain:
Letters (western, greek, russian etc.)
Numbers
Spaces, but only 1 at a time
Special characters (for example: "!##$%^&*.:;<>?/\|{}[]_+=-") , but only 1 at a time
EDIT:
Sorry for the confusion
I need it for cocoa-touch but i'll have to translate it for php for the server side anyway.
And with 1 at a time i mean spaces or special char's should be separated by letters or numbers.
Instead of writing one big regular expression, it would be clearer to write separate regular expressions to test each of your desired conditions.
Test whether the username contains only letters, numbers, ASCII symbols ! through #, and space: ^(\p{L}|\p{N}|[!-#]| )+$. This must match for the username to be valid. Note the use of the \p{L} class for Unicode letters and the \p{N} class for Unicode numbers.
Test whether the the username contains consecutive spaces: \s\s+. If this matches, the username is invalid.
Test whether symbols occur consecutively: [!-#][!-#]+. If this matches, the username is invalid.
This satisfies your criteria exactly as written.
However, depending on how the usernames have been written, perfectly valid names like "Éponine" may still be rejected by this approach. This is because the "É" could be written either as U+00C9 LATIN CAPITAL E WITH ACUTE (which is matched by \p{L}) or something like E followed by U+02CA MODIFIER LETTER ACUTE ACCENT (which is not matched by \p{L}.)
Regular-Expressions.info says it better:
Again, "character" really means "Unicode code point". \p{L} matches a
single code point in the category "letter". If your input string is à
encoded as U+0061 U+0300, it matches a without the accent. If the
input is à encoded as U+00E0, it matches à with the accent. The reason
is that both the code points U+0061 (a) and U+00E0 (à) are in the
category "letter", while U+0300 is in the category "mark".
Unicode is hairy, and restricting the characters in usernames is not necessarily a good idea anyway. Are you sure you want to do this?
The expression
^(\w| (?! )|["!##$%^&*.:;<>?/\|{}\[\]_+=\-")](?!["!##$%^&*.:;<>?/\|{}\[\]_+=\-")]))*$
will mostly do what you want, if your dialect support look-ahead assertions.
See it in action at RegExr.
Please ask yourself why you want to limit usernames in this way. Most of the time usernames starting with "!!" should be not an issue, and you annoy users if you reject their desired username.
Edit: \w does not match non-latin characters. To do this, replace \w with \p{L} wich may, or may not work depending on your regex implementation. Regexr unfortunately does not support it.
Try this:
^[!##$%^&*.:;<>?\/\|{}\[\]_+= -]?([\p{L}\d]+[!##$%^&*.:;<>?/\|{}\[\]_+= -]?)+$
See on rubular
You want something like
string strUserName = "BillYBob Stev#nS0&";
Regex regex = new Regex(#"(?i)\b(\w+\p{P}*\p{S}*\p{Z}*\p{C}*\s?)+\b");
Match match = regex.Match(strUserName);
If you want this explaining, let me know.
I hope this helps.
Note: This is case insensitive.
Since I don't know in what language you need this solution, I am providing answer in Java. It can be translated in any other platform:
String str = "à123 àà#bcà#";
String regex = "^([\\p{L}\\d]+[!##$%\\^&\\*.:;<>\\?/\\|{}\\[\\]_\\+=\\s-]?)+$";
Pattern p = Pattern.compile(regex);
matcher = p.matcher(str);
if (matcher.find())
System.out.println("Matched: " + matcher.group());
One assumption I made is that username will start with either an unicode letter or a number.

How to negate specific word in regex? [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

RegEx for String.Format

Hiho everyone! :)
I have an application, in which the user can insert a string into a textbox, which will be used for a String.Format output later. So the user's input must have a certain format:
I would like to replace exactly one placeholder, so the string should be of a form like this: "Text{0}Text". So it has to contain at least one '{0}', but no other statement between curly braces, for example no {1}.
For the text before and after the '{0}', I would allow any characters.
So I think, I have to respect the following restrictions: { must be written as {{, } must be written as }}, " must be written as \" and \ must be written as \.
Can somebody tell me, how I can write such a RegEx? In particular, can I do something like 'any character WITHOUT' to exclude the four characters ( {, }, " and \ ) above instead of listing every allowed character?
Many thanks!!
Nikki:)
I hate to be the guy who doesn't answer the question, but really it's poor usability to ask your user to format input to work with String.Format. Provide them with two input requests, so they enter the part before the {0} and the part after the {0}. Then you'll want to just concatenate the strings instead of use String.Format- using String.Format on user-supplied text is just a bad idea.
[^(){}\r\n]+\{0}[^(){}\r\n]+
will match any text except (, ), {, } and linebreaks, then match {0}, then the same as before. There needs to be at least one character before and after the {0}; if you don't want that, replace + with *.
You might also want to anchor the regex to beginning and end of your input string:
^[^(){}\r\n]+\{0}[^(){}\r\n]+$
(Similar to Tim's answer)
Something like:
^[^{}()]*(\{0})[^{}()]*$
Tested at http://www.regular-expressions.info/javascriptexample.html
It sounds like you're looking for the [^CHARS_GO_HERE] construct. The exact regex you'd need depends on your regex engine, but it would resemble [^({})].
Check out the "Negated Character Classes" section of the Character Class page at Regular-Expressions.info.
I think your question can be answered by the regexp:
^(((\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*(\{0\}))+(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*$
Explanation:
The expression is built up as follows:
^(allowed chars {0})+(allowed chars)*$
one or more sequences of allowed chars followed by a {0} with optional allowed chars at the end.
allowed chars is built of the 4 sequences you mentioned (I assumed the \ escape is \\ instead of \.) plus all chars that do not contain the escapes chars:
(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])
combined they make up the regexp I started with.