word start with uppercase (unicode) in laravel validation [duplicate] - regex

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';

If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);

First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?

Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version

<?php preg_match('/[a-zığüşöç]/u',$title) ?>

Related

Unexpected regex results with polytonic Greek capitals

I am trying to select only capital letters in polytonic Greek text using regex. The specific application is PHP, but I had trouble with it so I started playing around with it in RegExr:
https://regexr.com/6ellt
([Α-ΩΗΙΟΥΩᾼῌῼΡΆΈΉΊΌΎΏᾺῈῊῚῸῪῺἈἘἨἸὈὨᾈᾘᾨἌἜἬἼὌὬᾌᾜᾬἊἚἪἺὊὪᾊᾚᾪἎἮἾὮᾎᾞᾮἉἙἩἹὉὙὩᾉᾙᾩῬἍἝἭἽὍὝὭᾍᾝᾭἋἛἫἻὋὛὫᾋᾛᾫἏἯἿὟὯᾏᾟᾯΪΫᾹῙῩᾸῘῨ])
When the JavaScript engine is selected, the behaviour is as expected. However, if I select PCRE not only are capital letters selected, but also a bunch of seemingly random lowercase letters.
Can anyone shed some light on what is going on here? Is this a bug? Is there a way to get the desired result using the PCRE engine?
You need to tell the PCRE regex engine the input is to be parsed as a Unicode string.
In a PCRE regex, you can prepend the pattern with a (*UTF) verb. The (*UTF)[Α-ΩΗΙΟΥΩᾼῌῼΡΆΈΉΊΌΎΏᾺῈῊῚῸῪῺἈἘἨἸὈὨᾈᾘᾨἌἜἬἼὌὬᾌᾜᾬἊἚἪἺὊὪᾊᾚᾪἎἮἾὮᾎᾞᾮἉἙἩἹὉὙὩᾉᾙᾩῬἍἝἭἽὍὝὭᾍᾝᾭἋἛἫἻὋὛὫᾋᾛᾫἏἯἿὟὯᾏᾟᾯΪΫᾹῙῩᾸῘῨ] highights the correct matches.
However, you can also make it a bit shorter with
(*UTF)(?=\p{Lu})\p{Greek}
Here,
(*UTF) - a PCRE verb telling the PCRE engine the input is a Unicode string
(?=\p{Lu}) - a positive lookahead requiring the next char to be an uppercase char
\p{Greek} - a Greek char.
Note in case there is a u flag support in your PCRE implementation, it is most probably the way to go (as in PHP, /(?=\p{Lu})\p{Greek}/u).

RegEx: A way to handle both English and non-English characters (and my solution)

I would like to know if there is a recommendable RegEx pattern to match both English and non-English characters. So far I have come up with [^\x00-\x7F]+|[a-zA-Z'-]* based on the answer provided at SO. My solutions seems to work but since I am very nice to RegEx I would like to ask you to check this token and suggest some improvements. I am aware of most solutions that touch on this subject like this but I don't think there is already a good RegEx for this.
The answer depend mostly on the language. But in general, you have to enable the "unicode flag" (this is usually done by prepending (?u) to your regex, or by appending /u) and use unicode strings. This way, \w, \s and others will correctly match the corresponding unicode characters.
An example in Python 2 (Python 3 uses unicode by default):
>>> re.match('\w', 'è') # byte string, no unicode flag: no match
>>> re.match('(?u)\w', u'è') # unicode string and unicode flag: match
<_sre.SRE_Match object at 0x7f258bac07e8>
>>> re.match('\w', u'è', re.UNICODE) # another way to enable the unicode flag
<_sre.SRE_Match object at 0x7f258bac0850>

Using "#" in regular expression with VB.NEt

Assuming I have to check if "#" exists on a given string - should I use back slash before or not? So far I found they're both working for me, but I'm not sure if it always works on any Windows host (this is part of a VB.NET application that has to work world-wide)
The string: Hello #world
Pattern1: Hello #world
Pattern 2: Hello \#world
Which one should I use to get the most precise matching? pattern1 or pattern2?
I work with VB.NET on VS2010 (.NET FW 3.5)
Thank you
# is not a special regex character, at least not in VB.NET. Which means that both patterns are pretty much the same, and you can use whichever you prefer. Although for readability sake you probably should stick to the pattern without backslash.
You can find complete list of special regex characters in .NET here.
I would suggest you to leave this option on Regex engine. Just use its Regex.Escape function. It will escape the necessary things.

Blogger weird behavior with Japanese brackets

I'm experiencing a weird behavior from Blogger. The code works fine when I test it locally, but Blogger seems to skip Japanese brackets: () in my code.
I need to remove them, with a simple regex:
.replace(/\(/g,'').replace(/\)/g,'')
(I tried without using the backslash as well, it works locally, and omits brackets on Blogger in both cases.)
It seems to work well with other Japanese characters though, the only problem I've encountered so far are brackets. I'm looking for both solution/cheat/workaround for this specific case, but I'm also interested in more detailed information about why it happens.
Instead of the brackets you need to put their unicode value.
In most regex engines, we do this in this format:
\uFFFF
Where FFFF is the hex value of the unicode character.
In this case, a Japanese opening bracket is unicode FF08 and a Japanese closing bracket is unicode FF09.
So replace:
\( and \)
With:
\uFF08 and \uFF09
In your replaceAll regex.
Good Luck!

How do you reference unicode characters in ColdFusion regex?

I'm trying to match this character ’ which I can type with alt-0146. Word tells me it's unicode 0x2019 but I can't seem to match it using regular expressions in ColdFusion. Here's a snippet i'm using to match between 2 and 10 letters and apostrophes and this character
[[:alpha:]'\x2019]{2,10}
but it's not working. Any ideas?
It looks like the \x shorthand in CF only supports the first 255 ASCII characters. In order to go above that number, you need to use the chr command inline like this:
<cfscript>
yourString = "’";
result = refind("[[:alpha:]'" & chr(8217) & "]{2,10}", yourString);
writeOutput(result);
</cfscript>
That should give you a match.
Another thing you could try is directly including the character:
[[:alpha:]'#Chr(8217)#]{2,10}
However I'm not sure if that will work with a CF regex. If not, you still have the option to use Java regex within CF. This is easy to do, and enables you to use a far wider range of regex functionality, almost certainly including unicode support.
If you're doing replacements, you can do a Java Regex directly on a CF string, for example:
<cfset NewString = OrigString.replaceAll( 'ajavaregex' , 'replacement' )/>
For other functionality (e.g. getting an array of matches, callback functions on replace), I have created Java RegEx Utilities - a single component that simplifies these functionality into a single function call.