Blogger weird behavior with Japanese brackets - regex

I'm experiencing a weird behavior from Blogger. The code works fine when I test it locally, but Blogger seems to skip Japanese brackets: () in my code.
I need to remove them, with a simple regex:
.replace(/\(/g,'').replace(/\)/g,'')
(I tried without using the backslash as well, it works locally, and omits brackets on Blogger in both cases.)
It seems to work well with other Japanese characters though, the only problem I've encountered so far are brackets. I'm looking for both solution/cheat/workaround for this specific case, but I'm also interested in more detailed information about why it happens.

Instead of the brackets you need to put their unicode value.
In most regex engines, we do this in this format:
\uFFFF
Where FFFF is the hex value of the unicode character.
In this case, a Japanese opening bracket is unicode FF08 and a Japanese closing bracket is unicode FF09.
So replace:
\( and \)
With:
\uFF08 and \uFF09
In your replaceAll regex.
Good Luck!

Related

word start with uppercase (unicode) in laravel validation [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

REGEX: Special Characters detected for [[a-zA-Z0-9]]

I built a filter with the rule: [[a-zA-Z0-9]]
So the intention was that content with at least a number or a letter and should remove content with special characters for example "?" and ":)" or any other emojis.
So far it worked great, however i noticed that the rule does not worked for symbols that starts with the unicode "U+1" and seems to recognize as a letter or a number. Other special letters/symbols starting with "U+0" for example ◌́ seems to work as intended.
Can somebody please explain the reason for this?
Thanks

CQ5 textfield validation with regex

I have a simple CQ dialog with a textfield. The authors somehow managed to paste illegal characters into it, the last two times it was a vertical tab (VT) copied from a PowerPoint file.
I played around with some regex and came up with the following to exclude anything below SPACE and DEL:
/^[^\0-\x1F\x7F]*$/
Sadly I can't really test the vertical tab as I am not able to enter this character on regex101. So I tried it with TAB and this seems to be working: https://regex101.com/r/yH0lN5/1
But if I use this in my regex property of the textfield, no matter what I enter the validation fails. Any idea what I am doing wrong?
White listing isn't an option as i need to support Unicode characters like chinese in the future.
You should double the backslashes to make sure they are treated as literal backslashes by the regex engine.
Also, I suggest using consistent notation, and replace \0 with \x00:
regex="/^[^\\x00-\\x1F\\x7F]*$/"
And this regex just matches entires strings that contain zero or more characters (due to *) other than (due to the negated character class used [^...]) the ones from the NUL to US character ([\x00-\x1F]) and a DEL character (\x7F):

Regex accepting alphabets from languages like Å,Ø or र

I need a reg ex which would accept everything except white spaces,followed by #(only one occurrence) and then everything except white spaces.
Eg- abc#abc
By everything I mean here is all characters including characters like Norwegian or Nordic language alphabets,or any other language.
I tried this...
^\S[^#]+\b#\b\S[^#]+$
but this would fail for characters like Ø#Ø, Å#Å or र#र...
Edit-I want this for javascript...
Try something easier like:
^[^#\s]+#[^#\s]+$
?
[^#\s]+ matches everything except # or spaces.
First of all: did you try to run this regex for "a#a", "b#b", "c#c"? Because it fails, too :).
Your regex expect one non-space and at least one not-# before #.
The correct regex should be:
^\S+\b#\S+$
The other thing that may be messing with your results is encoding of the script in which you keep your regex. If it's not unicode, there may be some problems. But I'm not sure. What are you using to run the regex? npp? php?

Regex to match any strings containing Cyrillic symbols, except comments marked with //, ///, ///, etc

I want to find all strings containing at least 1 Cyrillic character (basically /.*[А-я].*/) but with exception of comments.
Comment is a string or part of a string which starts with 2 or more / characters.
Currently I get this regex which do some part of the trick:
^(?=^.*?[А-я]+).*?((?=[\/]{2,})|(^(?:(?![\/]{2,}).)*$))
But I'd like to get less bloated and faster expression.
And as additional question: could anyone explain why this one is working? I combined it by trial-and-error but I'm not sure I completely understood how it works, because when I try to change it in any part - it stops working.
The following regex will match any cyrllic character that is not preceded by a double forward slash
(?<!/{2}.*)[А-я]
It specifies that it should not be preceded by a double slash by using a negative lookbehind.
You haven't specified what flavour of regex your using, but be aware some flavours don't support lookarounds. For example PCRE (javascript) doesn't. You are using 3 of them in your regex, so i presume its ok.