How do you reference unicode characters in ColdFusion regex? - regex

I'm trying to match this character ’ which I can type with alt-0146. Word tells me it's unicode 0x2019 but I can't seem to match it using regular expressions in ColdFusion. Here's a snippet i'm using to match between 2 and 10 letters and apostrophes and this character
[[:alpha:]'\x2019]{2,10}
but it's not working. Any ideas?

It looks like the \x shorthand in CF only supports the first 255 ASCII characters. In order to go above that number, you need to use the chr command inline like this:
<cfscript>
yourString = "’";
result = refind("[[:alpha:]'" & chr(8217) & "]{2,10}", yourString);
writeOutput(result);
</cfscript>
That should give you a match.

Another thing you could try is directly including the character:
[[:alpha:]'#Chr(8217)#]{2,10}
However I'm not sure if that will work with a CF regex. If not, you still have the option to use Java regex within CF. This is easy to do, and enables you to use a far wider range of regex functionality, almost certainly including unicode support.
If you're doing replacements, you can do a Java Regex directly on a CF string, for example:
<cfset NewString = OrigString.replaceAll( 'ajavaregex' , 'replacement' )/>
For other functionality (e.g. getting an array of matches, callback functions on replace), I have created Java RegEx Utilities - a single component that simplifies these functionality into a single function call.

Related

Match Unicode character with regular expression

I can use regular expressions in VBA for Word 2019:
Dim RegEx As New RegExp
Dim Matches As MatchCollection
RegEx.Pattern = "[\d\w]+"
Text = "HelloWorld"
Set Matches = RegEx.Execute(Text)
But how can I match all Unicode characters and all digits too?
\p{L} works fine for me in PHP, but this doesn't work for me in VBA for Word 2019.
I would like to find words with characters and digits. So in PHP I use for this [\p{L}\p{N}]+. Which pattern can I use for this in VBA?
Currently, I would like to match words with German character, like äöüßÄÖÜ. But maybe I need this for other languages too.
But how can I match all Unicode characters and all digits too?
"VBScript Regular Expressions 5.5" (which I am pretty sure you are using here) are not "VBA Regular Expressions", they are a COM library that you can use in - among other things - VBA. They do not support Unicode with the built-in metacharacters (such as \w) and they have no knowledge of Unicode character classes (such as \p{L}). But of course you can still match Unicode characters with them.
Direct Matches
The simplest way is of course to directly use the Unicode characters you search for in the pattern. VBA uses Unicode strings, so matching Unicode is not a problem per se. Representing Unicode in your VBA source code, which itself is not Unicode, is a different matter. But ChrW() can help with that.
Assuming you have a certain character you want to match,
RegEx.Pattern = ChrW(&h4E16) & ChrW(&h754C)
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
The above uses hex numbers (&h...) and ChrW() to create the Unicode characters U+4E16 and U+754C (世界) at run-time. When they are in your text, they will be found. This is tedious, but it works well if you already know what words you're looking for.
Ranges
If you want to match character ranges, you can do that as well. Use the start point and end point of the range. For example, the basic block of the "CJK Unified Ideographs" range goes from U+4E00 to U+9FFF:
RegEx.Pattern = "[" + ChrW(&h4E00) & "-" & ChrW(&h9FFF) & "]+"
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
So this creates a natural range just like [a-z]+ to span all of the CJK characters. You'd have to define which ranges you want to match, so it's less convenient has having built-in support, but nothing is stopping you.
Caveats
The above is about matching Characters inside of the BMP (Basic Multilingual Plane). Characters outside of the BMP, such as Emoji, is a lot more difficult because of the way these characters work in Unicode. It's still possible, but it's not going to be pretty.
There are multiple ways of representing the same character. For example, ä could be represented by its own, singluar code-point, or by a followed by a second code-point for the dots (U+0308 "◌̈"). Since there is no telling how your input string represents certain characters, you should look into Unicode Normalization to make strings uniform before you search in them. In VBA this can be done by using the Win32 API.
Helpers
You can research Unicode ranges manually, but since there are so many of them, it's easy to miss some. I remember a useful helper for manually picking Unicode ranges, which now still lives on the Internet Archive: http://web.archive.org/web/20191118224127/http://kourge.net/projects/regexp-unicode-block
It allows you to qickly build regexes that span multiple ranges. It's aimed at JavaScript, but it's easy enough to adapt the output for VBA code.

word start with uppercase (unicode) in laravel validation [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

How to escape regular expression characters from variable in JMeter?

Problem is simple. I have regular expression used to extract some data from response. It looks like that:
<input type="hidden" +name="reportpreset_id" +value="(\w+)" *>${reportPresetName}</td>
Problem is that variable ${reportPresetName} may contain characters used by regular expression like parenthesis or dots.
I've tried to surround this variable with \Q and \E (based on that) but apparently these markers don't work (apparently Java supports this markers so I'm confused).
When I'm adding that markers even then this expression fails for any content of ${reportPresetName} variable (even for cases when it was working without those markers).
I've checked list of functions in JMeter, but I didn't found anything useful.
Does anyone know how to escape regular expression characters in JMeter?
update:
When I'm using this \Q and \E with assertion it fails. When I'm doing a copy of regular expression from assertion log in "View Results Tree" and testing it on recorded response data it works! So it looks like some kind bug in JMeter.
Jmeter uses jakarta ORO as its regexp engine in Regexp Extractor and Regexp Tester:
http://jmeter.apache.org/usermanual/regular_expressions.html
But it uses Java Regexp Engine for search in HTML/Text Viewer.
Read:
http://jmeter.apache.org/usermanual/regular_expressions.html $20.4
Please note that ORO does not support the \Q and \E meta-characters.
[In other RE engines, these can be used to quote a portion of an RE so that the
meta-characters stand for themselves.]
A solution for you would be to add a JSR223 post processor using Groovy after regexp that extracts the var and escapes regexp chars using:
org.apache.oro.text.regex.Perl5Compiler.quotemeta(String valueToEscape)
As of upcoming version 2.9, a new function has been created to do so:
__escapeOroRegexpChars(String to escape, Variable Name)
\Q and \E work in Java, see Pattern.
In Java, we use to double the backslash characters, though, so you might need to use (\\w+) and, of course, \\Q and \\E.
I am not sure in your case, as I don't understand your context, actually (never used JMeter so far).
In case JMeter does not support \\Q and \\E (which I don't know if does...), you can write your own function/procedure, where you will split string into characters and replace each character with escaped sequence as follows:
if the character is \, then replace it with \\\\
otherwise add before the character a prefix \\
This is not the optimal method, but for sure it will work as needed.
For example for input
This is-a\string 12&$34|!`^5
you will get
\\T\\h\\i\\s\\ \\i\\s\\-\\a\\\\s\\t\\r\\i\\n\\g\\ \\1\\2\\&\\$\\3\\4\\|\\!\\`\\^\\5

Check string for email with regular expressions or other way

I've tried the following code, but it gives me nomatch.
re:run("qw#qc.com", "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b").
regexp i got here http://www.regular-expressions.info/email.html
EDITED:
Next doesnt work to
re:run("345345", "\b[0-9]+\b").
If you got just en email in string when that one will match
re:run("qw#qc.com", "^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$").
I hesitate to answer this question, since I believe it relies on an incorrect assumption - that you can determine whether an email address is valid or not with a regular expression. See this question for more details; from a short glance I'd note that the regexp in your question doesn't accept the .museum and .рф top-level domains.
That said, you need to escape the backslashes. You want the string to contain backslashes, but in Erlang, backslashes are used inside strings to escape various characters, so any literal backslash needs to be written as \\. Try this:
3> re:run("qw#qc.com", "\\b[a-z0-9._%+-]+#[a-z0-9.-]+\\.[a-z]{2,4}\\b").
{match,[{0,9}]}
Or even better, this:
8> re:run("qw#qc.com", "\\b[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)*\\b").
{match,[{0,9}]}
That's the regexp used in the HTML 5 standard, modified to use \\b instead of ^ and $.
Looks like you need a case-insensitive match ?
Currently [A-Z0-9._%+-] (for example) only matches upper-case characters (plus numbers etc).
One solution is to specify [A-Za-z]. Another solution is to convert your email address to uppercase prior to matching.

internationalized regular expression in postgresql

How can write regular expressions to match names like 'José' in postgres.. In other words I need to setup a constraint to check that only valid names are entered, but want to allow unicode characters also.
Regular expressions, unicode style have some reference on this. But, it seems I can't write it in postgres.
If it is not possible to write a regex for this, will it be sufficient to check only on client side using javascript
PostgreSQL doesn't support character classes based on the Unicode Character Database like .NET does. You get the more-standard [[:alpha:]] character class, but this is locale-dependent and probably won't cover it.
You may be able to get away with just blacklisting the ASCII characters you don't want, and allowing all non-ASCII characters. eg something like
[^\s!"#$%&'()*+,\-./:;<=>?\[\\\]^_`~]+
(JavaScript doesn't have non-ASCII character classes either. Or even [[:alpha:]].)
For example, given v_text as a text variable to be sanitzed:
-- Allow internationalized text characters and remove undesired characters
v_text = regexp_replace( lower(trim(v_text)), '[!"#$%&()*+,./:;<=>?\[\\\]\^_\|~]+', '', 'g' );