Regular expression to allow some characters and text symbols - regex

I've made a regex to validate my form and guarantee that only some characters are allowed. For 'normal characters' as abcde... it works but i can't make it work with text symbols like ✈ or ❤.
Here is what i use to allow characters:
^[\x00-\x21\x23-\x26\x28-\x3a\x3d\x3f-\xff\xa8\xE2\x9C\x88]
And for allow text symbols, i've tried it:
^[\x00-\x21\x23-\x26\x28-\x3a\x3d\x3f-\xff\xa8\xE2\x9C\x88\x2708\x2764]
I've checked this website https://www.branah.com/ascii-converter to get hex codes and i saw that text symbols are a mix of "something" and characters like <> ' " (these aren't allowed in my regex).
Any idea how i could make it work and not allow characters like <>'"?
Thank in advance.

I believe that in C# regex, you want to use \u as the unicode hex escape prefix (see MSDN). So try something like this:
[\u2708\u2764]
to match ✈ or ❤. It's equivalent to
[\x{2708}\x{2764}]
in some other languages.

Related

Regular Expression - removing a line(English) and attaching it to the end of upper line(Korean)

I have this text like below:
아니다
bukan
싫다
tidak suka
훌륭하다
bagus
And I am trying to remove the English line(English Alphabets) and attach it to the end of upper line(Korean Alphabets) like this:
아니다bukan
싫다tidak suka
훌륭하다bagus
Now, Finally find almost close regular expression, which is this:
[가-힣]\R
However, It makes the text file like this:
아니bukan
싫tidak suka
훌륭하bagus
The problem is removing the one word of Korean too.
How can I solve this problem?
C++ std::regex does not support Unicode property classes like \p{Hangul}, but you may use the equivalent character class, [\u1100-\u11FF\u302E\u302F\u3131-\u318E\u3200-\u321E\u3260-\u327E\uA960-\uA97C\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uFFA0-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC], see this reference.
Besides, \R is not supported either. You may probably just use \r?\n to match Windows/Linux style line endings, or (?:\r\n?|\n) to also support MacOS line endings.
Next, if you match and consume a Korean char, when replacing, you need to capture it into a capturing group and use a backreference to the group in the replacement pattern.
So, you may use
([\u1100-\u11FF\u302E\u302F\u3131-\u318E\u3200-\u321E\u3260-\u327E\uA960-\uA97C\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uFFA0-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC])(?:\r\n?|\n)
Replace with $1 to put back the Korean char into the resulting string.
See the regex demo online.
The regex for the set of all Korean characters in unicode is this:
\p{Hangul}
There is more information here: https://www.regular-expressions.info/unicode.html
Maybe you also need a + after your group of characters?
Use the [\p{Hangul}]+\R regular expression instead of what you're using now.

internationalized regular expression in postgresql

How can write regular expressions to match names like 'José' in postgres.. In other words I need to setup a constraint to check that only valid names are entered, but want to allow unicode characters also.
Regular expressions, unicode style have some reference on this. But, it seems I can't write it in postgres.
If it is not possible to write a regex for this, will it be sufficient to check only on client side using javascript
PostgreSQL doesn't support character classes based on the Unicode Character Database like .NET does. You get the more-standard [[:alpha:]] character class, but this is locale-dependent and probably won't cover it.
You may be able to get away with just blacklisting the ASCII characters you don't want, and allowing all non-ASCII characters. eg something like
[^\s!"#$%&'()*+,\-./:;<=>?\[\\\]^_`~]+
(JavaScript doesn't have non-ASCII character classes either. Or even [[:alpha:]].)
For example, given v_text as a text variable to be sanitzed:
-- Allow internationalized text characters and remove undesired characters
v_text = regexp_replace( lower(trim(v_text)), '[!"#$%&()*+,./:;<=>?\[\\\]\^_\|~]+', '', 'g' );

regex unicode character in vim

I'm being an idiot.
Someone cut and pasted some text from microsoft word into my lovely html files.
I now have these unicode characters instead of regular quote symbols, (i.e. quotes appear as <92> in the text)
I want to do a regex replace but I'm having trouble selecting them.
:%s/\u92/'/g
:%s/\u5C/'/g
:%s/\x92/'/g
:%s/\x5C/'/g
...all fail. My google-fu has failed me.
From :help regexp (lightly edited), you need to use some specific syntax to select unicode characters with a regular expression in Vim:
\%u match specified multibyte character (eg \%u20ac)
That is, to search for the unicode character with hex code 20AC, enter this into your search pattern:
\%u20ac
The full table of character search patterns includes some additional options:
\%d match specified decimal character (eg \%d123)
\%x match specified hex character (eg \%x2a)
\%o match specified octal character (eg \%o040)
\%u match specified multibyte character (eg \%u20ac)
\%U match specified large multibyte character (eg \%U12345678)
This solution might not address the problem as originally stated, but it does address a different but very closely related one and I think it makes a lot of sense to place it here.
I don't know in which version of Vim it was implemented, but I was working on 7.4 when I tried it.
When in Edit mode, the sequence to output unicode characters is: ctrl-v u xxxx where xxxx is the code point. For instance outputting the euro sign would be ctrl-v u 20ac.
I tried it in Command mode as well and it worked. That is, to replace all instances of "20 euro" in my document with "20 €", I'd do:
:%s/20 euro/20 <ctrl-v u 20ac>/gc
In the above <ctrl-v u 20ac> is not literal, it's the sequence of keys that will output the € character.

RegEx: how to find whether a text begins with currency symbols?

I would like to check whether a given text begins with some currency symbols, like $€£¥. how to achieve that using regex
Depending on your language, but something like ^[\$€£¥].*
[] is a character group matching one of the characters inside.
You might have to write \$ because the $-sign has sometimes special meaning in regexps.
.* matches "everything else" (except a newline).
Edit: After re-reading your question: If you really want to match some currency symbols (maybe more than one), try ^[\$€£¥]+.*
Which regex flavor? If it's one that supports Unicode properties, you can use this:
^\p{Sc}
(I didn't add quotes or regex delimiters because I don't know which flavor you're using.)

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.