I want to identify whether text is in Chinese, Japanese or German, using Regular Expressions.
For example I have some text like this "MainWindow_Button_save".
Its German translation is "MainWindow_Button_sparen".
Its Chinese translation is "MainWindow_Button_保存".
And Japanese is "MainWindow_Button_保存".
I want a regular expression which finds the prefix "MainWindow_Button and determines whether the following text is Chinese/Japanese/German. I'm not very much concerned about the text. The only thing I am concerned about is which of the three languages it is in.
What I have done is just this "^MainWindow_Button_[^a-zA-Z]*", but how do I identify the language?
I tried working regular expression for the example here
I would suggest get the first and last character of chinese/japanese and put in regular expression "MainWindow_Button_([保-存])+", so that it matches to any chinese/japanese characters
If not using regular expression I would suggest in other way as follows in java:
Read the UNICODE value of the first character after "MainWindow_Button_", and verify whether the unicode value falls in chinese or japanese character set, if not in both then it will be german.
The following regex will help to provide the verification that the text is in either Chinese or Japanese:
^[\u3000-\u9FFF ]+$
Related
I have written a lot of text in greek with selected language "greek" in lyx 2.1 . The problem is that there are words and characters in english in there and when i output to pdf with pdflatex they are transformed into the matching greek characters. I suppose i need to change the language of every english word in there to english, but that sounds very repetitive... I am wondering if i can use regex to match english characters and change their language with find replace somehow. I have successfuly managed to match english characters with this capture group \regexp{([\backslash x00-\backslash x7F]+)\endregexp{}} but i dont know what to replace it with. Any ideas?
PS: does lyx regex even have backreferences or are capture groups just there for no reason?
I have a question:
I want to do a validation for the first and the last name with RegEx.
I want to do it with only Hebrew and English without numbers.
Someone can help me to do that code?
Seemingly Hebrew has the range \u0590-\u05fe (according to this nice JavaScript Unicode Regex generator`.
/^[a-z\u0590-\u05fe]+$/i
While the selected answer is correct about "Hebrew" the OP wanted to limit validation to only Hebrew and English letters. The Hebrew Unicode adds a lot of punctuation and symbols (as you can see in the table here) irrelevant for such validation. If you want only Hebrew letters (along with English letters) the regex would be:
/^[a-z\u05D0-\u05EA]+$/i
I would consider adding ' (single quote) as well, for foreign consonants that are missing in Hebrew (such as G in George and Ch in Charlie) make use of it along with a letter:
/^[a-z\u05D0-\u05EA']+$/i
English & Hebrew FULL regex
I'm using the above regex on my application. My users are just fine with it:
RegExp(r'^[a-zA-Z\u0590-\u05FF\u200f\u200e ]+$');
The regex supports:
English letters (includes Capital letters). a-zA-Z
Hebrew (includes special end-letters). \u0590-\u05FF
change direction unicodes (RLM, LRM). \u200f\u200e
White space.
Enjoy!
Try this. Not sure if it will work. If not, these references should help.
[A-Za-z\u0590-\u05FF]*
Hebrew Unicode
Unicode in Regular Expressions
Hebrew letters only:
/^[\u0590-\u05ea]+$/i
You can also use \p{Hebrew} in your regex to detect any Hebrew unicode characters (if you're regex engine supports it).
Well, the RegEx pattern is between two /'s. The i at the end is a flag that says to be indifferent to the cases. ^ means the start of a line, and $ means the end of a line. Brackets ([ and ]) means either of the characters inside the brackets. - means a range. Note that the characters are ordinal, so a-z or א-ת make sense; a-z means all letters from and include a to and include z. The same goes for א-ת. + means one or more of the preceding. So this pattern matches every sequence of letters from English or Hebrew.
P.S.: Also, note that the flavor of the RegEx is different in different languages and platforms. for example in Sublime Text the pattern would be: (?i)^[א-תa-z]+$.
/^[א-תa-z]+$/i
I am trying to detect Arabic characters in a webpage's HTML using Notepad++ CTRL+F with regular expressions. I am entering the following as my search terms and it is returning all characters.
[\u0600-\u06FF]
Sample block of random text I'm working with -
awr4tgagas
بqa4tq4twْq4tw4twtfwd
awfasfrw34جَ4tw4tg
دِيَّة عَرqaw4trawfَبِيَّ
Any ideas why this Regular Expression won't detect the Arabic characters properly and how I should go about this? I have the document encoded as UTF-8.
Thanks!
This is happening because Notepadd++ regex engine is PCRE which doesn't support the syntax you have provided.
To match a unicode codepoint you have to use \x{NNNN} so your regular expression becomes:
[\x{0600}-\x{06FF}]
Because Notepad++'s implementation of Regular Expressions requires that you use the
\x{NNNN}
notation to match Unicode characters.
In your example,
\x{0628}
can be used to match the ب (bāʾ,bet,beth,vet) character.
The \u symbol is used to match uppercase letters.
See http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions#Ranges_or_kinds_of_characters
for an explanation of Notepad++'s regex syntax.
I need a regular expression to match a-zA-Z0-9 as well as whitespace and special characters, but only including English whitespace/special characters, not those of other languages like French or Spanish.
Thanks.
It's not possible/practical to write a regular expression that matches English, but not French, Spanish and other languages.
If you really want to test if a word is from the English language, you can write some code to look it up in a English dictionary. That should be simple enough.
Depending on the regex engine, you may be able to use:
^\p{IsBasicLatin}*$
To allow only characters in the Basic Latin character set, which includes standard English lanuage punctuation (i.e., the characters that can be directly entered on a U.S. keyboard).
I was looking for a regular expression that would match regular english text (and avoid maybe html/xml/url etc) and landed on this page. I think the questioner just wanted to avoid character with phonetic information in it but allow for english punctuation characters. I ended up writing something by myself looking at my keyboard
[A-Za-z\d,.?;:\'"!$%() ]*
I don't claim this will work for everyone but was good enough for me.
How can write regular expressions to match names like 'José' in postgres.. In other words I need to setup a constraint to check that only valid names are entered, but want to allow unicode characters also.
Regular expressions, unicode style have some reference on this. But, it seems I can't write it in postgres.
If it is not possible to write a regex for this, will it be sufficient to check only on client side using javascript
PostgreSQL doesn't support character classes based on the Unicode Character Database like .NET does. You get the more-standard [[:alpha:]] character class, but this is locale-dependent and probably won't cover it.
You may be able to get away with just blacklisting the ASCII characters you don't want, and allowing all non-ASCII characters. eg something like
[^\s!"#$%&'()*+,\-./:;<=>?\[\\\]^_`~]+
(JavaScript doesn't have non-ASCII character classes either. Or even [[:alpha:]].)
For example, given v_text as a text variable to be sanitzed:
-- Allow internationalized text characters and remove undesired characters
v_text = regexp_replace( lower(trim(v_text)), '[!"#$%&()*+,./:;<=>?\[\\\]\^_\|~]+', '', 'g' );