I have written a lot of text in greek with selected language "greek" in lyx 2.1 . The problem is that there are words and characters in english in there and when i output to pdf with pdflatex they are transformed into the matching greek characters. I suppose i need to change the language of every english word in there to english, but that sounds very repetitive... I am wondering if i can use regex to match english characters and change their language with find replace somehow. I have successfuly managed to match english characters with this capture group \regexp{([\backslash x00-\backslash x7F]+)\endregexp{}} but i dont know what to replace it with. Any ideas?
PS: does lyx regex even have backreferences or are capture groups just there for no reason?
Related
I want to identify whether text is in Chinese, Japanese or German, using Regular Expressions.
For example I have some text like this "MainWindow_Button_save".
Its German translation is "MainWindow_Button_sparen".
Its Chinese translation is "MainWindow_Button_保存".
And Japanese is "MainWindow_Button_保存".
I want a regular expression which finds the prefix "MainWindow_Button and determines whether the following text is Chinese/Japanese/German. I'm not very much concerned about the text. The only thing I am concerned about is which of the three languages it is in.
What I have done is just this "^MainWindow_Button_[^a-zA-Z]*", but how do I identify the language?
I tried working regular expression for the example here
I would suggest get the first and last character of chinese/japanese and put in regular expression "MainWindow_Button_([保-存])+", so that it matches to any chinese/japanese characters
If not using regular expression I would suggest in other way as follows in java:
Read the UNICODE value of the first character after "MainWindow_Button_", and verify whether the unicode value falls in chinese or japanese character set, if not in both then it will be german.
The following regex will help to provide the verification that the text is in either Chinese or Japanese:
^[\u3000-\u9FFF ]+$
I have a question:
I want to do a validation for the first and the last name with RegEx.
I want to do it with only Hebrew and English without numbers.
Someone can help me to do that code?
Seemingly Hebrew has the range \u0590-\u05fe (according to this nice JavaScript Unicode Regex generator`.
/^[a-z\u0590-\u05fe]+$/i
While the selected answer is correct about "Hebrew" the OP wanted to limit validation to only Hebrew and English letters. The Hebrew Unicode adds a lot of punctuation and symbols (as you can see in the table here) irrelevant for such validation. If you want only Hebrew letters (along with English letters) the regex would be:
/^[a-z\u05D0-\u05EA]+$/i
I would consider adding ' (single quote) as well, for foreign consonants that are missing in Hebrew (such as G in George and Ch in Charlie) make use of it along with a letter:
/^[a-z\u05D0-\u05EA']+$/i
English & Hebrew FULL regex
I'm using the above regex on my application. My users are just fine with it:
RegExp(r'^[a-zA-Z\u0590-\u05FF\u200f\u200e ]+$');
The regex supports:
English letters (includes Capital letters). a-zA-Z
Hebrew (includes special end-letters). \u0590-\u05FF
change direction unicodes (RLM, LRM). \u200f\u200e
White space.
Enjoy!
Try this. Not sure if it will work. If not, these references should help.
[A-Za-z\u0590-\u05FF]*
Hebrew Unicode
Unicode in Regular Expressions
Hebrew letters only:
/^[\u0590-\u05ea]+$/i
You can also use \p{Hebrew} in your regex to detect any Hebrew unicode characters (if you're regex engine supports it).
Well, the RegEx pattern is between two /'s. The i at the end is a flag that says to be indifferent to the cases. ^ means the start of a line, and $ means the end of a line. Brackets ([ and ]) means either of the characters inside the brackets. - means a range. Note that the characters are ordinal, so a-z or א-ת make sense; a-z means all letters from and include a to and include z. The same goes for א-ת. + means one or more of the preceding. So this pattern matches every sequence of letters from English or Hebrew.
P.S.: Also, note that the flavor of the RegEx is different in different languages and platforms. for example in Sublime Text the pattern would be: (?i)^[א-תa-z]+$.
/^[א-תa-z]+$/i
I have a text like this
English text||Arabic text||Japanese text||Arabic text||numbers
I tried using (\|\|\p{Han}\p{Hiragana}\p{Katakana}\|\|) but I'm getting "invalid Regular Expression" error message in notepad++, although it's right as I tested it in This regex tester, Plus this will only look at the Japanese text with Katakana after Hiragana after Kanji, how can I make it look an the Japanese text without that order?
Notepad++ does not support \p modifier, try \p{Letter} (should match any letter in any language) and you will see no match.
You can use some other application, e.g. a very good one is EditPad.
I figured what's the main problem, I have to [\p{Han}\p{Hiragana}\p{Katakana}] if I don't want it in a specific order, so all what I had to do is find: \|\|([\p{Han}\p{Hiragana}\p{Katakana}]*?)\|\| and replace it with ||.
Of course Notepad++ didn't work, So I used EditPad as NikitOn suggested
I have a text box that a user can input any text in any language in and I need to split that text into words so that I could pass those words into hunspell spell check. For splitting I use a regexp that matches word delimiters.
At first I used \W as a word delimiter to split a text into wrods, but that works only with Latin letters, such as in English language. If I use non-Latin language, it treats every letter of it as \W. That's because \W is defined as any character that is [^a-zA-Z0-9_].
So far, (?![-'])[\pP|\pZ|\pC] seems to tokenize English, Spanish and Russian correctly. It basically says to treat all punctuation characters (except for the hyphen and the apostrophe), all separator characters and all "other" characters (control, private use, etc) as word delimiters. I have excluded hyphen and apostrophe because those usually shouldn't be treated as word delimiters.
I haven't tested it much, just came up with it today, so I thought it would be wise to ask if someone knew of any regex that is more suited for matching word delimiters in a multilingual text.
Note that I'm not concerned with languages that can't be tokenized, such as Japanese, Chinese, Thai, etc.
Update: Since people were asking what language I'm using (though it probably shouldn't matter much), I'm using C++ and Qt5's QRegularExpression class.
With Java (for example), you can emulate word boundaries like that (don't forget to double escape):
(?<![\p{L}\p{N}_])[\p{L}\p{N}_]+(?![\p{L}\p{N}_])
Where \p{L} matches any letters and \p{N} any digits.
Thus, you can easily split a string into "words" with: [^\p{L}\p{N}_]+
(I don't know the regex flavor you use, but you can probably remove the curly brackets).
In PHP this should work:
[\pL]*
In Javascript you can use (set "u" for unicode after delimiter):
/[\p{L}]*/u
I need a regular expression to match a-zA-Z0-9 as well as whitespace and special characters, but only including English whitespace/special characters, not those of other languages like French or Spanish.
Thanks.
It's not possible/practical to write a regular expression that matches English, but not French, Spanish and other languages.
If you really want to test if a word is from the English language, you can write some code to look it up in a English dictionary. That should be simple enough.
Depending on the regex engine, you may be able to use:
^\p{IsBasicLatin}*$
To allow only characters in the Basic Latin character set, which includes standard English lanuage punctuation (i.e., the characters that can be directly entered on a U.S. keyboard).
I was looking for a regular expression that would match regular english text (and avoid maybe html/xml/url etc) and landed on this page. I think the questioner just wanted to avoid character with phonetic information in it but allow for english punctuation characters. I ended up writing something by myself looking at my keyboard
[A-Za-z\d,.?;:\'"!$%() ]*
I don't claim this will work for everyone but was good enough for me.