Anyone know how to use Regex in notepad++ to find Arabic characters? - regex

I am trying to detect Arabic characters in a webpage's HTML using Notepad++ CTRL+F with regular expressions. I am entering the following as my search terms and it is returning all characters.
[\u0600-\u06FF]
Sample block of random text I'm working with -
awr4tgagas
بqa4tq4twْq4tw4twtfwd
awfasfrw34جَ4tw4tg
دِيَّة عَرqaw4trawfَبِيَّ
Any ideas why this Regular Expression won't detect the Arabic characters properly and how I should go about this? I have the document encoded as UTF-8.
Thanks!

This is happening because Notepadd++ regex engine is PCRE which doesn't support the syntax you have provided.
To match a unicode codepoint you have to use \x{NNNN} so your regular expression becomes:
[\x{0600}-\x{06FF}]

Because Notepad++'s implementation of Regular Expressions requires that you use the
\x{NNNN}
notation to match Unicode characters.
In your example,
\x{0628}
can be used to match the ب (bāʾ,bet,beth,vet) character.
The \u symbol is used to match uppercase letters.
See http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions#Ranges_or_kinds_of_characters
for an explanation of Notepad++'s regex syntax.

Related

Need a regular expression so that ^[a-zA-Z0-9][.a-zA-Z0-9_-]*$ works for all language

I need a regular expression which should match in all languages, like Japanese, Korean, Spanish, French, Italian, etc.
The expression is I'm trying to replicate is:
^[a-zA-Z0-9][.a-zA-Z0-9_-]*$
Assuming you're using PCRE (Perl/PHP/Python), do this instead:
^[\p{L}\p{N}][\p{L}\p{N}._-]*$
Or:
^[\p{L}\p{Nd}][\p{L}\p{Nd}._-]*$
This will match any letter or number in any UTF-8 string.
Now, ., _, and - get a bit trickier! You could match any punctuation by \p{P}, but this will include more than the three characters listed. These Unicode escape sequences are explained here and here.
As #Amadan mentioned, this will work in Ruby or with the Oniguruma library, and it will also work in POSIX regular expression engines:
/[[:alpha:][:digit:]][[:alpha:][:digit:]._-]*/
A POSIX regular expression uses bracket expressions instead of of \ escape sequences.

Regex to identify German, Chinese and Japanese

I want to identify whether text is in Chinese, Japanese or German, using Regular Expressions.
For example I have some text like this "MainWindow_Button_save".
Its German translation is "MainWindow_Button_sparen".
Its Chinese translation is "MainWindow_Button_保存".
And Japanese is "MainWindow_Button_保存".
I want a regular expression which finds the prefix "MainWindow_Button and determines whether the following text is Chinese/Japanese/German. I'm not very much concerned about the text. The only thing I am concerned about is which of the three languages it is in.
What I have done is just this "^MainWindow_Button_[^a-zA-Z]*", but how do I identify the language?
I tried working regular expression for the example here
I would suggest get the first and last character of chinese/japanese and put in regular expression "MainWindow_Button_([保-存])+", so that it matches to any chinese/japanese characters
If not using regular expression I would suggest in other way as follows in java:
Read the UNICODE value of the first character after "MainWindow_Button_", and verify whether the unicode value falls in chinese or japanese character set, if not in both then it will be german.
The following regex will help to provide the verification that the text is in either Chinese or Japanese:
^[\u3000-\u9FFF ]+$

A regular expression that matches two long strings and ignores everything in between

I am searching through a 1.5 million line Premiere Pro project for any text that matches one of my audio filters and is set to mono.
Text that I am searching for begins with the <ChannelType> tag and ends with the <FilterMatchName>Tags. So it would looks like this
<ChannelType>0</ChannelType>
<FrameRate>5292000</FrameRate>
</AudioComponent>
<FilterPreset>0</FilterPreset>
<OpaqueData Encoding="base64" Checksum="53060659">AAAAAD8L8lo+AUr+Pac1NjwTmoUAAAAAP0uQDD37nIg9ui6MPjwU5j+AAAA+C/JaAAAAAD8qqqsAAAAAP4AAAD92L8w9py8FAAAAAHNvZnQgY29tcHJlc3Npb24AIiBkZWZhdWx0PSIwIiBzdGVwPSIxIiBtaW49IjAiIG1heD0iMSIvPgoJICA8Zmw=</OpaqueData>
<FilterIndex>-1</FilterIndex>
<FilterMatchName>1094998321 Dynamics1</FilterMatchName>
If I were in a Word doc, I would just do a find as
<ChannelType>0</ChannelType>*<FilterMatchName>1094998321 Dynamics1</FilterMatchName>
I am terrible with Regex. I was hoping someone could help me out. Everything I have tried either doesn't match anything, or matches EVERYTHING in the document. I am using Notepad++.
Since you are working in Notepad++, you have access to PCRE regular expressions. This one will get all the text between <ChannelType> and </FilterMatchName>
(?s)<ChannelType>.*?</FilterMatchName>
the (?s) allows the . to match newline characters
After matching <ChannelType>, the .*? lazily matches all characters up to...
the closing </FilterMatchName>, which we match.
Let me know if you have any questions. :)
What type of regular expressions are you using (which language/library)?
Basically you can use .* instead of * in regular expressions. IF your text is long though, it's better to use a Reluctant quantifier[1] if your re implementation allows it.
This is a good site with comparison of different re implementations and tutorials:
http://www.regular-expressions.info
[1] http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

Removing Japanese text between two vertical bars using regex and notepad++

I have a text like this
English text||Arabic text||Japanese text||Arabic text||numbers
I tried using (\|\|\p{Han}\p{Hiragana}\p{Katakana}\|\|) but I'm getting "invalid Regular Expression" error message in notepad++, although it's right as I tested it in This regex tester, Plus this will only look at the Japanese text with Katakana after Hiragana after Kanji, how can I make it look an the Japanese text without that order?
Notepad++ does not support \p modifier, try \p{Letter} (should match any letter in any language) and you will see no match.
You can use some other application, e.g. a very good one is EditPad.
I figured what's the main problem, I have to [\p{Han}\p{Hiragana}\p{Katakana}] if I don't want it in a specific order, so all what I had to do is find: \|\|([\p{Han}\p{Hiragana}\p{Katakana}]*?)\|\| and replace it with ||.
Of course Notepad++ didn't work, So I used EditPad as NikitOn suggested

How can I search in Vim, using regular expressions for letters (both ascii and non ascii)?

In .NET, \p{L} matches any ascii or non-ascii letter (so it will match both a and ü).
http://www.regular-expressions.info/unicode.html#prop
Is there a Vim equivalent for this?
In Vim \a or \w will only match characters in range [a-z] (or [0-9A-Za-z_]).
You can explicitly tell vim which ranges of hex values to match. This is kind of a shotgun approach, but if you know what the possible ranges (like UTF-8 for example) this would work:
/[\x7f-\xffa-zA-Z]
You can also search for explicit unicode values by entering in the unicode character directly or it's code in the following format:
/\%u0300