Regex accepting alphabets from languages like Å,Ø or र - regex

I need a reg ex which would accept everything except white spaces,followed by #(only one occurrence) and then everything except white spaces.
Eg- abc#abc
By everything I mean here is all characters including characters like Norwegian or Nordic language alphabets,or any other language.
I tried this...
^\S[^#]+\b#\b\S[^#]+$
but this would fail for characters like Ø#Ø, Å#Å or र#र...
Edit-I want this for javascript...

Try something easier like:
^[^#\s]+#[^#\s]+$
?
[^#\s]+ matches everything except # or spaces.

First of all: did you try to run this regex for "a#a", "b#b", "c#c"? Because it fails, too :).
Your regex expect one non-space and at least one not-# before #.
The correct regex should be:
^\S+\b#\S+$
The other thing that may be messing with your results is encoding of the script in which you keep your regex. If it's not unicode, there may be some problems. But I'm not sure. What are you using to run the regex? npp? php?

Related

Regular Expression - removing a line(English) and attaching it to the end of upper line(Korean)

I have this text like below:
아니다
bukan
싫다
tidak suka
훌륭하다
bagus
And I am trying to remove the English line(English Alphabets) and attach it to the end of upper line(Korean Alphabets) like this:
아니다bukan
싫다tidak suka
훌륭하다bagus
Now, Finally find almost close regular expression, which is this:
[가-힣]\R
However, It makes the text file like this:
아니bukan
싫tidak suka
훌륭하bagus
The problem is removing the one word of Korean too.
How can I solve this problem?
C++ std::regex does not support Unicode property classes like \p{Hangul}, but you may use the equivalent character class, [\u1100-\u11FF\u302E\u302F\u3131-\u318E\u3200-\u321E\u3260-\u327E\uA960-\uA97C\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uFFA0-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC], see this reference.
Besides, \R is not supported either. You may probably just use \r?\n to match Windows/Linux style line endings, or (?:\r\n?|\n) to also support MacOS line endings.
Next, if you match and consume a Korean char, when replacing, you need to capture it into a capturing group and use a backreference to the group in the replacement pattern.
So, you may use
([\u1100-\u11FF\u302E\u302F\u3131-\u318E\u3200-\u321E\u3260-\u327E\uA960-\uA97C\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uFFA0-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC])(?:\r\n?|\n)
Replace with $1 to put back the Korean char into the resulting string.
See the regex demo online.
The regex for the set of all Korean characters in unicode is this:
\p{Hangul}
There is more information here: https://www.regular-expressions.info/unicode.html
Maybe you also need a + after your group of characters?
Use the [\p{Hangul}]+\R regular expression instead of what you're using now.

Sublime Text 2 snippet substitution RegEx: lookahead not taking accented characters

I'm trying to do a substitution where I need to find spaces followed by a number or a letter, with or without accents, to replace them with an underscore. I currently have this (note the space in the beginning):
\b(?=[a-zA-Z0-9àéèêëîïôöûü])
With the string test string école test, the replacement looks like this:
test_string école_test
I guess you got the problem, but just in case, the expected result is this:
test_string_école_test
The strangest thing is that if I just search for [a-zA-Z0-9àéèêëîïôöûü], it matches every single one of the letters, so my RegEx seems just fine...
Is this a bug or am I missing something?
Drop the \b - it's not essential to your query (you're already matching the space), and unicode support is patchy in regular expressions. The boundary detection is ASCII only in Sublime Text 2.

Regex to match any strings containing Cyrillic symbols, except comments marked with //, ///, ///, etc

I want to find all strings containing at least 1 Cyrillic character (basically /.*[А-я].*/) but with exception of comments.
Comment is a string or part of a string which starts with 2 or more / characters.
Currently I get this regex which do some part of the trick:
^(?=^.*?[А-я]+).*?((?=[\/]{2,})|(^(?:(?![\/]{2,}).)*$))
But I'd like to get less bloated and faster expression.
And as additional question: could anyone explain why this one is working? I combined it by trial-and-error but I'm not sure I completely understood how it works, because when I try to change it in any part - it stops working.
The following regex will match any cyrllic character that is not preceded by a double forward slash
(?<!/{2}.*)[А-я]
It specifies that it should not be preceded by a double slash by using a negative lookbehind.
You haven't specified what flavour of regex your using, but be aware some flavours don't support lookarounds. For example PCRE (javascript) doesn't. You are using 3 of them in your regex, so i presume its ok.

What is the regex syntax for a file name with spaces

I'm using a custom blog syndication tool and having problems in using the regex syntax.
Example:
The original code
<img src="http://www.mydomain.com/some image.png">
I tried:
/\<img src\=\"http\:\/\/www\.mydomain\.com\/some\%20image\.png\"\>/
and
/\<img src\=\"http\:\/\/www\.mydomain\.com\/some image\.png\"\>/
But none of them seem to work.
Any suggestions?
The pattern delimiters (/) at either end mean the regex engine should know where the pattern ends, so the engine shouldn't be confused by a space. Are you sure there isn't something else wrong with the pattern? I suspect this is the most likely problem.
You might like to try the %20 without the % being escaped.
Another thing to try could be an escaped space (with a preceding backslash).
Otherwise, you could try using \s to match a space - it's reasonably standard in regex engines (but also matches tabs, line feeds and carriage returns).

How to make a regular expression looking for a list of extensions separated by a space

I want to be able to take a string of text from the user that should be formated like this:
.ext1 .ext2 .ext3 ...
Basically, I am looking for a dot, a string of alphanumeric characters of any length a space, and rinse and repeat. I am a little confused on how to say " i need a period, string of characters and a space". But also, the last extension could either be followed by nothing, or a space, or a series of spaces. Also, I guess in between extensions could be followed by any number of spaces?
EDIT: I made it clearer what I was looking for.
Thanks!
Try this:
^(?:\.[A-Za-z0-9]+ +)*\.[A-Za-z0-9]+ *$
(Rubular)
In a Java string literal you need to escape the backslashes:
"^(?:\\.[A-Za-z0-9]+ +)*\\.[A-Za-z0-9]+ *$"
(\.\w+)\s* Match this and get your results.
^((\.\w+)\s*)*$ Check this and if it's true, your String is exactly what you want.
For the last pattern thing, you can't (AFAIK) do both getting all extensions (separated) and checking that the last is followed by other things. Either you check your string, or you extract the extensions from it.
I'd start with something like: ^.[a-z0-9]+([\t\n\v ]+.[a-z0-9]+)*$