Regular expressions, Greek characters and the *-quantifier doesn't work (but the +-quantifier does)? - regex

I use this regular expression [\p{Greek}] to match any Greek character. It works as expected and matches the first Greek character on the line. However, I want to match all Greek characters that follows that first character but the *-quantifier doesn't seem to work for Greek characters.
This is my input data. First three spaces, a double quote and then a Greek or Latin string, with one or more space, ending with ", and a new line.
"ξηλξλκξ λκλξλξ",
"lkjlkj kjljl",
"δδσασα ασδ ασδφ",
"xxaax asdsd dsds",
"δερεφε αδσφδσ",
a. ^.*?[\p{Greek}|\s] - just matches the first space on all lines.
b. ^.*?[\p{Greek}|\s]+ - matches all three initial spaces, on all lines.
c. ^.*?"[\p{Greek}|\s]+ - matches the whole line when it is written with Greek characters
d. ^.*?"[\p{Greek}|\s]* - matches the initial spaces and the " on the Latin lines and the whole line excluding the ", at the end on the Greek lines.
e. [\p{Greek}]* - matches all characters on the Latin lines, but just one at the time (in spite of the *). On the Greek lines it matches the initial spaces, one at the time, but not the first ". Then it matches the first word, not the space between the words,
(e) is super confusing. If I do a search-and-replace using that regular expression on the string "XYZ NOP", and insert A for everything found one at the time ("replace and find next") the result looks like this A A"XAYZA NAOPA",A. However, if I perform a "replace all", this is the result ´A A A A"AXAYAZA ANAOAPA"A,´. All the original characters remain, in spite of performing a search-and-replace, with As more or less randomly inserted.
I have no idea what is going on here.
A couple of questions here:
Why does ^.*? just match the first spaces in (a) and not the " in (b)?
Why does + and * give different results?
(e) - what to say!? I don't understand the *-quantifiers behaviour here.
I am using BBEdit for this. I have used BBEdit with regular expressions since the 90s and have never encountered any issues with its regexp implementation. But OTOH, I have never tried working with Greek characters before.

Related

I need a Regular Expression to find whitespaces and replace with a dash?

I thought this might work: ^['\s+', '-', "This should be connected"\w\s]{1,}$
But something is wrong with it. Does anyone no of a regex that will place dashes between words while at the same time not placing dashes in front of the very first word or behind the very last word? And, sometimes I will only have one word so no dashes are required.
The tool I am using is www.import.io which allows me turn any website into a table of data or an API in seconds – no coding required. It uses regex and xapath to help refine and reformat the data it captures.
I don't know about www.import.io, but in plain JavaScript
" This is my test string ".replace(/(\w+)\s+(?=\w)/g, "$1-")
has the result:
" This-is-my-test-string "
The regex replaces every whitespace characters with dashes between words, but not at the beginning or the end of the string.
(To be more precise it replaces every group of word characters and whitespaces which are followed by a word character with the same word characters without the whitespaces and with a dash instead.)

regex 1 character and space only

Hi i am learning regex..
I was trying to make a regex expression for following conditon:
any letter in the sequence given below - C-MPSTV-XZ condition is that it should not be repeated.
This letter can have one blank space in front or back ie it can be " C" or "C "
[C-MPSTV-XZ{1} ]{2}
I was trying the above expression {1} expected one character only and space after that allowing one space only. At the end of string i put {2} to get only 2 character .
I was expecting regex_match to be false for input "XX" but its not working.
Appreciate your help.
\s?[C-MPSTV-XZ]\s?. If you are using std::regex_match,
you shouldn't need anything else, since regex_match requires
a match over the entire string.
Your posted regex will match two characters which are both not spaces, because you're asking for any two from inside the character class. You're also going to accept {, 1 and } as characters because quantifiers act as literal characters inside a character class.
The simple alternative is to just spell out the two conditions explicitly:
( [C-MPRSTV-XZ]|[C-MPRSTV-XZ] )
This assumes that your regex engine is treating whitespace within regexes as significant. If not, or if you don't like that, replace the spaces with a suitable escape sequence.

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Regular expression - finding specific string with at least one capital letter

I am looking for a regular expression which matches a specific string which:
always start with "fu:
always ends with "
and contains at least one capital letter in between those start and ending points
point 3 is the part I really can't solve.
the regex "fu:(.*)?" matches all the strings apart from point 3.
[edit]
its pretty close now, the only problem is it doesnt stop after the second ".
Basically this string:
"fu:no capital letter:,some other random text WITH CAPITAL LETTERS"
is a match but shouldnt.
The regex that will work for you is this:
/^"fu:.*?[A-Z].*?"$/
Here the live demo of above regex
^"fu:.*[A-Z].*"$
Don't forget about multiline mode if you wish to search in several lines of text.
^"fu: - starts with "fu:
.* - any other characters
[A-Z] - capital letter
.* - other characters
"$ - " at the end
Good tool to test it: http://www.regexplanet.com/advanced/java/index.html
Something like
^"fu:([^"]*?[A-Z][^"]*?)"$
I commented on a problem with anubhava's solution (that it only matches upper case letters in the range A through Z), but then found the solution myself. Note that this requires a POSIX-compliant regular expression engine with support for Unicode.
My solution is
/^"fu:.*[[:upper:]].*"$/
It solves the problem of finding upper case letters in other languages than English (with partially or completely different alphabets).
An example in Ruby:
rx = /^"fu:.*[[:upper:]].*"$/
arr = ['"fu:Berlin"', '"fu:İstanbul"', '"fu:Washington"', '"fu:Örebro"', '"fu:Москва"']
arr.map {|s| s.scan rx}
In this case, all of the strings are matched.

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression