complicate regexp matching words (is it possible?) - regex

I want a regex to match all Greek (utf-8) words that do NOT:
end with .
end with -
end with '
end with numbers (1-9)
start with .
start with ,
start with -
the first letter is capital
all letters are capital
Is this possible? To match Greek words I use \p{Greek}{3,} which matches Greek UTF-8 words that have at least 3 characters.
I write programs in ruby, but if it can be done in perl or any other cli tool/language I'll write a script to dump the output in a text file.

(?<!\S)(?=\S*\p{Greek})(?![-,.\p{Lu}])(?![\p{Lu}\P{L}]+\b)\S+(?<![-.'1-9])(?!\S)
Let's break this beasty down:
The core of the regex is the \S+ in the middle which is surrounded by a bunch of positive and negative assertions.
(?<!\S) - The word must not be preceded by a non-whitespace character. This makes sure we don't start our match in the middle of a word.
(?=\S*\p{Greek}) - There must be at least one Greek letter in there somewhere.
(?![-,.\p{Lu}]) - The word must not start with a dash, comma, dot, or uppercase letter \p{Lu}.
(?![\p{Lu}\P{L}]+\b) - The word must not be all uppercase letters and symbols.
(?<![-.'1-9]) - The word must not end with a dash, dot, apostrophe, or digit 1 through 9.
(?!\S) - The word must not be followed by a non-whitespace character. This makes sure we don't end our match in the middle of a word.

Related

Regex command to match combinations but not only uppercase letters

Is there a regex command to match all combinations of uppercase letters, lowercase, underscore, brackets, numbers, but not only Uppercase letter words or only numbers?
I thought i had it with this one:
(/\b(?![A-Z]+\b)(?![0-9]+\b)[a-zA-Z0-9_{}]+\b/)
That was until i encountered: ABC{hello}_HI_HelLo
This is not a match, and i would like my regex to match this string.
There seem to be something with the negative lookahead since it reads "ABC" and assumes it is a Uppercase letter word only so it does not match the string, only the part after the "{" is matched.
When you add an underscore after "ABC" you get a matching string: ABC_{hello}_HI_HelLo
There is a word boundary between _ and {
You can assert a whitespace boundary to the left (?<!\S) and the right (?!\S) instead.
The pattern matches:
(?<!\S) Assert a whitespace boundary to the left
(?![A-Z]+(?!\S)) Assert not only uppercase chars followed by a whitespace boundary at the right
(?![0-9]+(?!\S)) Assert not only digits followed by a whitespace boundary at the right
[a-zA-Z0-9_{}]+ Match 1 or more occurrences of any of the listed
Regex demo

Regex to match if a word starts and end with a letter, have no more than one consecutive non-letter (. *')

I'm currently trying to find a regex to match a specific use case and I'm not finding any specific way to achieve it. I would like, as the title says, to match if a word starts and end with a letter, contains only letter and those characters: "\ *- \'" . It should also have no more than one consecutive non-letter.
I currently have this, but it accepts consecutive non-letter and doesn't accept single letters [a-zA-Z][a-zA-Z \-*']+[a-zA-Z]
I want my regex to accept this string
This is accepted since it contains only spaces and letter and there is no consecutive space
a should be accepted
This is --- not accepted because it contains 5 consecutive non-letters characters (3 dashes and 2 spaces)
" This is not accepted because it starts with a space"
Neither is this one since it ends with a dash -
You may use
^[a-zA-Z]+(?:[ *'-][a-zA-Z]+)*$
See the regex demo and the regex graph:
Details
^ - start of string anchor
[a-zA-Z]+ - 1+ ASCII letters
(?:[ *'-][a-zA-Z]+)* - 0 or more sequences of:
[ *'-] - a space, *, ' or -
[a-zA-Z]+ - 1+ ASCII letters
$ - end of string anchor.

How to write a regex in title case

I'm working with an SAP application called information steward and creating a rule where names will have to be in title case (ie each word is capitalized).
I've formulated the following rule:
BEGIN
IF(match_regex($name, '(^(\b[A-Z]\w*\s*)+$)', null)) RETURN TRUE;
ELSE RETURN FALSE;
END
Although it is successful it appears to accept inputs which should be identified as 'FALSE'. Please see the attached screenshot.
'TesT Name' and 'TEST NAME' should be FALSE but are instead passing under this regex.
Any help/guidance with the regex would be very useful.
The (^(\b[A-Z]\w*\s*)+$) regex presents a pattern that matches a string that fully matches:
^ - start of string
(\b[A-Z]\w*\s*)+ - 1 or more occurrences (due to (...)+) of
\b - a word boundary
[A-Z] - an uppercase ASCII letter
\w* - 0 or more letters/digits/underscores
\s* - 0+ whitespaces
$ - end of string.
As you see, it allows trailing whitespace, and \w matches what [A-Za-z0-9_] matches, i.e. it matches both lower- and uppercase letters.
You want to only match lowercase letters after initial uppercase ones, also allowing - and _ chars. You may use
^[A-Z][a-z0-9_-]*(\s+[A-Z][a-z0-9_-]*)*$
See the regex demo.
Details
^ - start of string anchor
[A-Z][a-z0-9_-]* - an uppercase letter followed with 0+ lowercase letters, digits, _ or - chars
(\s+[A-Z][a-z0-9_-]*)* - zero or more occurrences of:
\s+ - 1 or more whitespaces
[A-Z][a-z0-9_-]* - an uppercase letter followed with 0+ lowercase letters, digits, _ or - chars
$ - end of string.
I would write your regex as:
^[A-Z]\w*(?:\s+[A-Z]\w*)*$
This says to match a single word starting with a capital letter, then followed by one or more spaces and another word starting with a capital, this quantity zero or more times.
I phrase a matching word as starting with [A-Z] followed by \w*, meaning zero or more word characters. This allows for things like A to match.
Demo
Edit:
Based on the comments above, if you want some other character class to represent what follows the initial uppercase letter, then do that instead:
^[A-Z][something]*(?:\s+[A-Z][something]*)*$
where [something] is your character class.

Regexp at least 8 symbols and only one uppercase character

I need a regular expression for a string with has at least 8 symbols and only one uppercase character. Java
For example, it should match:
Asddffgf
asdAsadasd
asdasdaA
But not:
adadAasdasAsad
AsdaAadssadad
asdasdAsadasdA
I tried this: ^[a-z]*[A-Z][a-z]*$ This works good, but I need at least 8 symbols.
Then I tried this: (^[a-z]*[A-Z][a-z]*$){8,} But it doesn't work
^(?=[^A-Z]*[A-Z][^A-Z]*$).{8,}$
https://regex101.com/r/zTrbyX/6
Explanation:
^ - Anchor to the beginning of the string, so that the following lookahead restriction doesn't skip anything.
(?= ) - Positive lookahead; assert that the beginning of the string is followed by the contained pattern.
[^A-Z]*[A-Z][^A-Z]*$ - A sequence of any number of characters that are not capital letters, then a single capital letter, then more non capital letters until the end of the string. This insures that there will be one and only one capital letter throughout the string.
.{8,} - Any non-newline character eight or more times.
$ - Anchor at the end of the string (possibly unnecessary depending on your requirements).
In your first regex ^[a-z]*[A-Z][a-z]*$ you could append a positive lookahead (?=[a-zA-Z]{8,}) right after the ^.
That will assert that what follows matches at least 8 times a lower or uppercase character.
^(?=[a-zA-Z]{8,})[a-z]*[A-Z][a-z]*$

How to find a string with specified length that has specified letter?

So, I know how to find a string with specified length and how to find a string that has specified letter. But how can I find a string that matches both conditions?For example I want to find a 4 letter string that has letter "g".What I did:\b[A-Za-z].[Gg][A-Za-z].\bthis regex matches any word that has letter "g". So now I need to limit length, but when I try\b([A-Za-z].[Gg][A-Za-z].){4}\bit fails
To match only ASCII-letter sequences with length of 4 containing a specific letter, you can use
\b(?=\w*[Gg])[a-zA-Z]{4}\b
See the regex demo
The regex breakdown:
\b - a word boundary (we need the next letter to be a word character: [a-zA-Z0-9_], but we'll restrict it to [a-zA-Z] with the subsequent consuming pattern)
(?=\w*[Gg]) - a positive lookahead that makes sure there is at least one g or G in the word (\w* matches 0 or more alphanumeric symbols)
[a-zA-Z]{4} - 4 ASCII letters
\b - trailing word boundary
Already answer here by #Alan Moore
You just have to adapt :
(?<!\S)(?=[a-zA-Z]{4}(?!\S))\S*[gG]\S*
(?<!\S) matches a position that is not preceded by a non-whitespace
character.
(?=[a-zA-Z]{4}(?!\S)) further asserts that the position is
followed by exactly 4 letters.
Once the lookarounds
are satisfied, \S*[gG]\S* goes ahead and consumes the string,
assuming at least one of the characters is g or G.