Matching parts of string that contain no consecutive dashes - regex

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?

This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.

Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.

Related

match any number of consecutive words following a backslash

I'm trying to match a TeX command, i.e. a backslash followed by a word (in a desired list) using regex, but with any number of them. For example, if the list I want is test, other, list, then the sequences \test, \other, and \list should be matched, while \sdfsdf should not. I also would want \test\list and \test\other\list to be matched. However, I don't want to match things like \testagain (although \test again should be). I tried the following regex
(\\)(test|other|list)([^a-z])
to no avail, since it does not match \test\other. How would I do this? I am not very experienced with regex.
Use \b to match the word boundary at the end of the word.
\\(test|other|list)\b

Regex check for name Initials

I am trying to create a regex that checks if one or more middle-name initials have the following stucture:
INITIAL.[BLANK]INITIAL.[BLANK]INITIAL.
There can be multiple Initials as long as they are followed by a dot (.) - blank spaces are only allowed between two initials (e.g. L. B.)
It should not be possible to have a space after an initial if there's no other initial following.
At the moment, I have the following Regex which doesn't work perfectly as of now:
([A-Z]\. (?=[A-Z]|$))+
Using regex101, this is an example:
As you can see, it still matches the string even though there's a blank space at the end, without having another Initial following.
I am not sure why this is happening. I am just learning regex and would be glad if anyone could provide me with a solution to my problem :)
The error you're seeing is because at the last step, your expression reads in [A-Z]\. looks ahead for $ (and finds it). I would express the pattern this way: (?:[A-Z]\. )*[A-Z]\.$. Treat the last initial specially because it does not have a final space.
The pattern you tried ([A-Z]\. (?=[A-Z]|$))+ uses a repeated capturing group which will give you the value of the last iteration.
In that repetition you match a space <code>[A-Z]\. </code> effectively meaning that it should be present in the match.
You could repeat 0+ times matching a char [A-Z] followed by a space to match multiple occurrences.
Then match a char [A-Z] asserting what is on the right is not a non whitespace char.
\b(?:[A-Z]\. )*[A-Z]\.(?!\S)
Regex demo
If there can be multiple spaces but it should not match a newline:
\b(?:[A-Z]\.[^\S\r\n]*)*[A-Z]\.(?!\S)
Regex demo

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Regex to match everything except this regex

I think this is a simple thing for a lot of you, but I have a very limited knowlegde of regex at the moment. I want to match everything except a double digit number in a string.
For example:
TEST22KLO4567
QE45C2C
LOP10G7G400
Now I found out the regex to match the double digit numbers:
\d{2}
Which matches the following:
TEST22KLO4567
QE45C2C
LOP10G7G400
Now it seems to me that it would be fairly easy to turn that regex around to match everything BUT "\d{2}". I searched a lot but I can't seem to get it done. I hope someone here can help.
This only works if your regex engine supports look behinds:
^.+?(?=\d{2})|(?<=\d{2}).+$
Explanation:
The | separates two cases where this would match:
^.+?(?=\d{2})
This matches everything from the start of the string (^) until \d{2} is encountered.
(?<=\d{2}).+$
This matches the end of the string, from the place just after two digits.
If your regex engine doesn't support look behinds (JavaScript for example), I don't think it is possible using a pure regex solution.
You can match the first part:
^.+?(?=\d{2})
Then get where the match ends, add 2 to that number, and get the substring from that index.
You are right rejecting a search in regex is usually rather tricky.
In your case I think you want to have [^\d{2}], however, this is tricky as your other strings also contain two digits so your regex using it won't select them.
I would go with this regex (using PCRE 8.36 but should work also in others):
\*{2}\w*\*{2}
Explanation:
\*{2} .... matches "*" literally exactly two times
\w* .... matches "word character" zero or unlimited times
Found one regex pretty straightforward :
^(.*?[^\d])\d{2}([^\d].*?)$
Explanations :
^ : matches the beginnning of a line
(.*?[^\d]) : matches and catches the first part before the two numbers. It can contain anything (.*?) but needs to end with something different to a number ([^\d]) so we ensure that there is only 2 numbers in the middle
\d{2} : is the part you found yourself
([^\d].*?) : is the symetric of (.*?[^\d]) : begins with something different from a number ([^\d]) and matches anything next.
$ : up to the end of the line.
To test this reges you can use this link
It will match the first occurence of double digit, but because OP said there was only one it does the job correctly. I expect it to work with every regex engine as nothing too complex is used.

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$