How to create a short regular expression [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
How to create a short regular expression which only matches words that don't have the same characters following after another.
It is only the following Syntax elements allowed to use:
. * + ? | ()
And the alphabet is as {a, b}
Example:
Is matching: abababab
Not matching: abbab
Thank you :)

Well, your exercise is not very clear (which regex engine are you using? etc),
but I managed to do something:
(?<=^|\P{L})(?:(\p{L})(?!\1))+(?=\P{L}|$)
https://regex101.com/r/R2t2ik/1
Explanation
We are looking for a character of any type of language and not just [a-z]
neither just the \w for a word character. This is because àéêï would
typically not match. So instead, use \p{L} which is made for selecting
specific Unicode classes.
More details here:
https://www.regular-expressions.info/unicode.html#category
We will capture this char with a capturing group: (\p{L})
This will create a match with the number 1. The match 0 is the match of the
entire regular expression. Each capturing expression found from left to right
will create a new numbered match. In our case we will then be able to refer
our captured group with the \1 reference.
To check if two following characters are not identical, we will use a
negative lookahead, meaning that the searched item will not be selected
if the lookahead results with a success.
The regex becomes: (\p{L})(?!\1)
This means: "Find a letter of any language that is not followed by itself."
Now, a word is made of one or more characters, so it could be matched with
\w+ but as explained before, this would only work in English. So in any
language, it would become (\p{L})+. It seems that \p{L}+ doesn't work
properly, so adding a group around it will help the + to know what should
appear once or more.
Okay, that's good, but it's not what we want exactly. We only want to find
characters that are not followed by themselves. So we have to use our
pattern at point 3.
This becomes: (?:(\p{L})(?!\1))+
You would ask why do we have this (?: and ) around all of it?
Well, this is because we could simply use ( and )+ but in this case it
would create a new capturing group, which we don't need. So to create a
non-capturing group, you have to add the ?: at the beginning.
Capturing group = (abc) vs non-capturing group = (?:abc)
To finish, we want to capture word beginnings and ends with the help of
a positive lookbehind and a positive lookahead. I started with the usual
\b for word boundary but it did not work. Don't ask me why. I expect
that it's related to the use of the Unicode classes or perhaps the way the
selector is written. Someone may find an explanation, I'm not a specialist.
Well, I had to solve that by trying to match either the begin of the string
with the ^ selector and with the \P{L} Unicode class to select a char
which is not a language character. I did the same for the end by using the
$ selector.
So at the beginning, I added a positive lookbehind meaning "start with or
has a non-letter char before" done with this (?<=^|\P{L}) rule.
And at the end, I added a positive lookahead meaning "finish with or has
a non-letter char after" done with this (?=\P{L}|$) rule.
Putting everything together:
(?<=^|\P{L})5 + (?:(\p{L})(?!\1))+4 +
(?<=^|\P{L})5 results in:
(?<=^|\P{L})(?:(\p{L})(?!\1))+(?=\P{L}|$)
I hope it's what you where looking for and that it's not to complicated to
understand.

Related

Find and replace a Regex pattern occurring more than once [duplicate]

This question already has answers here:
How can I match overlapping strings with regex?
(6 answers)
Matching when an arbitrary pattern appears multiple times
(1 answer)
Closed 2 years ago.
I'm trying to find-and-replace instances where consecutive commas appear throughout a string; replacing them w/ something like ",N/A,". I was using a very simple /,,/g pattern, and that works on things like ",,abc" and ",,,,abc" (with even numbers of commas). However, it doesn't catch things like ",,,abc". That's because the first two commas are considered a match, and then the third comma is just considered part of a new ",abc" string. Is there a way to handle this w/ a RegEx pattern or options? Otherwise, I'm going to need to perform multiple searches.
FWIW - I'm working in JavaScript, but I'm guessing this is just a general RegEx question/answer.
The reason why /,,/g only matches once with three commas is because the global match restarts after the position of the final consumed characters. You need a way to match the pattern of ,, without consuming those characters for pattern matching purposes.
If your language supports it, use a positive lookahead. A positive lookeahead lets a regex match some additional characters, but not consume them in the pattern.
/,(?=,)/g
In English, this means:
, # match a comma, then
(?= #start a group that must exist, and if so, isn't consumed by the pattern,
, # a comma
)
See more about this here: https://www.regular-expressions.info/lookaround.html
Javascript supports positive lookahead. :)

How do I create a regular expression that doesn't match for certain variations of a string? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I wish to build a regular expression that does not match the first example below, the next 2 lines are examples of acceptable matches.
Distribute( SystemEvent, device, NULL)
Distribute(123)
Distribute( 123)
In words I want to match Distribute, followed by (, followed by an optional space, followed by anything that does not start with a capital S.
The expression below matches the first line though I thought the [^S] would stop that.
Distribute\( ?[^S]s
Distribute\( ?[^ S]
The first line is matched because ? maches with length zero, and then space goes under [^S]. So add space to the neg-group.
One way to fix it is this (considering It is PCRE):
Distribute\( ?+[^S]
Demo
By adding a + after ? you modify it to make it possesive (you match as much as you can and you don't go back)
On your regexp, since the ? was not posessive, it matched 0 times and then matched a non 'S' (a space)
I'm not exactly sure of your intentions, but if you are relying on human input, you should probably allow for more than one space
Distribute\( *+[^S]
Try this
Distribute\( ?(?=\S)(?=[^S])[^)]+\)
This matches Distribute( followed by a possible space then it uses a positive lookahead to check that the next character is not a white space character and then it checks that that character is also not an S Then it checks for one or more not ) characters followed by a )

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

?: Notation in Regular Expression [duplicate]

This question already has answers here:
What is a non-capturing group in regular expressions?
(18 answers)
Closed 6 years ago.
for one of my classes I have to describe the following regular expression:
\b4[0-9]{12}(?:[0-9]{3})\b
I understand that it selects a number that: begins with 4, is followed by 12 digits (each between 0-9), and is followed by another 3 digits.
What I don't understand is the the question mark with the semicolon (?:....). I've tried looking online to find out what this means but the links I've found were somewhat confusing; I was hoping someone could give me a quick basic idea of what the question mark does in this example.
This is going to be short answer.
When you use (?:) it means that the group is matched but is not captured for back-referencing i.e non-capturing group. It's not stored in memory to be referenced later on.
For example:
(34)5\1
This regex means that you are looking for 34 followed by 5 and then again 34. Definitely you could write it as 34534 but sometimes the captured group is a complex pattern which you could not predict before hand.
So whatever is matched by capturing group should be appearing again.
Regex101 demo for back-referencing
Back-referencing is also used while replacement.
For Example:
([A-Z]+)[0-9]+
This regex will look for many upper case letters followed by many digits. And I wish to replace this whole pattern just by found upper case letters.
Then I would replace whole pattern by using \1 which stands for back-referencing first captured group.
Regex101 demo for replacement
If you change to (?:[A-Z]+)[0-9]+ this will no longer capture it and hence cannot be referenced back.
Regex101 demo for non-capturing group
A live answer.
It's called a 'non-capturing group', which means the regex would not make a group by the match inside the parenteses like it would otherwise do (normally, a parenthesis creates a group).

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.