How to only match a single instance of a character?

How to only match a single instance of a character? - regex

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).

Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO

You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.

You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.

Related

How can I use a negative lookahead in an anchored regular-expression pattern?

My web-application allows users to specify custom URI path components which comply with the following restrictions:
All characters must be lowercase.
Be at least 2 characters long.
First character must match [a-z].
The last character must match [0-9a-z].
All other characters must match [0-9a-z_\-].
The - and _ characters must not exist as a consecutive run of 2 or more.
i.e. The string must not contain --, __, _-, or -_.
I've implemented the first 5 rules in a regular-expression easily enough:
^[a-z][0-9_a-z\-]*[0-9a-z]$
...however I don't know how to implement the last rule in a single regex.
I thought I'd start by just trying to change the regex so it won't match -- (as in a--b) - and I was thinking it could be a negative-lookahead, as it's asserting that that regex does not contain -- (right?):
Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions just like the start and end of line, and start and end of word anchors. [...] The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not
But adding (?!\-\-) to the regular expression (on regex101.com) in various spots, or as a lookbehind (?<!\-\-) does not cause strings like a--b to not-match.
i.e. all of these patterns match foo--bar when it shouldn't.
(?!\-\-)^[a-z][0-9_a-z\-]*[0-9a-z]$
^(?!\-\-)[a-z][0-9_a-z\-]*[0-9a-z]$
^[a-z](?!\-\-)[0-9_a-z\-]*[0-9a-z]$
^[a-z](?!\-\-)(?:[0-9_a-z\-]*)[0-9a-z]$
^[a-z][0-9_a-z\-]*(?!\-\-)[0-9a-z]$
^[a-z][0-9_a-z\-]*(?<!\-\-)[0-9a-z]$

You can place the negative lookahead right after matching a-z at the start of the string.
As you don't want to match any combination of - and - you can use 2 character classes (?!.*[_-][_-])
As the [_-][_-] part can occur anywhere in the string, you can precede it with .* optionally matching any character.
If you omit .* the assertion only runs on the current position, which in this case would be after matching the a-z at the start of the string.
^[a-z](?!.*[_-][_-])[0-9_a-z-]*[0-9a-z]$

Why the character ^ is required in an regex ^(?!.*?spam) to filter strings?

I try to filter strings, that don't contain word "spam".
I use the regex from here!
But I can't understand why I need the symbol ^ at the start of expression. I know that it signs the start of regex but I do not understand why it doesn't work without ^ in my case?
UPD. All the answers hereunder are very usefull.
It's completely clear now. Thank you!

The regex (?!.*?spam) matches a position in a string that is not followed by something matching .*?spam.
Every single string has such a position, because if nothing else, the very end of the string is certainly not followed by anything matching .*?spam.
So every single string contains a match for the regex (?!.*?spam).
The anchor ^ in ^(?!.*?spam) restricts the regex, so that it only matches strings where the very beginning of the string isn't followed by anything matching .*?spam — i.e., strings that don't contain spam at all (or anywhere in the first line, at least, depending on whether . matches newlines).

The lookahead is a zero-width assertion (that is, it ensures a position in your string). In your case it is a negative lookahead making sure that not "zero more characters, followed by the word spam" are following. This is true for a couple of positions in your string, see a demo on regex101.com without the anchor.
With the anchor the matching process starts right at the very beginning, so the whole string is analyzed, see the altered demo on regex101.com as well.

Regex pattern to start with first digit and end at last hyphen (if it exists)

I have a regex pattern that almost works, but I can't quite get it totally correct. My goal is that if a string starts with letters, to ignore them up to the first digit. The second part of the pattern needs to make the match stop at the last hyphen in the string, if one exists. Here are some examples of strings that I would be working on:
PCKG6JUB-0330M3-0-812 wanting returned 6JUB-0330M3-0
CCP352878 wanting returned 352878
0972543107 wanting returned 0972543107
This is the pattern that I have so far: \d[\S]*- The problem is that on the top example, it includes the last hyphen in the match, so I get 6JUB-0330M3-0-. Also, if no hyphen exists, then nothing is returned.
I'm using the VBScript engine.

Use this:
\d(?:\S*(?=-)|\S*)
First, I used a positive lookahead, (?=...), so we don't actually match the last hyphen. Then, I used alternation, |, to check for a match with a hyphen or without a hyphen. So that we don't need to match the digit on both sides of the alternation, I put this part in a non-capturing group, (?:...). Finally, \S is shorthand for a character class and doesn't need to be in brackets.
One would think we'd just be able to make the hyphen optional (i.e. \d\S*(?=-?)), but that doesn't work. This is because our \S* match is greedy (and it needs to be, since you want to match up until the last hyphen) and will just blow right past the hyphen.

Regular expression to match non-integer values in a string

I want to match the following rules:
One dash is allowed at the start of a number.
Only values between 0 and 9 should be allowed.
I currently have the following regex pattern, I'm matching the inverse so that I can thrown an exception upon finding a match that doesn't follow the rules:
[^-0-9]
The downside to this pattern is that it works for all cases except a hyphen in the middle of the String will still pass. For example:
"-2304923" is allowed correctly but "9234-342" is also allowed and shouldn't be.
Please let me know what I can do to specify the first character as [^-0-9] and the rest as [^0-9]. Thanks!

This regex will work for you:
^-?\d+$
Explanation: start the string ^, then - but optional (?), the digit \d repeated few times (+), and string must finish here $.

You can do this:
(?:^|\s)(-?\d+)(?:["'\s]|$)
^^^^^ non capturing group for start of line or space
^^^^^ capture number
^^^^^^^^^ non capturing group for end of line, space or quote
See it work
This will capture all strings of numbers in a line with an optional hyphen in front.
-2304923" "9234-342" 1234 -1234
++++++++ captured
^^^^^^^^ NOT captured
++++ captured
+++++ captured

I don't understand how your pattern - [^-0-9] is matching those strings you are talking about. That pattern is just the opposite of what you want. You have simply negated the character class by using caret(^) at the beginning. So, this pattern would match anything except the hyphen and the digits.
Anyways, for your requirement, first you need to match one hyphen at the beginning. So, just keep it outside the character class. And then to match any number of digits later on, you can use [0-9]+ or \d+.
So, your pattern to match the required format should be:
-[0-9]+ // or -\d+
The above regex is used to find the pattern in some large string. If you want the entire string to match this pattern, then you can add anchors at the ends of the regex: -
^-[0-9]+$

For a regular expression like this, it's sometimes helpful to think of it in terms of two cases.
Is the first character messed up somehow?
If not, are any of the other characters messed up somehow?
Combine these with |
(^[^-0-9]|^.+?[^0-9])

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!

In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++

You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.

You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to only match a single instance of a character? - regex

Related

How can I use a negative lookahead in an anchored regular-expression pattern?

Why the character ^ is required in an regex ^(?!.*?spam) to filter strings?

Regex pattern to start with first digit and end at last hyphen (if it exists)

Regular expression to match non-integer values in a string

Regular expression doesn't match if a character participated in a previous match

Categories

Resources