Negative lookbehind and square brackets - regex

I 'd like to create a regex that matches unmatched right square brackets. Examples:
]ichael ==> match ]
[my name is Michael] ==> no match
No nested pairs of of square brackets occur in my text.
I tried to use negative lookbehind for that, more specifically I use this regex: (?<!\[(.)+)\] but it doesn't seem to do the trick.
Any suggestions?

Unless you are using .NET, lookbehinds have to be of fixed length. Since you just want to detect whether there are any unmatched closing brackets, you don't actually need a lookbehind though:
^[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*\]
If this matches you have an unmatched closing parenthesis.
It's a bit easier to understand, if you realise that [^\[\]] is a negated character class that matches anything but square brackets, and if you lay it out in freespacing mode:
^ # start from the beginning of the string
[^\[\]]* # match non-bracket characters
(?: # this group matches matched brackets and what follows them
\[ # match [
[^\[\]]* # match non-bracket characters
\] # match ]
[^\[\]]* # match non-bracket characters
)* # repeat 0 or more times
\] # match ]
So this tries to find a ] after matching 0 or more well-matched pairs of brackets.
Note that the part between ^ and ] is functionally equivalent to Tim Pietzker's solution (which is a bit easier to understand conceptually, I think). What I have done, is an optimization technique called "unrolling the loop". If your flavor provides possessive quantifiers, you can turn all * into *+ to increase efficiency even further.
About your attempt
Even if you are using .NET, the problem with your pattern is that . allows you to go past other brackets. Hence, you'd get no match in
[abc]def]
Because both the first and the second ] have a [ somewhere in front of them. If you are using .NET, the simplest solution is
(?<!\[[^\[\]]*)\]
Here we use non-bracket characters in the repetition, so that we don't look past the first [ or ] we encounter to the left.

You don't need lookaround at all (and it would be difficult to use it most languages don't allow unlimited-length lookbehind assertions):
((?:\[[^\[\]]*]|[^\[\]]*)*+)\]
will match any text that ends in a closing bracket unless there's a corresponding opening bracket before it. It does not (and according to your question doesn't need to) handle nested brackets.
The part before the ] can be found in $1 so you can reuse it later.
Explanation:
( # Match and capture in group number 1:
(?: # the following regex (start of non-capturing group):
\[ # Either a [
[^\[\]]* # followed by non-brackets
\] # followed by ]
| # or
[^\[\]]* # Any number of non-bracket characters
)*+ # repeat as needed, match possessively to avoid backtracking
) # End of capturing group
\] # Match ]

This should do it:
'^[^\[]*\]'
Basically says pick out any closing square bracket that doesn't have an open square bracket between it and the beginning of the line.

\](.*)
Will match on everything after the ]:
]ichael -> ichael
[my name is Michael] ->

Related

regex check if a character exists in combination with other characters, but not by its self

^\[[FfMmHhTt/]+\]
The above RegEx will detect any combination of "m", "f", "h", "t" or "/" within square brackets, upper or lower case, just as it should. However, I would like to modify it so that the forward slash character cannot by found in the square brackets by itself. For example, [F/m], [t/m/h] or [Hh] should still pass, but [/] or [///] should not.
Leading and trailing slashes such as [/t] or [h/m/] should also fail to match.
Can't find any regex tutorials that describe such a thing.
You could phrase the pattern as:
^\[[FfMmHhTt](?:/?[FfMmHhTt])*\]$
Here is an explanation:
^ from the start of the string
\[ match a literal opening square bracket
[FfMmHhTt] followed by [fmht], in any case
(?:/?[FfMmHhTt])* followed by an optional forward slash separator, and
another matching letter, together zero or more times
\] match a literal closing square bracket
$ end of the string
Demo
The idea here is that we match an initial letter, since at least one letter is required for a match. Then, we match subsequent letters, each of which may or may not be prefixed with a forward slash separator.
Use
^\[\/*[FfMmHhTt][FfMmHhTt\/]*\]
See the regex demo and the regex graph:
Details
^ - start of string
\[ - [ char
\/* - zero or more /s
[FfMmHhTt] - an allowed letter
[FfMmHhTt\/]* - 0 or more / or allowed letters
\] - a ] char.
Another option could be if supported to use a negative lookahead asserting what is on the right is not 1+ times / followed by ]
^\[(?!/+\])[FfMmHhTt/]+\]
^ Start of string
\[ Match [
(?!/+\]) Negative lookahead, assert what is directly on the right is not 1+ times a forward slash followed by ]
[FfMmHhTt/]+ Match 1+ times any of the listed
\] Match ]
Regex demo

Regex pattern without one case

I would like to remove some strings from filename.
I want to remove every string in bracket but not if there is a string "remix" or "Remix" or "REMIX"
Now I have got
sed "s/\s*\(\s?[A-z0-9. ]*\)//g"
but how to exclude cases when there is remix in string?
You can use a capture group:
sed 's/\(\s*([^)]*remix[^)]*)\)\|\s*(\s\?[a-z0-9. ]*)/\1/gi'
When the "remix branch" doesn't match, the capture group is not defined and the matched part is replaced with an empty string.
When the "remix branch" succeeds, the matched part is replaced by the content of the capture group, so by itself.
Note: if that helps to avoid false positive, you can add word-boundaries around "remix": \bremix\b
pattern details:
\( # open the capture group 1
\s* # zero or more white-spaces
( # a literal parenthesis
[^)]* # zero or more characters that are not a closing parenthesis
remix
[^)]*
)
\) # close the capture group 1
\| # OR
# something else between parenthesis
\s* # note that it is essential that the two branches are able to
# start at the same position. If you remove \s* in the first
# branch, the second branch will always win when there's a space
# before the opening parenthesis.
(\s\?[a-z0-9. ]*)
\1 is the reference to the capture group 1
i makes the pattern case-insensitive
[EDIT]
If you want to do it in a POSIX compliant way, you must use a different approach because several Gnu features are not available, in particular the alternation \| (but also the i modifier, the \s character class, the optional quantifier \?).
This other approach consists to find all eventual characters that are not an opening parenthesis and all eventual substrings enclosed between parenthesis with "remix" inside, followed by eventual white-spaces and an eventual substring enclosed between parenthesis.
As you can see all is optional and the pattern can match an empty string, but it isn't a problem.
All before the parenthesis part to remove is captured in group 1.
sed 's/\(\([^(]*([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)[^ \t(]*\([ \t]\{1,\}[^ \t(]\{1,\}\)*\)*\)\([ \t]*([^)]*)\)\{0,1\}/\1/g;'
pattern details:
\( # open the capture group 1
\(
[^(]* # all that is not an opening parenthesis
# substring enclosed between parenthesis without "remix"
( [^)]* [Rr][Ee][Mm][Ii][Xx] [^)]* )
# Let's reach the next parenthesis without to match the white-spaces
# before it (otherwise the leading white-spaces are not removed)
[^ \t(]* # all that is not a white-space or an opening parenthesis
# eventual groups of white-spaces followed by characters that are
# not white-spaces nor opening parenthesis
\( [ \t]\{1,\} [^ \t(]\{1,\} \)*
\)*
\) # close the capture group 1
\(
[ \t]* # leading white-spaces
([^)]*) # parenthesis
\)\{0,1\} # makes this part optional (this avoid to remove a "remix" part
# alone at the end of the string)
The word boundaries in this mode aren't available too. So the only way to emulate them is to list the four possibilities:
([Rr][Ee][Mm][Ii][Xx]) # poss1
([Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss2
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx]) # poss3
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss4
and to replace ([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*) with:
\(poss1\)\{0,\}\(poss2\)\{0,\}\(poss3\)\{0,\}\(poss4\)\{0,\}
Just skip the lines matching "remix":
sed '/([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)/! s/([^)]*)//g'
where bracket are (US) :[]
sed '/remix\|REMIX\|Remix/ !s/\[[^]]*]//g'
where bracet (ROW): ()
sed '/remix\|REMIX\|Remix/ !s/([^)]*)//g'
assuming:
- there is no internal bracket
- Other form of remix are excluced (ReMix, ...), so line is deleted
- Remix could be any place in title (i love remix) [if needed specify which to take and remove]

regular expression to find content within square brackets, but with some exceptions:

I want to create a regular expression to find content within square brackets, but with some exceptions:
E.g.,
[abc] -> It should match
['abc'] -> it should not match
[$abc] -> it should not match
[integer] Like [0] -> it should not match
I have used this regular expression
\[((?!')[^]]*)\]
It is working for the first 2 condition but not for the other 2 condition.
This regex could do the job,
\[([^'$\d]+?)\]
DEMO
Explanation:
\[ Matches a literal [ symbol.
() Capturing group
[^'$\d]+? Matches any character not of literal ' or $ or \d one or more times. ? after + does a reluctant(non-greedy) match.
\] Matches a literal ] symbol.
You could add a $ to your negative lookahead assertion and assert that no integer number can be matched:
\[((?!['$]|\d+\])[^]]*)\]
Explanation:
\[ # Match [
( # Capture in group 1:
(?! # unless the following matches here: Either...
['$] # one of the characters ' or $
| # or
\d+\] # a positive integer number, followed by ]
) # End of lookahead assertion
[^]]* # Match any number of characters except closing brackets
) # End of group 1
\] # Match ]
Test it live on regex101.com.
You might be able to avoid the negative lookahead altogether:
\[[^]'$\d]*\]

Regular expression to match a regular expression inside square brackets

I have a string that contains a regular expression within square brackets, and can contain more than 1 item inside square brackets. below is an example of a string I'm using:
[REGEX:^([0-9])*$][REGEXERROR:That value is not valid]
In the example above, I'd like to match for the item [REGEX:^([0-9])*$], but I can't figure out how.
I thought I'd try using the regular expression \[REGEX:.*?\], but it matches [REGEX:^([0-9] (ie; it finishes when it finds the first ]).
I also tried \[REGEX:.*\], but it matches everything right to the end of the string.
Any ideas?
Suppose you are using PCRE, this should be able to find nested brackets in regular expressions:
\[REGEX:[^\[]*(\[[^\]]*\][^\[]*)*\]
This technique is called unrolling. The basic idea of this regex is:
match the starting brackets
match all characters that are not brackets
match one brackets
match all trailing characters that are not brackets
then repeat 3 and 4 until the last closing bracket comes
Explanation with free-space:
\[ # start brackets
REGEX: # plain match
[^\[]* # match any symbols other than [
( # then match nested brackets
\[ # the start [ of nested
[^\]]* # anything inside the bracket
\] # closing bracket
[^\[]* # trailing symbols after brackets
)* # repeatable
\] # end brackets
Reference: Mastering Regular Expression

Match a dot/period if not followed AND preceded by a single character

So I know I need to use the lookahead and lookbehind stuff, but I'm starting to lose my mind.
Therefore, can you provide an example, and an explanation of what it means.
I need to match the dots in the following sequence, but not that ones in between the individual characters.
this.is.a.sentence.e.g.
When removing the matched dots you should be left with:
this is a sentence e.g
The answer needs to work in a variety of different regex engines, so something generic is preferred, but if it's easier, I'm sure I can work it out from a .NET based one.
Lookbehinds are not widely supported, and your requirements make it difficult not to use them. Perhaps a superior regex guru can provide a solution that does not use them, but for now here is what I have:
(?: # do not capture
^ # anchor to start of line
| # alternation
(?<= # lookbehind
[^.]{2} # two non-period characters
)
)
\. # a literal period
| # alternation
\. # a literal period
(?: # do not capture
$ # anchor to end of line
| # alternation
(?= # lookahead
[^.]{2} # two non-period characters
)
)
Essentially this does two alternating checks: A period that is preceded either by the start of the line or two non-period characters, or a period that is followed either by the end of the line or two non-period characters.
This works for your specific example: http://rubular.com/r/3ueTN37Smh
You could also handle doing the replacement this way:
s/(^|[^.]{2})\.|\.($|[^.]{2})/\1 \2/
This captures the two preceeding or following characters instead and inserts them back as part of the match. It's simpler and probably available for more languages.