Find an item in the text with exceptions[Regular Expression] - regex

Please help create a regular expression that would be allocated "|" character everywhere except parentheses.
example|example (example(example))|example|example|example(example|example|example(example|example))|example
After making the selection should have 5 characters "|" are out of the equation. I want to note that the contents within the brackets should remain unchanged including the "|" character within them.

Considering you want to match pipes that are outside any set of parentheses, with nested sets, here's the pattern to achieve what you want:
Regex:
(?x) # Allow comments in regex (ignore whitespace)
(?: # Repeat *
[^(|)]*+ # Match every char except ( ) or |
( # 1. Group 1
\( # Opening paren
(?: # chars inside:
[^()]++ # a. everything inside parens except nested parens
| # or
(?1) # b. nested parens (recurse group 1)
) #
\) # Until closing paren.
)?+ # (end of group 1)
)*+ #
\K # Keep text out of match
\| # Match a pipe
regex101 Demo
One-liner:
(?:[^(|)]*+(\((?:[^()]++|(?1))\))?+)*+\K\|
regex101 Demo
This pattern uses some advanced features:
Possessive quantifiers
Recursion
Resetting the match start

Related

Notepad++ regex to find single character bounded by |

I'm having trouble coming up with the regex I need to do this find/replace in Notepad++. I'm fine with needing a couple of separate searches to complete the process.
Basically I need to add a | at the beginning and end of every line from a CSV, plus replace all the , with |. Then, on any value with only 1 character, I need to put two spaces around the character on each side ("A" becomes " A ")
Source:
col1,col2,col3,col4,col5,col6
name,desc,something,else,here,too
another,,three,,,
single,characters,here,a,b,c
last,line,here,,almost,
Results:
|col1|col2|col3|col4|col5|col6|
|name|desc|something|else|here|too|
|another||three||||
|single|characters|here| a | b | c |
|last|line|here||almost||
Adding the | to the beginning and the end of the line is simple enough, and replacing , with | is obviously straightforward. But I can't come up with the regex to find |x| where x is limited to a single character. I'm sure it is simple, but I'm new to regex.
Regex:
(?:(^)|(?!^)\G)(?:([^\r\n,]{2,})|([^\r\n,]))?(?:(,$)|(,)|($))
Replacement string:
(?{1}|)(?{2}\2)(?{3} \3 )(?{4}||)(?{5}|)(?{6}|)
Ugly, dirty and long but works.
Regex Explanation:
(?: # Start of non-capturing group (a)
(^) # Assert beginning of line (CP #1)
| # Or
(?!^) # //
\G # Match at previous matched position
) # End of non-capturing group (a)
(?: # Start of non-capturing group (b)
([^\r\n,]{2,}) # Match characters with more than 2-char length (any except \r, \n or `,`) (CP #2)
| # Or
([^\r\n,]) # Match one-char string (CP #3)
)? # Optional - End of non-capturing group (b)
(?: # Start of non-capturing group (c)
(,$) # Match `,$` (CP #4)
| # Or
(,) # Match single comma (CP #5)
| # Or
($) # Assert end of line (CP #6)
) # End of non-capturing group (c)
Three Step Solution:
Pattern: ^.+$ Replacement: |$0|
Pattern: , Replacement: |
Pattern: (?<=\|)([^|\r\n])(?=\|) Replacement: $0
The first replace adds | at the beginning and at the end, and replaces commas:
Search: ^|$|,
Replace: |
The second replace adds space around single character matches:
Search: (?<=[|])([^|])(?=[|])
Replace: $1
Add spaces to the left and to the right of $1.

Regex pattern without one case

I would like to remove some strings from filename.
I want to remove every string in bracket but not if there is a string "remix" or "Remix" or "REMIX"
Now I have got
sed "s/\s*\(\s?[A-z0-9. ]*\)//g"
but how to exclude cases when there is remix in string?
You can use a capture group:
sed 's/\(\s*([^)]*remix[^)]*)\)\|\s*(\s\?[a-z0-9. ]*)/\1/gi'
When the "remix branch" doesn't match, the capture group is not defined and the matched part is replaced with an empty string.
When the "remix branch" succeeds, the matched part is replaced by the content of the capture group, so by itself.
Note: if that helps to avoid false positive, you can add word-boundaries around "remix": \bremix\b
pattern details:
\( # open the capture group 1
\s* # zero or more white-spaces
( # a literal parenthesis
[^)]* # zero or more characters that are not a closing parenthesis
remix
[^)]*
)
\) # close the capture group 1
\| # OR
# something else between parenthesis
\s* # note that it is essential that the two branches are able to
# start at the same position. If you remove \s* in the first
# branch, the second branch will always win when there's a space
# before the opening parenthesis.
(\s\?[a-z0-9. ]*)
\1 is the reference to the capture group 1
i makes the pattern case-insensitive
[EDIT]
If you want to do it in a POSIX compliant way, you must use a different approach because several Gnu features are not available, in particular the alternation \| (but also the i modifier, the \s character class, the optional quantifier \?).
This other approach consists to find all eventual characters that are not an opening parenthesis and all eventual substrings enclosed between parenthesis with "remix" inside, followed by eventual white-spaces and an eventual substring enclosed between parenthesis.
As you can see all is optional and the pattern can match an empty string, but it isn't a problem.
All before the parenthesis part to remove is captured in group 1.
sed 's/\(\([^(]*([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)[^ \t(]*\([ \t]\{1,\}[^ \t(]\{1,\}\)*\)*\)\([ \t]*([^)]*)\)\{0,1\}/\1/g;'
pattern details:
\( # open the capture group 1
\(
[^(]* # all that is not an opening parenthesis
# substring enclosed between parenthesis without "remix"
( [^)]* [Rr][Ee][Mm][Ii][Xx] [^)]* )
# Let's reach the next parenthesis without to match the white-spaces
# before it (otherwise the leading white-spaces are not removed)
[^ \t(]* # all that is not a white-space or an opening parenthesis
# eventual groups of white-spaces followed by characters that are
# not white-spaces nor opening parenthesis
\( [ \t]\{1,\} [^ \t(]\{1,\} \)*
\)*
\) # close the capture group 1
\(
[ \t]* # leading white-spaces
([^)]*) # parenthesis
\)\{0,1\} # makes this part optional (this avoid to remove a "remix" part
# alone at the end of the string)
The word boundaries in this mode aren't available too. So the only way to emulate them is to list the four possibilities:
([Rr][Ee][Mm][Ii][Xx]) # poss1
([Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss2
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx]) # poss3
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss4
and to replace ([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*) with:
\(poss1\)\{0,\}\(poss2\)\{0,\}\(poss3\)\{0,\}\(poss4\)\{0,\}
Just skip the lines matching "remix":
sed '/([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)/! s/([^)]*)//g'
where bracket are (US) :[]
sed '/remix\|REMIX\|Remix/ !s/\[[^]]*]//g'
where bracet (ROW): ()
sed '/remix\|REMIX\|Remix/ !s/([^)]*)//g'
assuming:
- there is no internal bracket
- Other form of remix are excluced (ReMix, ...), so line is deleted
- Remix could be any place in title (i love remix) [if needed specify which to take and remove]

how to interpret this regular expression

I am having a lot of trouble interpreting this expression and I am getting really lost trying to read it. can someone help me?
^[^?](?:htaccess|access_log)(?:[.][^/?])?(?:[~])?(?:[?].*)?$
I know that ^ means to start at the beginning of the line, [^?] not matching a "?" i think, and then (?:) not sure what this does or how to interpret the rest of the line. Im thinking that htaccess|access_log means its an or statement so either htacces or access_log. [.][^/?] is a . followed by not a "?" but then what would the earlier [^?] mean...
What would an example of something this matches?
There are plenty of explainers that will breakdown a regular expression for you.
To be concise, the caret inside of a character class [^ ] is the negation operator, meaning match anything NOT in the character class. The ?: placed inside of an opening parentheses is a non-capturing group which specifies that the group is not to be captured, but to group expressions, and | is the alternation operator.
I would recommend taking a look at these sites for basic use of regular expressions.
Regular-Expressions.info
Rexegg (Regex Tutorial)
Regular Expression:
^ # the beginning of the string
[^?] # any character except: '?'
(?: # group, but do not capture:
htaccess # 'htaccess'
| # OR
access_log # 'access_log'
) # end of grouping
(?: # group, but do not capture (optional):
[.] # any character of: '.'
[^/?] # any character except: '/', '?'
)? # end of grouping
(?: # group, but do not capture (optional):
[~] # any character of: '~'
)? # end of grouping
(?: # group, but do not capture (optional):
[?] # any character of: '?'
.* # any character except \n (0 or more times)
)? # end of grouping
$ # before an optional \n, and the end of the string

Perl Regex match balanced parentheses

Following strings - match:
"MNO(A=(B=C) D=(E=F)) PQR(X=(G=H) I=(J=(K=L)))" - "MNO"
"MNO(A=(B=C) D=(E=F))" - "MNO"
"MNO" - "MNO"
"RAX.MNO(A=(B=C) D=(E=F)) PQR(X=(G=H) I=(J=(K=L)))" - "RAX.MNO"
"RAX.MNO(A=(B=C) D=(E=F))" - "RAX.MNO"
"RAX.MNO" - "RAX.MNO"
Inside every brace, there can be unlimited groups of them, but they have to be closed properly.
Any ideas? Don't know how to test properly for closure.
I have to use a Perl-Regular-Expression.
In Perl or PHP, for example, you could use a regex like
/\((?:[^()]++|(?R))*\)/
to match balanced parentheses and their contents.
See it on regex101.
To remove all those matches from a string $subject in Perl, you could use
$subject =~ s/\((?:[^()]++|(?R))*\)//g;
Explanation:
\( # Match a (
(?: # Start of non-capturing group:
[^()]++ # Either match one or more characters except (), don't backtrack
| # or
(?R) # Match the entire regex again, recursively
)* # Any number of times
\) # Match a )

regex to match postgresql bytea

In PostgreSQL, there is a BLOB datatype called bytea. It's just an array of bytes.
bytea literals are output in the following way:
'\\037\\213\\010\\010\\005`Us\\000\\0001.fp3\'\\223\\222%'
See PostgreSQL docs for full definition of the format.
I'm trying to construct a Perl regular expression which will match any such string.
It should also match standard ANSI SQL string literals, like 'Joe', 'Joe''s Mom', 'Fish Called ''Wendy'''
It should also match backslash-escaped variant: 'Joe\'s Mom', .
First aproach (shown below) works only for some bytea representations.
s{ ' # Opening apostrophe
(?: # Start group
[^\\\'] # Anything but a backslash or an apostrophe
| # or
\\ . # Backslash and anything
| # or
\'\' # Double apostrophe
)* # End of group
' # Closing apostrophe
}{LITERAL_REPLACED}xgo;
For other (longer ones, with many escaped apostrophes, Perl gives such warning:
Complex regular subexpression recursion limit (32766) exceeded at ./sqa.pl line 33, <> line 1.
So I am looking for a better (but still regex-based) solution, it probably requires some regex alchemy (avoiding backreferences and all).
OK, here the best solution I could put together, thanks to Leon and hobbs.
Note: This is not the solution I was looking for! It still makes Perl fail with warning "recursion limit (32766) exceeded", for some long strings. (try to stuff 400k random bytes into a bytea field, then export with pg_dump --inserts).
However, it matches most bytea strings (as they appear in SQL code and in server logs), and ANSI SQL string literals. For example:
'\014cS\0059\036a4JEd\021o\005t\0015K7'
'\\037\\213\\010\\010\\005`Us\\000\\0001.fp3\'\\223\\222%'
' Joe''s Mom friend\'s dog is called \'Fluffy'''
And here's the regex:
m{
' # opening apostrophe
(?> # start non-backtracking group
[^\\']+ # anything but a backslash or an apostrophe, one or more times
| # or
(?: # group of
\\ \\? [0-7]{3} # one or two backslashes and three octal digits
)+ # one or more times
| # or
'' # double apostrophe
| # or
\\ [\\'] # backslash-escaped apostrophe or backslash
)* # end of group
' # closing apostrophe
}x;
If you don't care about correctness, at least for now, couldn't you just try to match against regular quoted string literals? Probably something like
m{
(?> # start of a quote group
' # opening apostrophe
(?> # start non-backtracking group
[^\\']+ # anything but a backslash or an apostrophe, one or more times
| # or
\\ . # backslash-escaped something
)* # end of group
' # closing apostrophe
)+ # end of a quote group, many of these
}x;
First of all, it seems like you're trying to so two very different things in one regexp:
Matching it for correctness.
Unquoting it.
To match it, you could try something like his:
m{ ^ # Start of string
' # Opening apostrophe
(?> # Start non-backtracking group
[^\\\'] # Anything but a backslash or an apostrophe
| # or
(?: # Start group
\d{3} # 3 digits
|
. # one other character
) # end group
| # or
'' # Double apostrophe
)* # End of group
' # Closing apostrophe
$ # End of string
}xms;