Grep ambiguity nested square bracket - regex

sample.txt contains
abcde
abde
Can anybody explain the output of following commands -
grep '[[ab]]' sample.txt - no output
grep '[ab[]]' sample.txt - no output
grep '[ab[]' sample.txt - output is abcde , abde
grep '[ab]]' sample.txt - no output
And what does [(ab)] and [^(ab)] mean? Is it the same as [ab] and [^ab] ?

First thing to understand is, inside a character class, none of the meta-characters of regex has any special meaning. They are matched literally. For e.g., an * will match a * and will not mean 0 or 1 repetition. Similarly, () will match ( and ), and will not create a capture group.
Now, if a ] is found in a character class, that automatically closes the character class, and the further character won't be the part of that character class. Now, let's understand what is happening above:
In 1, 2, and 4, your character class ends at the first closing ]. So, the last closing bracket - ], is not the part of character class. It has to be matched separately. So, your pattern will match something like this:
'[[ab]]' is same as '([|a|b)(])' // The last `]` has to match.
'[ab[]]' is same as '(a|b|[)(])' // Again, the last `]` has to match.
'[ab]]' is same as '(a|b|])(])' // Same, the last `]` has to match.
^
^---- Character class closes here.
Now, since in both the string, there is no ] at the end, hence no match is found.
Whereas, in the 3rd pattern, your character class is closed only by the last ]. And hence everything comes inside the character class.
'[ab[]' means match string that contains 'a', or 'b', or '['
which is perfectly valid and match both the string.
And what does [(ab)] and [^(ab)] mean?
[(ab)] means match any of the (, a, b, ). Remember, inside a character class, no meta-character of regex has any special meaning. So, you can't create groups inside a character class.
[^(ab)] means exact opposite of [(ab)]. It matches any string which does not contain any of those characters specified.
Is it the same as [ab] and [^ab] ?
No. These two does not include ( and ). Hence they are little different.

I give it a try:
grep '[[ab]]' - match string which has one of "[,a,b" and then a "]" char followed
grep '[ab[]]' - match string which has one of "a,b,[" and then a "]" char followed
grep '[ab[]' - match string which has one of "a,b,["
grep '[ab]]' - match string which has one of "a,b" and then a "]" char followed
grep '[(ab)]' - match string which has one of "(,a,b,)"
grep '[^(ab)]' - match string which doesn't contain "(,a,b" and ")"
grep '[ab]' - match string which contains one of "a,b"
grep '[^ab]' - match string which doesn't contain "a" and "b"
you can go through those grep cmds on this example:
#create a file with below lines:
abcde
abde
[abcd
abcd]
abc[]foo
abc]bar
[ab]cdef
a(b)cde
you will see the difference, and think about it with my comment/explanation.

Related

Vim complex regex

I have these strings in a file:
a b
a-b
a / b / c
I want to replace these with:
"a b" => a_b
"a-b" => a_b
"a / b / c" => a_b_c
How do I write the regex ? Please also explain the regex and name the concepts involved.
Yet another way:
:g/^/co.|-s/.*/"&" =>/|+s/\W\+/_/g|-j
Overview:
For every line, :g/^/, copy a line (:copy) and then substitute to add the "..." => on the first line and do a substitution on the non-alpha characters on the next line with _. Then join the two line, -j.
Glory of Details:
:g/{pat}/{cmd} - run {cmd} on each line matching {pat}. Use ^ to match every line
copy . - copy the current line below the current line (.). Short: co.
-1s/.*/.../ - :s the line above (-1). Replace entire line, .*
"&" => - & is the entire match (or \0 in PRCE)
+s/\W\+/_/g - do a global :s on the next line (+1) for all non-alphanumeric characters with _
-j - do a :join starting from the line above with the next line
For more help:
:h :g
:h :copy
:h :s
:h :j
:h :range
This is beyond simple capturing and reordering in the replacement. The modification of the non-alphabetic characters to _ requires a contained substitution of the match. This can be done via :help sub-replace-expr:
:%substitute/.*/\='"' . submatch(0) . '" => ' . substitute(submatch(0), '\A\+', '_', 'g')/
Basically, this matches entire lines, then replaces with the match in double quotes, followed by =>, followed by the match with non-alphabetic character sequences (\A\+) replaced with a single _.
alternative
You can also do this in two separate steps: First duplicating and quoting the line:
:%substitute/.*/"&" => &/
Then, the second copy needs to be modified. To apply the substitution to only match after the => separator, a positive lookbehind (must match after => + any characters) must be given:
:%substitute/\%(=> .*\)\#<=\A\+/_/g
This achieves what you're asking for, although the question is somewhat ambiguous:
%s/\(\a\)\A\+/\1_/g
%s/[find_pattern]/[replace_pattern/g does find and replace for every line (%) in a file, and does any number of matches (g), as opposed to the default behaviour of just the first one.
(\a) captures a group (brackets have to be escaped), containing an alphabetic character.
\A+ means one or more non-alphabetic character
/1 is a backreference to the first captured group in the pattern. In this case the alphabetic character in brackets.
_ is just the literal.
So together it replaces every letter followed by 1 or more non-letters with that letter followed by _. So this only works when the line ends with the last letter.
One way of doing this:
:%s/[\ -]\/*\ */_/g
[\ -] looks for either a space \ (note the space between \ and -) or a dash -.
The asterisk * means 0 or N occurrences. So \/* 0 or N occurrences of slash /; \ * 0 or N occurrences of space. Finally g replace all occurrences in the line.
[Edit]
I had misunderstood the question. Your problem can be solved using multiple sub-expressions in 2 steps.
step 1) Put an underscore before the c
:%s/c/_c/g
step 2) find and replace
:%s/a\([\ -]\/*\ *\)b\(\1\)*\(_\)*\(c\)*/"a\1b\2\4" => a_b\3\4/g
This will give you
"a b" => a_b
"a-b" => a_b
"a / b / c" => a_b_c
Explanation:
\(\) denotes a sub-expression, order of appearance matters so \1 matches to sub-expression one and so forth.
The trick is to add a _ somewhere so we can use it and at the same get information about the length. Because it only appears before c, the subexpression \3 will only match _ for that line.
Now, by replacing by "a\1b\2\4" we skip \3 avoiding to add an underscore.
:%s:[\ /-]\+:_:g
Explanation:
s: : : - Substitute command (with delimiter `:`)
[\ /-] - Match a ` ` (space), `/`, or `-` character
\+ - Match one or more of the previous group consecutively
_ - Replace with one `_` character
g - Replace all matches in line
% - Execute command on every line in file (optional)
I interpreted your question to be very generic. If you need to match more specific patterns, please indicate exactly what needs to be matched.
[Edit]
If you need to match ' / ' exactly, use:
:%s:\ /\ \|[\ -]:_:g
s: : : - Substitute command (with delimiter `:`)
\| - Match left pattern OR right pattern
\ /\ - Match ` / ` exactly
[\ -] - Match a ` ` (space) or `-` character
_ - Replace with one `_` character
g - Replace all matches in line
% - Execute command on every line in file (optional)
[Edit 2]
I misunderstood what you wanted to substitute.
You're making your life very difficult if you're trying to do this with a
single regex. It will get so complicated, at that point you're better off
writing a small function, like some of the other answers. But you should be
able to get away with two substitution commands without it getting too crazy.
One for the first two strings (a b and a-b), and one for the third
(a / b / c).
%s:\v(\a+)[\ -](\a+):"\0"\ =>\ \1_\2
%s:\v(\a+)\s*/\s*(\a+)\s*/\s*(\a+):"\0"\ =>\ \1_\2_\3
Explanation:
%s:\v(\a+)[\ -](\a+):"\0"\ =>\ \1_\2
s: : - Substitute command (with delimiter `:`)
\v - Very Magic mode *
( ) ( ) - Capture contained matches into numbered sub-expressions
\a+ \a+ - Match at least one alphanumeric character
[\ -] - Match either ` ` (space) or `-`
" "\ =>\ _ - Literal text
\0 - Replace with entire matched text
\1 \2 - Replace with first and second `()` sub-expression, respectively
% - Execute command on every line in file (optional)
%s:\v(\a+)\s*/\s*(\a+)\s*/\s*(\a+):"\0"\ =>\ \1_\2_\3
s: : - Substitute command (with delimiter `:`)
\v - Very Magic mode *
( ) ( ) ( ) - Capture contained matches into numbered sub-expressions
\a+ \a+ \a+ - Match at least one alphanumeric character
\s*/\s* \s*/\s* - Match a `/` and any surrounding spaces
" "\ =>\ _ _ - Literal text
\0 - Replace with entire matched text
\1 \2 \3 - Replace with first, second, and third `()` sub-expression, respectively
% - Execute command on every line in file (optional)
* This eliminates the need for a lot of ugly backslashes.
See `:h /magic` and `:h /\v`

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.
You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])
In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt
Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

Regular expression replace, back referencing replaced characters sublime text 3

I have a file with the following lines:
A 123
B 323
Each line starts with either A or B, and is followed by a blank and a number.
I am trying to convert this into
'C [a-z]*A 123'
for each line. I use a regex in Find and replace. The regex [AB] [0-9]* selects all the lines without a problem. I'm trying to replace it with 'C [a-z]*$1' that does not print $1 in the replaced string, and returns:
'C [a-z]*'
What am I missing?
You regex - [AB] [0-9]* - has no round brackets (i.e. no capturing groups that must be present if you wish to reference the captured subtexts later in the relacement string), and thus, you do not get the expected result.
You can use
(?m)^[AB][ ]([0-9]{3})
Or, if the digits are optional, use * quantifier that means match 0 or more characters as defined with the preceding subpattern
(?m)^[AB][ ]([0-9]*)
And replace with
'C [a-z]*$1'
See demo

Pipe separated values in groups of 3 regex

I have the following string
abc|ghy|33d
The regex below matches it fine
^([\d\w]{3}[|]{1})+[\d\w]{3}$
The string changes but the characters separated by the pipe are always in 3's ... so we can have
krr|455
we can also have
ddc
Here's where the problem happens: The regex explained above doesn't match the string if there is only one set of letters ... i.e. "dcc"
Let's do this step by step.
Your regex :
^([\d\w]{3}[|]{1})+[\d\w]{3}$
We can already see some changes. [|]{1} is equivalent to \|.
Then, we see that you match the first part (aaa|) at least once (the + operator matches once at least). Also, \w matches numbers.
The * operator matches 0 or more. So :
^(?:\w{3}\|)*\w{3}$
works.
See here.
Explanation
^ Matches beggining of string
(?:something)* matches something zero time or more. the group is non-capturing as you won't need to
\w{3} matches 3 alphanumeric characters
\| matches |
$ matches end of string.
^[\d\w]{3}(?:[|][\d\w]{3}){0,2}$
You simply quantify the variable part.See demo.
https://regex101.com/r/tS1hW2/18
You can modify your regex as below:
^([\d\w]{3})(\|[\d\w]{3})*$
here first match 3 alphaNumeric and then alphaNum with | as prefix.
Demo
Your description is a little awkward, but I'm guessing you want to be able to match
abc
abc|def
abc|def|ghi
You can do that with
/^\w{3}(?:\|\w{3}){0,2}$/
Visualization
Explanation
^ — match beginning of string
\w{3} — match any 3 of [A-Za-z0-9_]
(? ... )? — non-capturing group, 0 or 1 matches
\| — literal | character
$ — end of string
If the goal is to match any amount of 3-letter segments, you can use
/^(?:\w{3}(?:\||$))+$/

Reg Ex question

What does the following reg ex code mean?
'/^\w{4,20}$/'
It means that string should contain from 4 to 20 word characters (letters, digits, and underscores). Here:
^ (caret) matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well
$ (dollar) matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break
\w shorthand character class matching word characters (letters, digits, and underscores). Can be used inside and outside character classes.
{n,m} where n >= 0 and m >= n Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times
Let me show you a usage example. Say, we have the file with the following contents:
[spongebob#conductor /tmp]$ cat file.txt
between4and20
therearetoomanyalphanumcharacters
foo
okay
Now you want to get only those strings which match your pattern '/^\w{4,20}$/':
[spongebob#conductor /tmp]$ grep -E '^\w{4,20}$' blah
between4and20
okay
On output you see only those lines, which fulfil your regular expression.
Ah, also, don't confuse ^ (caret) with ^ immediately after the opening [, the latter negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [), for example [^a-d] matches x (any character except a, b, c or d).
It means:
^ Between the beginning,
$ and the end of a given string,
\w{4,20} there should be only 4-20 Alphanumeric characters (like
a,b,c,d,1,2,3...etc, and also _)
I think you'll find Wikipedia's page on Regular Expressions a big, big help while learning regexes.
And just so there is no confusion, ^ and $ don't necessarily need each other,
If the regex was:
'/^\w{4,20}/'
That'd mean: The match should be at the start of the string, followed by 4-20 alphanumeric characters.
Example (match in bold): Foobar baz
And if the regex pattern was:
'/\w{4,20}$/'
That'd mean: The match should be at the end of the string, proceeded by 4-20 alpha-numeric characters
Example (match in bold): Foo barbaz
/ opening delimiter
^ = start of sting
\w = word character
{x,y} min max
$ = end of string
/end delimiter