git grep <regex containing newline> - regex

I'm trying to grep all line breaks after some binary operators in a project using git bash on a Windows machine.
Tried the following commands which did not work:
$ git grep "[+-*\|%]\ *\n"
fatal: command line, '[+-*\|%]\ *\n': Invalid range end
$ git grep "[+\-*\|%]\ *\n"
fatal: command line, '[+\-*\|%]\ *\n': Invalid range end
OK, I don't know how to include "-" in a character set, but still after removing it the \n matches the character n literally:
$ git grep "[+*%] *\n"
somefile.py: self[:] = '|' + name + '='
^^^
Escaping the backslash once (\\n) has no effect, and escaping it twice (\\\n) causes the regex to match \n (literally).
What is the correct way to grep here?

I don't know how to include "-" in a character set
There is no need to escape the dash character (-) if you want to include it in a character set. If you put it the first or the last character in set it doesn't have its special meaning.
Also, there is no need to escape | inside a character range. Apart from ^ (when it's the first character in the range), - (when it is not the first or the last character in the range), ] and \ (when it is used to escape ]), all other characters have their literal meaning (i.e no special meaning) in a character range.
There is also no need to put \n in the regexp. The grepping tools, by default, try to match the regexp against one row at a time and git grep does the same. If you need to match the regexp only at the end of line then put $ (the end of line anchor) as the last character of the regexp.
Your regexp should be [-+*|%] *$.
Put together, the complete command line is:
git grep '[-+*|%] *$'

How to find a newline in the middle of a line
For lack of better option I think I'll start with:
sudo apt install pcregrep
git grep --cached -Il '' | xargs pcregrep -Mb 'y\nl'
this combines:
How to list all text (non-binary) files in a git repository?
https://unix.stackexchange.com/questions/112132/how-can-i-grep-patterns-across-multiple-lines/112134#112134
The output clearly shows the filename and line number, e.g.:
myfile.txt:123:my
love
myfile.txt:234:my
life
otherfile.txt:11:my
lion
Tested on Ubuntu 22.04.

Related

How to find and replace text within ".." using bash script

I want to replace this line #discovery.seed_hosts: ["host1","host2"] with discovery.seed_hosts: ["${extraNode1}","${extraNode2}","${masterIP}"]. Need to remove the # and replace the host1 and host2 as per the given argument also need to add another value (3rd value) into the array as well.
sudo sed -i "/#discovery.seed_hosts: ["host1","host2"]/s/#discovery.seed_hosts: ["host1","host2"]/discovery.seed_hosts: ["${extraNode1}","${extraNode2}","${masterIP}"]/" check.yml
I tried the above command to do this but it is giving error because of the ["host1","host2"] in the command.
sed: -e expression #1, char 49: Invalid range end - Error received
You need to use the s (substitute) command, and escape ., in addition to [. Like this:
sudo sed -i 's/#\(discovery\.seed_hosts: \["\)host1","host2"]/\1${extraNode1}","${extraNode2}","${masterIP}"]/' check.yml
If you don't scape the ., the sed command will match any character where the . is, like #discovery7seed_hosts: ["host1","host2"].
The sed command is pretty straight forward. I just added parentheses around the part of the match that I wanted to reuse in the substitution which creates a group. The \1 is replace with "group 1", the contents of what's in between the parentheses, which must be escaped too.
EDIT: The ", double quotes, don't need to be escaped because the sed command is in single quotes: 's/.../.../'. Also, the ], closing square bracket, doesn't need to be escaped as long as its corresponding [, opening square bracket, has been escaped. Finally, both parentheses ( and ) need to be escaped to create the group. (END OF EDIT)
Test:
$ cat check.yml
This is a test
Another line
#discovery.seed_hosts: ["host1","host2"]
#discovery7seed_hosts: ["host1","host2"]
OK. Good bye?
$ sed 's/#\(discovery\.seed_hosts: \["\)host1","host2"]/\1${extraNode1}","${extraNode2}","${masterIP}"]/' check.yml
This is a test
Another line
discovery.seed_hosts: ["${extraNode1}","${extraNode2}","${masterIP}"]
#discovery7seed_hosts: ["host1","host2"]
OK. Good bye?
$
You'll need to escape [s and "s with backslashes as:
sudo sed -i "/#discovery.seed_hosts: \[\"host1\",\"host2\"]/s/#discovery.seed_hosts: \[\"host1\",\"host2\"]/discovery.seed_hosts: [\""${extraNode1}\"",\""${extraNode2}\"",\""${masterIP}\""]/" check.yml

Grep with reg ex

Trying to use regex with grep in the command line to give me lines that start with either a whitespace or lowercase int followed by a space. From there, they must end with either a semi colon or a o.
I tried
grep ^[\s\|int]\s+[\;\o]$ fileName
but I don't get what I'm looking for. I also tried
grep ^\s*int\s+([a-z][a-zA-Z]*,\s*)*[a-z]A-Z]*\s*;
but nothing.
Let's consider this test file:
$ cat file
keep marco
polo
int keep;
int x
If I understand your rules correctly, two of the lines in the above should be kept and the other two discarded.
Let's try grep:
$ grep -E '^(\s|int\s).*[;o]$' file
keep marco
int keep;
The above uses \s to mean space. \s is supported by GNU grep. For other greps, we can use a POSIX character class instead. After reorganizing the code slightly to reduce typing:
grep -E '^(|int)[[:blank:]].*[;o]$' file
How it works
In a Unix shell, the single quotes in the command are critical: they stop the shell from interpreting or expanding any character inside the single quotes.
-E tells grep to use extended regular expressions. Thus reduces the need for backslashes.
Let's examine the regular expression, one piece at a time:
^ matches at the beginning of a line.
(\s|int\s) This matches either a space or int followed by a space.
.* matches zero or more of any character.
[;o] matches any character in the square brackets which means that it matches either ; or o.
$ matches at the end of a line.

Match a string using grep

I want to match the below string using a regular expression in grep command.
File name is test.txt,
Unknown Unknown
Jessica Patiño
Althea Dubravsky 45622
Monique Outlaw 49473
April Zwearcan 45758
Tania Horne 45467
I want to match the lines containing special characters alone from the above list of lines; the line which I exactly need is 'Jessica Patiño', which contains a non-ASCII character.
I used,
grep '[^0-9a-zA-Z]' test.txt
But it returns all lines.
The following command should return the lines you want:
grep -v '^[0-9a-zA-Z ]*$' test.txt
Explanation
[0-9a-zA-Z ] matches a space or any alphanumeric character.
Adding the asterisk matches any string containing only these characters.
Prepending the pattern with ^ and appending it with $ anchors the string to the beginning and end of line so that the pattern matches only the lines which contain only the desired characters.
Finally, the -v or --invert-match option to grep inverts the sense of matching, i.e., select non-matching lines.
The provided answers should work for the example text given. However, you're likely to come across people with hyphens or apostrophes in their names, etc. To search for all non-ASCII characters, this should do the trick:
grep -P "[\x00-\x1F\x7F-\xFF]" test.txt
-P enables "Perl" mode and allows use of character code searches. \x00-\x1F are control characters, and \x7F-\xFF is everything above 126.
I would use:
grep [^0-9a-zA-Z\s]+ test.txt
live example
Or, even better:
grep -i "[^\da-z\s]" test.txt

Egrep expression: how to unescape single quotes when reading from file?

I need to use egrep to obtain an entry in an index file.
In order to find the entry, I use the following command:
egrep "^$var_name" index
$var_name is the variable read from a var list file:
while read var_name; do
egrep "^$var_name" index
done < list
One of the possible keys comes usually in this format:
$ERROR['SOME_VAR']
My index file is in the form:
$ERROR['SOME_VAR'] --> n
Where n is the line where the variable is found.
The problem is that $var_name is automatically escaped when read. When I enable the debug mode, I get the following command being executed:
+ egrep '^$ERRORS['\''SELECT_COUNTRY'\'']' index
The command above doesn't work, because egrep will try to interpret the pattern.
If I don't use the extended version, using grep or fgrep, the command will work only if I remove the ^ anchor:
grep -F "$var_name" index # this actually works
The problem is that I need to ensure that the match is made at the beginning of the line.
Ideas?
set -x shows the command being executed in shell notation.
The backslashes you see do not become part of the argument, they're just printed by set -x to show the executed command in a copypastable format.
Your problem is not too much escaping, but too little: $ in regex means "end of line", so ^$ERROR will never match anything. Similarly, [ ] is a character range, and will not match literal square brackets.
The correct regex to match your pattern would be ^\$ERROR\['SOME VAR'], equivalent to the shell argument in egrep "^\\\$ERROR\['SOME_VAR']".
Your options to fix this are:
If you expect to be able to use regex in your input file, you need to include regex escapes like above, so that your patterns are valid.
If you expect to be able to use arbitrary, literal strings, use a tool that can match flexibly and literally. This requires jumping through some hoops, since UNIX tools for legacy reasons are very sloppy.
Here's one with awk:
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
It passes the string in through the environment (because -v is sloppy) and then matches literally against the string from the start of the input.
Here's an example invocation:
$ cat script
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
$ cat list
$ERRORS['SOME_VAR']
\E and \Q
'"'%##%*'
$ cat index
hello world
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
etc
$ bash script
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
You can use printf "%q":
while read -r var_name; do
egrep "^$(printf "%q\n" "$var_name")" index
done < list
Update: You can also do:
while read -r var_name; do
egrep "^\Q$var_name\E" index
done < list
Here \Q and \E are used to make string in between a literal string removing all special meaning of regex symbols.

unix sed command regular expression

Can anyone explain me how the regular expression works in the sed substitute command.
$ cat path.txt
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/sbin:/sbin:/bin/:/usr/sbin:/usr/bin:/opt/omni/bin:
/opt/omni/lbin:/opt/omni/sbin:/root/bin
$ sed 's/\(\/[^:]*\).**/\1/g' path.txt
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
From the above sed command they used back reference and save operator concept.
Can anyone explain me how the regular expression especially /[^:]* work in the substitute command to get only the first path in each line.
I think you wrote an extra asterisk * in your sed code, so it should be like this:
$ sed 's/\(\/[^:]*\).*/\1/g' file
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
To change the delimiter will help to understand it a little bit better:
sed 's#\(/[^:]*\).*#\1#g'
The s#something#otherthing#g is a basic sed command that looks for something and changes it for otherthing all over the file.
If you do s#(something)#\1#g then you "save" that something and then you can print it back with \1.
Hence, what it is doing is to get a pattern like /[^:]* and then print is back. /[^:]* means / and then every char except :. So it will get / + all the string until it finds a semicolon :. It will store that piece of the string and then print it back.
Small examples:
# get every char
$ echo "hello123bye" | sed 's#\([a-z]*\).*#\1#g'
hello
# get everything until it finds the number 3
$ echo "hello123bye" | sed 's#\([^3]*\).*#\1#g'
hello12
[^:]*
in regex would match all characters except for :, so it would match until this:
/usr/kbos/bin
also it would match these,
/usr/local/bin
/usr/jbin
/usr/bin
/usr/sas/bin
As, these all contains characters, that are not :
.* match any character, zero or more times.
Thus, this regex [^:]*.*, would match all this expressions:
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/jbin:/usr/bin:/usr/sas/bin
/usr/bin:/usr/sas/bin
However, you get only the first field (ie,/usr/kbos/bin, by using back reference in sed), because, regular expression output the longest possible match found.