bash grep - negative match - regex

I want to show flag places in my Python unittests where I have been lazy and de-activated tests.
But I also have conditional executions that are not laziness, they are motivated by performance or system conditions at time of testing. Those are the skipUnless ones and I want to ignore them entirely.
Let's take some inputs that I have put in a file, test_so_bashregex.txt, with some comments.
!ignore this, because skipUnless means I have an acceptable conditional flag
#unittest.skipUnless(do_test, do_test_msg)
def test_conditional_function():
xxx
!catch these 2, lazy test-passing
#unittest.skip("fb212.test_urls_security_usergroup Test_Detail.test_related fails with 302")
def sometest_function():
xxx
#unittest.expectedFailure
def test_another_function():
xxx
!bonus points... ignore things that are commented out
# #unittest.expectedFailure
Additionally, I can't use a grep -v skipUnless in a pipe because I really want to use egrep -A 3 xxx *.py to give some context, as in:
grep -A 3 "#unittest\." *.py
test_backend_security_meta.py: #unittest.skip("rewrite - data can be legitimately missing")
test_backend_security_meta.py- def test_storage(self):
test_backend_security_meta.py- with getMultiDb() as mdb:
test_backend_security_meta.py-
What I have tried:
Trying # https://www.debuggex.com/
I tried #unittest\.(.+)(?!(Unless\()) and that didn't work, as it matches the first 3.
Ditto #unittest\.[a-zA-Z]+(?!(Unless\())
#unittest\.skip(?!(Unless\()) worked partially, on the 2 with skip.
All of those do partial matches despite the presence of Unless.
on bash egrep, which is where this going to end up, things don't look much better.
jluc#explore$ egrep '#unittest\..*(?!(Unless))' test_so_bashregex.txt
egrep: repetition-operator operand invalid

you could try this regex:
(?<!#\s)#unittest\.(?!skipUnless)(skip|expectedFailure).*
if you don't care if 'skip' or 'expectedFailure' appear you could simplify it:
(?<!#\s)#unittest\.(?!skipUnless).*

How about something like this - grep seems a bit restrictive
items=$(find . -name "*.py")
for item in $items; do
cat $item | awk '
/^\#unittest.*expectedFailure/{seen_skip=1;}
/^\#unittest.*skip/{seen_skip=1;}
/^def/{
if (seen_skip == 1)
print "Being lazy at " $1
seen_skip=0;
}
'
done

OK, I'll put up what I found with sweaver2112's help, but if someone has a good single-stage grep-ready regex, I'll take it.
bash's egrep/grep doesn't like ?! (ref grep: repetition-operator operand invalid). end of story there.
What I have done instead is to pipe it to some extra filters: negative grep -v skipUnless and another one to strip leading comments. These 2 strip out the unwanted lines. But, then pipe their output back into another grep looking for #unittest again and again with the -A 3 flag.
If the negative greps have cleared out a line, it won't show in the last pipestage so drops out of the input. If not, I get my context right back.
egrep -A 3 -n '#unittest\.' test_so_bashregex.txt | egrep -v "^\s*#" | egrep -v "skipUnless\(" | grep #unittest -A 3
output:
7:#unittest.skip("fb212.test_urls_security_usergroup Test_Detail.test_related fails with 302")
8-def sometest_function():
9- xxx
10:#unittest.expectedFailure
11-def test_another_function():
12- xxx
And my actual output from running it on * *.py*, rather than my test.txt file:
egrep -A 3 -n '#unittest\.' *.py | egrep -v "\d:\s*#" | egrep -v "skipUnless\(" | grep #unittest -A 3
output:
test_backend_security_meta.py:77: #unittest.skip("rewrite - data can be legitimately missing")
test_backend_security_meta.py-78- def test_storage(self):
test_backend_security_meta.py-79- with getMultiDb() as mdb:
test_backend_security_meta.py-80-
--
test_backend_security_meta.py:98: #unittest.skip("rewrite - data can be legitimately missing")
test_backend_security_meta.py-99- def test_get_li_tag_for_object(self):
test_backend_security_meta.py-100- li = self.mgr.get_li_tag()
test_backend_security_meta.py-101-

Related

Print 1 Occurence for Each Pattern Match

I have a file that contains a pattern at the beginning of each newline:
./bob/some/text/path/index.html
./bob/some/other/path/index.html
./bob/some/text/path/index1.html
./sue/some/text/path/index.html
./sue/some/text/path/index2.html
./sue/some/other/path/index.html
./john/some/text/path/index.html
./john/some/other/path/index.html
./john/some/more/text/index1.html
... etc.
I came up with the following code to match the ./{name}/ pattern and would like to print 1 occurance of each name, BUT, it either prints out every line matching that pattern, or just 1 and stops when using the -m 1 flag:
I've tried it as a simple grep line(below) and also put it in a for loop
name=$(grep -iEoha -m 1 '\.\/([^/]*)\/' ./without_localnamespace.txt)
echo $name
My expected reuslts are:
./bob/
./sue/
./john/
Actual Results are:
./bob/
awk -F'/' '!a[$2]++{print $1 FS $2 FS}' input
./bob/
./sue/
./john/
You can do
cut -d "/" -f2 ./without_localnamespace.txt | sort -u
You seem to want unique occurrences, use
grep -Eoha '\./[^/]*/' ./without_localnamespace.txt | uniq
See the online demo
Regarding the pattern, you do not need to escape forward slashes, they are not special regex metacharacters. The -i flag is redundant here, too.

Separating specific words with underscores, but not the plural form

I've been working with regex on strings recently and I've hit a snag. You see, I'm trying to get this:
chocolatecakes
thecakeismine
cakessurpassexpectation
to do this:
chocolate_cakes
the_cake_ismine
cakes_surpassexpectation
However, when I use this:
#!/bin/sh
words_array=(is cake)
number_of_times=0
word_underscorer (){
echo $1 | sed -r "s/([a-z])($2)/\1_\2/g" | sed -r "s/($2)([a-z])/\1_\2/g"
}
for words_to_underscore in "${words_array[#]}"; do
if [ "$number_of_times" -eq 0 ]; then
first=`word_underscorer "chocolatecakes" "$words_to_underscore"`
second=`word_underscorer "thecakeismine" "$words_to_underscore"`
third=`word_underscorer "cakessurpassexpectation" "$words_to_underscore"`
else
word_underscorer "$first" "$words_to_underscore"
word_underscorer "$second" "$words_to_underscore"
word_underscorer "$third" "$words_to_underscore"
fi
echo "$first"
echo "$second"
echo "$third"
done
I get this:
chocolate_cake_s
the_cake_ismine
cake_ssurpassexpectation
I'm not sure how to fix this.
Based on what you've shown you could do something such as:
sed -r -e "s/($2)/_\1_/g" -r -e "s/($2)_s|^($2)(_*)/\1s\2_/g" -r -e "s/^_|_$//g"
That should return the final result of:
chocolate_cakes
the_cake_ismine
cakes_surpassexpectation
The idea here is process by elimination; that is not to say that this method doesn't have potential issues — you'll hopefully understand what I mean below. Each sed operation is labeled by number to help you see what is happening.
The sed commands work on the array, which starts out with "is" and then "cake":
1. is -> _is_
2. is_s or is_ -> iss or is_
3. _is_ -> is
1. cake -> _cake_
2. cake_s or cake_ -> cakes or cake_
3. _cake_ -> cake
string one:
1. chocolatecakes -> chocolate_cake_s
2. chocolate_cake_s -> chocolate_cakes_
3. chocolate_cakes_ -> chocolate_cakes
string two:
1. thecake_is_mine -> the_cake_ismine
2. the_cake_ismine -> no change
3. the_cake_ismine -> no change
string three:
1. cakessurpassexpectation -> _cake_ssurpassexpectation
2. _cake_ssurpassexpectation -> _cakes_surpassexpectation
3. _cakes_surpassexpectation -> cakes_surpassexpectation
So you can see here what the issue might be with the "is" portion of the array; it could possibly get broken up perhaps in an undesired way during the sed operation if somehow it ends up becoming "is_s" on operation number 2. This is where you'll want to test multiple combinations of your strings to ensure that you've covered all the possible scenarios you don't want. Once you've done that you can go back and refine the patterns as needed, or even further find ways to optimize things in a way that allows you to use less piped commands.
If you write the words to a file (words) then you can do something like this:
sed -e 's/\('$(sed ':l;N;s/\n/\\|/;bl' words )'\)/\1_'/g -e 's/_$//' input
This gives you:
chocolate_cakes
the_cake_ismine
cakes_surpassexpectation
The main point is to construct this sed command:
sed -e s/\(chocolate\|cake\|the\|cakes\)/\1_/g -e s/_$// input
This might work for you (GNU sed):
sed -r 's/\B([^_])\B(cakes?|is)\B/\1_\2/g;s/(cakes?|is)\B([^_])\B/\1_\2/g' file
Insert an underscore infront/behind a particular word if the particular word is within another word and the character before/after the particular word is not an underscore.

Grep to select the searched-for regexp surrounded on either/both sides by a certain number of characters? [duplicate]

I want to run ack or grep on HTML files that often have very long lines. I don't want to see very long lines that wrap repeatedly. But I do want to see just that portion of a long line that surrounds a string that matches the regular expression. How can I get this using any combination of Unix tools?
You could use the grep options -oE, possibly in combination with changing your pattern to ".{0,10}<original pattern>.{0,10}" in order to see some context around it:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e., force grep to behave as egrep).
For example (from #Renaud's comment):
grep -oE ".{0,10}mysearchstring.{0,10}" myfile.txt
Alternatively, you could try -c:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.
Pipe your results thru cut. I'm also considering adding a --cut switch so you could say --cut=80 and only get 80 columns.
You could use less as a pager for ack and chop long lines: ack --pager="less -S" This retains the long line but leaves it on one line instead of wrapping. To see more of the line, scroll left/right in less with the arrow keys.
I have the following alias setup for ack to do this:
alias ick='ack -i --pager="less -R -S"'
grep -oE ".\{0,10\}error.\{0,10\}" mylogfile.txt
In the unusual situation where you cannot use -E, use lowercase -e instead.
Explanation:
cut -c 1-100
gets characters from 1 to 100.
The Silver Searcher (ag) supports its natively via the --width NUM option. It will replace the rest of longer lines by [...].
Example (truncate after 120 characters):
$ ag --width 120 '#patternfly'
...
1:{"version":3,"file":"react-icons.js","sources":["../../node_modules/#patternfly/ [...]
In ack3, a similar feature is planned but currently not implemented.
Taken from: http://www.topbug.net/blog/2016/08/18/truncate-long-matching-lines-of-grep-a-solution-that-preserves-color/
The suggested approach ".{0,10}<original pattern>.{0,10}" is perfectly good except for that the highlighting color is often messed up. I've created a script with a similar output but the color is also preserved:
#!/bin/bash
# Usage:
# grepl PATTERN [FILE]
# how many characters around the searching keyword should be shown?
context_length=10
# What is the length of the control character for the color before and after the
# matching string?
# This is mostly determined by the environmental variable GREP_COLORS.
control_length_before=$(($(echo a | grep --color=always a | cut -d a -f '1' | wc -c)-1))
control_length_after=$(($(echo a | grep --color=always a | cut -d a -f '2' | wc -c)-1))
grep -E --color=always "$1" $2 |
grep --color=none -oE \
".{0,$(($control_length_before + $context_length))}$1.{0,$(($control_length_after + $context_length))}"
Assuming the script is saved as grepl, then grepl pattern file_with_long_lines should display the matching lines but with only 10 characters around the matching string.
I put the following into my .bashrc:
grepl() {
$(which grep) --color=always $# | less -RS
}
You can then use grepl on the command line with any arguments that are available for grep. Use the arrow keys to see the tail of longer lines. Use q to quit.
Explanation:
grepl() {: Define a new function that will be available in every (new) bash console.
$(which grep): Get the full path of grep. (Ubuntu defines an alias for grep that is equivalent to grep --color=auto. We don't want that alias but the original grep.)
--color=always: Colorize the output. (--color=auto from the alias won't work since grep detects that the output is put into a pipe and won't color it then.)
$#: Put all arguments given to the grepl function here.
less: Display the lines using less
-R: Show colors
S: Don't break long lines
Here's what I do:
function grep () {
tput rmam;
command grep "$#";
tput smam;
}
In my .bash_profile, I override grep so that it automatically runs tput rmam before and tput smam after, which disabled wrapping and then re-enables it.
ag can also take the regex trick, if you prefer it:
ag --column -o ".{0,20}error.{0,20}"

how to replace part of a string using sed

echo "/home/repository/tags/1.9.1/1.9.1.8/core" | sed "s/HELP/XXX/g"
I would like some HELP in replacing what is in between tags and core with let's say XXX. So my desired output would be /home/repository/tags/XXX/core.
The string is a directory path, where /home/repository/tags are the only constant parts. The path is always six levels deep. So it may not always be between tags and core.
echo "/home/repository/whatever/1.9.1/1.9.1.8/core/and/more/junk" \
| sed 's#\(/[^/]*/[^/]*/[^/]*\)/[^/]*/[^/]*#\1/XXX#'
yields ...
/home/repository/whatever/XXX/core/and/more/junk
By using repetition quantifiers, you can easily adjust where your replacement is made:
echo "/home/repository/tags/1.9.1/1.9.1.8/core" | \
sed -r 's|(/([^/]+/){3})([^/]+/){2}(.*)|\1XXX/\4|'
3 represents how many components to keep at the beginning
2 represents how many to replace
You could even use variables:
$ dirs='/one/two/three/four/five/six/seven/eight'
$ for keep in {0..3}; do for replace in {0..3}; do echo "$dirs" | \
sed -r "s|(/([^/]+/){$keep})([^/]+/){$replace}(.*)|\1XXX/\4|"; done; done
/XXX/one/two/three/four/five/six/seven/eight
/XXX/two/three/four/five/six/seven/eight
/XXX/three/four/five/six/seven/eight
/XXX/four/five/six/seven/eight
/one/XXX/two/three/four/five/six/seven/eight
/one/XXX/three/four/five/six/seven/eight
/one/XXX/four/five/six/seven/eight
/one/XXX/five/six/seven/eight
/one/two/XXX/three/four/five/six/seven/eight
/one/two/XXX/four/five/six/seven/eight
/one/two/XXX/five/six/seven/eight
/one/two/XXX/six/seven/eight
/one/two/three/XXX/four/five/six/seven/eight
/one/two/three/XXX/five/six/seven/eight
/one/two/three/XXX/six/seven/eight
/one/two/three/XXX/seven/eight
If your directory is always 6 levels deep, this works (remember to escape the round brackets):
echo "/home/repository/tags/1.9.1/1.9.1.8/core" |
sed 's/\(\/home\/repository\/tags\/\).*\/.*\(\/.*\)/\1XXX\2/'
produces:
/home/repository/tags/XXX/core
Here, spare yourself some regex agony:
echo "/home/repository/tags/1.9.1/1.9.1.8/core" | sed 's#/home/repository/tags/.*/\(.\+\)$#/home/repository/tags/XXX/\1#'
No need to explicitly match the components if all you're really trying to do is strip out everything between tags/ and the last component. Note that I used + not *, so the component must be nonempty. That'll guard against having a trailing slash.

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file