regex in bash-script to exclude certain word - regex

I want to exclude "cgs" and "CGS" but select all other data.
Testdata:
exclude this-->
C
SP999_20151204080019_0054236_000_CGS.csv
CSP999_20151204080019_0054236_000_cgs.csv
accept all other.
I tried something like this .*([Cc][Gg][Ss]).* to select the cgs, but I don't understand the exclude thing =) It must be a filename_pattern without grep.
Kind Regards,
Bobby

Does it have to be a regexp? You can easily do it with a glob pattern, if you set in your script
shopt -o extglob
to enable extended globbing. You would then use the pattern
!(*[Cc][Gg][Ss]*)
to generate all entries which do NOT have CGS in their name.

grep --invert-match --ignore-case cgs < filenames_list

extglob bash option
Try this:
ls -ld $path/!(*[cC][gG][sS].csv)
And have a look at
man -Pless\ +/extglob bash
If the extglob shell option is enabled using the shopt builtin, several
extended pattern matching operators are recognized. In the following
description, a pattern-list is a list of one or more patterns separated
by a |. Composite patterns may be formed using one or more of the fol‐
lowing sub-patterns:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns

Following may help you:
ls |grep -iEv "cgs"

Using invert match of grep:
grep -v 'cgs\|CGS' <filelist
Or,
ls | grep -v 'cgs\|CGS'

Related

regex quantifiers in bash --simple vs extended matching {n} times

I'm using the bash shell and trying to list files in a directory whose names match regex patterns. Some of these patterns work, while others don't. For example, the * wildcard is fine:
$ls FILE_*
FILE_123.txt FILE_2345.txt FILE_789.txt
And the range pattern captures the first two of these with the following:
$ls FILE_[1-3]*.txt
FILE_123.txt FILE_2345.txt
but not the filename with the "7" character after "FILE_", as expected. Great. But now I want to count digits:
$ls FILE_[0-9]{3}.txt
ls: FILE_[0-9]{3}.txt: No such file or directory
Shouldn't this give me the filenames with three numeric digits following "FILE_" (i.e. FILE_123.txt and FILE_789.txt, but not FILE_2345.txt) Can someone tell me how I should be using the {n} quantifier (i.e. "match this pattern n times)?
ls uses with glob pattern, you can not use {3}. You have to use FILE_[0-9][0-9][0-9].txt. Or, you could the following command.
ls | grep -E "FILE_[0-9]{3}.txt"
Edit:
Or, you also use find command.
find . -regextype egrep -regex '.*/FILE_[0-9]{3}\.txt'
The .*/ prefix is needed to match a complete path. On Mac OS X :
find -E . -regex ".*/FILE_[0-9]{3}\.txt"
Bash filename expansion does not use regular expressions. It uses glob pattern matching, which is distinctly different, and what you're trying with FILE_[0-9]{3}.txt does brace expansion followed by filename expansion. Even bash's extended globbing feature doesn't have an equivalent to regular expression's {N}, so as already mentioned you have to use FILE_[0-9][0-9][0-9].txt

Match fixed string + numbers 0-10 with grep

I have a list of files such as this:
Sample_lane1-Bob10_R1.fastq.gz
Sample_lane1-Bob1_R1.fastq.gz
Sample_lane1-Bob2_R1.fastq.gz
Sample_lane1-Bob4_R1.fastq.gz
Sample_lane1-Bob5_R1.fastq.gz
Sample_lane1-Bob7_R1.fastq.gz
Sample_lane1-Bob8_R1.fastq.gz
Sample_lane1-Bob9_R1.fastq.gz
Sample_lane2-Bob10_R1.fastq.gz
Sample_lane2-Bob1_R1.fastq.gz
Sample_lane2-Bob3_R1.fastq.gz
Sample_lane2-Bob4_R1.fastq.gz
Sample_lane2-Bob6_R1.fastq.gz
Sample_lane2-Bob7_R1.fastq.gz
Sample_lane2-Bob8_R1.fastq.gz
Sample_lane2-Bob9_R1.fastq.gz
Sample_lane3-Bob11_R1.fastq.gz
Sample_lane3-Bob12_R1.fastq.gz
Sample_lane3-Bob13_R1.fastq.gz
Sample_lane3-Bob15_R1.fastq.gz
Sample_lane3-Bob16_R1.fastq.gz
Sample_lane3-Bob18_R1.fastq.gz
Sample_lane3-Bob19_R1.fastq.gz
Sample_lane3-Bob20_R1.fastq.gz
Sample_lane5-Bob11_R1.fastq.gz
Sample_lane5-Bob12_R1.fastq.gz
Sample_lane5-Bob16_R1.fastq.gz
Sample_lane5-Bob17_R1.fastq.gz
Sample_lane5-Bob19_R1.fastq.gz
Sample_lane5-Bob20_R1.fastq.gz
Sample_lane8-Sample1_R1.fastq.gz
Sample_lane8-Sample2_R1.fastq.gz
Sample_lane8-Sample3_R1.fastq.gz
Sample_lane8-Sample4_R1.fastq.gz
Sample_lane8-Sample5_R1.fastq.gz
I want to return only the files that are labeled 'Bob1' through 'Bob10' in order to perform some downstream actions, and I want to return the files labeled 'Bob11' through 'Bob20' similarly.
I have been trying to use grep for this with a regular expression, but have not been able to match both 'Bob' and the adjacent numeric range. For example, this is one of the many lines that have not worked:
grep -E "Bob#([10|0-9])"
I have tried many different combinations of Bob, 10|0-9, ", (), and [] in different places based on different tutorials I have found online but none have worked so far.
EDIT: For completeness, this solution given by #anubhava solved the above question:
grep -E "Bob(10|[0-9])_"
I did not specifically ask for the regex to return the other half of the range, 'Bob11'-'Bob20', but came up with this solution for it as per this page:
grep -E "Bob([1-2][1-9])_"
You can use this regex for grep against a file:
grep -E "Bob(10|[0-9])_" file
However if you are using glob pattern in a directory then use this extended glob:
shopt -s extglob
printf "%s\n" *Bob#(10|[[:digit:]])_*
Output:
Sample_lane1-Bob10_R1.fastq.gz
Sample_lane1-Bob1_R1.fastq.gz
Sample_lane1-Bob2_R1.fastq.gz
Sample_lane1-Bob4_R1.fastq.gz
Sample_lane1-Bob5_R1.fastq.gz
Sample_lane1-Bob7_R1.fastq.gz
Sample_lane1-Bob8_R1.fastq.gz
Sample_lane1-Bob9_R1.fastq.gz
Sample_lane2-Bob10_R1.fastq.gz
Sample_lane2-Bob1_R1.fastq.gz
Sample_lane2-Bob3_R1.fastq.gz
Sample_lane2-Bob4_R1.fastq.gz
Sample_lane2-Bob6_R1.fastq.gz
Sample_lane2-Bob7_R1.fastq.gz
Sample_lane2-Bob8_R1.fastq.gz
Sample_lane2-Bob9_R1.fastq.gz
If you use a tool that can do math instead of relying on a regexp then you can select any range you like:
$ awk -F'-Bob|_' '$3+0>7 && $3+0<13' file
Sample_lane1-Bob10_R1.fastq.gz
Sample_lane1-Bob8_R1.fastq.gz
Sample_lane1-Bob9_R1.fastq.gz
Sample_lane2-Bob10_R1.fastq.gz
Sample_lane2-Bob8_R1.fastq.gz
Sample_lane2-Bob9_R1.fastq.gz
Sample_lane3-Bob11_R1.fastq.gz
Sample_lane3-Bob12_R1.fastq.gz
Sample_lane5-Bob11_R1.fastq.gz
Sample_lane5-Bob12_R1.fastq.gz

bash grep regexp - excluding subpattern

I have a script written in bash, with one particular grep command I need to modify.
Generally I have two patterns: A & B. There is a textfile that can contain lines with all possible combinations of those patterns, that is:
"xxxAxxx", "xxxBxxx", "xxxAxxxBxxx", "xxxxxx", where "x" are any characters.
I need to match ALL lines APART FROM the ones containing ONLY "A".
At the moment, it is done with "grep -v (A)", but this is a false track, as this would exclude also lines with "xxxAxxxBxxx" - which are OK for me. This is why it needs modification. :)
The tricky part is that this one grep lies in the middle of a 'multiply-piped' command with many other greps, seds and awks inside. Thus forming a smarter pattern would be the best solution. Others would cause much additional work on changing other commands there, and even would impact another parts of the code.
Therefore, the question is: is there a possibility to match pattern and exclude a subpattern in one grep, but allow them to appear both in one line?
Example:
A file contains those lines:
fooTHISfoo
fooTHISfooTHATfoo
fooTHATfoo
foofoo
and I need to match
fooTHISfooTHATfoo
fooTHATfoo
foofoo
a line with "THIS" is not allowed.
You can use this awk command:
awk '!(/THIS/ && !/THAT/)' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
Or by reversing the boolean expression:
awk '!/THIS/ || /THAT/' file
fooTHISfooTHATfoo
fooTHATfoo
foofoo
You want to match lines that contain B, or don't contain A. Equivalently, to delete lines containing A and not B. You could do this in sed:
sed -e '/A/{;/B/!d}'
Or in this particular case:
sed '/THIS/{/THAT/!d}' file
Tricky for grep alone. However, replace that with an awk call: Filter out lines with "A" unless there is a "B"
echo "xxxAxxx
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx" | awk '!/A/ || /B/'
xxxBxxx
xxxAxxxBxxx
xxxBxxxAxxx
xxxxxx
grep solution. Uses perl regexp (-P) for Lookaheads (look if there is not, some explanation here).
grep -Pv '^((?!THAT).)*THIS((?!THAT).)*$' file

How do I match PATTERN but not PREFIX_PATTERN?

Say I've got some documents that contain several lines with KEYWORD and some lines with PREFIX_KEYWORD.
How would I match only these lines that have KEYWORD and ignore the lines that have PREFIX_KEYWORD on them?
Yes, I could grep for KEYWORD, feed the results into the editor of my choice and let the editor delete all lines that have PREFIX_KEYWORD but I'm asking whether there's a built-in way in grep to do this.
If this helps: I'm not interested in the exact match but only want to know whether there are occurrences of KEYWORD in the file.
One way would be to grep for your KEYWORD and filter out the rest. This could look like
grep KEYWORD file | grep -v PREFIX_KEYWORD
Another way with a perl expression:
grep -P '(?<!PREFIX_)KEYWORD' file
The same answer was given here: Regex to match specific strings without a given prefix
Try this:
grep -w KEYWORD your_file
man page for -w says:
-w Searches for the expression as a word as if surrounded
by \< and \>.
If you need the word KEYWORD by itself, why not regex for KEYWORD with any special character before (space, newline, etc.)?

Bash string replacement with regex repetition

I have a file: filename_20130214_suffix.csv
I'd like replace the yyyymmdd part in bash. Here is what I intend to do:
file=`ls -t /path/filename_* | head -1`
file2=${file/20130214/20130215}
#this will not work
#file2=${file/[0-9]{8}/20130215/}
The problem is that parameter expansion does not use regular expressions, but patterns or globs(compare the difference between the regular expression "filename_..csv" and the glob "filename_.csv"). Globs cannot match a fixed number of a specific string.
However, you can enable extended patterns in bash, which should be close enough to what you want.
shopt -s extglob # Turn on extended pattern support
file2=${file/+([0-9])/20130215}
You can't match exactly 8 digts, but the +(...) lets you match one or more of the pattern inside the parentheses, which should be sufficient for your use case.
Since all you want to do in this case is replace everything between the _ characters, you could also simply use
file2=${file/_*_/_20130215_}
[[ $file =~ ^([^_]+_)[0-9]{8}(_.*) ]] && file2="${BASH_REMATCH[1]}20130215${BASH_REMATCH[2]}"