I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.
I have a bunch of files with filenames composed of underscore and dots, here is one example:
META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
I want to remove the part that contains .bed.nodup.sortedbed.roadmap.sort.fgwas.gz. so the expected filename output would be META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
I am using these sed commands but neither one works:
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo $stringZ | sed -e 's/\([[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.\)//g'
echo $stringZ | sed -e 's/\[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.//g'
Any solution is sed or awk would help a lot
Don't use external utilities and regexes for such a simple task! Use parameter expansions instead.
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo "${stringZ/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz}"
To perform the renaming of all the files containing .bed.nodup.sortedbed.roadmap.sort.fgwas.gz, use this:
shopt -s nullglob
substring=.bed.nodup.sortedbed.roadmap.sort.fgwas.gz
for file in *"$substring"*; do
echo mv -- "$file" "${file/"$substring"}"
done
Note. I left echo in front of mv so that nothing is going to be renamed; the commands will only be displayed on your terminal. Remove echo if you're satisfied with what you see.
Your regex doesn't really feel too much more general than the fixed pattern would be, but if you want to make it work, you need to allow for more than one lower case character between each dot. Right now you're looking for exactly one, but you can fix it with \+ after each [[:lower:]] like
printf '%s' "$stringZ" | sed -e 's/\([[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.\)//g'
which with
stringZ="META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params"
give me the output
META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
Try this:
#!/bin/bash
for line in $(ls -1 META*);
do
f2=$(echo $line | sed 's/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz//')
mv $line $f2
done
I need something like:
grep ^"unwanted_word"XXXXXXXX
You can do it using -v (for --invert-match) option of grep as:
grep -v "unwanted_word" file | grep XXXXXXXX
grep -v "unwanted_word" file will filter the lines that have the unwanted_word and grep XXXXXXXX will list only lines with pattern XXXXXXXX.
EDIT:
From your comment it looks like you want to list all lines without the unwanted_word. In that case all you need is:
grep -v 'unwanted_word' file
I understood the question as "How do I match a word but exclude another", for which one solution is two greps in series: First grep finding the wanted "word1", second grep excluding "word2":
grep "word1" | grep -v "word2"
In my case: I need to differentiate between "plot" and "#plot" which grep's "word" option won't do ("#" not being a alphanumerical).
If your grep supports Perl regular expression with -P option you can do (if bash; if tcsh you'll need to escape the !):
grep -P '(?!.*unwanted_word)keyword' file
Demo:
$ cat file
foo1
foo2
foo3
foo4
bar
baz
Let us now list all foo except foo3
$ grep -P '(?!.*foo3)foo' file
foo1
foo2
foo4
$
The right solution is to use grep -v "word" file, with its awk equivalent:
awk '!/word/' file
However, if you happen to have a more complex situation in which you want, say, XXX to appear and YYY not to appear, then awk comes handy instead of piping several greps:
awk '/XXX/ && !/YYY/' file
# ^^^^^ ^^^^^^
# I want it |
# I don't want it
You can even say something more complex. For example: I want those lines containing either XXX or YYY, but not ZZZ:
awk '(/XXX/ || /YYY/) && !/ZZZ/' file
etc.
Invert match using grep -v:
grep -v "unwanted word" file pattern
grep provides '-v' or '--invert-match' option to select non-matching lines.
e.g.
grep -v 'unwanted_pattern' file_name
This will output all the lines from file file_name, which does not have 'unwanted_pattern'.
If you are searching the pattern in multiple files inside a folder, you can use the recursive search option as follows
grep -r 'wanted_pattern' * | grep -v 'unwanted_pattern'
Here grep will try to list all the occurrences of 'wanted_pattern' in all the files from within currently directory and pass it to second grep to filter out the 'unwanted_pattern'.
'|' - pipe will tell shell to connect the standard output of left program (grep -r 'wanted_pattern' *) to standard input of right program (grep -v 'unwanted_pattern').
The -v option will show you all the lines that don't match the pattern.
grep -v ^unwanted_word
I excluded the root ("/") mount point by using grep -vw "^/".
# cat /tmp/topfsfind.txt| head -4 |awk '{print $NF}'
/
/root/.m2
/root
/var
# cat /tmp/topfsfind.txt| head -4 |awk '{print $NF}' | grep -vw "^/"
/root/.m2
/root
/var
I've a directory with a bunch of files. I want to find all the files that DO NOT contain the string "speedup" so I successfully used the following command:
grep -iL speedup *
I want to find files that have "abc" AND "efg" in that order, and those two strings are on different lines in that file. Eg: a file with content:
blah blah..
blah blah..
blah abc blah
blah blah..
blah blah..
blah blah..
blah efg blah blah
blah blah..
blah blah..
Should be matched.
Grep is an awkward tool for this operation.
pcregrep which is found in most of the modern Linux systems can be used as
pcregrep -M 'abc.*(\n|.)*efg' test.txt
where -M, --multiline allow patterns to match more than one line
There is a newer pcre2grep also. Both are provided by the PCRE project.
pcre2grep is available for Mac OS X via Mac Ports as part of port pcre2:
% sudo port install pcre2
and via Homebrew as:
% brew install pcre
or for pcre2
% brew install pcre2
pcre2grep is also available on Linux (Ubuntu 18.04+)
$ sudo apt install pcre2-utils # PCRE2
$ sudo apt install pcregrep # Older PCRE
Here is a solution inspired by this answer:
if 'abc' and 'efg' can be on the same line:
grep -zl 'abc.*efg' <your list of files>
if 'abc' and 'efg' must be on different lines:
grep -Pzl '(?s)abc.*\n.*efg' <your list of files>
Params:
-P Use perl compatible regular expressions (PCRE).
-z Treat the input as a set of lines, each terminated by a zero byte instead of a newline. i.e. grep treats the input as a one big line. Note that if you don't use -l it will display matches followed by a NUL char, see comments.
-l list matching filenames only.
(?s) activate PCRE_DOTALL, which means that '.' finds any character or newline.
I'm not sure if it is possible with grep, but sed makes it very easy:
sed -e '/abc/,/efg/!d' [file-with-content]
sed should suffice as poster LJ stated above,
instead of !d you can simply use p to print:
sed -n '/abc/,/efg/p' file
I relied heavily on pcregrep, but with newer grep you do not need to install pcregrep for many of its features. Just use grep -P.
In the example of the OP's question, I think the following options work nicely, with the second best matching how I understand the question:
grep -Pzo "abc(.|\n)*efg" /tmp/tes*
grep -Pzl "abc(.|\n)*efg" /tmp/tes*
I copied the text as /tmp/test1 and deleted the 'g' and saved as /tmp/test2. Here is the output showing that the first shows the matched string and the second shows only the filename (typical -o is to show match and typical -l is to show only filename). Note that the 'z' is necessary for multiline and the '(.|\n)' means to match either 'anything other than newline' or 'newline' - i.e. anything:
user#host:~$ grep -Pzo "abc(.|\n)*efg" /tmp/tes*
/tmp/test1:abc blah
blah blah..
blah blah..
blah blah..
blah efg
user#host:~$ grep -Pzl "abc(.|\n)*efg" /tmp/tes*
/tmp/test1
To determine if your version is new enough, run man grep and see if something similar to this appears near the top:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE, see
below). This is highly experimental and grep -P may warn of
unimplemented features.
That is from GNU grep 2.10.
This can be done easily by first using tr to replace the newlines with some other character:
tr '\n' '\a' | grep -o 'abc.*def' | tr '\a' '\n'
Here, I am using the alarm character, \a (ASCII 7) in place of a newline.
This is almost never found in your text, and grep can match it with a ., or match it specifically with \a.
awk one-liner:
awk '/abc/,/efg/' [file-with-content]
If you are willing to use contexts, this could be achieved by typing
grep -A 500 abc test.txt | grep -B 500 efg
This will display everything between "abc" and "efg", as long as they are within 500 lines of each other.
You can do that very easily if you can use Perl.
perl -ne 'if (/abc/) { $abc = 1; next }; print "Found in $ARGV\n" if ($abc && /efg/); }' yourfilename.txt
You can do that with a single regular expression too, but that involves taking the entire contents of the file into a single string, which might end up taking up too much memory with large files.
For completeness, here is that method:
perl -e '#lines = <>; $content = join("", #lines); print "Found in $ARGV\n" if ($content =~ /abc.*efg/s);' yourfilename.txt
I don't know how I would do that with grep, but I would do something like this with awk:
awk '/abc/{ln1=NR} /efg/{ln2=NR} END{if(ln1 && ln2 && ln1 < ln2){print "found"}else{print "not found"}}' foo
You need to be careful how you do this, though. Do you want the regex to match the substring or the entire word? add \w tags as appropriate. Also, while this strictly conforms to how you stated the example, it doesn't quite work when abc appears a second time after efg. If you want to handle that, add an if as appropriate in the /abc/ case etc.
If you need both words are close each other, for example no more than 3 lines, you can do this:
find . -exec grep -Hn -C 3 "abc" {} \; | grep -C 3 "efg"
Same example but filtering only *.txt files:
find . -name *.txt -exec grep -Hn -C 3 "abc" {} \; | grep -C 3 "efg"
And also you can replace grep command with egrep command if you want also find with regular expressions.
I released a grep alternative a few days ago that does support this directly, either via multiline matching or using conditions - hopefully it is useful for some people searching here. This is what the commands for the example would look like:
Multiline:
sift -lm 'abc.*efg' testfile
Conditions:
sift -l 'abc' testfile --followed-by 'efg'
You could also specify that 'efg' has to follow 'abc' within a certain number of lines:
sift -l 'abc' testfile --followed-within 5:'efg'
You can find more information on sift-tool.org.
Possible with ripgrep:
$ rg --multiline 'abc(\n|.)+?efg' test.txt
3:blah abc blah
4:blah abc blah
5:blah blah..
6:blah blah..
7:blah blah..
8:blah efg blah blah
Or some other incantations.
If you want . to count as a newline:
$ rg --multiline '(?s)abc.+?efg' test.txt
3:blah abc blah
4:blah abc blah
5:blah blah..
6:blah blah..
7:blah blah..
8:blah efg blah blah
Or equivalent to having the (?s) would be rg --multiline --multiline-dotall
And to answer the original question, where they have to be on separate lines:
$ rg --multiline 'abc.*[\n](\n|.)*efg' test.txt
And if you want it "non greedy" so you don't just get the first abc with the last efg (separate them into pairs):
$ rg --multiline 'abc.*[\n](\n|.)*?efg' test.txt
https://til.hashrocket.com/posts/9zneks2cbv-multiline-matches-with-ripgrep-rg
Sadly, you can't. From the grep docs:
grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN.
While the sed option is the simplest and easiest, LJ's one-liner is sadly not the most portable. Those stuck with a version of the C Shell (instead of bash) will need to escape their bangs:
sed -e '/abc/,/efg/\!d' [file]
Which line unfortunately does not work in bash et al.
With silver searcher:
ag 'abc.*(\n|.)*efg' your_filename
similar to ring bearer's answer, but with ag instead. Speed advantages of silver searcher could possibly shine here.
#!/bin/bash
shopt -s nullglob
for file in *
do
r=$(awk '/abc/{f=1}/efg/{g=1;exit}END{print g&&f ?1:0}' file)
if [ "$r" -eq 1 ];then
echo "Found pattern in $file"
else
echo "not found"
fi
done
you can use grep incase you are not keen in the sequence of the pattern.
grep -l "pattern1" filepattern*.* | xargs grep "pattern2"
example
grep -l "vector" *.cpp | xargs grep "map"
grep -l will find all the files which matches the first pattern, and xargs will grep for the second pattern. Hope this helps.
If you have some estimation about the distance between the 2 strings 'abc' and 'efg' you are looking for, you might use:
grep -r . -e 'abc' -A num1 -B num2 | grep 'efg'
That way, the first grep will return the line with the 'abc' plus #num1 lines after it, and #num2 lines after it, and the second grep will sift through all of those to get the 'efg'.
Then you'll know at which files they appear together.
With ugrep released a few months ago:
ugrep 'abc(\n|.)+?efg'
This tool is highly optimized for speed. It's also GNU/BSD/PCRE-grep compatible.
Note that we should use a lazy repetition +?, unless you want to match all lines with efg together until the last efg in the file.
You have at least a couple options --
DOTALL method
use (?s) to DOTALL the . character to include \n
you can also use a lookahead (?=\n) -- won't be captured in match
example-text:
true
match me
false
match me one
false
match me two
true
match me three
third line!!
{BLANK_LINE}
command:
grep -Pozi '(?s)true.+?\n(?=\n)' example-text
-p for perl regular expressions
-o to only match pattern, not whole line
-z to allow line breaks
-i makes case-insensitive
output:
true
match me
true
match me three
third line!!
notes:
- +? makes modifier non-greedy so matches shortest string instead of largest (prevents from returning one match containing entire text)
you can use the oldschool O.G. manual method using \n
command:
grep -Pozi 'true(.|\n)+?\n(?=\n)'
output:
true
match me
true
match me three
third line!!
I used this to extract a fasta sequence from a multi fasta file using the -P option for grep:
grep -Pzo ">tig00000034[^>]+" file.fasta > desired_sequence.fasta
P for perl based searches
z for making a line end in 0 bytes rather than newline char
o to just capture what matched since grep returns the whole line (which in this case since you did -z is the whole file).
The core of the regexp is the [^>] which translates to "not the greater than symbol"
As an alternative to Balu Mohan's answer, it is possible to enforce the order of the patterns using only grep, head and tail:
for f in FILEGLOB; do tail $f -n +$(grep -n "pattern1" $f | head -n1 | cut -d : -f 1) 2>/dev/null | grep "pattern2" &>/dev/null && echo $f; done
This one isn't very pretty, though. Formatted more readably:
for f in FILEGLOB; do
tail $f -n +$(grep -n "pattern1" $f | head -n1 | cut -d : -f 1) 2>/dev/null \
| grep -q "pattern2" \
&& echo $f
done
This will print the names of all files where "pattern2" appears after "pattern1", or where both appear on the same line:
$ echo "abc
def" > a.txt
$ echo "def
abc" > b.txt
$ echo "abcdef" > c.txt; echo "defabc" > d.txt
$ for f in *.txt; do tail $f -n +$(grep -n "abc" $f | head -n1 | cut -d : -f 1) 2>/dev/null | grep -q "def" && echo $f; done
a.txt
c.txt
d.txt
Explanation
tail -n +i - print all lines after the ith, inclusive
grep -n - prepend matching lines with their line numbers
head -n1 - print only the first row
cut -d : -f 1 - print the first cut column using : as the delimiter
2>/dev/null - silence tail error output that occurs if the $() expression returns empty
grep -q - silence grep and return immediately if a match is found, since we are only interested in the exit code
This should work too?!
perl -lpne 'print $ARGV if /abc.*?efg/s' file_list
$ARGV contains the name of the current file when reading from file_list
/s modifier searches across newline.
The filepattern *.sh is important to prevent directories to be inspected. Of course some test could prevent that too.
for f in *.sh
do
a=$( grep -n -m1 abc $f )
test -n "${a}" && z=$( grep -n efg $f | tail -n 1) || continue
(( ((${z/:*/}-${a/:*/})) > 0 )) && echo $f
done
The
grep -n -m1 abc $f
searches maximum 1 matching and returns (-n) the linenumber.
If a match was found (test -n ...) find the last match of efg (find all and take the last with tail -n 1).
z=$( grep -n efg $f | tail -n 1)
else continue.
Since the result is something like 18:foofile.sh String alf="abc"; we need to cut away from ":" till end of line.
((${z/:*/}-${a/:*/}))
Should return a positive result if the last match of the 2nd expression is past the first match of the first.
Then we report the filename echo $f.
To search recursively across all files (across multiple lines within each file) with BOTH strings present (i.e. string1 and string2 on different lines and both present in same file):
grep -r -l 'string1' * > tmp; while read p; do grep -l 'string2' $p; done < tmp; rm tmp
To search recursively across all files (across multiple lines within each file) with EITHER string present (i.e. string1 and string2 on different lines and either present in same file):
grep -r -l 'string1\|string2' *
Here's a way by using two greps in a row:
egrep -o 'abc|efg' $file | grep -A1 abc | grep efg | wc -l
returns 0 or a positive integer.
egrep -o (Only shows matches, trick: multiple matches on the same line produce multi-line output as if they are on different lines)
grep -A1 abc (print abc and the line after it)
grep efg | wc -l (0-n count of efg lines found after abc on the same or following lines, result can be used in an 'if")
grep can be changed to egrep etc. if pattern matching is needed
This should work:
cat FILE | egrep 'abc|efg'
If there is more than one match you can filter out using grep -v
I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}