Grep lines with a maximum number of characters - regex

How can I improve the following command:
grep 'sting' filename
such that to print only the lines with a maximum number of characters? For example only the lines which contain less than 100 characters?

You can use like this:
grep -E '^.{1,100}$' filename | grep 'string'
OR using a single awk command like this:
awk '/string/ && length() <= 100' filename

Here is the another version in awk:
awk '$0 ~ /string/ { if(length($0) <= 100) print}'

With sed for lines between >=10 and <=90 characters:
sed -i -r '/^.{10,90}$/!d' $file;

Related

Find all text between $...$ delimiters using bash script

I have a text file, and I'm trying to get an array of strings containing between $..$ delimiters (LaTeX formulas) using bash script. My current code doesn't work, result is empty:
#!/bin/bash
array=($(grep -o '\$([^\$]*)\$' test.txt))
echo ${array[#]}
I tested this regex here, it finds the matches. I use the following test string:
b5f1e7$bfc2439c621353$d1ce0$629f$b8b5
Expected result is
bfc2439c621353 629f
But echo returns empty. Although if I use '[0-9]\+' it works:
5 1 7 2439 621353 1 0 629 8 5
What do I do wrong?
How about:
grep -o '\$[^$]*\$' test.txt | tr -d '$'
This is basically performing your original grep (but without the brackets, which were causing it to not match), then removing the first/last characters from each match.
You may use awk with input field separator as $:
s='b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
awk -F '$' '{for (i=2; i<=NF; i+=2) print $i}' <<< "$s"
Note that this awk command doesn't validate input. If you want awk to allow for only valid inputs then you may use this gnu awk command with FPAT:
awk -v FPAT='\\$[^$]*\\$' '{for (i=1; i<=NF; i++) {gsub(/\$/, "", $i); print $i}}' <<< "$s"
bfc2439c621353
629f
What about this?
grep -Eo '\$[^$]+\$' a.txt | sed 's/\$//g'
I'm using sed to replace the $.
Try escaping your braces:
tst> grep -o '\$\([^\$]*\)\$' test.txt
$bfc2439c621353$
$629f$
of course, you then have to strip out the $ signs (-o prints the entire match). You can try sed instead:
tst> sed 's/[^\$]*\$\([^\$]*\)\$[^\$]*/\1\n/g' test.txt
bfc2439c621353
629f
Why is your expected output given b5f1e7$bfc2439c621353$d1ce0$629f$b8b5 the two elements bfc2439c621353 629f rather than the three elements bfc2439c621353 d1ce0 629f?
Here's a single grep command to extract those:
$ grep -Po '\$\K[^\$]*(?=\$)' <<<'b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
bfc2439c621353
d1ce0
629f
(This requires GNU grep as compiled with libpcre for -P)
This uses \$\K (equivalent to (?<=\$)to look behind at the first $ and (?=\$) to look ahead to the next $. Since these are lookarounds, they are not absorbed by grep in the process and therefore d1ce0 is available to be found.
Here's a single POSIX sed command to extract those:
$ sed 's/^[^$]*\$//; s/\$[^$]*$//; s/\$/\n/g' \
<<<'b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
bfc2439c621353
d1ce0
629f
This does not use any GNU notation and should work on any POSIX-compatible system (such as OS X). It removes the leading and trailing portions that aren't wanted, then replaces each $ with a newline.
Using bash regex:
var="b5f1e7\$bfc2439c621353\$d1ce0\$629f\$b8b5" # string to var
while [[ $var =~ ([^$]*\$)([^$]*)\$(.*) ]] # matching
do
echo -n "${BASH_REMATCH[2]} " # 2nd element has the match
var="${BASH_REMATCH[3]}" # 3rd is the rest of the string
done
echo # trailing newline
bfc2439c621353 629f

Find words with exact number of characters

I have hundreds of lines like
1234 dfsdfdsfa INIUUININI112123424124 12321 JH7897IUHIH879KJ
and from each line, I want to get only words with exactly 9 characters (dfsdfdsfa in the example). How could I do it?
I tried many regexs/sed/grep/awk but without success.
With grep:
$ grep -oE '\b.{9}\b' infile
dfsdfdsfa
-o returns only matches and not the complete lines; -E is because I'm lazy and don't want to escape the {} (as in \{\}).
The regex itself is "any 9 characters between word boundaries". This is not exactly foolproof and would also match abcd efgh, which can be avoided by indicating that we want non-blank characters only:
grep -oE '\b[^[:blank:]]{9}\b' infile
Instead of using \b...\b, we could use the -w option to grep, which ensures the same.
grep with -w (--word-regexp) option:
grep -wo '.\{9\}' file.txt
Note that, word constituent characters are:
[[:alnum:]_]
Example:
% grep -wo '.\{9\}' <<<'1234 dfsdfdsfa INIUUININI112123424124 12321 JH7897IUHIH879KJ'
dfsdfdsfa
Here is a pure bash solution:
filename="test.txt"
declare -a record
while read -ra record
do
for field in ${record[#]}
do
if (( ${#field} == 9 ))
then
echo $field
fi
done
done < "$filename"
and here is an awk solution embedded in bash:
filename='test.txt'
awk -f - "$filename" << '_END_'
{
for (i=1; i < NF; i++) {
if (length($i) == 9) print $i
}
}
_END_
cat foo.txt | sed -e 's/[\t ]/\n/g' | awk '/^.{9}$/
should do the trick too.

bash regex multiple match in one line

I'm trying to process my text.
For example i got:
asdf asdf get.this random random get.that
get.it this.no also.this.no
My desired output is:
get.this get.that
get.it
So regexp should catch only this pattern (get.\w), but it has to do it recursively because of multiple occurences in one line, so easiest way with sed
sed 's/.*(REGEX).*/\1/'
does not work (it shows only first occurence).
Probably the good way is to use grep -o, but i have old version of grep and -o flag is not available.
This grep may give what you need:
grep -o "get[^ ]*" file
Try awk:
awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
You might need to tweak the regex between the slashes for your specific issue. Sample output:
$ awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
get.this
get.that
get.it
With awk:
awk -v patt="^get" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
printf "%s%s", $i, OFS;
print ""
}' <<< "$text"
bash
while read -a words; do
for word in "${words[#]}"; do
if [[ $word == get* ]]; then
echo -n "$word "
fi
done
echo
done <<< "$text"
perl
perl -lane 'print join " ", grep {$_ =~ /^get/} #F' <<< "$text"
This might work for you (GNU sed):
sed -r '/\bget\.\S+/{s//\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1 /g;s/ $//}' file
or if you want one per line:
sed -r '/\n/!s/\bget\.\S+/\n&\n/g;/^get/P;D' file

Regular expression not showing multiple line content

I have a file with following format.
<hello>
<random1>
<random2>
....
....
....
<random100>
<bye>
I want to find whether bye and hello are there, and bye is below hello. I tried this regular expression.
grep "hello.*bye" filename
but it fails to match what I expected.
You could use pcregrep:
pcregrep -M 'hello(\n|.)*bye' filename
The -M option makes it possible to search for patterns that span line boundaries.
For your input, it'd produce:
<hello>
<random1>
<random2>
....
....
....
<random100>
<bye>
IF the input file is small enough, you can try:
grep "hello.*bye" <(tr $'\n' ' ' < filename)
This replaces all newlines with spaces and thus turns the file contents into a single line that grep searches at once.
If you'd rather simply remove newlines, use:
grep "hello.*bye" <(tr -d $'\n' < filename)
$ cat file1.txt
<hello>
<bye>
$ awk '/<hello>/ {hello=1} /<bye>/&&hello {bye=1; exit} END {exit !(hello && bye)}' \
file1.txt \
&& echo found || echo not found
found
$ cat file2.txt
<bye>
<hello>
$ awk '/<hello>/ {hello=1} /<bye>/&&hello {bye=1; exit} END {exit !(hello && bye)}' \
file2.txt \
&& echo found || echo not found
not found
Perl:
perl -0777 -lne 'print (/hello.*bye/s ? "y" : "n")'
or
perl -0777 -ne 'exit(! /hello.*bye/s)'
The -0777 options slurps the whole file as a single string. The "s" flag tells perl to allow "." to match a newline.
With GNU awk for a multi-char RS:
awk -v RS='^$' '{print (/hello.*bye/ ? "y" : "n")}'

i have a file and i need to extract a particular string followed after the regex 'LN:' from the second line

please refer the file contents below.
#HD VN:1.0 SO:unsorted
#SQ SN:Chr1 LN:30427680
#PG ID:bowtie2 PN:bowtie2 VN:2.1.0
how can i extract just the number 30427680 using awk or any other unix command.
Using sed
sed -n 's/.*LN://p' < input.txt
This will erase everything up until LN:, and print what's left, and only if a substitution did take place.
Using awk
awk -v FS=: '/LN:/ { print $3; }' < input.txt
This will match lines that contain LN:, use : as field separator, and print the 3rd column.
Using grep
grep -o '[0-9]\{3,\}' < input.txt
This will match sequences of 3 or more digits, and print only the matched pattern thanks to the -o.
Depending on other cases not included in your question, you might have to make the patterns more strict.
Using grep:
grep -oP 'LN:\K.*' filename
Just use grep:
grep -o 30427680 file
-o, --only-matching
Prints only the matching part of the lines.
Using perl :
perl -ne 'print $& if /LN:\K.*/' filename
or
perl -ne 'print $1 if /LN:(.*)/' filename
Another awk
awk -F"LN:" 'NF>1 {print $2}' file