Find words with exact number of characters - regex

I have hundreds of lines like
1234 dfsdfdsfa INIUUININI112123424124 12321 JH7897IUHIH879KJ
and from each line, I want to get only words with exactly 9 characters (dfsdfdsfa in the example). How could I do it?
I tried many regexs/sed/grep/awk but without success.

With grep:
$ grep -oE '\b.{9}\b' infile
dfsdfdsfa
-o returns only matches and not the complete lines; -E is because I'm lazy and don't want to escape the {} (as in \{\}).
The regex itself is "any 9 characters between word boundaries". This is not exactly foolproof and would also match abcd efgh, which can be avoided by indicating that we want non-blank characters only:
grep -oE '\b[^[:blank:]]{9}\b' infile
Instead of using \b...\b, we could use the -w option to grep, which ensures the same.

grep with -w (--word-regexp) option:
grep -wo '.\{9\}' file.txt
Note that, word constituent characters are:
[[:alnum:]_]
Example:
% grep -wo '.\{9\}' <<<'1234 dfsdfdsfa INIUUININI112123424124 12321 JH7897IUHIH879KJ'
dfsdfdsfa

Here is a pure bash solution:
filename="test.txt"
declare -a record
while read -ra record
do
for field in ${record[#]}
do
if (( ${#field} == 9 ))
then
echo $field
fi
done
done < "$filename"
and here is an awk solution embedded in bash:
filename='test.txt'
awk -f - "$filename" << '_END_'
{
for (i=1; i < NF; i++) {
if (length($i) == 9) print $i
}
}
_END_

cat foo.txt | sed -e 's/[\t ]/\n/g' | awk '/^.{9}$/
should do the trick too.

Related

Find all text between $...$ delimiters using bash script

I have a text file, and I'm trying to get an array of strings containing between $..$ delimiters (LaTeX formulas) using bash script. My current code doesn't work, result is empty:
#!/bin/bash
array=($(grep -o '\$([^\$]*)\$' test.txt))
echo ${array[#]}
I tested this regex here, it finds the matches. I use the following test string:
b5f1e7$bfc2439c621353$d1ce0$629f$b8b5
Expected result is
bfc2439c621353 629f
But echo returns empty. Although if I use '[0-9]\+' it works:
5 1 7 2439 621353 1 0 629 8 5
What do I do wrong?
How about:
grep -o '\$[^$]*\$' test.txt | tr -d '$'
This is basically performing your original grep (but without the brackets, which were causing it to not match), then removing the first/last characters from each match.
You may use awk with input field separator as $:
s='b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
awk -F '$' '{for (i=2; i<=NF; i+=2) print $i}' <<< "$s"
Note that this awk command doesn't validate input. If you want awk to allow for only valid inputs then you may use this gnu awk command with FPAT:
awk -v FPAT='\\$[^$]*\\$' '{for (i=1; i<=NF; i++) {gsub(/\$/, "", $i); print $i}}' <<< "$s"
bfc2439c621353
629f
What about this?
grep -Eo '\$[^$]+\$' a.txt | sed 's/\$//g'
I'm using sed to replace the $.
Try escaping your braces:
tst> grep -o '\$\([^\$]*\)\$' test.txt
$bfc2439c621353$
$629f$
of course, you then have to strip out the $ signs (-o prints the entire match). You can try sed instead:
tst> sed 's/[^\$]*\$\([^\$]*\)\$[^\$]*/\1\n/g' test.txt
bfc2439c621353
629f
Why is your expected output given b5f1e7$bfc2439c621353$d1ce0$629f$b8b5 the two elements bfc2439c621353 629f rather than the three elements bfc2439c621353 d1ce0 629f?
Here's a single grep command to extract those:
$ grep -Po '\$\K[^\$]*(?=\$)' <<<'b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
bfc2439c621353
d1ce0
629f
(This requires GNU grep as compiled with libpcre for -P)
This uses \$\K (equivalent to (?<=\$)to look behind at the first $ and (?=\$) to look ahead to the next $. Since these are lookarounds, they are not absorbed by grep in the process and therefore d1ce0 is available to be found.
Here's a single POSIX sed command to extract those:
$ sed 's/^[^$]*\$//; s/\$[^$]*$//; s/\$/\n/g' \
<<<'b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
bfc2439c621353
d1ce0
629f
This does not use any GNU notation and should work on any POSIX-compatible system (such as OS X). It removes the leading and trailing portions that aren't wanted, then replaces each $ with a newline.
Using bash regex:
var="b5f1e7\$bfc2439c621353\$d1ce0\$629f\$b8b5" # string to var
while [[ $var =~ ([^$]*\$)([^$]*)\$(.*) ]] # matching
do
echo -n "${BASH_REMATCH[2]} " # 2nd element has the match
var="${BASH_REMATCH[3]}" # 3rd is the rest of the string
done
echo # trailing newline
bfc2439c621353 629f

How to display words as per given number of letters?

I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

Grep lines with a maximum number of characters

How can I improve the following command:
grep 'sting' filename
such that to print only the lines with a maximum number of characters? For example only the lines which contain less than 100 characters?
You can use like this:
grep -E '^.{1,100}$' filename | grep 'string'
OR using a single awk command like this:
awk '/string/ && length() <= 100' filename
Here is the another version in awk:
awk '$0 ~ /string/ { if(length($0) <= 100) print}'
With sed for lines between >=10 and <=90 characters:
sed -i -r '/^.{10,90}$/!d' $file;

Match two strings in one line with grep

I am trying to use grep to match lines that contain two different strings. I have tried the following but this matches lines that contain either string1 or string2 which not what I want.
grep 'string1\|string2' filename
So how do I match with grep only the lines that contain both strings?
You can use
grep 'string1' filename | grep 'string2'
Or
grep 'string1.*string2\|string2.*string1' filename
I think this is what you were looking for:
grep -E "string1|string2" filename
I think that answers like this:
grep 'string1.*string2\|string2.*string1' filename
only match the case where both are present, not one or the other or both.
To search for files containing all the words in any order anywhere:
grep -ril \'action\' | xargs grep -il \'model\' | xargs grep -il \'view_type\'
The first grep kicks off a recursive search (r), ignoring case (i) and listing (printing out) the name of the files that are matching (l) for one term ('action' with the single quotes) occurring anywhere in the file.
The subsequent greps search for the other terms, retaining case insensitivity and listing out the matching files.
The final list of files that you will get will the ones that contain these terms, in any order anywhere in the file.
If you have a grep with a -P option for a limited perl regex, you can use
grep -P '(?=.*string1)(?=.*string2)'
which has the advantage of working with overlapping strings. It's somewhat more straightforward using perl as grep, because you can specify the and logic more directly:
perl -ne 'print if /string1/ && /string2/'
Your method was almost good, only missing the -w
grep -w 'string1\|string2' filename
You could try something like this:
(pattern1.*pattern2|pattern2.*pattern1)
The | operator in a regular expression means or. That is to say either string1 or string2 will match. You could do:
grep 'string1' filename | grep 'string2'
which will pipe the results from the first command into the second grep. That should give you only lines that match both.
And as people suggested perl and python, and convoluted shell scripts, here a simple awk approach:
awk '/string1/ && /string2/' filename
Having looked at the comments to the accepted answer: no, this doesn't do multi-line; but then that's also not what the author of the question asked for.
Don't try to use grep for this, use awk instead. To match 2 regexps R1 and R2 in grep you'd think it would be:
grep 'R1.*R2|R2.*R1'
while in awk it'd be:
awk '/R1/ && /R2/'
but what if R2 overlaps with or is a subset of R1? That grep command simply would not work while the awk command would. Lets say you want to find lines that contain the and heat:
$ echo 'theatre' | grep 'the.*heat|heat.*the'
$ echo 'theatre' | awk '/the/ && /heat/'
theatre
You'd have to use 2 greps and a pipe for that:
$ echo 'theatre' | grep 'the' | grep 'heat'
theatre
and of course if you had actually required them to be separate you can always write in awk the same regexp as you used in grep and there are alternative awk solutions that don't involve repeating the regexps in every possible sequence.
Putting that aside, what if you wanted to extend your solution to match 3 regexps R1, R2, and R3. In grep that'd be one of these poor choices:
grep 'R1.*R2.*R3|R1.*R3.*R2|R2.*R1.*R3|R2.*R3.*R1|R3.*R1.*R2|R3.*R2.*R1' file
grep R1 file | grep R2 | grep R3
while in awk it'd be the concise, obvious, simple, efficient:
awk '/R1/ && /R2/ && /R3/'
Now, what if you actually wanted to match literal strings S1 and S2 instead of regexps R1 and R2? You simply can't do that in one call to grep, you have to either write code to escape all RE metachars before calling grep:
S1=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<< 'R1')
S2=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<< 'R2')
grep 'S1.*S2|S2.*S1'
or again use 2 greps and a pipe:
grep -F 'S1' file | grep -F 'S2'
which again are poor choices whereas with awk you simply use a string operator instead of regexp operator:
awk 'index($0,S1) && index($0.S2)'
Now, what if you wanted to match 2 regexps in a paragraph rather than a line? Can't be done in grep, trivial in awk:
awk -v RS='' '/R1/ && /R2/'
How about across a whole file? Again can't be done in grep and trivial in awk (this time I'm using GNU awk for multi-char RS for conciseness but it's not much more code in any awk or you can pick a control-char you know won't be in the input for the RS to do the same):
awk -v RS='^$' '/R1/ && /R2/'
So - if you want to find multiple regexps or strings in a line or paragraph or file then don't use grep, use awk.
git grep
Here is the syntax using git grep with multiple patterns:
git grep --all-match --no-index -l -e string1 -e string2 -e string3 file
You may also combine patterns with Boolean expressions such as --and, --or and --not.
Check man git-grep for help.
--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.
--no-index Search files in the current directory that is not managed by Git.
-l/--files-with-matches/--name-only Show only the names of files.
-e The next parameter is the pattern. Default is to use basic regexp.
Other params to consider:
--threads Number of grep worker threads to use.
-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.
To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.
Related:
How to grep for two words existing on the same line?
Check if all of multiple strings or regexes exist in a file
How to run grep with multiple AND patterns? & Match all patterns from file at once
For OR operation, see:
How do I grep for multiple patterns with pattern having a pipe character?
Grep: how to add an “OR” condition?
Found lines that only starts with 6 spaces and finished with:
cat my_file.txt | grep
-e '^ .*(\.c$|\.cpp$|\.h$|\.log$|\.out$)' # .c or .cpp or .h or .log or .out
-e '^ .*[0-9]\{5,9\}$' # numers between 5 and 9 digist
> nolog.txt
Let's say we need to find count of multiple words in a file testfile.
There are two ways to go about it
1) Use grep command with regex matching pattern
grep -c '\<\(DOG\|CAT\)\>' testfile
2) Use egrep command
egrep -c 'DOG|CAT' testfile
With egrep you need not to worry about expression and just separate words by a pipe separator.
grep ‘string1\|string2’ FILENAME
GNU grep version 3.1
Place the strings you want to grep for into a file
echo who > find.txt
echo Roger >> find.txt
echo [44][0-9]{9,} >> find.txt
Then search using -f
grep -f find.txt BIG_FILE_TO_SEARCH.txt
grep '(string1.*string2 | string2.*string1)' filename
will get line with string1 and string2 in any order
for multiline match:
echo -e "test1\ntest2\ntest3" |tr -d '\n' |grep "test1.*test3"
or
echo -e "test1\ntest5\ntest3" >tst.txt
cat tst.txt |tr -d '\n' |grep "test1.*test3\|test3.*test1"
we just need to remove the newline character and it works!
You should have grep like this:
$ grep 'string1' file | grep 'string2'
I often run into the same problem as yours, and I just wrote a piece of script:
function m() { # m means 'multi pattern grep'
function _usage() {
echo "usage: COMMAND [-inH] -p<pattern1> -p<pattern2> <filename>"
echo "-i : ignore case"
echo "-n : show line number"
echo "-H : show filename"
echo "-h : show header"
echo "-p : specify pattern"
}
declare -a patterns
# it is important to declare OPTIND as local
local ignorecase_flag filename linum header_flag colon result OPTIND
while getopts "iHhnp:" opt; do
case $opt in
i)
ignorecase_flag=true ;;
H)
filename="FILENAME," ;;
n)
linum="NR," ;;
p)
patterns+=( "$OPTARG" ) ;;
h)
header_flag=true ;;
\?)
_usage
return ;;
esac
done
if [[ -n $filename || -n $linum ]]; then
colon="\":\","
fi
shift $(( $OPTIND - 1 ))
if [[ $ignorecase_flag == true ]]; then
for s in "${patterns[#]}"; do
result+=" && s~/${s,,}/"
done
result=${result# && }
result="{s=tolower(\$0)} $result"
else
for s in "${patterns[#]}"; do
result="$result && /$s/"
done
result=${result# && }
fi
result+=" { print "$filename$linum$colon"\$0 }"
if [[ ! -t 0 ]]; then # pipe case
cat - | awk "${result}"
else
for f in "$#"; do
[[ $header_flag == true ]] && echo "########## $f ##########"
awk "${result}" $f
done
fi
}
Usage:
echo "a b c" | m -p A
echo "a b c" | m -i -p A # a b c
You can put it in .bashrc if you like.
grep -i -w 'string1\|string2' filename
This works for exact word match and matching case insensitive words ,for that -i is used
When the both strings are in sequence then put a pattern in between on grep command:
$ grep -E "string1(?.*)string2" file
Example if the following lines are contained in a file named Dockerfile:
FROM python:3.8 as build-python
FROM python:3.8-slim
To get the line that contains the strings: FROM python and as build-python then use:
$ grep -E "FROM python:(?.*) as build-python" Dockerfile
Then the output will show only the line that contain both strings:
FROM python:3.8 as build-python
If git is initialized and added to the branch then it is better to use git grep because it is super fast and it will search inside the whole directory.
git grep 'string1.*string2.*string3'
searching for two String and highlight only string1 and string2
grep -E 'string1.*string2|string2.*string1' filename | grep -E 'string1|string2'
or
grep 'string1.*string2\|string2.*string1' filename | grep -E 'string1\|string2'
ripgrep
Here is the example using rg:
rg -N '(?P<p1>.*string1.*)(?P<p2>.*string2.*)' file.txt
It's one of the quickest grepping tools, since it's built on top of Rust's regex engine which uses finite automata, SIMD and aggressive literal optimizations to make searching very fast.
Use it, especially when you're working with a large data.
See also related feature request at GH-875.