Pattern matching in if statement in bash - regex

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.

Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done

Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.

Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

Related

Extracting part of path containing a number in bash

In bash, given a path such as:
mypath='my/path/to/version/5e/is/7/here'
I would like to extract the first part that contains a number. For the example I would want to extract: 5e
Is there a better way than looping over the parts using while and checking each part for a number?
while IFS=/ read part
do
if [[ $part =~ *[0-9]* ]]; then
echo "$part"
fi
done <<< "$mypath"
Using Bash's regex:
[[ "$mypath" =~ [^/]*[0-9]+[^/]* ]] && echo "${BASH_REMATCH[0]}"
5e
Method using 'grep -o'.
echo $mypath | grep -o -E '\b[^/]*[0-9][^/]*\b' | head -1
Replace / by a newline
Filter the first match with a number
mypath='my/path/to/version/5e/is/7/here'
<<<"${mypath//\//$'\n'}" grep -m1 '[0-9]'
and a safer alternative that uses zero separated stream with GNU tools in case there are newlines in the path:
<<<"${mypath}" tr '/' '\0' | grep -z -m1 '[0-9]'
Is there a better way than looping over the parts using while and checking each part for a number?
No, either way or another you have to loop through all the parts until the first part with numbers is discovered. The loop may be hidden behind other tools, but it's still going to loop through the parts. You solution seems to be pretty good by itself, just break after you've found the first part if you want only the first.
Could you please try following, written and tested with shown samples. This should print if we have more than 1 values in the lines too. If you talk about better way, awk could be fast compare to pure bash loop + regex solutions IMHO, so adding it here.
awk -F'/' '
{
val=""
for(i=1;i<=NF;i++){
if($i~/[0-9][a-zA-Z]/ || $i~/[a-zA-Z][0-9]/){
val=(val?val OFS:"")$i
}
}
print val
}' Input_file
Explanation: Adding detailed explanation for above.
awk -F'/' ' ##Starting awk program from here and setting field separator as / here.
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Running for loop till value of NF.
if($i~/[0-9][a-zA-Z]/ || $i~/[a-zA-Z][0-9]/){ ##Checking condition if field value is matching regex of digit alphabet then do following.
val=(val?val OFS:"")$i ##Creating variable val where keep on adding current field value in it.
}
}
print val ##Printing val here.
}' Input_file ##Mentioning Input_file name here.
Using Perl:
mypath='my/path/to/version/5e/is/7/here'
# Method 1 (using for loop):
echo "${mypath}" | perl -F'/' -lane 'for my $dir ( #F ) { next unless $dir =~ /\d/; print $dir; last; }'
# Method 2 (using grep):
echo "${mypath}" | perl -F'/' -lane 'my $dir = ( grep { /\d/ } #F )[0]; print $dir if defined $dir;'
# Prints:
# 5e
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/' : Split into #F on /, rather than on whitespace.
next unless $dir =~ /\d/; : skip the rest of the loop if the current part of the path does not* contain a digit (\d).
last; : exit the loop (here, it also exits the script), so that it prints only the first occurrence of the matching directory.
grep { ... } LIST : for the LIST argument, returns the list of elements for which the expression ... is true, here returns the list of all path elements that have a digit.
(LIST)[0] : returns the first element of the LIST, here, the first path element with a digit.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
With awk, set RS to / and print the first record containing a number.
awk -v RS=/ '/[0-9]/{print;exit}' <<< "$mypath"
5e
Another bash variant
mypath='my/path/to/app version/5e/is/7/here'
until [[ ${mypath:0:1} =~ [0-9] ]]; do
mypath=${mypath#*/}
done
echo ${mypath%%/*}

Search for substring matches in a file bash

The premise is to store a database file of colon separated values representing items.
var1:var2:var3:var4
I need to sort through this file and extract the lines where any of the values match a search string.
For example
Search for "Help"
Hey:There:You:Friends
I:Kinda:Need:Help (this line would be extracted)
I'm using a function to pass in the search string, and then passing the found lines to another function to format the output. However I can't seem to be able to get the format right when passing. Here is sample code i've tried of different ways that I've found on this site, but they don't seem to be working for me
#Option 1, it doesn't ever find matches
function retrieveMatch {
if [ -n "$1" ]; then
while read line; do
if [[ *"$1"* =~ "$line" ]]; then
formatPrint "$line"
fi
done
fi
}
#Option 2, it gets all the matches, but then passes the value in a
#format different than a file? At least it seems to...
function retrieveMatch {
if [ -n "$1" ]; then
formatPrint `cat database.txt | grep "$1"`
fi
}
function formatPrint {
list="database.txt" #default file for printing all info
if [ -n "$1" ]; then
list="$1"
fi
IFS=':'
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
}
I can't seem to get the first one to find any matches
The second options gets the right values, but when I try to formatPrint, it throws an error saying that the list of values passed in are not a directory.
Honestly, I'd replace the whole thing with
function retrieveMatch {
grep "$1" | tr ':' '\n'
}
To be called as
retrieveMatch Help < filename
...like the original function (Option 1) appeared to be designed. To do more complicated things with matching lines, have a look at awk:
# in the awk script, the fields in the line will be $1, $2 etc.
awk -v pattern="$1" -F : '$0 ~ pattern { for(i = 1; i < NF; ++i) print $i }'
See this link. Awk is made to process exactly this sort of data, so if you plan to do complex things with it, it is definitely worth a look.
Answering the question more directly, there are two/three problems in your code. One is, as was pointed out in the comments to the question, that the line
if [[ *"$1"* =~ "$line" ]]; then
Will try to use "$line" as a regular expression to find a match in *"$1"*, assuming that *"$1"* does not become more than one token after pathname expansion because the * are not quoted. Assuming that the * are supposed to match anything the way they would in glob expressions (but not in regular expressions), this could be replaced with
if [[ "$line" =~ "$1" ]]; then
because =~ will report a match if the regex matches any part of the string.
The second problem is that you're divided on whether you want "$list" in formatPrint to be a file or a line. You say in retrieveMatch that it should be a line:
formatPrint "$line"
But you set it to a filename default in formatPrint:
list="database.txt" #default file for printing all info
You'll have to decide on one. If you decide that formatPrint should format lines, then the third problem is that the redirection in
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
tries to use "$list" as a filename. This could be fixed by replacing the last line with
done <<< "$list" # using a here-string (bash-specific)
Or
done <<EOF
$list
EOF
(note: in the latter case, do not indent the code; it's a here-document that's taken verbatim). And, of course, read will only split four fields the way you wrote it.
I feel I must be missing something, but..
cat > foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Foo:Bar
[Give control-D]
grep -i help foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Does it fit the bill?
EDIT: To expand a little further on this thought..
cat > foo.bsh
#!/bin/bash
hits="$(grep -i help foo.txt)"
while read -r line; do
echo "${line}"
done <<< "$hits"
[Give control-D]

Regex with fswatch - Exclude files not ending with ".txt"

For a list of files, I'd like to match the ones not ending with .txt. I am currently using this expression:
.*(txt$)|(html\.txt$)
This expression will match everything ending in .txt, but I'd like it to do the opposite.
Should match:
happiness.html
joy.png
fear.src
Should not match:
madness.html.txt
excitement.txt
I'd like to get this so I can use it in pair with fswatch:
fswatch -0 -e 'regex here' . | xargs -0 -n 1 -I {} echo "{} has been changed"
The problem is it doesn't seem to work.
PS: I use the tag bash instead of fswatch because I don't have enough reputation points to create it. Sorry!
Try using a lookbehind, like this:
.*$(?<!\.txt)
Demonstration
Basically, this matches any line of text so long as the last 4 characters are not ".txt".
You can use Negative Lookahead for this purpose.
^(?!.*\.txt).+$
Live Demo
You can use this expression with grep using option -P:
grep -Po '^(?!.*\.txt).+$' file
Since question has been tagged as bash, lookaheads may not be supported (except grep -P), here is one grep solution that doesn't need lookaheads:
grep -v '\.txt$' file
happiness.html
joy.png
fear.src
EDIT: You can use this xargs command to avoid matching *.txt files:
xargs -0 -n 1 -I {} bash -c '[[ "{}" == *".txt" ]] && echo "{} has been changed"'
It really depends what regular expression tool you are using. Many tools provide a way to invert the sense of a regex. For example:
bash
# succeeds if filename ends with .txt
[[ $filename =~ "."txt$ ]]
# succeeds if filename does not end with .txt
! [[ $filename =~ "."txt$ ]]
# another way of writing the negative
[[ ! $filename =~ "."txt$ ]]
grep
# succeeds if filename ends with .txt
egrep -q "\.txt$" <<<"$filename"
# succeeds if filename does not end with .txt
egrep -qv "\.txt$" <<<"$filename"
awk
/\.txt$/ { print "line ends with .txt" }
! /\.txt$/ { print "line doesn't end with .txt" }
$1 ~ /\.txt$/ { print "first field ends with .txt" }
$1 !~ /\.txt$/ { print "first field doesn't end with .txt" }
For the adventurous, a posix ERE which will work in any posix compatible regex engine
/[^t]$|[^x]t$|[^t]xt$|[^.]txt$/

bash regular expression test: if vs grep

I need to scan each line of a file looking for any characters above hex \x7E. The file has several million rows, so improving efficiency would be great. So far, reading each line in a while loop, this works and finds lines with invalid characters:
echo "$line" | grep -P "[\x7F-\xFF]" > /dev/null 2>&1
if [ $? -eq 0 ]; then...
But this doesn't:
if [[ "$line" =~ [\x7F-\xFF] ]]; then...
I'm assuming it would be more efficient the second way, if I could get it to work. What am I missing?
If you're interested in efficiency, you shouldn't write your loop in bash. You should rethink your program in terms of pipes and use efficient tools.
That said, you can do this with
LC_CTYPE=C LC_COLLATE=C
if [[ "$line" =~ [$'\x7f'-$'\xff'] ]]
then
echo "It contains bytes \x7F or up"
fi
I basically have to split the file. Valid records go to one file, invalid records go to another.
sed -n '/[^\x0-\x7e]/w badrecords
//! w goodrecords'
If you're already using Perl regular expressions, you might as well use perl for the task:
perl -ne '
if (/[\x7F-\xFF]/) {print STDERR $_} else {print}
' file > valid 2> invalid
I'd bet that's faster than a bash loop.
I suspect this would be more efficient, even though it processes the file twice:
grep -P "[\x7F-\xFF]" file > invalid
grep -vP "[\x7F-\xFF]" file > valid
You'd want to write your grep code as
if grep -qP "[\x7F-\xFF]" <<< "$line"; then...

Bash regex match spanning multiple lines

I'm trying to create a bash script that validates files. One of the requirements is that there has to be exactly one "2" in the file.
Here's my code at the moment:
regex1="[0-9b]*2[0-9b]*2[0-9b]*"
# This regex will match if there are at least two 2's in the file
if [[ ( $(cat "$file") =~ $regex1 ) ]]; then
# stuff to do when there's more than 1 "2"
fi
#...
regex2="^[013456789b]*$"
# This regex will match if there are at least no 2's in the file
if [[ ( $(cat "$file") =~ $regex2 ) ]]; then
# stuff to do when there are no 2's
fi
What I'm trying to do is match the following pieces:
654654654654
254654845845
845462888888
(because there are 2 2's in there, it should be matched)
987886546548
546546546848
654684546548
(because there are no 2's in there, it should be matched)
Any idea how I make it search all lines with the =~ operator?
I'm trying to create a bash script that validates files. One of the
requirements is that there has to be exactly one "2" in the file.
Try using grep
#!/bin/bash
file='input.txt'
n=$(grep -o '2' "$file" | wc -l)
# echo $n
if [[ $n -eq 1 ]]; then
echo 'Valid'
else
echo 'Invalid'
fi
How about this:
twocount=$(tr -dc '2' input.txt | wc -c)
if (( twocount != 1 ))
then
# there was either no 2, or more than one 2
else
# exactly one 2
fi
Using anchors as you've been, match a string of non-2s, a 2, and another string of non-2s.
^[^2]*2[^2]*$
Multiline regex match is indeed possible using awk with null record separator.
Consider below code:
awk '$0 ~ /^.*2.*2/ || $0 ~ /^[013456789]*$/' RS= file
654654654654
254654845845
845462888888
Take note of RS= which makes awk join multiple lines into single line $0 until it hits a double newline.