Search for substring matches in a file bash - regex

The premise is to store a database file of colon separated values representing items.
var1:var2:var3:var4
I need to sort through this file and extract the lines where any of the values match a search string.
For example
Search for "Help"
Hey:There:You:Friends
I:Kinda:Need:Help (this line would be extracted)
I'm using a function to pass in the search string, and then passing the found lines to another function to format the output. However I can't seem to be able to get the format right when passing. Here is sample code i've tried of different ways that I've found on this site, but they don't seem to be working for me
#Option 1, it doesn't ever find matches
function retrieveMatch {
if [ -n "$1" ]; then
while read line; do
if [[ *"$1"* =~ "$line" ]]; then
formatPrint "$line"
fi
done
fi
}
#Option 2, it gets all the matches, but then passes the value in a
#format different than a file? At least it seems to...
function retrieveMatch {
if [ -n "$1" ]; then
formatPrint `cat database.txt | grep "$1"`
fi
}
function formatPrint {
list="database.txt" #default file for printing all info
if [ -n "$1" ]; then
list="$1"
fi
IFS=':'
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
}
I can't seem to get the first one to find any matches
The second options gets the right values, but when I try to formatPrint, it throws an error saying that the list of values passed in are not a directory.

Honestly, I'd replace the whole thing with
function retrieveMatch {
grep "$1" | tr ':' '\n'
}
To be called as
retrieveMatch Help < filename
...like the original function (Option 1) appeared to be designed. To do more complicated things with matching lines, have a look at awk:
# in the awk script, the fields in the line will be $1, $2 etc.
awk -v pattern="$1" -F : '$0 ~ pattern { for(i = 1; i < NF; ++i) print $i }'
See this link. Awk is made to process exactly this sort of data, so if you plan to do complex things with it, it is definitely worth a look.
Answering the question more directly, there are two/three problems in your code. One is, as was pointed out in the comments to the question, that the line
if [[ *"$1"* =~ "$line" ]]; then
Will try to use "$line" as a regular expression to find a match in *"$1"*, assuming that *"$1"* does not become more than one token after pathname expansion because the * are not quoted. Assuming that the * are supposed to match anything the way they would in glob expressions (but not in regular expressions), this could be replaced with
if [[ "$line" =~ "$1" ]]; then
because =~ will report a match if the regex matches any part of the string.
The second problem is that you're divided on whether you want "$list" in formatPrint to be a file or a line. You say in retrieveMatch that it should be a line:
formatPrint "$line"
But you set it to a filename default in formatPrint:
list="database.txt" #default file for printing all info
You'll have to decide on one. If you decide that formatPrint should format lines, then the third problem is that the redirection in
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
tries to use "$list" as a filename. This could be fixed by replacing the last line with
done <<< "$list" # using a here-string (bash-specific)
Or
done <<EOF
$list
EOF
(note: in the latter case, do not indent the code; it's a here-document that's taken verbatim). And, of course, read will only split four fields the way you wrote it.

I feel I must be missing something, but..
cat > foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Foo:Bar
[Give control-D]
grep -i help foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Does it fit the bill?
EDIT: To expand a little further on this thought..
cat > foo.bsh
#!/bin/bash
hits="$(grep -i help foo.txt)"
while read -r line; do
echo "${line}"
done <<< "$hits"
[Give control-D]

Related

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

Find regular expression in a file matching a given value

I have some basic knowledge on using regular expressions with grep (bash).
But I want to use regular expressions the other way around.
For example I have a file containing the following entries:
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
Now I want to use bash to figure out to which line a particular number matches.
For example:
grep 8 file
should return:
line_three=[7-9]
Note: I am aware that the example of "grep 8 file" doesn't make sense, but I hope it helps to understand what I am trying to achieve.
Thanks for you help,
Marcel
As others haven pointed out, awk is the right tool for this:
awk -F'=' '8~$2{print $0;}' file
... and if you want this tool to feel more like grep, a quick bash wrapper:
#!/bin/bash
awk -F'=' -v seek_value="$1" 'seek_value~$2{print $0;}' "$2"
Which would run like:
./not_exactly_grep.sh 8 file
line_three=[7-9]
My first impression is that this is not a task for grep, maybe for awk.
Trying to do things with grep I only see this:
for line in $(cat file); do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done
Using while for file reading (following comments):
while IFS= read -r line; do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done < file
This can be done in native bash using the syntax [[ $value =~ $regex ]] to test:
find_regex_matching() {
local value=$1
while IFS= read -r line; do # read from input line-by-line
[[ $line = *=* ]] || continue # skip lines not containing an =
regex=${line#*=} # prune everything before the = for the regex
if [[ $value =~ $regex ]]; then # test whether we match...
printf '%s\n' "$line" # ...and print if we do.
fi
done
}
...used as:
find_regex_matching 8 <file
...or, to test it with your sample input inline:
find_regex_matching 8 <<'EOF'
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
EOF
...which properly emits:
line_three=[7-9]
You could replace printf '%s\n' "$line" with printf '%s\n' "${line%%=*}" to print only the key (contents before the =), if so inclined. See the bash-hackers page on parameter expansion for a rundown on the syntax involved.
This is not built-in functionality of grep, but it's easy to do with awk, with a change in syntax:
/[0-3]/ { print "line one" }
/[4-6]/ { print "line two" }
/[7-9]/ { print "line three" }
If you really need to, you could programmatically change your input file to this syntax, if it doesn't contain any characters that need escaping (mainly / in the regex or " in the string):
sed -e 's#\(.*\)=\(.*\)#/\2/ { print "\1" }#'
As I understand it, you are looking for a range that includes some value.
You can do this in gawk:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
$ awk -v n=8 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[7-9]
Since the digits are being treated as numbers (vs a regex) it supports larger ranges:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[75-95]
line_four=[55-105]
$ awk -v n=92 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[75-95]
line_four=[55-105]
If you are just looking to interpret the right hand side of the = as a regex, you can do:
$ awk -F= -v tgt=8 'tgt~$2' /tmp/file
You would like to do something like
grep -Ef <(cut -d= -f2 file) <(echo 8)
This wil grep what you want but will not display where.
With grep you can show some message:
echo "8" | sed -n '/[7-9]/ s/.*/Found it in line_three/p'
Now you would like to transfer your regexp file into such commands:
sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file
Store these commands in a virtual command file and you will have
echo "8" | sed -nf <(sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file)

Find a string in a file name (shell script)

I am trying to use regex to match a file name and extract only a portion of the file name. My file names have this pattern: galax_report_for_Sample11_8757.xls, and I want to extract the string Sample11 in this case. I have tried the following regex, but it does not work for me, could someone help with the correct regex?
name=galax_report_for_Sample11_8757.xls
sampleName=$([[ "$name" =~ ^[^_]+_([^_]+) ]] && echo ${BASH_REMATCH[2]})
edit:
just found this works for me:
sampleName=$([[ "$name" =~ ^[^_]+_([^_]+)_([^_]+)_([^_]+) ]] && echo ${BASH_REMATCH[3]})
In a simple case like this, where you essentially have just a list of values separated by a single instance of a separator character each, consider using cut to extract the field of interest:
sampleName=$(echo 'galax_report_for_Sample11_8757.xls' | cut -d _ -f 4)
If you're using bash or zsh or ksh, you can make it a little more efficient:
sampleName=$(cut -d _ -f 4 <<< 'galax_report_for_Sample11_8757.xls')
Here is a slightly shorter alternative to the approach you used:
sampleName=$([[ "$name" =~ ^([^_]+_){3}([^_]+) ]] && echo ${BASH_REMATCH[2]})

bash regular expression test: if vs grep

I need to scan each line of a file looking for any characters above hex \x7E. The file has several million rows, so improving efficiency would be great. So far, reading each line in a while loop, this works and finds lines with invalid characters:
echo "$line" | grep -P "[\x7F-\xFF]" > /dev/null 2>&1
if [ $? -eq 0 ]; then...
But this doesn't:
if [[ "$line" =~ [\x7F-\xFF] ]]; then...
I'm assuming it would be more efficient the second way, if I could get it to work. What am I missing?
If you're interested in efficiency, you shouldn't write your loop in bash. You should rethink your program in terms of pipes and use efficient tools.
That said, you can do this with
LC_CTYPE=C LC_COLLATE=C
if [[ "$line" =~ [$'\x7f'-$'\xff'] ]]
then
echo "It contains bytes \x7F or up"
fi
I basically have to split the file. Valid records go to one file, invalid records go to another.
sed -n '/[^\x0-\x7e]/w badrecords
//! w goodrecords'
If you're already using Perl regular expressions, you might as well use perl for the task:
perl -ne '
if (/[\x7F-\xFF]/) {print STDERR $_} else {print}
' file > valid 2> invalid
I'd bet that's faster than a bash loop.
I suspect this would be more efficient, even though it processes the file twice:
grep -P "[\x7F-\xFF]" file > invalid
grep -vP "[\x7F-\xFF]" file > valid
You'd want to write your grep code as
if grep -qP "[\x7F-\xFF]" <<< "$line"; then...

How can I assign the match of my regular expression to a variable?

I have a text file with various entries in it. Each entry is ended with line containing all asterisks.
I'd like to use shell commands to parse this file and assign each entry to a variable. How can I do this?
Here's an example input file:
***********
Field1
***********
Lorem ipsum
Data to match
***********
More data
Still more data
***********
Here is what my solution looks like so far:
#!/bin/bash
for error in `python example.py | sed -n '/.*/,/^\**$/p'`
do
echo -e $error
echo -e "\n"
done
However, this just assigns each word in the matched text to $error, rather than a whole block.
I'm surprised to not see a native bash solution here. Yes, bash has regular expressions. You can find plenty of random documentation online, particularly if you include "bash_rematch" in your query, or just look at the man pages. Here's a silly example, taken from here and slightly modified, which prints the whole match, and each of the captured matches, for a regular expression.
if [[ $str =~ $regex ]]; then
echo "$str matches"
echo "matching substring: ${BASH_REMATCH[0]}"
i=1
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo " capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
else
echo "$str does not match"
fi
The important bit is that the extended test [[ ... ]] using its regex comparision =~ stores the entire match in ${BASH_REMATCH[0]} and the captured matches in ${BASH_REMATCH[i]}.
If you want to do it in Bash, you could do something like the following. It uses globbing instead of regexps (The extglob shell option enables extended pattern matching, so that we can match a line consisting only of asterisks.)
#!/bin/bash
shopt -s extglob
entry=""
while read line
do
case $line in
+(\*))
# do something with $entry here
entry=""
;;
*)
entry="$entry$line
"
;;
esac
done
Try putting double quotes around the command.
#!/bin/bash
for error in "`python example.py | sed -n '/.*/,/^\**$/p'`"
do
echo -e $error
echo -e "\n"
done
depending on what you want to do with the variables
awk '
f && /\*/{print "variable:"s;f=0}
/\*/{ f=1 ;s="";next}
f{
s=s" "$0
}' file
output:
# ./test.sh
variable: Field1
variable: Lorem ipsum Data to match
variable: More data Still more data
the above just prints them out. if you want, store in array for later use...eg array[++d]=s
Splitting records in (ba)sh is not so easy, but can be done using IFS to split on single characters (simply set IFS='*' before your for loop, but this generates multiple empty records and is problematic if any record contains a '*'). The obvious solution is to use perl or awk and use RS to split your records, since those tools provide better mechanisms for splitting records. A hybrid solution is to use perl to do the record splitting, and have perl call your bash function with the record you want. For example:
#!/bin/bash
foo() {
echo record start:
echo "$#"
echo record end
}
export -f foo
perl -e "$/='********'; while(<>){chomp;system( \"foo '\$_'\" )}" << 'EOF'
this is a 2-line
record
********
the 2nd record
is 3 lines
long
********
a 3rd * record
EOF
This gives the following output:
record start:
this is a 2-line
record
record end
record start:
the 2nd record
is 3 lines
long
record end
record start:
a 3rd * record
record end