bash regular expression test: if vs grep - regex

I need to scan each line of a file looking for any characters above hex \x7E. The file has several million rows, so improving efficiency would be great. So far, reading each line in a while loop, this works and finds lines with invalid characters:
echo "$line" | grep -P "[\x7F-\xFF]" > /dev/null 2>&1
if [ $? -eq 0 ]; then...
But this doesn't:
if [[ "$line" =~ [\x7F-\xFF] ]]; then...
I'm assuming it would be more efficient the second way, if I could get it to work. What am I missing?

If you're interested in efficiency, you shouldn't write your loop in bash. You should rethink your program in terms of pipes and use efficient tools.
That said, you can do this with
LC_CTYPE=C LC_COLLATE=C
if [[ "$line" =~ [$'\x7f'-$'\xff'] ]]
then
echo "It contains bytes \x7F or up"
fi

I basically have to split the file. Valid records go to one file, invalid records go to another.
sed -n '/[^\x0-\x7e]/w badrecords
//! w goodrecords'

If you're already using Perl regular expressions, you might as well use perl for the task:
perl -ne '
if (/[\x7F-\xFF]/) {print STDERR $_} else {print}
' file > valid 2> invalid
I'd bet that's faster than a bash loop.
I suspect this would be more efficient, even though it processes the file twice:
grep -P "[\x7F-\xFF]" file > invalid
grep -vP "[\x7F-\xFF]" file > valid
You'd want to write your grep code as
if grep -qP "[\x7F-\xFF]" <<< "$line"; then...

Related

How to pass regular expression matching string from a file in awk?

I have a requirement where I have to split a large file into small files. Each line of the large file containing the matching string should be put into another file with the output file name same as the matching string. For one string I can get it done via awk as shown below.
awk '/apple/{print}' large_file.txt > apple.txt
I want a script which takes the regular expression matching string from another file and puts the results into a file with the same name as the matching string. How to get it done with awk command?
Let's say the string to be matched is put into a file called matching_string.txt the contents of which would look like this:
apple
orange
mango
If the large_file.txt is something like:
apple is a great fruit
we should eat apple
orange is juicy
mango is the king of fruits
litchi is a seasonal fruit
then the resulting file should be
apple.txt:
apple is a great fruit
we should eat apple
orange.txt:
orange is juicy
mango.txt:
mango is the king of fruits
I am new to the Linux environment and beginner level at scripting. Any other solution using regular expression, sed, python etc. should be also okay.
EDIT
Working Script:
I tweaked my script a little based on the answer by #Stephen Quan, it works for the tsch shell.
#!/bin/tcsh -f
foreach word ("`cat pattern.txt`")
if (-r ${word}.txt) then
rm -rf ${word}.txt
endif
awk "/${word}/ { print }" large.txt > ${word}.txt
end
Why use awk? Grep does the job too. Usually, awk '/pattern/{print}' can be replaced by the shorter grep -e 'pattern'.
pattern=apple
grep -e "$pattern" large.txt > "$pattern.txt"
Write a script or a shell function. For instance, a simple shell function can be defined ad-hoc and then called.
filter() { grep -e "$1" large.txt > "$1.txt"; }
for pattern in apple orangle mango; do filter "$pattern"; done
As a shell script (e.g. filter.sh):
#!/bin/sh
grep -e "$1" large.txt > "$1.txt"
Needless to say, the script file must have the executable bit set, otherwise it cannot be executed (obviously).
Assuming your pattern file (e.g. pattern.txt) contains one pattern per line:
#!/bin/sh
while IFS= read -r pattern <&3; do
filter "$pattern"
# or: ./filter.sh "$pattern"
done 3< pattern.txt
All of that can be done without script or function if you simply want a one-shot task to be done (but defining and using the function is not really more complicated than calling its body directly):
while IFS= read -r pattern <&3; do
grep -e "$pattern" large.txt > "$pattern.txt"
done 3< pattern.txt
Note that a for loop cannot be used here, since your program will break as soon as one of your patterns contains space or tab characters.
To do this in awk:
for word in $(cat matching_string.txt)
do
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done < matching_string.txt
The pattern is a regex pattern followed by a command. Note that when you get into regex-capture groups, you may find that the implementation of awk varies from one platform to another.
If it is a simplistic regex, I prefer perl because in cross-platform environments (particularly osx and git-bash on Windows), perl has a more consistent implementation for regex handling. In this case, the perl solution would be:
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
perl -ne "if (/${word}/) { print }" < large_file.txt > ${word}.txt
done < matching_string.txt
I wanted to also demonstrate capture groups. In this case, it is a bit of over-engineered to represent your line as 3 capture groups (prefix, word, postfix), but, I do this because it serves as a template for you to create more complex regex capture group processing scenarios:
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
perl -ne "if (/(.*)(${word})(.*)/) { print $1$2$3 . '\n' }" < large_file.txt > ${word}.txt
done < matching_string.txt
use grep -e pattern:
pattern=orange
grep -e "$pattern" large.txt > "$pattern.txt"
then use the read command to read all Patterns and generate all files:
filename='patternfile.txt'
while read pattern; do
grep -e "$pattern" large.txt > "$pattern.txt"
done < $filename

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

Find regular expression in a file matching a given value

I have some basic knowledge on using regular expressions with grep (bash).
But I want to use regular expressions the other way around.
For example I have a file containing the following entries:
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
Now I want to use bash to figure out to which line a particular number matches.
For example:
grep 8 file
should return:
line_three=[7-9]
Note: I am aware that the example of "grep 8 file" doesn't make sense, but I hope it helps to understand what I am trying to achieve.
Thanks for you help,
Marcel
As others haven pointed out, awk is the right tool for this:
awk -F'=' '8~$2{print $0;}' file
... and if you want this tool to feel more like grep, a quick bash wrapper:
#!/bin/bash
awk -F'=' -v seek_value="$1" 'seek_value~$2{print $0;}' "$2"
Which would run like:
./not_exactly_grep.sh 8 file
line_three=[7-9]
My first impression is that this is not a task for grep, maybe for awk.
Trying to do things with grep I only see this:
for line in $(cat file); do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done
Using while for file reading (following comments):
while IFS= read -r line; do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done < file
This can be done in native bash using the syntax [[ $value =~ $regex ]] to test:
find_regex_matching() {
local value=$1
while IFS= read -r line; do # read from input line-by-line
[[ $line = *=* ]] || continue # skip lines not containing an =
regex=${line#*=} # prune everything before the = for the regex
if [[ $value =~ $regex ]]; then # test whether we match...
printf '%s\n' "$line" # ...and print if we do.
fi
done
}
...used as:
find_regex_matching 8 <file
...or, to test it with your sample input inline:
find_regex_matching 8 <<'EOF'
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
EOF
...which properly emits:
line_three=[7-9]
You could replace printf '%s\n' "$line" with printf '%s\n' "${line%%=*}" to print only the key (contents before the =), if so inclined. See the bash-hackers page on parameter expansion for a rundown on the syntax involved.
This is not built-in functionality of grep, but it's easy to do with awk, with a change in syntax:
/[0-3]/ { print "line one" }
/[4-6]/ { print "line two" }
/[7-9]/ { print "line three" }
If you really need to, you could programmatically change your input file to this syntax, if it doesn't contain any characters that need escaping (mainly / in the regex or " in the string):
sed -e 's#\(.*\)=\(.*\)#/\2/ { print "\1" }#'
As I understand it, you are looking for a range that includes some value.
You can do this in gawk:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
$ awk -v n=8 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[7-9]
Since the digits are being treated as numbers (vs a regex) it supports larger ranges:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[75-95]
line_four=[55-105]
$ awk -v n=92 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[75-95]
line_four=[55-105]
If you are just looking to interpret the right hand side of the = as a regex, you can do:
$ awk -F= -v tgt=8 'tgt~$2' /tmp/file
You would like to do something like
grep -Ef <(cut -d= -f2 file) <(echo 8)
This wil grep what you want but will not display where.
With grep you can show some message:
echo "8" | sed -n '/[7-9]/ s/.*/Found it in line_three/p'
Now you would like to transfer your regexp file into such commands:
sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file
Store these commands in a virtual command file and you will have
echo "8" | sed -nf <(sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file)

Search for substring matches in a file bash

The premise is to store a database file of colon separated values representing items.
var1:var2:var3:var4
I need to sort through this file and extract the lines where any of the values match a search string.
For example
Search for "Help"
Hey:There:You:Friends
I:Kinda:Need:Help (this line would be extracted)
I'm using a function to pass in the search string, and then passing the found lines to another function to format the output. However I can't seem to be able to get the format right when passing. Here is sample code i've tried of different ways that I've found on this site, but they don't seem to be working for me
#Option 1, it doesn't ever find matches
function retrieveMatch {
if [ -n "$1" ]; then
while read line; do
if [[ *"$1"* =~ "$line" ]]; then
formatPrint "$line"
fi
done
fi
}
#Option 2, it gets all the matches, but then passes the value in a
#format different than a file? At least it seems to...
function retrieveMatch {
if [ -n "$1" ]; then
formatPrint `cat database.txt | grep "$1"`
fi
}
function formatPrint {
list="database.txt" #default file for printing all info
if [ -n "$1" ]; then
list="$1"
fi
IFS=':'
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
}
I can't seem to get the first one to find any matches
The second options gets the right values, but when I try to formatPrint, it throws an error saying that the list of values passed in are not a directory.
Honestly, I'd replace the whole thing with
function retrieveMatch {
grep "$1" | tr ':' '\n'
}
To be called as
retrieveMatch Help < filename
...like the original function (Option 1) appeared to be designed. To do more complicated things with matching lines, have a look at awk:
# in the awk script, the fields in the line will be $1, $2 etc.
awk -v pattern="$1" -F : '$0 ~ pattern { for(i = 1; i < NF; ++i) print $i }'
See this link. Awk is made to process exactly this sort of data, so if you plan to do complex things with it, it is definitely worth a look.
Answering the question more directly, there are two/three problems in your code. One is, as was pointed out in the comments to the question, that the line
if [[ *"$1"* =~ "$line" ]]; then
Will try to use "$line" as a regular expression to find a match in *"$1"*, assuming that *"$1"* does not become more than one token after pathname expansion because the * are not quoted. Assuming that the * are supposed to match anything the way they would in glob expressions (but not in regular expressions), this could be replaced with
if [[ "$line" =~ "$1" ]]; then
because =~ will report a match if the regex matches any part of the string.
The second problem is that you're divided on whether you want "$list" in formatPrint to be a file or a line. You say in retrieveMatch that it should be a line:
formatPrint "$line"
But you set it to a filename default in formatPrint:
list="database.txt" #default file for printing all info
You'll have to decide on one. If you decide that formatPrint should format lines, then the third problem is that the redirection in
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
tries to use "$list" as a filename. This could be fixed by replacing the last line with
done <<< "$list" # using a here-string (bash-specific)
Or
done <<EOF
$list
EOF
(note: in the latter case, do not indent the code; it's a here-document that's taken verbatim). And, of course, read will only split four fields the way you wrote it.
I feel I must be missing something, but..
cat > foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Foo:Bar
[Give control-D]
grep -i help foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Does it fit the bill?
EDIT: To expand a little further on this thought..
cat > foo.bsh
#!/bin/bash
hits="$(grep -i help foo.txt)"
while read -r line; do
echo "${line}"
done <<< "$hits"
[Give control-D]

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}