Extract all text between two directory entries in a file - regex

I have a file that documents the structure of several directories. I am trying to print the text for each directory individually. My input file looks like this:
$ cat file.txt
/bin:
file_1
file_2
file_3
/sbin:
file_a
file_b
file_c
/usr/local/bin:
doc_a
doc_b
doc_c
What I'm trying to do is print a specific section of the file based on user selection:
#!/bin/bash
PS3=$'\nMake a selection '
select dir in $(grep ':' file.txt;) do
case $REPLY in
[0-9]) echo $dir
# Need something here. Maybe a pcregrep regex?
# pcregrep '(<= $dir)*(some_fancy_regex)' file.txt
break;;
esac
done
The user is presented with menu options:
1) /bin:
2) /sbin:
3) /usr/local/bin:
Make a selection
Suppose the user chooses 2. Currently, this just prints the chosen directory on the screen. I would like to display the directory as well as the files it contains.
/sbin:
file_a
file_b
file_c
From what I've read it seems like a pcre regex would work here. I barely understand non-pcre style regex. I'm trying to wrap my brain around positive and negative lookahead & lookbehind but I really don't know what I'm doing yet. If someone could help me figure this out I would appreciate it.
All directories begin with a / and end with :
File names listed under each directory may contain:
[a-z], [A-Z],[0-9]
Literal characters . _ - [
All directory / file structures end with a blank empty line

It can be done purely in bash 4 in a single pass without using any external tool. Here is the script to solve this problem:
#!/bin/bash
# declare an associative array
declare -A dirs=()
# loop thru all lines and populate our associate array
# with dir as key and \n separated file names as value
while read -r; do
[[ -z $REPLY ]] && continue
if [[ $REPLY == *: ]]; then
d="$REPLY"
else
dirs["$d"]+=$'\n'"$REPLY"
fi
done < file.txt
# present a menu to customer and print selected dir name with file names
select dir in "${!dirs[#]}"; do
if [[ -n $dir ]]; then
printf '%s%s\n' "$dir" "${dirs[$dir]}"
break
fi
done
Output:
1) /usr/local/bin:
2) /bin:
3) /sbin:
#? 1
/usr/local/bin:
doc_a
doc_b
doc_c
and this:
1) /usr/local/bin:
2) /bin:
3) /sbin:
#? 3
/sbin:
file_a
file_b
file_c

With GNU sed and bash:
dir="/usr/local/bin:"
sed -n "/${dir//\//\/}/,/^$/{/^$/d;p}" file
With bash:
dir="/usr/local/bin:"
while IFS= read -r line; do
[[ $line == $dir ]] && switch=1
[[ $line == "" ]] && switch=0
[[ $switch == 1 ]] && echo "$line"
done < file
Output in both cases:
/usr/local/bin:
doc_a
doc_b
doc_c

Don't mistake shell for a text-processing tool, that's what awk is for. All you need are these 4 lines:
$ cat tst.sh
awk -v RS= -F'\n' -v OFS=') ' '{print NR, $1}' file.txt >&2
printf '\nMake a selection: ' >&2
IFS= read -r rsp
awk -v RS= -v nr="$rsp" 'NR==nr' file.txt
$ ./tst.sh
1) /bin:
2) /sbin:
3) /usr/local/bin:
Make a selection: 2
/sbin:
file_a
file_b
file_c

Grep is not to best tool to do this because it is line-oriented; you can't really have grep look at expressions that span multiple lines, except with some contortion – and that -z option is not specified by POSIX.
You could do it like this:
#!/bin/bash
PS3=$'\nMake a selection '
mapfile -t opts < <(grep ':' file.txt)
select dir in "${opts[#]}"; do
sed -n "\#$dir#,/^$/{/^$/q;p}" file.txt
break
done
First, I've changed your menu creation. Notice that you have a spare semicolon within the command substitution and a missing one after it; using grep like this would also break if there are spaces in the directory names. I've thus used mapfile to get all the lines containing : into an array.
Then, once I know about the directory, I use sed to print "from the directory name on until the next empty line". That would simply be
sed -n "/$dir/,/^$/p"
but this falls short on multiple fronts. First of all, the directory name can contain slashes, which trips up the / delimited addressing. We can use \%regexp% instead of /regexp/, where % can be any character; I've chosen #.
Now, we have
sed -n "\#$dir#,/^$/p"
That's almost there, but prints trailing blank lines; we suppress that by using {/^$/q;p} instead of just p, which says "if the line is blank, quit, else print it".
Sample output (edited to use a directory name with a space):
1) /bin blah:
2) /sbin:
3) /usr/local/bin:
Make a selection 1
/bin blah:
file_1
file_2
file_3
Remark: Non-GNU seds (like the one found in macOS) might complain about the two commands in curly braces; using {/^$/q;p;} instead (extra semicolon) might help.

Related

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount

Find regular expression in a file matching a given value

I have some basic knowledge on using regular expressions with grep (bash).
But I want to use regular expressions the other way around.
For example I have a file containing the following entries:
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
Now I want to use bash to figure out to which line a particular number matches.
For example:
grep 8 file
should return:
line_three=[7-9]
Note: I am aware that the example of "grep 8 file" doesn't make sense, but I hope it helps to understand what I am trying to achieve.
Thanks for you help,
Marcel
As others haven pointed out, awk is the right tool for this:
awk -F'=' '8~$2{print $0;}' file
... and if you want this tool to feel more like grep, a quick bash wrapper:
#!/bin/bash
awk -F'=' -v seek_value="$1" 'seek_value~$2{print $0;}' "$2"
Which would run like:
./not_exactly_grep.sh 8 file
line_three=[7-9]
My first impression is that this is not a task for grep, maybe for awk.
Trying to do things with grep I only see this:
for line in $(cat file); do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done
Using while for file reading (following comments):
while IFS= read -r line; do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done < file
This can be done in native bash using the syntax [[ $value =~ $regex ]] to test:
find_regex_matching() {
local value=$1
while IFS= read -r line; do # read from input line-by-line
[[ $line = *=* ]] || continue # skip lines not containing an =
regex=${line#*=} # prune everything before the = for the regex
if [[ $value =~ $regex ]]; then # test whether we match...
printf '%s\n' "$line" # ...and print if we do.
fi
done
}
...used as:
find_regex_matching 8 <file
...or, to test it with your sample input inline:
find_regex_matching 8 <<'EOF'
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
EOF
...which properly emits:
line_three=[7-9]
You could replace printf '%s\n' "$line" with printf '%s\n' "${line%%=*}" to print only the key (contents before the =), if so inclined. See the bash-hackers page on parameter expansion for a rundown on the syntax involved.
This is not built-in functionality of grep, but it's easy to do with awk, with a change in syntax:
/[0-3]/ { print "line one" }
/[4-6]/ { print "line two" }
/[7-9]/ { print "line three" }
If you really need to, you could programmatically change your input file to this syntax, if it doesn't contain any characters that need escaping (mainly / in the regex or " in the string):
sed -e 's#\(.*\)=\(.*\)#/\2/ { print "\1" }#'
As I understand it, you are looking for a range that includes some value.
You can do this in gawk:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
$ awk -v n=8 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[7-9]
Since the digits are being treated as numbers (vs a regex) it supports larger ranges:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[75-95]
line_four=[55-105]
$ awk -v n=92 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[75-95]
line_four=[55-105]
If you are just looking to interpret the right hand side of the = as a regex, you can do:
$ awk -F= -v tgt=8 'tgt~$2' /tmp/file
You would like to do something like
grep -Ef <(cut -d= -f2 file) <(echo 8)
This wil grep what you want but will not display where.
With grep you can show some message:
echo "8" | sed -n '/[7-9]/ s/.*/Found it in line_three/p'
Now you would like to transfer your regexp file into such commands:
sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file
Store these commands in a virtual command file and you will have
echo "8" | sed -nf <(sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file)

Search for substring matches in a file bash

The premise is to store a database file of colon separated values representing items.
var1:var2:var3:var4
I need to sort through this file and extract the lines where any of the values match a search string.
For example
Search for "Help"
Hey:There:You:Friends
I:Kinda:Need:Help (this line would be extracted)
I'm using a function to pass in the search string, and then passing the found lines to another function to format the output. However I can't seem to be able to get the format right when passing. Here is sample code i've tried of different ways that I've found on this site, but they don't seem to be working for me
#Option 1, it doesn't ever find matches
function retrieveMatch {
if [ -n "$1" ]; then
while read line; do
if [[ *"$1"* =~ "$line" ]]; then
formatPrint "$line"
fi
done
fi
}
#Option 2, it gets all the matches, but then passes the value in a
#format different than a file? At least it seems to...
function retrieveMatch {
if [ -n "$1" ]; then
formatPrint `cat database.txt | grep "$1"`
fi
}
function formatPrint {
list="database.txt" #default file for printing all info
if [ -n "$1" ]; then
list="$1"
fi
IFS=':'
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
}
I can't seem to get the first one to find any matches
The second options gets the right values, but when I try to formatPrint, it throws an error saying that the list of values passed in are not a directory.
Honestly, I'd replace the whole thing with
function retrieveMatch {
grep "$1" | tr ':' '\n'
}
To be called as
retrieveMatch Help < filename
...like the original function (Option 1) appeared to be designed. To do more complicated things with matching lines, have a look at awk:
# in the awk script, the fields in the line will be $1, $2 etc.
awk -v pattern="$1" -F : '$0 ~ pattern { for(i = 1; i < NF; ++i) print $i }'
See this link. Awk is made to process exactly this sort of data, so if you plan to do complex things with it, it is definitely worth a look.
Answering the question more directly, there are two/three problems in your code. One is, as was pointed out in the comments to the question, that the line
if [[ *"$1"* =~ "$line" ]]; then
Will try to use "$line" as a regular expression to find a match in *"$1"*, assuming that *"$1"* does not become more than one token after pathname expansion because the * are not quoted. Assuming that the * are supposed to match anything the way they would in glob expressions (but not in regular expressions), this could be replaced with
if [[ "$line" =~ "$1" ]]; then
because =~ will report a match if the regex matches any part of the string.
The second problem is that you're divided on whether you want "$list" in formatPrint to be a file or a line. You say in retrieveMatch that it should be a line:
formatPrint "$line"
But you set it to a filename default in formatPrint:
list="database.txt" #default file for printing all info
You'll have to decide on one. If you decide that formatPrint should format lines, then the third problem is that the redirection in
while read var1 var2 var3 var4; do
echo "$var1"
echo "$var2"
echo "$var3"
echo "$var4"
done < "$list"
tries to use "$list" as a filename. This could be fixed by replacing the last line with
done <<< "$list" # using a here-string (bash-specific)
Or
done <<EOF
$list
EOF
(note: in the latter case, do not indent the code; it's a here-document that's taken verbatim). And, of course, read will only split four fields the way you wrote it.
I feel I must be missing something, but..
cat > foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Foo:Bar
[Give control-D]
grep -i help foo.txt
Hey:There:You:Friends I:Kinda:Need:Help
Does it fit the bill?
EDIT: To expand a little further on this thought..
cat > foo.bsh
#!/bin/bash
hits="$(grep -i help foo.txt)"
while read -r line; do
echo "${line}"
done <<< "$hits"
[Give control-D]

Get all variables in bash from text line

Suppose I have a text line like
echo -e "$text is now set for ${items[$i]} and count is ${#items[#]} and this number is $((i+1))"
I need to get all variables (for example, using sed) so that after all I have list containing: $text, ${items[$i]}, $i, ${#items[#]}, $((i+1)).
I am writing script which have some complex commands and before executing each command it prompts it to user. So when my script prompts command like "pacman -S ${softtitles[$i]}" you can't guess what this command is actually does. I just want to add a list of variables used in this command below and it's values. So I decided to do it via regex and sed, but I can't do it properly :/
UPD: It can be just a string like echo "$test is 'ololo', ${items[$i]} is 'today', $i is 3", it doesn't need to be list at all and it can include any temporary variables and multiple lines of code. Also it doesn't have to be sed :)
SOLUTION:
echo $m | grep -oP '(?<!\[)\$[{(]?[^"\s\/\047.\\]+[})]?' | uniq > vars
$m - our line of code with several bash variables, like "This is $string with ${some[$i]} variables"
uniq - if we have string with multiple same variables, this will remove dublicates
vars - temporary file to hold all variables found in text string
Next piece of code will show all variables and its values in fancy style:
if [ ! "`cat vars`" == "" ]; then
while read -r p; do
value=`eval echo $p`
Style=`echo -e "$Style\n\t$Green$p = $value$Def"`
done < vars
fi
$Style - predefined variable with some text (title of the command)
$Green, $Def - just tput settings of color (green -> text -> default)
Green=`tput setaf 2`
Def=`tput sgr0`
$p - each line of vars file (all variables one by one) looped by while read -r p loop.
You could simply use the below grep command,
$ grep -oP '(?<!\[)(\$[^"\s]+)' file
$text
${items[$i]}
${#items[#]}
$((i+1))
I'm not sure its perfect , but it will help for you
sed -r 's/(\$[^ "]+)/\n\1\n/g' filename | sed -n '/^\$/p'
Explanation :
(\$[^ "]+) - Match the character $ followed by any charter until whitespace or double quote.
\n\1\n - Matched word before and after put newlines ( so the variable present in separate line ) .
/^\$/p - start with $ print the line like print variable
A few approaches, I tested each of them on file which contains
echo -e "$text is now set for ${items[$i]} and count is ${#items[#]} and this number is $((i+1))"
grep
$ grep -oP '\$[^ "]*' file
$text
${items[$i]}
${#items[#]}
$((i+1))
perl
$ perl -ne '#f=(/\$[^ "]*/g); print "#f"' file
$text ${items[$i]} ${#items[#]} $((i+1))
or
$ perl -ne '#f=(/\$[^ "]*/g); print join "\n",#f' file
$text
${items[$i]}
${#items[#]}
$((i+1))
The idea is the same in all of them. They will collect the list of strings that start with a $ and as many subsequent characters as possible that are neither spaces nor ".

Bash Script sed command not working correctly with file passed through command line

Problem
As I am trying to write a script to rename massive files according to some regex requirement, the command work ok on my iTerm2 succeeds but the same command fails to do the work in the script.
Plus some of my file names includes some Chinese and Korean characters.(don't know whether that is the problem or not)
code
So My code takes three input: Old regex, New regex and the files that need to be renamed.
Here is not code:
#!/bin/bash
# we have less than 3 arguments. Print the help text:
if [ $# -lt 3 ] ; then
cat << HELP
ren -- renames a number of files using sed regular expressions USAGE: ren 'regexp'
'replacement' files...
EXAMPLE: rename all *.HTM files into *.html:
ren 'HTM' 'html' *.HTM
HELP
exit 0
fi
OLD="$1"
NEW="$2"
# The shift command removes one argument from the list of
# command line arguments.
shift
shift
# $# contains now all the files:
for file in "$#"; do
if [ -f "$file" ] ; then
newfile=`echo "$file" | sed "s/${OLD}/${NEW}/g"`
if [ -f "$newfile" ]; then
echo "ERROR: $newfile exists already"
else
echo "renaming $file to $newfile ..."
mv "$file" "$newfile"
fi
fi
done
I register the bash command in the .profile as:
alias ren="bash /pathtothefile/ren.sh"
Test
The original file name is "제01과.mp3" and I want it to become "第01课.mp3".
So with my script I use:
$ ren "제\([0-9]*\)과" "第\1课" *.mp3
And it seems that the sed in the script has not worked successfully.
But the following which is exactly the same, works to replaces the name:
$ echo "제01과.mp3" | sed s/"제\([0-9]*\)과\.mp3"/"第\1课\.mp3"/g
Any thoughts? Thx
Print the result
I have make the following change in the script so that it could print the process information:
newfile=`echo "$file" | sed "s/${OLD}/${NEW}/g"`
echo "The ${file} is changed to ${newfile}"
And the result for my test is:
The 제01과.mp3 is changed into 제01과.mp3
ERROR: 제01과.mp3 exists already
So there is no format problem.
Updating(all done under bash 4.2.45(2), Mac OS 10.9)
Testing
As I try to execute the command from the bash directly. I mean with the for loop. There is something interesting. I first stored all the names into a files.txt file using:
$ ls | grep mp3 > files.txt
And do the sed and bla bla. While single command in bash interactive mode like:
$ file="제01과.mp3"
$ echo $file | sed s/"제\([0-9]*\)과\.mp3"/"第\1课\.mp3"/g
gives
第01课.mp3
While in the following in the interactive mode:
files=`cat files.txt`
for file in $files
do
echo $file | sed s/"제\([0-9]*\)과\.mp3"/"第\1课\.mp3"/g
done
gives no changes!
And by now:
echo $file
gives:
$ 제30과.mp3
(There are only 30 files)
Problem Part
And I tried the first command which worked before:
$ echo $file | sed s/"제\([0-9]*\)과\.mp3"/"第\1课\.mp3"/g
It gives no changes as:
$ 제30과.mp3
So I create a new newfile and tried again as:
$ newfile="제30과.mp3"
$ echo $newfile | sed s/"제\([0-9]*\)과\.mp3"/"第\1课\.mp3"/g
And it gives correctly:
$第30课.mp3
WOW ORZ... Why! Why ! Why! And I try to see whether file and newfile are the same, and of course, they are not:
if [[ $file == $new ]]; then
echo True
else
echo False
fi
gives:
False
My guess
I guess there are some encoding problems , but I have found non reference, could anyone help? Thx again.
Update 2
I seem to understand that there are a huge difference between string and the file name. To be specific, it I directly use a variable like:
file="제30과.mp3"
in the script, the sed works fine. However, if the variable was passed from the $# or set the variable like:
file=./*mp3
Then the sed fails to work. I don't know why. And btw, mac sed has no -r option and in ubuntu -r does not solve the question I mention above.
Some errors combined:
In order to use groups in a regex, you need extended regex -r in sed, -E in grep
escaping correctly is a beast :)
Example
files="제2과.mp3 제30과.mp3"
for file in $files
do
echo $file | sed -r 's/제([0-9]*)과\.mp3/第\1课.mp3/g'
done
outputs
第2课.mp3
第30课.mp3
If you are not doing this as a programming project, but want to skip ahead to the part where it just works, I found these resources listed at http://www.tldp.org/LDP/GNU-Linux-Tools-Summary/html/x4055.htm:
MMV (and MCP, MLN, ...) utilities use a specialized syntax to perform bulk file operations on paths. (http://linux.maruhn.com/sec/mmv.html)
mmv before\*after.mp3 Before\#1After.mp3
Esomaniac, a Java alternative that also works on Windows, is apparently dead (home page is parked).
rename is a perl script you can download from CPAN: https://metacpan.org/release/File-Rename
rename 's/\.JPG$/.jpg/' *.JPG