Extract section of file to a variable, from within a shell script - regex

I'm writing a script intended to run under bash in OS X. I have Markdown files that look like this:
# File name
## Heading 1
Some text
## Heading 2
* List item 1
* List item 2
## Some other section
...
I'm trying to read everything inside Heading 2 into a variable, and I've tried using sed, grep, and perl, but can't get a working solution. Of those tools, it looked to (in theory) be possible and easiest with Perl, especially given that I need multiple lines. It looks like this regex works (at least with javascript syntax):
## Heading 2\s+(.+)\s+
I'd like to keep it a one-liner, and stick to only tools available on stock OS X (El Capitan, 10.11). Assume I only know "Heading 2" up front, not the following header caption.

With sed :
$ myvar=$(sed "/^## $1$/,/^## Heading/!d;//d;/^$/d" file)
$ echo "$myvar"
* List item 1
* List item 2
You can remove /^$/d if you want to preserve blank lines.
Update :
I've replaced single quotes with double quotes to allow shell expansion.
You can call it with ./scriptname.sh "Heading 2".
Some explanations :
/^## $1$/,/^## Heading/ applies subsequent commands to lines matching the first pattern up to next line containing the second pattern.
!d deletes all lines except those corresponding to the range.
//d matches the same pattern as the address(es) and removes it.

Using sed:
head2="$(sed -n '/## Heading 2/,/## Heading 3/{s/^## Heading .*//;p;}' file)"
echo "$head2"
* List item 1
* List item 2
Using perl:
head2="$(perl -0pe 's/(?s).*## Heading 2\s*(.*)\s*## Heading 3.*/\1/' file)"
echo "$head2"
* List item 1
* List item 2
You can also install gnu grep using home brew and use this regex:
head2="$(grep -zoP '## Heading 2\s*\K[\s\S]*(?=\s*## Heading 3)' file)"

You can use awk ranges:
awk '/^## Heading 2/,/^## Heading [^2]/ {if (!/^## Heading 2/&&!/^## Heading [^2]/) { print}}'
or with variables
awk '/s/,/e/ {if ($0 !~ s && $0 !~ e) { print}}' s='^## Heading 2' e='^## Heading [^2]'

Related

How to use sed to search and replace a pattern who appears multiple times in the same line?

Because the question can be misleading, here is a little example. I have this kind of file:
some text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##
In this example, I want to replace each occurrence of KEY- inside a pair of ## by VALUE-. I started with this sed command:
sed -i 's/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g'
Here is how it works:
\(##[^#]*\): create a first group composed of two # and any characters except # ...
KEY-: ... until the last occurrence of KEY- on that line
\([^#]*##\): and create a second group with all the characters except # until the next pair of #.
The problem is my command can't handle correctly the following line because there are multiple KEY- inside my pair of ##:
again ##some-text-KEY-some-other-text-KEY-text##
Indeed, I get this result:
again ##some-text-KEY-some-other-text-VALUE-text##
If I want to replace all the occurrences of KEY- in that line, I have to run my command multiple times and I prefer to avoid that. I also tried with lazy operators but the problem is the same.
How can I create a regex and a sed command who can handle correctly all my file?
The problem is rather complex: you need to replace all occurrences of some multicharacter text inside blocks of text between identical multicharacter delimiters.
The easiest and safest way to solve the task is using Perl:
perl -i -pe 's/(##)(.*?)(##)/$end_delim=$3; "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim"/ge' file
See the online demo.
The (##)(.*?)(##) pattern will match strings between two adjacent ## substrings capturing the start delimiter into Group 1, end delimiter in Group 3, and all text in between into Group 2. Since the regex substitution re-sets all placeholders, the temporary variable is used to keep the value of the end delimiter ($end_delim=$3), then, "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim" replaces the match with the value in the Group 1 of the first match (the first ##), then the Group 2 value with all KEY- replaced with VALUE-, and then the end delimiter.
If there are no KEY-s in between matches on the same line you may use a branch with sed by enclosing your command with :A and tA:
sed -i ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' file
Note you missed the first placeholder in \VALUE-\2, it should be \1VALUE-\2.
See the online demo:
s="some KEY- text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##"
sed ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' <<< "$s"
Output:
some KEY- text
some text ##some-text-VALUE-some-other-text##
text again ##some-text-VALUE-some-other-text## ##some-text-VALUE-some-other-text##
again ##some-text-VALUE-some-other-text-VALUE-text##
some text with KEY ##VALUE-some-text##
blabla ##KEY##
More details:
sed allows the usage of loops and branches. The :A in the code above is a label, a special location marker that can be "jumped at" using the appropriate operator. t is used to create a branch, this "command jumps to the label only if the previous substitute command was successful". So, once the pattern matched and the replacement occurred, sed goes back to where it was and re-tries a match. If it is not successful, sed goes on to search for the matches further in the string. So, tA means go back to the location marked with A if there was a successful search-and-replace operation.
This might work for you (GNU sed):
sed -E 's/##/\n/g;:a;s/^([^\n]*(\n[^\n]*\n[^\n]*)*\n[^\n]*)KEY-/\1VALUE-/;ta;s/\n/##/g' file
Convert ##'s to newlines. Using a loop, replace VAL- between matched newlines to VALUE-. When all done replace newlines by ##'s.

Insert text into line if that line doesn't contain another string using sed

I am merging a number of text files on a linux server but the lines in some differ slightly and I need to unify them.
For example some files will have line like
id='1244' group='american' name='fred',american
Other files will be like
id='2345' name='frank', english
finally others will be like
id='7897' group='' name='maria',scottish
what I need to do is, if group='' or group is not in the string at all I need to add it somewhere before the comma setting it to the text after the comma so in the 2nd example above the line would become:
id='2345' name='frank' group='english',english
and the same in the last example which would become
id='7897' name='maria' group='scottish',scottish
This is going into a bash script. I can't actually delete the line and add to the end of the file as it relates to the following line.
I've used the following:
sed -i.bak 's#group=""##' file
which deletes the group="" string so the lines will either contain group='something' or wont contain it at all and that works
Then I tried to add the group if it doesn't exist using the following:
sed -i.bak '/group/! s#,(.*$)#group="\1",\1#' file
but that throws up the error
sed: -e expression #1, char 38: invalid reference \1 on `s' command's RHS
EDIT by Ed Morton to create a single sample input file and expected output:
Sample Input:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank', english
bar
id='7897' group='' name='maria',scottish
Expected Output:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish
sed -r "
/group=''/ s/// # group is empty, remove it
/group=/! s/,[[:blank:]]*(.+)/ group='\\1',\\1/ # group is missing, add it
" file
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish
The foo and bar lines are untouched because the s/// command did not match a comma followed by characters.
something like
sed '
/^[^,]*group[^,]*,/ ! {
s/, *\(.*\)/ group='\''\1'\'', \1/
}
/^[^,]*group='\'\''/ {
s/group='\'\''\([^,]*\), *\(.*\)/group='\''\2'\''\1, \2/
}
'
This GNU awk may help:
awk -v sq="'" '
BEGIN{RS="[ ,\n]+"; FS="="; found=0}
$1=="group"{
if($2==sq sq)
{next}
else
{found=1}
}
NF>1{
printf "%s=%s ",$1,$2
}
NF==1{
if(!found)
{printf "group=%s",$1}
print ","$1
found=0
}
' file
The script relies on the record separator RS which is set to get all key='value' pairs.
If the key group isn't found or is empty, it is printed when reaching a record with only one field.
Note that the variable sq holds the single quote character and is used to detect empty group field.
Sed can be pretty ugly. And your data format appears to be somewhat inconsistent. This MIGHT work for you:
$ sed -e "/group='[a-z]/b e" -e "s/group='' *//" -e "s/,\([a-z]*\)$/ group='\1', /" -e ':e' input.txt
Broken out for easier reading, here's what we're doing:
/group='[a-z]/b e - If the line contains a valid group, branch to the end.
s/group='' *// - Remove any empty group,
s/,\([a-z]*\)$/ group='\1', / - add a new group based on your specs
:e - branch label for the first command.
And then the default action is to print the line.
I really don't like manipulating data this way. It's prone to error, and you'll be further ahead reading this data into something that accurately stores its data structure, then prints the data according to a new structure. A more robust solution would likely be tied directly to whatever is producing or consuming this data, and would not sit in the middle like this.

Linux CLI change price (awk or sed?)

I have price strings formatted as
$25.00
in various html files. I would like to use the Linux command line (BASH, presumably with awk or sed) to increase each price by a certain dollar amount ($3 in this case).
In short, I need to find $nn.00 and replace it with $(n+3)n.00
Started to put it together but I don't know how to add 3 sed -r 's/([^$][0-9][0-9][.]00). ????' file.html
Thanks!
Sample data:
$ cat prices_file.html
<p>$25.00</p><p>$78.00</p>
<p>$2.00</p>
<p>$101.00</p>
Solution with Perl:
$ perl -pi.bak -e 's/\$(\d+\.\d+)/sprintf("\$%.2f", $1 + 3)/eg' prices_file.html
After:
$ cat prices_file.html
<p>$28.00</p><p>$81.00</p>
<p>$5.00</p>
<p>$104.00</p>
Above example is one of most common perl use cases with substitution.
It will also backup your original file (in prices_file.html.bak) in case you do something unwanted to it.
What is maybe not so common is evaluation modifier (s///e) which allows you to execute arbitrary perl code in substitution.
Global modifier (s///g) tells perl to replace all occurrences (here in a context of line, if you remove g modifier if would only replace first price in 1st line of given sample data).
In sprintf("\$%.2f", $1 + 3) replacement, $1 refers to matched group [(\d+\.\d+)].

Grep Regex: List all lines except

I'm trying to automagically remove all lines from a text file that contains a letter "T" that is not immediately followed by a "H". I've been using grep and sending the output to another file, but I can't come up with the magic regex that will help me do this.
I don't mind using awk, sed, or some other linux tool if grep isn't the right tool to be using.
That should do it:
grep -v 'T[^H]'
-v : print lines not matching
[^H]: matches any character but H
You can do:
grep -v 'T[^H]' input
-v is the inverse match option of grep it does not list the lines that match the pattern.
The regex used is T[^H] which matches any lines that as a T followed by any character other than a H.
Read lines from file exclude EMPTY Lines and Lines starting with #
grep -v '^$\|^#' folderlist.txt
folderlist.txt
# This is list of folders
folder1/test
folder2
# This is comment
folder3
folder4/backup
folder5/backup
Results will be:
folder1/test
folder2
folder3
folder4/backup
folder5/backup
Adding 2 awk solutions to the mix here.
1st solution(simpler solution): With simple awk and any version of awk.
awk '!/T/ || /TH/' Input_file
Checking 2 conditions:
If a line doesn't contain T OR
If a line contains TH then:
If any of above condition is TRUE then print that line simply.
2nd solution(GNU awk specific): Using GNU awk using match function where mentioning regex (T)(.|$) and using match function's array creation capability.
awk '
!/T/{
print
next
}
match($0,/(T)(.|$)/,arr) && arr[1]=="T" && arr[2]=="H"
' Input_file
Explanation: firstly checking if a line doesn't have T then print that simply. Then using match function of awk to match T followed by any character OR end of the line. Since these are getting stored into 2 capturing groups so checking if array arr's 1st element is T and 2nd element is H then print that line.

Multiple multi-line regex matches in Bash

I'm trying to do some fairly simple string parsing in bash script.
Basically, I have a file that is comprised of multiple multi-line fields. Each field is surrounded by a known header and footer.
I want to extract each field separately into an array or similar, like this
>FILE=`cat file`
>REGEX="######[\s\S]+?#####"
>
>if [[$FILE =~ $REGEX ]] then
> echo $BASH_REMATCH
>fi
FILE:
######################################
this is field one
######
######################################
this is field two
they can be any number of lines
######
Now I'm pretty sure the problem is that bash doesn't match newlines with the "."
I can match this with "pcregrep -M", but of course the whole file is going to match. Can I get one match at a time from pcregrep?
I'm not opposed to using some inline perl or similar.
if you have gawk
awk 'BEGIN{ RS="##*#" }
NF{
gsub("\n"," ") #remove this is you want to retain new lines
print "-->"$0
# put to array
arr[++d]=$0
} ' file
output
$ ./shell.sh
--> this is field one
--> this is field two they can be any number of lines
The TXR language performs whole-document multi-line matching, binds variables, and (with the -B "dump bindings" option) emits properly escaped shell variable assignments that can be eval-ed. Arrays are supported.
The # character is special so it has to be doubled up to match literally.
$ cat fields.txr
#(collect)
#########################################
# (collect)
#field
# (until)
#########
# (end)
# (cat field)## <- catenate the fields together with a space separator by default
#(end)
$ txr -B fields.txr data
field[0]="this is field one"
field[1]="this is field two they can be any number of lines"
$ eval $(txr -B fields.txr data)
$ echo ${field[0]}
this is field one
$ echo ${field[1]}
this is field two they can be any number of lines
The #field syntax matches an entire line. These are collected into a list since it is inside a #(collect), and the lists are collected into lists-of-lists because that is nested inside another #(collect). The inner #(cat field) however, reduces the inner lists to a single string, so we end up with a list of strings.
This is "classic TXR": how it was originally designed and used, sparked by the idea:
Why don't we make here-documents work backwards and do parsing from reams of text into variables?
This implicit emission of matched variables by default, in the shell syntax by default, continues to be a supported behavior even though the language has grown much more powerful, so there is less of a need to integrate with shell scripts.
I would build something around awk. Here is a first proof of concept:
awk '
BEGIN{ f=0; fi="" }
/^######################################$/{ f=1 }
/^######$/{ f=0; print"Field:"fi; fi="" }
{ if(f==2)fi=fi"-"$0; if(f==1)f++ }
' file
begin="######################################"
end="######"
i=0
flag=0
while read -r line
do
case $line in
$begin)
flag=1;;
$end)
((i++))
flag=0;;
*)
if [[ $flag == 1 ]]
then
array[i]+="$line"$'\n' # retain the newline
fi;;
esac
done < datafile
If you want to keep the marker lines in the array elements, move the assignment statement (with its flag test) to the top of the while loop before the case.