Using sed to delete all lines between two matching patterns - regex

I have a file something like:
# ID 1
blah blah
blah blah
$ description 1
blah blah
# ID 2
blah
$ description 2
blah blah
blah blah
How can I use a sed command to delete all lines between the # and $ line? So the result will become:
# ID 1
$ description 1
blah blah
# ID 2
$ description 2
blah blah
blah blah
Can you please kindly give an explanation as well?

Use this sed command to achieve that:
sed '/^#/,/^\$/{/^#/!{/^\$/!d}}' file.txt
Mac users (to prevent extra characters at the end of d command error) need to add semicolons before the closing brackets
sed '/^#/,/^\$/{/^#/!{/^\$/!d;};}' file.txt
OUTPUT
# ID 1
$ description 1
blah blah
# ID 2
$ description 2
blah blah
blah blah
Explanation:
/^#/,/^\$/ will match all the text between lines starting with # to lines starting with $. ^ is used for start of line character. $ is a special character so needs to be escaped.
/^#/! means do following if start of line is not #
/^$/! means do following if start of line is not $
d means delete
So overall it is first matching all the lines from ^# to ^\$ then from those matched lines finding lines that don't match ^# and don't match ^\$ and deleting them using d.

$ cat test
1
start
2
end
3
$ sed -n '1,/start/p;/end/,$p' test
1
start
end
3
$ sed '/start/,/end/d' test
1
3

In general form, if you have a file with contents of form abcde, where section a precedes pattern b, then section c precedes pattern d, then section e follows, and you apply the following sed commands, you get the following results.
In this demonstration, the output is represented by => abcde, where the letters show which sections would be in the output. Thus, ae shows an output of only sections a and e, ace would be sections a, c, and e, etc.
Note that if b or d appear in the output, those are the patterns appearing (i.e., they're treated as if they're sections in the output).
Also don't confuse the /d/ pattern with the command d. The command is always at the end in these demonstrations. The pattern is always between the //.
sed -n -e '/b/,/d/!p' abcde => ae
sed -n -e '/b/,/d/p' abcde => bcd
sed -n -e '/b/,/d/{//!p}' abcde => c
sed -n -e '/b/,/d/{//p}' abcde => bd
sed -e '/b/,/d/!d' abcde => bcd
sed -e '/b/,/d/d' abcde => ae
sed -e '/b/,/d/{//!d}' abcde => abde
sed -e '/b/,/d/{//d}' abcde => ace

Another approach with sed:
sed '/^#/,/^\$/{//!d;};' file
/^#/,/^\$/: from line starting with # up to next line starting with $
//!d: delete all lines except those matching the address patterns

I did something like this long time ago and it was something like:
sed -n -e "1,/# ID 1/ p" -e "/\$ description 1/,$ p"
Which is something like:
-n suppress all output
-e "1,/# ID 1/ p" execute from the first line until your pattern and p (print)
-e "/\$ description 1/,$ p" execute from the second pattern until the end and p (print).
I might be wrong with some of the escaping on the strings, so please double check.

The example below removes lines between "if" and "end if".
All files are scanned, and lines between the two matching patterns are removed ( including them ).
IFS='
'
PATTERN_1="^if"
PATTERN_2="end if"
# Search for the 1st pattern in all files under the current directory.
GREP_RESULTS=(`grep -nRi "$PATTERN_1" .`)
# Go through each result
for line in "${GREP_RESULTS[#]}"; do
# Save the file and line number where the match was found.
FILE=${line%%:*}
START_LINE=`echo "$line" | cut -f2 -d:`
# Search on the same file for a match of the 2nd pattern. The search
# starts from the line where the 1st pattern was matched.
GREP_RESULT=(`tail -n +${START_LINE} $FILE | grep -in "$PATTERN_2" | head -n1`)
END_LINE="$(( $START_LINE + `echo "$GREP_RESULT" | cut -f1 -d:` - 1 ))"
# Remove lines between first and second match from file
sed -e "${START_LINE},${END_LINE}d;" $FILE > $FILE
done

Related

How to replace regex multiline dotall using perl in command line

Hi I am new using perl and I would like to know how to use it to replace regex in multline mode. And if possible to also make the "." match with break lines.
I am using the following expression:
perl -pe 's/text.*end/textChanged/g' myFile.txt
The expression above replace in single line mode. It does not consider break lines.
Using:
Windows
Stranberry Perl
Note that Perl one-liner uses -p or -n switch which wraps a while loop behind the scenes.
And the while loop uses scalar context which reads line by line, so you will not see any changes in your output unless the text.*end appears in a single line.
Here is a sample
$ cat a.txt
abc
text 1 2
2 3 4
ab end
hello
here
$ perl -pe 's/text.*end/textChanged/g' a.txt # Nothing happens - while reads line by line
abc
text 1 2
2 3 4
ab end
hello
here
Now, you can do like setting the Record separator variable to undef.
$ perl -pe ' BEGIN { $/=undef } s/text.*end/textChanged/g' a.txt # Nothing happens
abc
text 1 2
2 3 4
ab end
hello
here
But, when you add the /s modifier, the substitution takes place.
$ perl -pe ' BEGIN { $/=undef } s/text.*end/textChanged/gs ' a.txt
abc
textChanged
hello
here
$
Reading the entire file using slurp mode and again nothing happens with your substitution.
$ perl -0777 -pe ' s/text.*end/textChanged/g ' a.txt
abc
text 1 2
2 3 4
ab end
hello
here
$
Now you use the /s flag so that dot can match the newline as well and the substitution takes place.
$ perl -0777 -pe ' s/text.*end/textChanged/gs ' a.txt
abc
textChanged
hello
here
$
Thanks #ikegami... for the bundle options, like below
$ perl -0777pe ' s/text.*end/textChanged/gs ' a.txt
So when you want the dot to match newlines, you need to add the /s modifier in the regex.

Regex pattern for quoted numbers and commas

I'm trying to find the correct regex to search a file for double quoted numbers separated by a comma. For example I'm trying to find "27,422,734" and then replace it in a text editor to correct the comma to be every 4 numbers so the end result would be "2742,2734"
I've tried a few examples I found on SO but none are helping me with this scenario like
"[^"]+"
'\d+'
while the above do find matches, I don't know how to deal with the commas and how what to replace that with.
Thanks for any help!
I found an even shorter solution (works with gnu-sed):
colonmv () {
echo $# | sed 's/,//g' | sed -r ':a;s/\B[0-9]{4}\>/,&/;ta'
}
But attention, the first sed command eats every comma, not just between digits, so improve it or filter your input before.
The second command uses the :a trick.
Read 4 digits, followed by a non digit (>) replace with the same plus comma, when a replacement took place, jump back from ta to :a and repeat.
Now, let's see colonmv in the wild:
colonmv '"A 3-grouped, pretty long number: 5,127,422,734 and an ungrouped one 5678905567789065778"'
"A 3-grouped pretty long number: 51,2742,2734 and an ungrouped one 567,8905,5677,8906,5778"
There might be better way of doing but I propose the following approach:
INPUT:
$ cat to_transform.txt
abc "27,422,734" def"27,422,734" def
ltu "123,734" abc "345,678,123,734" vtu
xtz "345,678,123,734" vtu "345,678,123,734"
u "1" a
"123"
iu"abc"a "123,734"
CMD:
$ paste -d' ' <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt | sed -e 's/,//g;:loop s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g; s/,,/,/g; /\([0-9]\{5\}\)/b loop') | awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'
OUTPUT:
$ cat to_transform.txt
abc "2742,2734" def"2742,2734" def
ltu "12,3734" abc "3456,7812,3734" vtu
xtz "3456,7812,3734" vtu "3456,7812,3734"
u "1" a
"123"
iu"abc"a "12,3734"
CODE DETAILS AND EXPLANATIONS:
<(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) will extract each number to be processed from the input file, the regex used here use lookbehind/lookahead to enforce the surrounded by quotes condition, (:?\d+,\d+)+ is used to extract the numbers like 27,422,734.
the sed command will getting the output from the grep command will then do the following operations:
SED DETAILS:
s/,//g #remove all , in the number
:loop #create a label to loop
s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g #add a coma after every chain of 4 characters starting by the end of the string/or from the latest coma added
s/,,/,/g #remove duplicate comas added by the previous step if any
/\([0-9]\{5\}\)/b loop #if there are at least 5 digits present successively in the string loop and continue the processing.
Temporary output after the paste operation:
27,422,734 2742,2734
27,422,734 2742,2734
123,734 12,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
123,734 12,3734
Last but not least the awk command will read this file and run some sed command to replace every element of the first column by the corresponding value in the second command: awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'.
Precondition: Your input conforms to "[0-9,]*" and is a "#,###"-format correct number.
#!/bin/bash
colonmv () {
echo $1 | sed -r 's/,([0-9]{3})+/\1/g;' | \
rev | sed -r 's/[^0-9]?([0-9]{4})/\1,/g;s/,"$/"/;s/.*/"&/' | rev
}
colonmv '"734"'
colonmv '"2,734"'
colonmv '"22,734"'
colonmv '"422,734"'
colonmv '"7,422,734"'
colonmv '"27,422,734"'
colonmv '"127,422,734"'
colonmv '"5,127,422,734"'
Test:
colonmv.sh
"734""
"2734"
"2,2734"
"42,2734"
"742,2734"
"2742,2734"
"1,2742,2734"
"51,2742,2734"

sed not replacing some spaces

I am having some trouble getting SED to work right.
Input file:
$ cat txt
# nasty comment
blah blah blah this line is invalid
; this also isn't right
foo = 23 # comment here
blah=76876.8768 -- fubar
yoyo=76
tab_moo = -45.99
// comment
fubar = baz
#dfgpo=sf
####
Now how I parse it:
$ cat txt | sed -r 's/(#|--|;|\/\/).*//' | grep '=' | sed -r 's/[[:blank:]]+//'
foo= 23
blah=76876.8768
yoyo=76
tab_moo = -45.99
fubar= baz
The goal is to remove all comments and all inline whitespace.
I don't get why some spaces are left in the output. What am I doing wrong?
In sed, s/// only replaces the first occurrence on any given line. You need to add /g on the end:
sed -r 's/[[:blank:]]+//g'

How to match and keep the first number in a line using sed?

Question
Let's say I have one line of text with a number placed somewhere (it could be at the beginning, in the middle or at the end of the line).
How to match and keep the first number found in a line using sed?
Minimal example
Here is my attempt (following this page of a tutorial on regular expressions) and the output for different positions of the number:
$echo "SomeText 123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "SomeText 123" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
As you can only the last digit is kept in the process whereas the desired output should be 123...
Using sed:
echo "SomeText 123SomeText 456" | sed -r 's/^[^0-9]*([0-9]+).*$/\1/'
123
You can also do this in gnu awk:
echo "SomeText 123SomeText 456" | awk '{print gensub(/^[^0-9]*([0-9]+).*$/, "\\1", $0)}'
123
To complement the sed solutions, here's an awk alternative (assuming that the goal is to extract the 1st number on each line, if any (i.e., ignore lines without any numbers)):
awk -F'[^0-9]*' '/[0-9]/ { print ($1 != "" ? $1 : $2) }'
-F'[^0-9]*' defines any sequence of non-digit chars. (including the empty string) as the field separator; awk automatically breaks each input line into fields based on that separator, with $1 representing the first field, $2 the second, and so on.
/[0-9]/ is a pattern (condition) that ensures that output is only produced for lines that contain at least one digit, via its associated action (the {...} block) - in other words: lines containing NO number at all are ignored.
{ print ($1!="" ? $1 : $2) } prints the 1st field, if nonempty, otherwise the 2nd one; rationale: if the line starts with a number, the 1st field will contain the 1st number on the line (because the line starts with a field rather than a separator; otherwise, it is the 2nd field that contains the 1st number (because the line starts with a separator).
You can also use grep, which is ideally suited to this task. sed is a Stream EDitor, which is only going to indirectly give you what you want. With grep, you only have to specify the part of the line you want.
$ cat file.txt
SomeText 123SomeText
123SomeText
SomeText 123
$ grep -o '[0-9]\+' file.txt
123
123
123
grep -o prints only the matching parts of a line, each on a separate line. The pattern is simple: one or more digits.
If your version of grep is compatible with the -P switch, you can use Perl-style regular expressions and make the command even shorter:
$ grep -Po '\d+' file.txt
123
123
123
Again, this matches one or more digits.
Using grep is a lot simpler and has the advantage that if the line doesn't match, nothing is printed:
$ echo "no number" | grep -Po '\d+' # no output
$ echo "yes 123number" | grep -Po '\d+'
123
edit
As pointed out in the comments, one possible problem is that this won't only print the first matching number on the line. If the line contains more than one number, they will all be printed. As far as I'm aware, this can't be done using grep -o.
In that case, I'd go with perl:
perl -lne 'print $1 if /.*?(\d+).*/'
This uses lazy matching (the question mark) so only non-digit characters are consumed by the .* at the start of the pattern. The $1 is a back reference, like \1 in sed. If there are more than one number on the line, this only prints the first. If there aren't any at all, it doesn't print anything:
$ echo "no number" | perl -ne 'print "$1\n" if /.*?(\d+).*/'
$ echo "yes123number456" | perl -lne 'print $1 if /.*?(\d+).*/'
123
If for some reason you still really want to use sed, you can do this:
sed -n 's/^[^0-9]*\([0-9]\{1,\}\).*$/\1/p'
unlike the other answers, this is compatible with all version of sed and will only print lines that contain a match.
Try this sed command,
$echo "SomeText 123SomeText" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123
Another example,
$ echo "SomeText 123SomeText 456" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123 456
It prints all the numbers in a file and the captured numbers are separated by spaces while printing.

Using regex to extract a substring while excluding a certain phrase

Say for the string:
test.1234.mp4
I would like to extract the numbers
1234
without extracting the 4 in mp4
What would the regex be for this?
The numbers aren't always in the second position and can be in different positions and might not always be four digits. I would like to extract the number without extracting the 4 in mp4 essentially.
More examples:
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
Essentially only the numbers would be extracted. Hence, for the last example, 666 from e666 would not be extracte and only 123.
To extract I have been using
echo "example.123.mp4" | grep -o "REGEX"
Edit: test456 was meant to be test.456
The accepted answer will fail on "test.e666.123.mp4" (print 666).
This should work
$ cat | perl -ne '/\.(\d+)\./; print "$1\n"'
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
1234
456
111
123
Note that this will only print the first group of numbers, if we have test.123.456.mp4 only 123 will be printed.
The idea is to match a dot followed by numbers which we are interested in (parentheses to save the match), followed by another dot. This means that it will fail on 123.mp4.
To fix this you could have:
$ cat | perl -ne '/(^|\.)(\d+)\./; print "$2\n"'
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
781.test.mp4
1234
456
111
123
781
First match is either beginning of line (^) or a dot, followed by numbers and a dot. We use $2 here since $1 is either beginning of a line or a dot.
cut can make it:
$ echo "test.1234.mp4" | cut -d. -f2
1234
where
cut -d'.' -f2
delimiter 2nd field
If you provide more examples we can improve the output. With the current code you would extract any something in blablabla.something.blablabla.
Update: from your question update we can do this:
grep -o '\.[0-9]*\.' | sed 's/\.//g'
test:
$ echo "test.abc.1234.mp4
test456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4" | grep -o '\.[0-9]*\.' | sed 's/\.//g'
1234
111
123
grep -Po "(?<=\.)\d+(?=\.)"
echo "test.1234.mp4" | perl -lpe 's/[^.\d]+\d*//g;s/\D*(\d+).*/$1/'
or:
echo "1321.test.mp4" | perl -lpe 's/.*(?:^|\.)(\d+)\..*/$1/'
p is to print by default so that we don't need explicit print.
e says we have an expression, not a script file
l puts the newline
These will also work if you have a number at the first part of the name.
perl -F'\.' -lane 'print "$F[scalar(#F)-2]" if(/\d+\.mp4$/)' your_file
tested:
> perl -F'\.' -lane 'print "$F[scalar(#F)-2]" if(/\d+\.mp4$/)' temp
1234
111
123
$ cat file
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
$ sed 's/.*\.\([0-9][0-9]*\)\..*/\1/' file
1234
456
111
123