Why this working regex does not work with sed? - regex

I have this type of text:
Song of Solomon 1:1: The song of songs, which is Solomon’s.
John 3:16:For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
III John 1:8: We therefore ought to receive such, that we might be fellowhelpers to the truth.
I am trying to remove the verse (or metadata if you will) and just get plain text the content. The example text shows three different types of verses (multiword, singleword and roman + word), I thought that it would be easier to detect from the beginning of each line, anything until "number:number:", and then substitute it with "" (empty string).
I tested a regex that seems to work (as I described):
First find until "number:number:" excluding it [or: .+?(?=(\s+)(\d+)(:)(\d+)(:))],
And then include the "number:number:" pattern [or: (\s+)(\d+)(:)(\d+)(:)]
Which leads to the following regex:
.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)
The regex seems to work fine, you can try it here, the problem is that when I try to use the regex with sed it just does not work:
$ sed 's/.+?(?=(\s+)(\d+)(:)(\d+)(:))(\s+)(\d+)(:)(\d+)(:)//g' testcase.txt
It will produce the same text as the input, when it should produce:
The song of songs, which is Solomon’s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
Any help please?
Thank you very much!

This awk should do:
awk -F": *" '{print $3}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
To make it more secure to the number:number: use this:
awk -F"[0-9]+:[0-9]+: *" '{print $2}' file
The song of songs, which is Solomon.s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.
This will also prevent problems with : within the text.
Using Adams regex, we can shorten it some.
awk -F"([0-9]+:){2} ?" '{print $2}' file
or
awk -F"([0-9]+:){2} ?" '{$0=$2}1' file

You can use the following sed command:
sed 's/.*[0-9]\+:[0-9]\+: *//' file.txt
If you have only basic posix regexes available, you need to use the following command:
sed 's/.*[0-9]\{1,\}:[0-9]\{1,\}: \{0,\}//' file.txt
I need to use \{1,\} since the \+ and \* operator is not part of the basic posix regex specification.
Btw, if you have GNU goodies, you also use grep:
grep -oP '.*([0-9]+:){2} *\K.*' file.txt
I'm using the \K option here. \K clears the current match until this point which can be used like a lookbehind assertion - but with a variable length.

This:
sed -r 's/.*([0-9]+:){2} ?//' testcase.txt

This is the job cut was invented to do:
$ cut -d: -f3- file
The song of songs, which is Solomon’s.
For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
We therefore ought to receive such, that we might be fellowhelpers to the truth.

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

Execute command defined by backreference in sed

I am creating a primitive experimental templating engine completely based on sed (merely for my private enjoyment). One thing I have been trying to achieve for several hours now is to replace certain text patterns with the output of a command they contain.
To clearify, if an input line looks like this
Lorem {{echo ipsum}}
I would look the sed output to look like this:
Lorem ipsum
The closest I have come is this:
echo 'Lorem {{echo ipsum}}' | sed 's/{{\(.*\)}}/'"$(\\1)"'/g'
which does not work.
However,
echo 'Lorem {{echo ipsum}}' | sed 's/{{\(.*\)}}/'"$(echo \\1)"'/g'
gives me
Lorem echo ipsum
I don't quite understand what is happening here. Why can I give the backreference to the echo command, but cannot evaluate the entire backreference in $()? When is \\1 getting evaluated? Is the thing I am trying to achieve even possible with pure sed?
Keep in mind that it is entirely clear to me that what I am trying to achieve is easily possible with other tools. However, I am highly interested in whether this is possible with pure sed.
Thanks!
The reason your attempt doesn't work is that $() is expanded by the shell before sed is even called. For this reason it can't use the backreferences sed is eventually going to capture.
It is possible to do this sort of thing with GNU sed (not with POSIX sed). The main trick is that GNU sed has a e flag to the s command that makes it replace the pattern space (the whole space) with the result of the pattern space executed as a shell command. What this means is that
echo 'echo foo' | sed 's/f/g/e'
prints goo.
This can be used for your use case as follows:
echo 'Lorem {{echo ipsum}}' | sed ':a /\(.*\){{\(.*\)}}\(.*\)/ { h; s//\1\n\3/; x; s//\2/e; G; s/\(.*\)\n\(.*\)\n\(.*\)/\2\1\3/; ba }'
The sed code works as follows:
:a # jump label for looping, in case there are
# several {{}} expressions in a line
/\(.*\){{\(.*\)}}\(.*\)/ { # if there is a {{}} expression,
h # make a copy of the line
s//\1\n\3/ # isolate the surrounding parts
x # swap the original back in
s//\2/e # isolate the command, execute, get output
G # get the outer parts we put into the hold
# buffer
s/\(.*\)\n\(.*\)\n\(.*\)/\2\1\3/ # rearrange the parts to put the command
# output into the right place
ba # rinse, repeat until all {{}} are covered
}
This makes use of sed's greedy matching in the regexes to always capture the last {{}} expression in a line. Note that it will have difficulties if there are several commands in a line and one of the later ones has multi-line output. Handling this case will require the definition of a marker that the commands embedded in the data are not allowed to have as part of their output and that the templates are not allowed to contain. I would suggest something like {{{}}}, which would lead to
sed ':a /\(.*\){{\(.*\)}}\(.*\)/ { h; s//{{{}}}\1{{{}}}\3/; x; s//\2/e; G; s/\(.*\)\n{{{}}}\(.*\){{{}}}\(.*\)/\2\1\3/; ba }'
The reasoning behind this is that the template engine would run into trouble anyway if the embedded commands printed further {{}} terms. This convention is impossible to enforce, but then any code you pass into this template engine had better come from a trusted source, anyway.
Mind you, I am not sure that this whole thing is a sane idea1. You're not planning to use it in any sort of production code, are you?
1I am, however, quite sure whether it is a sane idea.

Sed remove only first occurence of a string

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE
Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.
Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

Regex to replace a string in context but not the context

I am new to regex and want to do the following task:
I have a string say, JOHN.S and I would want to replace the period with tab. However, the replacement should only occur if the period is between two letters. Something that I don't want it to happen is to replace period in John, S. with a tab. Instead, I will just replace , with a tab, which I know how to do.
If I try to replace /[a-zA-Z]\.[a-zA-Z]/, then the surrounding letters will be removed but obviously I want to keep them. They should just be used to identify the context.
I have searched for a long time but have not come up a solution. More specifically, I am working with bash. So maybe sed is what I am going to use.
Thank you.
It is just a matter of catching the surrounding information with () and printing them back with \1, \2, etc:
sed -r 's/(\w)\.(\w)/\1\t\2/g' file
Using your syntax:
sed -r 's/([a-zA-Z])\.([a-zA-Z])/\1\t\2/g' file
Test
$ cat file
John, S.
JOHN.S
blabla
$ sed -r 's/(\w)\.(\w)/\1\t\2/g' file
John, S.
JOHN S
blabla

Grep Regex - Words in brackets?

I want to know the regex in grep to match everything that isn't a specific word. I know how to not match everything that isn't a single character,
gibberish blah[^.]*jack
That would match blah, jack and everything in between as long as the in between didn't contain a period. But is it possible to do something like this?
gibberish blah[^joe]*jack
Match blah, jack and everything in between as long as the in between didn't contain the word "joe"?
UPDATE:
I can also use AWK if that would better suit this purpose.
So basically, I just want to get the sentence "gibberish blah other words jack", as long as "joe" isn't in the other words.
Update 2 (The Answer, to a different question):
Sorry, I am tired. The sentence actually can contain the word "joe", but not two of them. So "gibberish blah jill joe moo jack" would be accepted, but "gibberish blah jill joe moo joe jack" wouldn't.
Anyway, I figured out the solution to my problem. Just grep for "gibberish.*jack" and then do a word count (wc) to see how many "joes" are in that sentence. If wc comes back with 1, then it's ok, but if it comes back with 2 or more, the sentence is wrong.
So, sorry for asking a question that wouldn't even solve my problem. I will mark sputnick's answer as the right one, since his answer looks like it would solve the original posts problem.
What you're looking for is named look around, it's an advanced regex technique in pcre & perl. It's used in modern languages. grep can handle this expressions if you have the -P switch. If you don't have -P, try pcregrep instead. (or any modern language).
See
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
NOTE
If you just want to negate a regex, maybe a simple grep -v "regex" will be sufficient. (It depends of your needs) :
$ echo 'gibberish blah other words jack' | grep -v 'joe'
gibberish blah other words jack
$ echo 'gibberish blah joe other words jack' | grep -v 'joe'
$
See
man grep | less +/invert-match
Try the negative lookbehind syntax:
blahish blah(?<!joe)*jack