Regex to replace a string in context but not the context - regex

I am new to regex and want to do the following task:
I have a string say, JOHN.S and I would want to replace the period with tab. However, the replacement should only occur if the period is between two letters. Something that I don't want it to happen is to replace period in John, S. with a tab. Instead, I will just replace , with a tab, which I know how to do.
If I try to replace /[a-zA-Z]\.[a-zA-Z]/, then the surrounding letters will be removed but obviously I want to keep them. They should just be used to identify the context.
I have searched for a long time but have not come up a solution. More specifically, I am working with bash. So maybe sed is what I am going to use.
Thank you.

It is just a matter of catching the surrounding information with () and printing them back with \1, \2, etc:
sed -r 's/(\w)\.(\w)/\1\t\2/g' file
Using your syntax:
sed -r 's/([a-zA-Z])\.([a-zA-Z])/\1\t\2/g' file
Test
$ cat file
John, S.
JOHN.S
blabla
$ sed -r 's/(\w)\.(\w)/\1\t\2/g' file
John, S.
JOHN S
blabla

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

sed -e "s/^[^#][A-Z]/export /g" will remove first letters... how to prevent this matter?

Say I have this file:
#!/bin/bash
#t1.sh
#There are so many comments here
#I HAVE A PEN
VAR=0
#I GOT AN APPLE
mixed
#APPLE PEN HERE
VAR1=$APPLE_PEN
Now I want to add "export" before every Variable and print result on the screen. I wrote these variable names in upper case then called this sed command:
cat t1.sh|sed -e "s/^[^#][A-Z]/export /g"
But the result removed first letters like:
#I HAVE A PEN
export R1=0
#I GOT AN APPLE
mixed
#APPLE PEN HERE
export R2=$APPLE_PEN
why is that? how to solve it?
BTW,
I could easily make regularexchange on notepad++ on this way:
Will sed also support such exchange too? if so, how to make it?
You need to add a capturing group around the pattern and use a backreference in the RHS part to restore that captured text in the result:
"s/^\([^#][A-Z]\)/export \1/"
^^ ^^ ^^
You need to escape the parentheses since you are using the BRE POSIX regex flavor where an unescaped ( matches a literal open parenthesis.
This might work for you (GNU sed):
sed '/^[^#][A-Z]/s/^/export /' file
Which would read: If the lines first character is not a hash and the second character is a capital letter then insert export followed by a space, at the beginning of the line.
or if you prefer to use a "back reference":
sed 's/^[^#][A-Z]/export &/ file
N.B. there is no need for the g flag on the substitution because when read a line at a time there can only be one start of line.

Using ampersand in sed

I have a csv file full of lines like the following:
Aity Chel Jenni,Hendaland 229,2591 TE Amsterdam
I want to create a sed pattern for in an automated batch script that changes the info in this kind of formatting into the following formatting:
Aity Chel Jenni,Hendaland 30,2591 TE, Amsterdam
With a bit of research, I found out that I had to create a regex, then use an ampersand (&) character to have it change things around using the & to define the location of the regex.
I have tried the following:
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
And have been trying variants of that trying to get the regexes down, but it doesn't seem to change anything.
Am I making a mistake in the usage of the ampersand or is my regex wrong?
Reading through the internet I can't seem to wrap my head around this function, can someone give me any examples/explain to me how to properly do this?
You are saying
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
^
But you don't have to capture with () to use &. Instead, just say:
sed 's/[1-9] [A-Z]\{2\}/&,/' file
Note you need to escape the elements in the { } quantifier, unless you use -r:
sed -r 's/[1-9] [A-Z]{2}/&,/' file
Try the following:
sed -r 's:[0-9] [A-Z]{2}\b:&,:' file > out
About your own pattern, you're missing the closing parenthesis. And, iirc, you need to escape ( inside sed patterns to not match them literally.
The -r option enabled sed to use extended regex, which provides the {2} expansion.

Using sed to replace one line (that might change) with another

I want to run a script that changes a line in the HTML code, indicating when the page was last updated. So for instance, I have the line
<d>This page was last updated on 29.04.2013 at 00:34 UTC</d>
and I am updating it now, so I want to replace that line with
<d>This page was last updated on 15.05.2013 at 15:50 UTC</d>
This is the only line in my source code that has the <d> tag, so hopefully that helps. I already have some code that generates the new string with the current date and time, but I can't figure out a way to replace the old one (which changes, so I don't know exactly what it is).
I've tried putting in a comment <!--date--> in the previous line, deleting the whole line that has <d> (with grep), and then putting in a new line after the comment that is the new string, but that fails. For example, if I want to just insert the string text after the comment, and use
sed -i 's/<!--date-->/<!--date-->text/' file.html
I get invalid command code j. I think it might be because there are some special characters like <,!, and > in the strings, but if I want to put in the date string above, I will have even more, like : and /. Thanks for any ideas on how to fix this.
This will change the text only on lines that contain <d>:
sed -i.bak "/<d>/s/on .* at [^<]*/on newdate at newtime/" file.html
I've tested this with the BSD sed that ships with MacOS X 10.8.3
You don't need your <!--date--> hack. You can use regular expressions and another delimiter besides "/" in your sed command:
sed -i.bak 's#<d>This page was last updated on.*</d>#<d>This page was last updated on 12.05.2013 at 00:38 UTC</d>#' whatever.html
Or, if you have your update in a variable called $replacement:
sed -i.bak "s#<d>This page was last updated on.*</d>#$replacement#" whatever.html
When using the command line, try escaping special characters like this:
! ===> \!

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input