Use "sed" to Remove Capture Group 1 From All Lines In a File

Use "sed" to Remove Capture Group 1 From All Lines In a File - regex

I currently have a file with lines like the below:
ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z
My goal is to remove everything from the "#" to the next comma, such that it instead looks like the below:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z
I'm not that experienced with utilizing sed and RegEx expressions. In playing around on a testing website, I came up with the below RegEx string, in which capture group 1 is perfectly matching to what I want to remove:
regex101.com Test
How would I go about putting this in a "sed" command against a given input file, and writing the results to a new output file. I had tried the below most recently:
sed 's/(#.+?),//' input.csv > input_Corrected.csv
Just as another note, I'm doing this in a bash script in which I have an API call generating the "input.csv" file, and then want to run this sed command to clean up the data format to match my needs.

You can use
sed 's/#[^,]*,/,/' input.csv > input_Corrected.csv
sed 's/#[^,]*//' input.csv > input_Corrected.csv
The #[^,]*, POSIX BRE pattern matches a # and then any zero or more chars other than , and then a , (in the first example, use it if there MUST be a comma after the match) and replaces with a comma (in the first example, keep the replacement empty if you use the second approach).
See the online demo:
s='ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z'
sed 's/#[^,]*,/,/' <<< "$s"
Output:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z

You can used the below regular expression in order to remove the content of the valid email address only.
sed "s/#([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})//g" input.csv > input_Corrected.csv
And as per your requirement you can use the below code. As it is going to replace all the email address on the file as you have on your file "calvin_hobbes2#netnet" which is not valid email address.
sed "s/#[^,]*//g" input.csv > input_Corrected.csv

Related

How to use sed to search and replace a pattern who appears multiple times in the same line?

Because the question can be misleading, here is a little example. I have this kind of file:
some text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##
In this example, I want to replace each occurrence of KEY- inside a pair of ## by VALUE-. I started with this sed command:
sed -i 's/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g'
Here is how it works:
\(##[^#]*\): create a first group composed of two # and any characters except # ...
KEY-: ... until the last occurrence of KEY- on that line
\([^#]*##\): and create a second group with all the characters except # until the next pair of #.
The problem is my command can't handle correctly the following line because there are multiple KEY- inside my pair of ##:
again ##some-text-KEY-some-other-text-KEY-text##
Indeed, I get this result:
again ##some-text-KEY-some-other-text-VALUE-text##
If I want to replace all the occurrences of KEY- in that line, I have to run my command multiple times and I prefer to avoid that. I also tried with lazy operators but the problem is the same.
How can I create a regex and a sed command who can handle correctly all my file?

The problem is rather complex: you need to replace all occurrences of some multicharacter text inside blocks of text between identical multicharacter delimiters.
The easiest and safest way to solve the task is using Perl:
perl -i -pe 's/(##)(.*?)(##)/$end_delim=$3; "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim"/ge' file
See the online demo.
The (##)(.*?)(##) pattern will match strings between two adjacent ## substrings capturing the start delimiter into Group 1, end delimiter in Group 3, and all text in between into Group 2. Since the regex substitution re-sets all placeholders, the temporary variable is used to keep the value of the end delimiter ($end_delim=$3), then, "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim" replaces the match with the value in the Group 1 of the first match (the first ##), then the Group 2 value with all KEY- replaced with VALUE-, and then the end delimiter.
If there are no KEY-s in between matches on the same line you may use a branch with sed by enclosing your command with :A and tA:
sed -i ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' file
Note you missed the first placeholder in \VALUE-\2, it should be \1VALUE-\2.
See the online demo:
s="some KEY- text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##"
sed ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' <<< "$s"
Output:
some KEY- text
some text ##some-text-VALUE-some-other-text##
text again ##some-text-VALUE-some-other-text## ##some-text-VALUE-some-other-text##
again ##some-text-VALUE-some-other-text-VALUE-text##
some text with KEY ##VALUE-some-text##
blabla ##KEY##
More details:
sed allows the usage of loops and branches. The :A in the code above is a label, a special location marker that can be "jumped at" using the appropriate operator. t is used to create a branch, this "command jumps to the label only if the previous substitute command was successful". So, once the pattern matched and the replacement occurred, sed goes back to where it was and re-tries a match. If it is not successful, sed goes on to search for the matches further in the string. So, tA means go back to the location marked with A if there was a successful search-and-replace operation.

This might work for you (GNU sed):
sed -E 's/##/\n/g;:a;s/^([^\n]*(\n[^\n]*\n[^\n]*)*\n[^\n]*)KEY-/\1VALUE-/;ta;s/\n/##/g' file
Convert ##'s to newlines. Using a loop, replace VAL- between matched newlines to VALUE-. When all done replace newlines by ##'s.

sed regex match and replace any last digit

I have lots of file containing following ipaddress, and i want to replace last digit of ip and look like i am having struggle to come up with correct regex
file1
IPADDR=10.30.2.26
NETMASK=255.255.0.0
GATEWAY=10.30.0.1
I want to replace 10.30.2.26 to 10.30.2.27 using sed but somehow i am missing something, i have tried following.
I have many file which i want to replace and last digit could be anything.
I have tried sed 's/[^IPADDR].$/7/g' file1
how do i match anything between ^IPADDR{anything}$ ?

In your regex, [^IPADDR] is a character class that search for any character except those listed between brackets. I'm not sure that's what you want.
You can use an address instead to find lines starting with IPADDR(/^IPADDR/) and apply the substitution command on it:
sed '/^IPADDR/s/[0-9]$/7/' file

You may use the following command:
sed -r 's/(^IPADDR=[0-9.]+)([0-9]$)/\17/g' file
Prints:
IPADDR=10.30.2.27
NETMASK=255.255.0.0
GATEWAY=10.30.0.1

Property File with Sed regex - Ignore first character for match

I have a test property file with this in it:
-config.test=false
config.test=false
I'm trying to, using sed, update the values of these properties whether they have the - in front of them or not. Originally I was using this, which worked:
sed -i -e "s/#*\(config.test\)\s*=\s*\(.*\)/\1=$(echo "true" | sed -e 's/[\/&]/\\&/g')/" $FILE_NAME
However, since I was basically ignoring all characters before the match, I found that when I had properties with keys that ended in the same value, it'd give me problems. Such as:
# The regex matches both of these
config.test=true
not.config.test=true
Is there a way to either ignore the first character for a match or ignore the initial - specifically?
EDIT:
Adding a little clarification in terms of what I'd want the regex to match:
config.test=false # Should match
-config.test=false # Should match
not.config.test=false # Should NOT match

sed -E 's/^(-?config\.test=).*/\1true/' file
? means zero or 1 repetitions of so it means the - can be present or not when matching the regexp.

I found some solution for a regex of a specific length instead of ignoring the first character with sed and awk. Sometimes the opposite does the same by an easier way.
If you only have the alternative to use sed I have two workaround depending on your file.
If your file looks like this
$ cat file
config.test=false
-config.test=false
not.config.test=false
you can use this one-liner
sed 's/^\(.\{11,12\}=\)\(.*$\)/\1true/' file
sed is looking at the beginning ^ of each line and is grouping \( ... \) for later back referencing every character . that occurs 11 or 12 times \{11,12\} followed by a =.
This first group will be replaced with the back reference \1.
The second group that match every character after the = to the end of line \(.*$\) will be dropped. Instead of the second group sed replaces with your desired string true.
This also means, that every character after the new string true will be chopped.
If you want to avoid this and your file looks like
$ cat file
config.test=true # Should match
-config.test=true # Should match
not.config.test=false # Should NOT match
you can use this one-liner
sed 's/^\(.\{11,12\}=\)\(false\)\(.*$\)/\1true\3/' file
This is like the example before but works with three groups for back referencing.
The content of the former group 2 is now in group 3. So no content after a change from false to true will be chopped.
The new second group \(false\) will be dropped and replaced by the string true.
If your file looks like in the example before and you are allowed to use awk, you can try this
awk -F'=' 'length($1)<=12 {sub(/false/,"true")};{print}'
For me this looks much more self-explanatory, but is up to your decision.
In both sed examples you invoke only one time the sed command which is always good.
The first sed command needs 39 and the second 50 character to type.
The awk command needs 52 character to type.
Please tell me if this works for you or if you need another solution.

Understanding a sed example

I found a solution for extracting the password from a Mac OS X Keychain item. It uses sed to get the password from the security command:
security 2>&1 >/dev/null find-generic-password -ga $USER | \
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
The code is here in a comment by 'sr105'. The part before the | evaluates to password: "secret". I'm trying to figure out exactly how the sed command works. Here are some thoughts:
I understand the flags -En, but what are the commas doing in this example? In the sed docs it says a comma separates an address range, but there's 3 commas.
The first 'address' /^password: / has a trailing s; in the docs s is only mentioned as the replace command like s/pattern/replacement/. Not the case here.
The ^password: "(.*)"$ part looks like the Regex for isolating secret, but it's not delimited.
I can understand the end part where the back-reference \1 is printed out, but again, what are the commas doing there??
Note that I'm not interested in an easier alternative to this sed example. This will only be part of a larger bash script which will include some more sed parsing in an .htaccess file, so I'd really like to learn the syntax even if it is obscure.
Thanks for your help!

Here is sed command:
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
Commas are used as regex delimiter it can very well be another delimiter like #:
sed -En '/^password: / s#^password: "(.*)"$#\1#p'`
/^password: / finds an input line that starts with password:
s#^password: "(.*)"$#\1#p finds and captures double-quoted string after password: and replaces the entire line with the captured string \1 ( so all that remains is the password )

First, the command extracts passwords from a file (or stream) and prints them to stdout.
While you "normally" might execute a sed command on all lines of a file, sed offers to specify a regex pattern which describes which lines the following command should get applied to.
In your case
/^password: /
is a regex, saying that the command:
s,^password: "(.*)"$,\1,p
should get executed for all lines looking like password: "secret". The command substitutes those lines with the password itself while suppressing the outer lines.
The substitute command might look uncommon but you can choose the delimiter in an sed command, it is not limited to /. In this case , was chosen.

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks

You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'

Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise

That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input

I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.

or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Use "sed" to Remove Capture Group 1 From All Lines In a File - regex

Related

How to use sed to search and replace a pattern who appears multiple times in the same line?

sed regex match and replace any last digit

Property File with Sed regex - Ignore first character for match

Understanding a sed example

using sed to copy lines and delete characters from the duplicates

Categories

Resources