How to replace text with comma [Linux on Windows] - regex

We get these automated emails from our client that have this rough format:
VP##0-X1-#####-#[Revision #:Document title]
VP##0-X2-#####-#[Revision #:Document title]
VP##0-X3-#####-#[Revision #:Document title]
What I want to do:
replace [Revision with a comma
replace : with a comma
delete ]
So that I can convert this into a CSV and then use some excel magic to fill in our tracking sheet.
I've tried to use sed with this general format:
sed -i 's,[Revision ,\,,g' <FILE>
but I don't know how to get a comma in for this case.
This is what I want to get in the end:
VP##0-X1-#####-#,#,Document title
VP##0-X2-#####-#,#,Document title
VP##0-X3-#####-#,#,ocument title
Any and all insight is appreciated.
I'm using Ubuntu on Windows.

sed 's/\[Revision /,/;s/:/,/;s/]//' inputfile
VP##0-X1-#####-#, #,Document title
VP##0-X2-#####-#, #,Document title
VP##0-X3-#####-#, #,Document title
No need to use heavy lifting by using back-referencing or using multiple sed commands. You can issue multiple replacement commands from within single sed command:
Syntax:
sed 's/a/A/' file
sed 's/b/B/' file
sed 's/c/C/' file
Can be combined into one command:
sed 's/a/A/;s/b/B/;s/c/C/' file #note the semicolon separating multiple replace operations.

You can use:
sed -Ei 's/(.*)(\[Revision)(.*)(:)(.*)(])/\1,\3,\5/' <FILE>
Testing it with one line and an echo:
$ echo "[VP##0-X1-#####-#[Revision #:Document title]" | sed -E 's/(.*)(\[Revision)(.*)(:)(.*)(])/\1,\3,\5/'
[VP##0-X1-#####-#, #,Document title
Explanation:
'(.*)(\[Revision)(.*)(:)(.*)(])
The regular expression in the first half of the sed command is divided into 6 groups defined by ().
Group 2 (\[Revision) will match "[Revision" and group 4 (:) will match ":", the parts of the string you want to replace.
/\1,\3,\5/'
In the second part of the command, the same groups can be used as the replacement text, so I used group 1 (\1) to preserve everything before "[Revision", then use a comma ',', then use group 3 (\3) (everything between "[Revision" and ":"), a comma ",", and finally group 5 (\5). Group 6 will match the final ']', so it is not used as you wanted to remove it.

The [ must be escaped since it is a special character for regular expressions. Also, it may be better to use another character than , as separator in the sed command. This should do the trick:
sed -i 's/\[Revision /,/g' <FILE>

With sed, / is a pretty common separator. Also, square brackets are special characters and need to be escaped.
replace [Revision with a comma
sed -i 's/\[Revision /,/g' <FILE>
replace : with a comma
sed -i 's/:/,/g' <FILE>
delete ]
sed -i 's/\]//g' <FILE>

Related

Replace spaces with new lines if part of a specific pattern using sed and regex with extended syntax

so I have a text file with multiple instances looking like this:
word. word or words [something:'else]
I need to replace with a new line the double space after every period followed by a sequence of words and then a "[", like so:
word.\nword or words [something:'else]
I thought about using the sed command in bash with extended regex syntax, but nothing has worked so far... I've tried different variations of this:
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/g' old.txt > new.txt
I'm an absolute beginner at this, so I'm not sure at all about what I'm doing 😳
This might work for you (GNU sed):
sed -E 's/\. ((\w+ )+\[)/\.\n\1/g' file
Replace globally a period followed by two spaces and one or more words space separated followed by an opening square bracket by; a period followed by a newline followed by the matching back reference from the regexp.
Your sed command is almost correct (but contains some redundancies)
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/' old.txt > new.txt
# ^
# You forget terminating the s command
But you don't need to capture everything. A simpler one could be
sed -E 's/\. (.*\[)/.\n\1/' old.txt > new.txt

Sed matches unwanted extra characters

I want to replace parts of file paths in a configuration file using sed in Cygwin. The file paths are in form of \\\\some\\constant\\path\\2018-03-20_2030.1\\Release\\base\\some_dll.dll (yes, double backslashes in the file) and the beginning part containing date should be replaced.
For matching I've written following regex: \\\\\\\\some\\\\constant\\\\path\\\\[0-9_\.-]* with a character set supposed to match only date, consisting of digits and "-", "_" and "." symbols. This results into following command for replacement: sed 's/\\\\\\\\some\\\\constant\\\\path\\\\[0-9_\.-]*/bla/g' file.txt
The problem is that, after replacement, I get blaRelease\\base\\some_dll.dll instead of bla\\Release\\base\\some_dll.dll as it was successfully replaced using Regexr.
Why does sed behave this way and how can I fix it?
The problem is that the character class [0-9_\.-] is matching backslashes. If you replace the class with [0-9_.-], it will do what you expect.
Note that in a character class, . isn't special and doesn't need quoting. For example, from my Cygwin command line:
$ echo '\.' | sed 's/[\.]/x/g'
xx
$ echo '\.' | sed 's/[.]/x/g'
\x
A simple sed may help you on same.
sed 's/.*Release/bla\\\\Release/' Input_file
In case you want to have backup of Input_file and save the output of it into Input_file itself then following may help you on same.
sed -i.bak 's/.*Release/bla\\\\Release/' Input_file
In another case if you simply want to save output into Input_file itself then following may help you on same too.(difference between above and this one is this one will not create a backup of original Input_file).
sed -i 's/.*Release/bla\\\\Release/' Input_file

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Capture text between two tokens

I'm trying to get the text between two tokens.
For example, let's say the text is:
arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end
The output should be: CaptureThis
And the two tokens are: :start: and /end
The closest I could get was using this regex:
INPUT="arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end"
VALUE=$(echo "${INPUT}" | sed -e 's/:start:\(.*\)\/end/\1/')
... but this returns most of the string: arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end
How do I get all of the other text out of the way?
You could use (GNU) grep with Perl regular expressions (look-arounds) and the -o option to only return the match:
$ grep -Po '(?<=:start:).*(?=/end)' <<< 'arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end'
CaptureThis
Try this:
$ sed 's/^.*:start:\(.*\)\/end.*$/\1/' <<<'arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end'
CaptureThis
The problem with your approach was that you only replaced part of the input line, because your regex didn't capture the entire line.
Note how the command above anchors the regex both at the beginning of the line (^.*) and at the end (.*$) so as to ensure that the entire line is matched and thus replaced.
You could use :
VALUE=$(echo "${INPUT}" | sed -e 's/.*:start:\(.*\)\/end.*/\1/')
If the tokens are liable to change, you could use variables - but since "/end" has a "/", that could lead to sed getting confused, so you'd probably want to change its delimiter to some non-conflicting character (like a "?"), so :
TOKEN1=":start:"
TOKEN2="/end"
VALUE=$(echo "${INPUT}" | sed -e "s?.*$TOKEN1\(.*\)$TOKEN2.*?\1?")
There is no need for any external utilities, bash parameter-expansion will handle it all for you:
INPUT="arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end"
token=${INPUT##*:}
echo ${token%/*}
Output
CaptureThis

Extracting Substring from String with Multiple Special Characters Using Sed

I have a text file with a line that reads:
<div id="page_footer"><div><? print('Any phrase's characters can go here!'); ?></div></div>
And I'm wanting to use sed or awk to extract the substring above between the single quotes so it just prints ...
Any phrase's characters can go here!
I want the phrase to be delimited as I have above, starting after the single quote and ending at the single-quote immediately followed by a parenthesis and then semicolon. The following sed command with a capture group doesn't seem to be working for me. Suggestions?
sed '/^<div id="page_footer"><div><? print(\'\(.\+\)\');/ s//\1/p' /home/foobar/testfile.txt
Incorrect would be using cut like
grep "page_footer" /home/foobar/testfile.txt | cut -d "'" -f2
It will go wrong with single quotes inside the string. Counting the number of single quotes first will change this from a simple to an over-complicated solution.
A solution with sed is better: remove everything until the first single quote and everything after the last one. A single quote in the string becomes messy when you first close the sed parameter with a single quote, escape the single quote and open a sed string again:
grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*//' -e 's/[^'\'']*$//'
And this is not the full solution, you want to remove the first/last quotes as well:
grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*'\''//' -e 's/'\''[^'\'']*$//'
Writing the sed parameters in double-quoted strings and using the . wildcard for matching the single quote will make the line shorter:
grep page_footer /home/foobar/testfile.txt | sed -e "s/^[^\']*.//" -e "s/.[^\']*$//"
Using advanced grep (such as in Linux), this might be what you are looking for
grep -Po "(?<=').*?(?='\);)"