Sed handle patterns over multiple lines - regex

For example if I have a file like
yo#gmail.com, yo#
gmail.com yo#gmail
.com
And I want to replace the string yo#gmail.com.
If the file had the target string in a single line then we could've just used
sed "s/yo#gmail.com/e#email.com/g" file
So, is it possible for me to catch patters that are spread between multiple line without replacing the \n?
Something like this
e#email.com, e#
email.com e#email
.com
Thank you.

You can do this:
tr -d '\n' < file | sed 's/yo#gmail.com/e#email.com/g'

This might work for you (GNU sed):
sed -E 'N;s/yo#gmail\.com/e#email.com/g
h;s/(\S+)\n(\S+)/\1\2\n\1/;/yo#gmail\.com/!{g;P;D}
s//\ne#email.com/;:a;s/\n(\S)(\S+)\n\S/\1\n\2\n/;ta
s/(.*\n.*)\n/\1/;P;D' file
Append the following line to the pattern space.
Replace all occurrences of matching email address in both the first and second lines.
Make a copy of the pattern space.
Concatenate the last word of the first line with the first word of the second and keep the second line as is. If there is no match with the email address, revert the line, print/delete the first line and repeat.
Otherwise, replace the match and re-insert the newline as of the length of the first word of the second line (deleting the first word of the second line too).
Remove the newline used for scaffolding, print/delete the first line and repeat.
N.B. The lines will not be of the same length as the originals if the replacement string length does not match the matching string length. Also there has been no attempt to break the replacement string in the same relative split if the match and replacement strings are not the same length.
Alternative:
echo $'yo#gmail.com\ne#email.com' |
sed -E 'N;s#(.*)\n(.*)#s/\\n\1/\\n\2/g#
:a;\#\\n([^/])(.*)\\n(.)?(.*/g)#{s//\1\\n\2\3\\n\4/;H;ba}
x;s/.//;s#\\n/g$#/g#gm;s#\\n/#/#;s/\./\\./g' |
sed -e 'N' -f - -e 'P;D' file
or:
echo 's/yo#gmail.com/e#email.com/' |
sed -E 'h;s#/#/\\n#g;:a;H;s/\\n([^/])/\1\\n/g;ta;x;s/\\n$//mg;s/\./\\./g' |
sed -zf - /file
N.B. With the last alternative solution, the last sed invocation can be swapped for the first alternative solutions last sed invocation.

Related

replace last n parts after spliting on delimiter using sed or regex

I need to replace last 2 parts of the string separated by delimiter with empty space to clean up the name.
Example:
something-useful-a12356-78929
=>
something-useful
something-more-useful-v35f62-2728902
=>
something-more-useful
I tried the following:
echo "something-useful-12345-67890" | sed -re 's/(-([0-9])+)//g'
This works if my last 2 elements of delimiter are numbers only, but wouldn't work for the example above. I need to remove the last 2 parts after splitting it on "-"
I can only use sed or regex to solve this.
Does sed 's/\(-[^-]*\)\{2\}$//' file does what you want?
Use [^-] to match anything other than -. Use $ to match the end of the string. Match hyphen followed by non-hyphens twice at the end.
sed -r 's/(-[^-]+){2}$//'
This might work for you (GNU sed):
sed -re 's/-[^-]*//2g' file
Removes globally from the second occurrence of - followed by non - characters.

sed retrieve part of line

I have lines of code that look like this:
hi:12345:234 (second line)
How do I write a line of code using the sed command that only prints out the 2nd item in the second line?
My current command looks like this:
sed -n '2p' file which gets the second line, but I don't know what regex to use to match only the 2nd item '12345' and combine with my current command
Could you please try following, written and tested with shown samples in GNU sed.
sed -n '2s/\([^:]*\):\([^:]*\).*/\2/p' Input_file
Explanation: Using -n option of sed will stop the printing for all the lines and printing will happen only for those lines where we are explicitly mentioning p option to print(later in code). Then mentioning 2s means perform substitution on 2nd line only. Then using regex and sed's capability to store matched regex into a temp buffer by which values can be retrieved later by numbering 1,2...and so on. Regex is basically catching 1st part which comes before first occurrence of : and then 2nd part after first occurrence of : to till 2nd occurrence of : as per OP's request. So while doing substitution mentioning /2 will replace whole line with 2nd value stored in buffer as per request, then mentioning p will print that part only in 2nd line.
A couple of solutions:
echo "hi:12345:234" | sed -n '2s/.*:\([0-9]*\):.*/\1/p'
echo "hi:12345:234" | sed -n '2{s/^[^:]*://; s/:.*//p; q}'
echo "hi:12345:234" | awk -F':' 'FNR==2{print $2}'
All display 12345.
sed -n '2s/.*:\([0-9]*\):.*/\1/p' only displays the captured value thanks to -n and p option/flag. It matches a whole string capturing digits between two colons, \1 only keeps the capture.
The sed -n '2{s/^[^:]*://;s/:.*//p;q}' removes all from start till first :, all from the second to end, and then quits (q) so if your file is big, it will be processed quicker.
awk -F':' 'FNR==2{print $2}' splits the second line with a colon and fetches the second item.

repace n occurrences of a character in a string from the end

I am struggling to come up with a solution to replace n occurrences of a character with another character in a string starting from the end of the string. For example, if I want to replace last 5 occurrences of "," with "|" in a string like
abc, def,,{"data":{"xyz":null,"uan":"5643df"},{"path":"/abc/def/xyz"}},546,453,,,
to get a result like
abc, def,,{"data":{"xyz":null,"uan":"5643df"},{"path":"/abc/def/xyz"}}|546|453|||
I have looked at multiple solution which helps you find the last occurrence or all occurrences or 5 occurrences from the beginning but nothing which helps me do it from the end of the string. Reversing the string and doing it from the beginning and then reversing the string again is not an option because of the sheer size of the file.
With GNU sed. Replace five times last comma and rest of row with pipe and rest of row (s/,([^,]*)$/|\1/):
echo 'a,b,c,d,e,f,g,h' | sed -r 's/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/;'
Output:
a,b,c|d|e|f|g|h
An awk version:
echo 'a,b,c,d,e,f,g,h' | awk -F, '{printf "%s",$1;for(i=2;i<=NF;i++) printf (NF-5<i?"|%s":",%s"),$i;print ""}'
a,b,c|d|e|f|g|h
It uses a loop to print each field. Count up and find when to use , or |. Number can be changed to get other result.
Example last to field:
echo 'a,b,c,d,e,f,g,h' | awk -F, '{printf "%s",$1;for(i=2;i<=NF;i++) printf (NF-2<i?"|%s":",%s"),$i;print ""}'
a,b,c,d,e,f|g|h
This might work for you (GNU sed):
sed -E '/(,[^,]*){5}$/{s//\n&/;h;y/,/|/;H;g;s/\n.*\n//}' file
Insert a newline just before the fifth comma from the end of a line, make a copy, replace all ,'s by |'s, append the current line to the copy and remove everything between the first and last newlines.
An alternative using GNU parallel and sed:
parallel -n0 -q echo 's/\(.*\),/\1|/' ::: {1..5} | sed -f - file
N.B. The first solution only amends the a line if there are at least 5 commas whereas the second solution amends a line regardless of how many commas there are.

Problem to change an occurence in a file with sed

I have a file with several lines :
OTU3055 UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
OTU0856 OTU53699 UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
OTU0125 UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
I want to remove all OTUXXXX occurences (there are always 4 numbers after the "OTU") which appears in the file. I used sed but it didn't work. The OTUXXXX always appearat the beginning of the lines.
sed 's/OTU[0-9]{4} //g' my_file.txt
I put a space after OTU[0-9]{4} because I want the Uniref90 IDs are at the beginning of eacg line.
Edit :
sed -r 's/OTU[0-9]{4} //g' my_file.txt works. But I get another problem,
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
Some lines still begin with a white space. I tried sed 's/^ *//' my_file.txt and it does not work. I want the second line of my file starts like the two other lines, without any space.
You may use
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b[[:space:]]*//g' file > newfile
Or, if the matches can be found anywhere, not only at the string start:
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b//g' file | sed 's/[[:space:]]*$//' > newfile
The whitespaces after the OTU<digits> won't get matched with the second snippet, so a piped sed command is necessary.
See the online demo.
Details
[[:space:]]* - 0+ whitespace chars
\b a word boundary
OTU[0-9]{4,} - OTU and 4 or more digits
\b - a word boundary
[[:space:]]* - 0+ whitespace chars.
There is no explanation for your posted actual output given your posted input and the command you ran but if you want to match on 4 or more digits and the space after the OTU* strings could be a tab or some other white space that's not a blank char then this is what you need using GNU or OSX/BSD awk for -E:
$ sed -E 's/(OTU[0-9]{4,}[[:space:]]+)+//' file
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2

how to trim trailing spaces after all delimiter in a text file

Need help to remove trailing spaces after all delimiter in a text file
I have Text file with below data.
eg.
ADDRESS_ID| COUNTRY_TP_CD| RESIDENCE_TP_CD| PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0| 76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
I want to remove spaces after the delimiter and the first letter of the word.
Any regex or unix script that can do the same. Looking for output as below:
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU||||||2013-09-19 14:48:49.609000|
Any help will be appreciated.
awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)} 1' file
Using a perl one-liner to remove the spacing around every field. Assumes no embedded delimiters:
perl -i -lpe 's/\s*([^|]*?)\s*/$1/g' file.txt
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-l: Enable line ending processing
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
The below perl code would remove the spaces which are present at the start of a line or the spaces after to the delimiter | ,
$ perl -pe 's/(?<=\|) +|^ +//g' file
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
To save the changes made to that file,
perl -i -pe 's/(?<=\|) +|^ +//g' file
sed 's/\ //g' input.txt > output.txt
With sed:
sed -r -e 's/(^|\|)\s+/\1/g' -e 's/\s+$//' filename
In the first expression:
(^|\|) matches the beginning of the line or a | character, and saves this in capture group 1.
\s+ matches a sequence of whitespace characters after that.
The replacement \1 substitutes capture group 1, so this deletes the whitespace at the beginning of the line and after the delimiter.
The g modifier makes it operate on all the matches in the line.
In the second expression:
\s+ again matches a sequence of whitespace
$ matches the end of the line
The replacement replaces the whole thing with an empty string, this removing trailing spaces.
for posix sed (for GNU sed add --posix)
sed 's/^[[:space:]]//;s/|[[:space:]]/|/g' YourFile
use 2 substitution (there are no OR (|) in sed regex posix version)
Remove starting space by replacing space at start( ^[[:space:]]*) by nothing
Replace any sequence pipe than any space (|[[:space:]]*) by pipe
[[:space:]] could be replace by a single space char if text only have space (ASCII 32) char