how to trim trailing spaces after all delimiter in a text file - regex

Need help to remove trailing spaces after all delimiter in a text file
I have Text file with below data.
eg.
ADDRESS_ID| COUNTRY_TP_CD| RESIDENCE_TP_CD| PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0| 76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
I want to remove spaces after the delimiter and the first letter of the word.
Any regex or unix script that can do the same. Looking for output as below:
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU||||||2013-09-19 14:48:49.609000|
Any help will be appreciated.

awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)} 1' file

Using a perl one-liner to remove the spacing around every field. Assumes no embedded delimiters:
perl -i -lpe 's/\s*([^|]*?)\s*/$1/g' file.txt
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-l: Enable line ending processing
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.

The below perl code would remove the spaces which are present at the start of a line or the spaces after to the delimiter | ,
$ perl -pe 's/(?<=\|) +|^ +//g' file
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
To save the changes made to that file,
perl -i -pe 's/(?<=\|) +|^ +//g' file

sed 's/\ //g' input.txt > output.txt

With sed:
sed -r -e 's/(^|\|)\s+/\1/g' -e 's/\s+$//' filename
In the first expression:
(^|\|) matches the beginning of the line or a | character, and saves this in capture group 1.
\s+ matches a sequence of whitespace characters after that.
The replacement \1 substitutes capture group 1, so this deletes the whitespace at the beginning of the line and after the delimiter.
The g modifier makes it operate on all the matches in the line.
In the second expression:
\s+ again matches a sequence of whitespace
$ matches the end of the line
The replacement replaces the whole thing with an empty string, this removing trailing spaces.

for posix sed (for GNU sed add --posix)
sed 's/^[[:space:]]//;s/|[[:space:]]/|/g' YourFile
use 2 substitution (there are no OR (|) in sed regex posix version)
Remove starting space by replacing space at start( ^[[:space:]]*) by nothing
Replace any sequence pipe than any space (|[[:space:]]*) by pipe
[[:space:]] could be replace by a single space char if text only have space (ASCII 32) char

Related

Sed handle patterns over multiple lines

For example if I have a file like
yo#gmail.com, yo#
gmail.com yo#gmail
.com
And I want to replace the string yo#gmail.com.
If the file had the target string in a single line then we could've just used
sed "s/yo#gmail.com/e#email.com/g" file
So, is it possible for me to catch patters that are spread between multiple line without replacing the \n?
Something like this
e#email.com, e#
email.com e#email
.com
Thank you.
You can do this:
tr -d '\n' < file | sed 's/yo#gmail.com/e#email.com/g'
This might work for you (GNU sed):
sed -E 'N;s/yo#gmail\.com/e#email.com/g
h;s/(\S+)\n(\S+)/\1\2\n\1/;/yo#gmail\.com/!{g;P;D}
s//\ne#email.com/;:a;s/\n(\S)(\S+)\n\S/\1\n\2\n/;ta
s/(.*\n.*)\n/\1/;P;D' file
Append the following line to the pattern space.
Replace all occurrences of matching email address in both the first and second lines.
Make a copy of the pattern space.
Concatenate the last word of the first line with the first word of the second and keep the second line as is. If there is no match with the email address, revert the line, print/delete the first line and repeat.
Otherwise, replace the match and re-insert the newline as of the length of the first word of the second line (deleting the first word of the second line too).
Remove the newline used for scaffolding, print/delete the first line and repeat.
N.B. The lines will not be of the same length as the originals if the replacement string length does not match the matching string length. Also there has been no attempt to break the replacement string in the same relative split if the match and replacement strings are not the same length.
Alternative:
echo $'yo#gmail.com\ne#email.com' |
sed -E 'N;s#(.*)\n(.*)#s/\\n\1/\\n\2/g#
:a;\#\\n([^/])(.*)\\n(.)?(.*/g)#{s//\1\\n\2\3\\n\4/;H;ba}
x;s/.//;s#\\n/g$#/g#gm;s#\\n/#/#;s/\./\\./g' |
sed -e 'N' -f - -e 'P;D' file
or:
echo 's/yo#gmail.com/e#email.com/' |
sed -E 'h;s#/#/\\n#g;:a;H;s/\\n([^/])/\1\\n/g;ta;x;s/\\n$//mg;s/\./\\./g' |
sed -zf - /file
N.B. With the last alternative solution, the last sed invocation can be swapped for the first alternative solutions last sed invocation.

bash sed to remove all whitespace and blank lines from end of file

I found this to remove whitespace from the end of a script How to remove trailing whitespaces with sed? but it doesn't quite do what I was hoping. What I would like to do when I think of remove all white space is to remove also any empty lines - I think that this sed just removes spaces and tabs, but can it be expanded to also trim out any empty lines from the end of the file? Maybe it's not possible to do this with one line, and maybe there are better ways to achieve this, any options are great.
Also, am I right in thinking that this should replace the file in place with the changes? I'm just not sure that's happening in my testing.
sed -i 's/[ \t]*$//' ~/.bashrc
# -i is in place, [ \t] applies to any number of spaces and tabs before the end of the file "*$"
To remove all whitespace at the end of the file:
perl -0777 -pe 's{\s+\z}{}m' foo > bar
To change the file in-place:
perl -i.bak -0777 -pe 's{\s+\z}{}m' foo
To replace all whitespace at the end of the file with a single newline:
perl -0777 -pe 's{\s+\z}{\n}m' foo > bar
To change the file in-place:
perl -i.bak -0777 -pe 's{\s+\z}{\n}m' foo
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
-0777 : Slurp files whole.
\s+\z : one or more whitespace characters (including newline) at the end of the string (which happens to be the entire file).
The regex uses this modifier:
/m : Allow multiline matches.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
This might work for you (GNU sed):
sed ':a;/\S/!{$d;N;ba}' file
Append empty lines to the previous line.
If the empty line is the last, delete the current pattern space.
Otherwise print the pattern space.
To remove spaces from the end of all lines too:
sed ':a;/\S/!{$d;N;ba};s/ *$//mg' file
or:
sed 'H;$!d;x;s/.//;s/ *$//mg;s/\n*$//' file

Problem to change an occurence in a file with sed

I have a file with several lines :
OTU3055 UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
OTU0856 OTU53699 UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
OTU0125 UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
I want to remove all OTUXXXX occurences (there are always 4 numbers after the "OTU") which appears in the file. I used sed but it didn't work. The OTUXXXX always appearat the beginning of the lines.
sed 's/OTU[0-9]{4} //g' my_file.txt
I put a space after OTU[0-9]{4} because I want the Uniref90 IDs are at the beginning of eacg line.
Edit :
sed -r 's/OTU[0-9]{4} //g' my_file.txt works. But I get another problem,
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
Some lines still begin with a white space. I tried sed 's/^ *//' my_file.txt and it does not work. I want the second line of my file starts like the two other lines, without any space.
You may use
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b[[:space:]]*//g' file > newfile
Or, if the matches can be found anywhere, not only at the string start:
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b//g' file | sed 's/[[:space:]]*$//' > newfile
The whitespaces after the OTU<digits> won't get matched with the second snippet, so a piped sed command is necessary.
See the online demo.
Details
[[:space:]]* - 0+ whitespace chars
\b a word boundary
OTU[0-9]{4,} - OTU and 4 or more digits
\b - a word boundary
[[:space:]]* - 0+ whitespace chars.
There is no explanation for your posted actual output given your posted input and the command you ran but if you want to match on 4 or more digits and the space after the OTU* strings could be a tab or some other white space that's not a blank char then this is what you need using GNU or OSX/BSD awk for -E:
$ sed -E 's/(OTU[0-9]{4,}[[:space:]]+)+//' file
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2

Remove new line only if after a number

I've collected some CSV data from terminal but every line is only 80 characters long so it's not importing properly.
Here's two lines of data:
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691
,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5
267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9
746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,1230
0,11970,12240,12867,13475,14310,15962,17624,19105,21075,
I wanna remove the newline char only if it's after any number or comma, but not if it's only on it's own, since that means it's a new line of CSV data.
I couldn't figure out how to do this on shell with sed. If any other program like awk or perl is better for this scenario then feel free to show me a solution for that.
Expected output:
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,
Just remove the newline if it's preceded by a digit or comma:
perl -pe 'chomp if /[\d,]$/' input-file > output-file
-p reads the input line by line and prints the result
chomp removes newline if present at the end
\d matches a digit
$ matches the end of line
With awk by reading in paragraph mode and replacing all \n
$ awk -v RS= '{gsub("\n","")} 1' ip.txt
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,
To leave the blanks, set ORS to double newlines, however this will add an extra newline at end
$ awk -v RS= -v ORS='\n\n' '{gsub("\n","")} 1' ip.txt
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,
You can use this regex:
(?<!\n)\n(?!\n)
and replace with empty string.
perl -0pe 's/([\d,])\n([\d,])/$1$2/sg' (file)
should do it.
That is, read the file without line delimiters, treat the whole thing as one string and remove the newlines that are preceded and followed by a digit or comma.

How to add a line break before and after a regex in a text file?

This is an excerpt from the file I want to edit:
>chr1|-|9|S|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG >chr1|+|9|Y|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
I would a new text file in which I add a line break before ">" and after "somatic" or after "germline", how can I do in R or Unix?
Expected output:
>chr1|-|9|S|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
>chr1|+|9|Y|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
By the looks of your input, you could simply replace spaces with newlines:
tr -s ' ' '\n' <infile >outfile
(Some tr dialects don't like \n. Try '\012' or a literal newline: opening quote, newline, closing quote.)
If that won't work, you can easily do this in sed. If somatic is static, just hard-code it:
sed -e 's/somatic */&\n/g' -e 's/ >/\n>/g' file >newfile
The usual caveats about different sed dialects apply. Some versions don't like \n for newline, some want a newline or a semicolon instead of multiple -e arguments.
On Linux, you can modify the file in-place:
sed -i 's/somatic */&\
/g
s/ >/\
/g' file
(For variation, I'm showing how to do this if your sed doesn't recognize \n but allows literal newlines, and how to put the script in a single multi-line string.)
On *BSD (including MacOS) you need to add an argument to -i always; sed -i '' ...
If somatic is variable, but you always want to replace the first space after a wedge, try something like
sed 's/\(>[^ ]*\) /\1\n/g'
>[^ ] matches a wedge followed by zero or more non-space characters. The parentheses capture the matched string into \1. Again, some sed variants don't want backslashes in front of the parentheses, or are otherwise just ... different.
If you have very long lines, you might bump into a sed which has problems with that. Maybe try Perl instead. (Luckily, no dialects to worry about!)
perl -i -pe 's/(>[^ ]*) /$1\n/g;s/ >/\n>/g' file
(Skip the -i option if you don't want to modify the input file. Then output will be to standard output.)
(\bsomatic\b|\bgermline\b)|(?=>)
Try this.See demo.Replace by $1\n
http://regex101.com/r/tF5fT5/53
If there's no support for lookahead then try
(\bsomatic\b|\bgermline\b)
Try this.Replace by $1\n.See demo.
http://regex101.com/r/tF5fT5/50
and
(>)
Replace by \n$1.See demo.
http://regex101.com/r/tF5fT5/51
Thank you everyone!
I used:
tr -s ' ' '\n' <infile >outfile
as suggested by tripleee and it worked perfectly!