sed to replace blanks with newlines in OS X - regex

I want to replace blanks with newline characters in a file. Bunch of other things I tried from the answers to other questions here didn't work:
sed -e 's/\s\+/\n/g' file
sed -e 's/[[:blank:]]\+/\n/g' file
These both return the file as it is. I tried the following:
sed -e 's/[[:blank:]]/\n/g' file
which replaces the blanks with ns.
I assume the difference is due to the difference between gnu sed and the one in OS X. How can I achieve this in OS X?

The trick is to insert a new line (actually a new line).
$ echo 'this will replace blanks with new lines' | sed 's/ /\
/g'

sed on OS X doesn't recognize \n in the replacement, you need to use a literal newline, and you have to escape it to prevent it from ending the command. It also doesn't understand the \s or +, so use [[:blank:]]\{1,} to match one or more spaces.
sed -e 's/[[:blank:]]+/\
/g' file

The tr command is easier/more-suitable IMO:
tr ' ' '\n' < $FILE_PATH
or:
echo 'this will replace blanks with new lines' | tr ' ' '\n'

Related

Negate Single Quote In Character Class

I have the following code where I am trying to replace assign with always #(*) using SED.
Essentially I am trying to ignore the ' in character class but SED still seems to match it. I don't want sed to catch the line if the line contains a ' (My regex is much more complicated than this but I want sed to catch lines which my regex and ignore the lines which match my regex but contain a ')
echo "assign sample_signal = '0;" | sed '/[^\x27]/ {s/assign/always #(*)/g}'
Result: always #(*) sample_signal = '0;
Try this please (GNU sed):
$ echo "assign sample_signal = '0;" | sed -n '/[\x27]/!{s/assign/always #(*)/g; p}'
$ echo "assign sample_signal = 0;" | sed -n '/[\x27]/!{s/assign/always #(*)/g; p}'
always #(*) sample_signal = 0;
Two mistakes you've made:
1. /[^\x27]/ means to match any character that is not a ', but there're many characters that are not ', so the regex will match anyway.
2. You didn't use -n which is to suppress the output, so match or not, substitude or not, the line will be printed out anyway.
So I changed to /[\x27]/!{} which means when \x27 matched, not execute the block {}.
(In sed's word, it will be executed when not matched.)
And by -n switch, and p in the block, lines with ' are ignored.
Just use '\'' anywhere you need a ':
$ echo "f'oo" | sed 's/'\''o/X/'
fXo
$ echo "f'oo" | sed 's/[^'\'']o/X/'
f'X
You can enclose your sed command in double quotes and simply use /'/! to apply your command to lines not containing quotes:
echo "assign sample_signal = '0;" | sed "/'/! {s/assign/always #(*)/g;}"
If there is just one s command to apply, you can also omit the braces:
echo "assign sample_signal = '0;" | sed "/'/! s/assign/always #(*)/g"
As #EdMorton points out in comments, enclosing the command in double quotes may have unwanted effects. You may need to escape dollar sign(\$) to avoid unwanted variable expansion in your pattern) and double escape backslashes: \\\.

How can I append spaces to the end of specific lines using regex

I have a text file that has lines of different lengths. I need to make these uniform so that the PLSQL Developers text import function reads them correctly. Lines that are 89 characters long need to be padded with 4 spaces on the end. For some reason the -i argument to sed isn't accepted either.
The file can be found here
I have tried a number of different regex commands found from various sources through Google but none of them are working, either because the 'function cannot be parsed' or it doesn't add the spaces needed.
The code that I wrote that worked using Notepad++ was
Find: (^.{89})($)
Replace: \1 \2
I've tried a number of unix sed commands such as
sed -e "s/(^.{89})($)/\1 \2/" file.txt
sed -e "s/(^.{89})($)/\1\s\s\s\s\2/" file.txt
sed -e "s/(^.{89})($)/\1\ \ \ \ \2/" file.txt
sed -e "s/\(^.\{89\}\)\($\)/\1\ \ \ \ \2" file.txt
sed -e 's/\(^.\{89\}\)\($\)/\1[[:space:]]\2/g' file.txt
sed -e 's/\(^.\{89\}\)\($\)/\1[[:space:]]\{4\}\2/g' file.txt
sed -e 's/(^.{89})($)/\1[[:space:]]{4}\2/g' file.txt
The main issue here is that you are using BRE POSIX syntax and unescaped ( / ) are treated as literal parentheses and unescaped {/} are treated as literal braces.
This should work:
sed 's/^.\{89\}$/& /' file.txt > newfile.txt
Here:
^ - matches the start of a line
.\{89\} - matches any 89 chars
$ - asserts the end of line position.
The & in the replacement refers to the whole match.
If you need to use -i option, see sed edit file in place.
A software tools kludge for HP-UX, based on its manuals:
sed 's/.*/ /' file | paste -d ' ' file - | cut -c 1-93
How it works:
Use sed to create a dummy stream of the same number of lines as file, but with four spaces on each line.
Use paste to append the dummy stream to what's in file. (Because of the -d ' ' it adds five spaces, but it doesn't much matter.)
Use cut to chop off anything over 93 bytes.
If HP-UX sed is even less like GNU sed than I've supposed, it could be replaced with some equivalent like:
yes ' ' | head -n $(wc -l < file) | paste -d ' ' file - | cut -c 1-93
Try this:
awk '{ <br>
diff = 89 - length($0); <br>
for(i=1;i<=diff; i++) <br>
{ <br>
$0 = $0 "a" <br>
} <br>
print $0 <br>
} <br>
' FileName.txt

How to add a line break before and after a regex in a text file?

This is an excerpt from the file I want to edit:
>chr1|-|9|S|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG >chr1|+|9|Y|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
I would a new text file in which I add a line break before ">" and after "somatic" or after "germline", how can I do in R or Unix?
Expected output:
>chr1|-|9|S|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
>chr1|+|9|Y|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
By the looks of your input, you could simply replace spaces with newlines:
tr -s ' ' '\n' <infile >outfile
(Some tr dialects don't like \n. Try '\012' or a literal newline: opening quote, newline, closing quote.)
If that won't work, you can easily do this in sed. If somatic is static, just hard-code it:
sed -e 's/somatic */&\n/g' -e 's/ >/\n>/g' file >newfile
The usual caveats about different sed dialects apply. Some versions don't like \n for newline, some want a newline or a semicolon instead of multiple -e arguments.
On Linux, you can modify the file in-place:
sed -i 's/somatic */&\
/g
s/ >/\
/g' file
(For variation, I'm showing how to do this if your sed doesn't recognize \n but allows literal newlines, and how to put the script in a single multi-line string.)
On *BSD (including MacOS) you need to add an argument to -i always; sed -i '' ...
If somatic is variable, but you always want to replace the first space after a wedge, try something like
sed 's/\(>[^ ]*\) /\1\n/g'
>[^ ] matches a wedge followed by zero or more non-space characters. The parentheses capture the matched string into \1. Again, some sed variants don't want backslashes in front of the parentheses, or are otherwise just ... different.
If you have very long lines, you might bump into a sed which has problems with that. Maybe try Perl instead. (Luckily, no dialects to worry about!)
perl -i -pe 's/(>[^ ]*) /$1\n/g;s/ >/\n>/g' file
(Skip the -i option if you don't want to modify the input file. Then output will be to standard output.)
(\bsomatic\b|\bgermline\b)|(?=>)
Try this.See demo.Replace by $1\n
http://regex101.com/r/tF5fT5/53
If there's no support for lookahead then try
(\bsomatic\b|\bgermline\b)
Try this.Replace by $1\n.See demo.
http://regex101.com/r/tF5fT5/50
and
(>)
Replace by \n$1.See demo.
http://regex101.com/r/tF5fT5/51
Thank you everyone!
I used:
tr -s ' ' '\n' <infile >outfile
as suggested by tripleee and it worked perfectly!

Replace newline with string in sed on mac

In a file, I need to replace all newlines (not the escape sequence '\n', but the actual newline) with a string. All the questions I've found on SO have been the other way around; i.e. replacing a string with a literal newline. This is on a Mac.
I've tried the following
sed -i '' 's/\
/STOP/g' file.txt
But it gives me an "unterminated substitute pattern" error.
While it can be done using sed also but doing this with awk is much simpler:
awk -v ORS='STOP' '1' file
This changes output record separator to STOP instead of default \n.
Update: Here is a sed version to do same on OSX:
sed -i.bak -n -e 'H;${x;s/\n/STOP/g;p;}' file

How to remove invalid characters from an xml file using sed or Perl

I want to get rid of all invalid characters; example hexadecimal value 0x1A from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters
x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
Assuming UTF-8 XML documents:
perl -CSDA -pe'
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml
If you want to encode the bad bytes instead,
perl -CSDA -pe'
s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
"&#".ord($1).";"
/xeg;
' file.xml > file_fixed.xml
You can call it a few different ways:
perl -CSDA -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml # Inplace with backup
perl -CSDA -i -pe'...' file.xml # Inplace without backup
The tr command would be simpler. So, try something like:
cat <filename> | tr -d '\032' > <newfilename>
Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr. Not sure if tr likes hex.
Try:
perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml
There is actually a way to do this with sed, like so:
cat input_file | LANG=C sed -E \
-e 's/.*/& /g' \
-e 's/(('\
'[\x9\xa\xd\x20-\x7f]|'\
'[\xc0-\xdf][\x80-\xbf]|'\
'[\xe0-\xec][\x80-\xbf][\x80-\xbf]|'\
'[\xed][\x80-\x9f][\x80-\xbf]|'\
'[\xee-\xef][\x80-\xbf][\x80-\xbf]|'\
'[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]'\
')*)./\1?/g' \
-e 's/(.*)\?/\1/g' \
-e 's|]]>|]]>]]<![CDATA[>|g' > output_file
This works in four steps:
Add a single whitespace character to the end of every line.
Replace every sequence of legal characters followed by any character
with the same sequence of legal characters followed by a question mark
character (instead of the any).
Note that in a line of only legal characters, the '.' matches the last
character in the line, which is why we added a space in step 1.
Remove the last character in the line, which we expect to be a question mark.
Replace the string ']]>' with ']]>]]'.
The LANG=C env variable is set to prevent sed from doing charset conversion itself - it should treat every character as 8-bit ascii.