How can I replace multiple newlines(\n) between characters using sed? - regex

I have this sample file
{
{
doSomething();
}
}
What I am trying to achieve is:
{
{
doSomething();
}
}
I tried doing this
sed -r -i -e 's/}\n+}/}\n}/g' file.txt
but to no avail.
I want to remove the newlines between the closing parenthesis.
Note: I already read this How can I replace a newline (\n) using sed? and I am also aware of the N command but I can't create a working sed expression to achieve what I am trying to achieve.

Using sed:
$ sed 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file
{
{
doSomething();
}
}
If you want to change the file in place, use the -i switch:
sed -i 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file # GNU sed
sed -i '' -e 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file # BSD sed
How it works
H;1h;$!d;x;
This is a sed idiom which reads the whole file in to the pattern space. (If your file is huge, we would want to look at other solutions.)
s/}\n\n*}/}\n}/g
This is just your substitute command with one change: \n+ is changed to \n\n*. This is because + is not an active character under basic regular expressions. If we use extended regex, then braces, { and }, become active characters which has the potential to lead to other issues for this input.
Note that this removes the extra newlines only if the second closing brace, }, occurs at the beginning of a line.

Using awk:
awk -v RS='}' -v 'ORS=}' '!/{/{gsub(/\n+/,"\n")};1' file-name | head -n -1

It is not obvious what you exactly intend to do. However, if you want to work with multiline regexp you should be aware that in its classical form, sed isn't intended to work on more than one line at a time (maybe some multiline versions of sed have occured however). You can use other tools like perl. What I usually do when I need it is the following workflow:
tr '\n' 'µ' < myfile.txt | sed -e 's/µ}µ/}/' | tr 'µ' '\n'
(because the µ character is available on my keyboard and is most unlikely to appear in the file I am working with). The initial tr commands make one huge line from my file and the second tr command put the newline again at their initial locations.

Related

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

removing unmatched lines with SED

I'm trying to remove everything but 3 separate lines with specific matching pattern and leave just the 3 lines I want
Here is my code;
sed -n '/matching pattern/matching pattern/matching pattern/p' > file.txt
If you have multiple commands on the same line, you need to separate the commands by a ;:
sed -n '/matching pattern/p;/matching pattern2/p;/matching pattern3/p' file
Alternatively you can put them onto separate lines:
sed -n '/matching pattern/p
/matching pattern2/p
/matching pattern3/p' file
Beside that, you can also use regex alternation:
sed -rn '/(pattern|pattern2|pattern3)/p' file
or (better) use grep:
grep -E '(pattern|pattern2|pattern3)' file
However, this might get messy if the patterns getting longer and more complicated.
awk to the rescue!
awk '/pattern1/ || /pattern2/ || /pattern3/' filename
I think it's cleaner than alternatives.
Sed with Deletion
There's always more than one way to do this sort of thing, but one useful sed programming pattern is using alternation with deletion. For example:
# BSD sed
sed -E '/root|daemon|nobody/!d' /etc/passwd
# GNU sed
sed -r '/root|daemon|nobody/!d' /etc/passwd
This makes it possible to express ideas like "delete everything except for the listed terms." Even when expressions are functionally equivalent, it can be helpful to use a construct that most closely matches the idea you're trying to convey.
This might work for you (GNU sed):
sed '/pattern1/b;/pattern2/b;/pattern3/b;d' file
The normal flow of sed is to print what remains in the pattern space after processing. Therefore if the required pattern is in the pattern space let sed do its thing otherwise delete the line.
N.B. the b command is like a goto and if it has no following identifier, it means break out of any further sed commands and print (or not print if the -n option is in action) the contents of the pattern space.
If I understood you correctly:
sed -n '/\(pattern1\|pattern2\|pattern3\)/p' file > newfile

replace \n\t pattern in a file

ok I have a recordset that is pipe delimited
I am checking the number of delimiters on each line as they have started including | in the data (and we cannot change the incoming file)
while using a great awk to parse out the bad records into a bad file for processing we discovered that some data has a new line character (\n) (followed by a tab (\t) )
I have tried sed to replace \n\t with just \t but it always either changes the \n\t with \r\n or replaces all the \n (file is \r\n for line end)
yes to answer some quetions below...
files can be large 200+ mb
the line feed is in the data spuriously (not every row.. but enought to be a pain)
I have tried
sed ':a;N;$!ba;s/\n\t/\t/g' Clicks.txt >test2.txt
sed 's/\n\t/\t/g' Clicks.txt >test1.txt
sample record
12345|876|testdata\n
\t\t\t\tsome text|6209\r\n
would like
12345|876|testdata\t\t\t\tsome text|6209\r\n
please help!!!
NOTE must be in KSH (MKS KSH to be specific)
i don't care if it is sed or not.. just need to correct the issue...
several of the solutions below woke on small data or do part of the job...
as an aside i have started playing with removing all linefeeds and then replacing the caraige return with carrige return linefeed.. but can't quite get that to work either
I have tried TR but since it is single char it only does part of the issue
tr -d '\n' test.txt
leave me with a \r ended file....
need to get it to \r\n (and no-no dos2unix or unix2dos exists on this system)
If the input file is small (and you therefore don't mind processing it twice), you can use
cat input.txt | tr -d "\n" | sed 's/\r/\r\n/g'
Edit:
As I should have known by now, you can avoid using cat about everywhere.
I had reviewed my old answers in SO for UUOC, and carefully checked for a possible filename in the tr usage. As Ed pointed out in his comment, cat can be avoided here as well:
The command above can be improved by
tr -d "\n" < input.txt | sed 's/\r/\r\n/g'
It's unclear what you are trying to do but given this input file:
$ cat -v file
12345|876|testdata
some text|6209^M
Is this what you're trying to do:
$ gawk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/,"")} 1' file | cat -v
12345|876|testdata some text|6209^M
The above uses GNU awk for multi-char RS. Alternatively with any awk:
$ awk '{rec = rec $0} /\r$/{print rec; rec=""}' file | cat -v
12345|876|testdata some text|6209^M
The cat -vs above are just there to show where the \rs (^Ms) are.
Note that the solution below reads the input file as a whole into memory, which won't work for large files.
Generally, Ed Morton's awk solution is better.
Here's a POSIX-compliant sed solution:
tab=$(printf '\t')
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/\n${tab}/${tab}/g" Clicks.txt
Keys to making this POSIX-compliant:
POSIX sed doesn't recognize \t as an escape sequence, so a literal tab - via variable $tab, created with tab=$(printf '\t') - must be used in the script.
POSIX sed - or at least BSD sed - requires label names (such as :a and the a in ba above) - whether implied or explicit - to be terminated with an actual newline, or, alternatively, terminated implicitly by continuing the script in the next -e option, which is the approach chosen here.
-e ':a' -e '$!{N;ba' -e '}' is an established Sed idiom that simply "slurps" the entire input file (uses a loop to read all lines into its buffer first). This is the prerequisite for enabling subsequent string substitution across input lines.
Note how the option-argument for the last -e option is a double-quoted string so that the references to shell variable $tab are expanded to actual tabs before Sed sees them. By contrast, \n is the one escape sequence recognized by POSIX sed itself (in the regex part, not the replacement-string part).
Alternatively, if your shell supports ANSI C-quoted strings ($'...'), you can use them directly to produce the desired control characters:
sed -e ':a' -e '$!{N;ba' -e '}' -e $'s/\\n\t/\\t/g' Clicks.txt
Note how the option-argument for the last -e option is an ANSI C-quoted string, and how literal \n (which is the one escape sequence that is recognized by POSIX Sed) must then be represented as \\n. By contrast, $'...' expands \t to an actual tab before Sed sees it.
Thanks everyone for all your suggestions.. After looking at all the answers.. None quite did the trick... After some thought... I came up with
tr -d '\n' <Clicks.txt | tr '\r' '\n' | sed 's/\n/\r\n/g' >test.txt
Delete all newlines
translate all Carriage return to newline
Sed replace all newline with Carriage return line feed
This works in seconds on a 32mb file.

how to select lines containing several words using sed?

I am learning using sed in unix.
I have a file with many lines and I wanna delete all lines except lines containing strings(e.g) alex, eva and tom.
I think I can use
sed '/alex|eva|tom/!d' filename
However I find it doesn't work, it cannot match the line. It just match "alex|eva|tom"...
Only
sed '/alex/!d' filename
works.
Anyone know how to select lines containing more than 1 words using sed?
plus, with parenthesis like "sed '/(alex)|(eva)|(tom)/!d' file" doesn't work, and I wanna the line containing all three words.
sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:
awk '/alex/ && /eva/ && /tom/' file
delete all lines except lines containing strings(e.g) alex, eva and tom
As worded you're asking to preserve lines containing all those words but your samples preserve lines containing any. Just in case "all" wasn't a misspeak: Regular expressions can't express any-order searches, fortunately sed lets you run multiple matches:
sed -n '/alex/{/eva/{/tom/p}}'
or you could just delete them serially:
sed '/alex/!d; /eva/!d; /tom/!d'
The above works on GNU/anything systems, with BSD-based userlands you'll have to insert a bunch of newlines or pass them as separate expressions:
sed -n '/alex/ {
/eva/ {
/tom/ p
}
}'
or
sed -e '/alex/!d' -e '/eva/!d' -e '/tom/!d'
You can use:
sed -r '/alex|eva|tom/!d' filename
OR on Mac:
sed -E '/alex|eva|tom/!d' filename
Use -i.bak for inline editing so:
sed -i.bak -r '/alex|eva|tom/!d' filename
You should be using \| instead of |.
Edit: Looks like this is true for some variants of sed but not others.
This might work for you (GNU sed):
sed -nr '/alex/G;/eva/G;/tom/G;s/\n{3}//p' file
This method would allow a range of values to be present i.e. you wanted 2 or more of the list then use:
sed -nr '/alex/G;/eva/G;/tom/G;s/\n{2,3}//p' file

Replace string that contains CRLF?

I'm reformatting a file, and I want to perform the following steps:
Replace double CRLF's with a temporary character sequence ($CRLF$ or something)
Remove all CRLF's in the whole file
Go back and replace the double CRLF's.
So input like this:
This is a paragraph
of text that has
been manually fitted
into a certain colum
width.
This is another
paragraph of text
that is the same.
Will become
This is a paragraph of text that has been manually fitted into a certain colum width.
This is another paragraph of text that is the same.
It seems this should be possible by piping the input through a few simple sed programs, but I'm not sure how to refer to CRLF in sed (to use in sed 's/<CRLF><CRLF>/$CRLF$/'). Or maybe there's a better way of doing this?
You can use sed to decorate all rows with a {CRLF} at end:
sed 's/$/<CRLF>/'
then remove all \r\n with tr
| tr -d "\r\n"
and then replace double CRLF's with \n
| sed 's/<CRLF><CRLF>/\n/g'
and remove leftover CRLF's.
There was an one-liner sed which did all this in a single cycle, but I can't seem to find it now.
Try the below:
cat file.txt | sed 's/$/ /;s/^ *$/CRLF/' | tr -d '\r\n' | sed 's/CRLF/\r\n'/
That's not quite the method you've given; what this does is the below:
Add a space to the end of each line.
Replace any line that contains only whitespace (ie blank lines) with "CRLF".
Deletes any line-breaking characters (both CR and LF).
Replaces any occurrences of the string "CRLF" with a Windows-style line break.
This works on Cygwin bash for me.
Redefine the Problem
It looks like what you're really trying to do is reflow your paragraphs and single-space your lines. There are a number of ways you can do this.
A Non-Sed Solution
If you don't mind using some packages outside coreutils, you could use some additional shell utilities to make this as easy as:
dos2unix /tmp/foo
fmt -w0 /tmp/foo | cat --squeeze-blank | sponge /tmp/foo
unix2dos /tmp/foo
Sponge is from the moreutils package, and will allow you to write the same file you're reading. The dos2unix (or alternatively the tofrodos) package will allow to convert your line endings back and forth for easier integration with tools that expect Unix-style line endings.
This might work for you (GNU sed):
sed ':a;$!{N;/\n$/{p;d};s/\r\?\n/ /;ba}' file
Am I missing why this is not easier?
Add CRLF:
sed -e s/\s+$/$'\r\n'/ < index.html > index_CRLF.html
remove CRLF... go unix:
sed -e s/\s+$/$'\n'/ < index_CRLF.html > index.html