Replace string that contains CRLF? - regex

I'm reformatting a file, and I want to perform the following steps:
Replace double CRLF's with a temporary character sequence ($CRLF$ or something)
Remove all CRLF's in the whole file
Go back and replace the double CRLF's.
So input like this:
This is a paragraph
of text that has
been manually fitted
into a certain colum
width.
This is another
paragraph of text
that is the same.
Will become
This is a paragraph of text that has been manually fitted into a certain colum width.
This is another paragraph of text that is the same.
It seems this should be possible by piping the input through a few simple sed programs, but I'm not sure how to refer to CRLF in sed (to use in sed 's/<CRLF><CRLF>/$CRLF$/'). Or maybe there's a better way of doing this?

You can use sed to decorate all rows with a {CRLF} at end:
sed 's/$/<CRLF>/'
then remove all \r\n with tr
| tr -d "\r\n"
and then replace double CRLF's with \n
| sed 's/<CRLF><CRLF>/\n/g'
and remove leftover CRLF's.
There was an one-liner sed which did all this in a single cycle, but I can't seem to find it now.

Try the below:
cat file.txt | sed 's/$/ /;s/^ *$/CRLF/' | tr -d '\r\n' | sed 's/CRLF/\r\n'/
That's not quite the method you've given; what this does is the below:
Add a space to the end of each line.
Replace any line that contains only whitespace (ie blank lines) with "CRLF".
Deletes any line-breaking characters (both CR and LF).
Replaces any occurrences of the string "CRLF" with a Windows-style line break.
This works on Cygwin bash for me.

Redefine the Problem
It looks like what you're really trying to do is reflow your paragraphs and single-space your lines. There are a number of ways you can do this.
A Non-Sed Solution
If you don't mind using some packages outside coreutils, you could use some additional shell utilities to make this as easy as:
dos2unix /tmp/foo
fmt -w0 /tmp/foo | cat --squeeze-blank | sponge /tmp/foo
unix2dos /tmp/foo
Sponge is from the moreutils package, and will allow you to write the same file you're reading. The dos2unix (or alternatively the tofrodos) package will allow to convert your line endings back and forth for easier integration with tools that expect Unix-style line endings.

This might work for you (GNU sed):
sed ':a;$!{N;/\n$/{p;d};s/\r\?\n/ /;ba}' file

Am I missing why this is not easier?
Add CRLF:
sed -e s/\s+$/$'\r\n'/ < index.html > index_CRLF.html
remove CRLF... go unix:
sed -e s/\s+$/$'\n'/ < index_CRLF.html > index.html

Related

sed -i '/$(command 1)/$(command 2)/' myHtmlFile ? Inline editing with sed and awk

I'm writing a shell script that builds and edits an html file whose main content is basically clamscan's (ClamAV) output.
So, the script's mission is : translating the output, removing unhelpful stuff, adding html tags and so on.
Though, i'm stuck with the last modification i want.
One part of the edited output from clamscan looks like this :
/path/to/infected-file: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
/path/to/infected-zipfile!(1)ZIP:eicar.com: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
/path/to/infected-zipfilewithinzipfile!ZIP:eicar_com.zip!(2)ZIP:eicar.com: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
I want to shrink those long lines. Something like this would be the best :
infected-file: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfilewithinzipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
But i'd already be happy to just remove the path to the infected file.
Since it seemed easy to get some results with awk and i used sed for all previous editing, I thought my best option was going with something like :
sed -i 's/<awk command 1>/<awk command 2>/' myHtmlFile
Unfortunately i spent a few hours turning this in various way with no luck.
There seems to be syntax issues with things like :
sed "s#$(awk -F': ' '{print $1}' testfile)#$(awk -F': ' '{print $1}' testfile | awk -F'\' '{print $NF}')#" testfile
whether i use single or double quotes, whether i try to concatenate sed strings or try to escape various chars depending on the chosen syntax.
I also though about loops (so that i could make sed work with vars containing awk results) but i'm unsure how to manage a loop for this kind of inline editing.
It could probably be solved with a powerful regex, but i'm quite bad at it ^^
$ sed -E 's#[^:]+/([^:!]+).*: #\1: #' file
infected-file: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
infected-zipfilewithinzipfile: Eicar-Test-Signature<span class="mep-subhead-warning"> FOUND</span>
The above regexp does this:
[^:]+/ - consumes a leading string that contains no colons and ends in /, e.g. /path/to/
([^:!]+) - saves the subsequent string that contains no colons or exclamation marks in a capture group, e.g. infected-zipfile
.*: - consumes the subsequent string leading up to a colon followed by a blank char, e.g. !(1)ZIP:eicar.com:.
and then the replacement does this:
\1 - print the string saved into capture group 1 in step 2 above
: - print a colon followed by a blank char (I could have used a capture group for this too)
Ed Morton has already explained how to do this with a single regex substitution (i.e. the right way); I'll explain what's wrong with the original approach, and show how to do it with a shell loop (i.e. the wrong way).
The problem with the combined sed+awk+awk approach is that you need them to operate line-by-line in lockstep. That is, when sed processes line N of the file, it needs to replace the Nth line of output from the first awk command with the Nth line of output from the second awk pipeline. But the commands don't interrelate that way; the shell runs all of the awk commands, collects their entire output, then feeds that to sed as a single huge (and malformed) substitute expression. Given your sample data (and assuming the last awk command should have -f '/' instead of -f '\'), it would do essentially this:
sed 's#/path/to/infected-file
/path/to/infected-zipfile!(1)ZIP:eicar.com
/path/to/infected-zipfilewithinzipfile!ZIP:eicar_com.zip!(2)ZIP:eicar.com#infected-file
infected-zipfile!(1)ZIP:eicar.com
infected-zipfilewithinzipfile!ZIP:eicar_com.zip!(2)ZIP:eicar.com#' testfile
sed will reject this because of the newlines in the pattern (and replacement string as well). If it weren't for it being rejected, sed would go ahead and try to apply the whole mess to every line separately, but since it's not actually what you wanted that wouldn't work either.
In order to get all of those commands to work line-by-line in lockstep, you'd have to do something like use a shell loop to read and process each line through each of the commands (&pipeline) individually, like this:
while read -r line; do
fullpath=$(echo "$line" | awk -F': ' '{print $1}')
trimmedpath=$(echo "$line" | awk -F': ' '{print $1}' testfile | awk -F'/' '{print $NF}'
echo "$line" | sed "s#$fullpath#$trimmedpath#"
done < testfile
You could actually skip the fullpath and trimmedpath variables, and use the two $(echo "$line" | awk...) substitutions directly in the sed command if you wanted. But really, you shouldn't do it this way at all; use Ed's single-regex solution.
This might work for you (GNU sed):
sed -r 's#^([^/]*/)*([[:alpha:]-]*)([^:]*:)* #\2: #' file
This removes any directories, keeps the file name and removes any superfluous characters up to a : followed by a space.

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

How can I replace multiple newlines(\n) between characters using sed?

I have this sample file
{
{
doSomething();
}
}
What I am trying to achieve is:
{
{
doSomething();
}
}
I tried doing this
sed -r -i -e 's/}\n+}/}\n}/g' file.txt
but to no avail.
I want to remove the newlines between the closing parenthesis.
Note: I already read this How can I replace a newline (\n) using sed? and I am also aware of the N command but I can't create a working sed expression to achieve what I am trying to achieve.
Using sed:
$ sed 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file
{
{
doSomething();
}
}
If you want to change the file in place, use the -i switch:
sed -i 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file # GNU sed
sed -i '' -e 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file # BSD sed
How it works
H;1h;$!d;x;
This is a sed idiom which reads the whole file in to the pattern space. (If your file is huge, we would want to look at other solutions.)
s/}\n\n*}/}\n}/g
This is just your substitute command with one change: \n+ is changed to \n\n*. This is because + is not an active character under basic regular expressions. If we use extended regex, then braces, { and }, become active characters which has the potential to lead to other issues for this input.
Note that this removes the extra newlines only if the second closing brace, }, occurs at the beginning of a line.
Using awk:
awk -v RS='}' -v 'ORS=}' '!/{/{gsub(/\n+/,"\n")};1' file-name | head -n -1
It is not obvious what you exactly intend to do. However, if you want to work with multiline regexp you should be aware that in its classical form, sed isn't intended to work on more than one line at a time (maybe some multiline versions of sed have occured however). You can use other tools like perl. What I usually do when I need it is the following workflow:
tr '\n' 'µ' < myfile.txt | sed -e 's/µ}µ/}/' | tr 'µ' '\n'
(because the µ character is available on my keyboard and is most unlikely to appear in the file I am working with). The initial tr commands make one huge line from my file and the second tr command put the newline again at their initial locations.

replace \n\t pattern in a file

ok I have a recordset that is pipe delimited
I am checking the number of delimiters on each line as they have started including | in the data (and we cannot change the incoming file)
while using a great awk to parse out the bad records into a bad file for processing we discovered that some data has a new line character (\n) (followed by a tab (\t) )
I have tried sed to replace \n\t with just \t but it always either changes the \n\t with \r\n or replaces all the \n (file is \r\n for line end)
yes to answer some quetions below...
files can be large 200+ mb
the line feed is in the data spuriously (not every row.. but enought to be a pain)
I have tried
sed ':a;N;$!ba;s/\n\t/\t/g' Clicks.txt >test2.txt
sed 's/\n\t/\t/g' Clicks.txt >test1.txt
sample record
12345|876|testdata\n
\t\t\t\tsome text|6209\r\n
would like
12345|876|testdata\t\t\t\tsome text|6209\r\n
please help!!!
NOTE must be in KSH (MKS KSH to be specific)
i don't care if it is sed or not.. just need to correct the issue...
several of the solutions below woke on small data or do part of the job...
as an aside i have started playing with removing all linefeeds and then replacing the caraige return with carrige return linefeed.. but can't quite get that to work either
I have tried TR but since it is single char it only does part of the issue
tr -d '\n' test.txt
leave me with a \r ended file....
need to get it to \r\n (and no-no dos2unix or unix2dos exists on this system)
If the input file is small (and you therefore don't mind processing it twice), you can use
cat input.txt | tr -d "\n" | sed 's/\r/\r\n/g'
Edit:
As I should have known by now, you can avoid using cat about everywhere.
I had reviewed my old answers in SO for UUOC, and carefully checked for a possible filename in the tr usage. As Ed pointed out in his comment, cat can be avoided here as well:
The command above can be improved by
tr -d "\n" < input.txt | sed 's/\r/\r\n/g'
It's unclear what you are trying to do but given this input file:
$ cat -v file
12345|876|testdata
some text|6209^M
Is this what you're trying to do:
$ gawk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/,"")} 1' file | cat -v
12345|876|testdata some text|6209^M
The above uses GNU awk for multi-char RS. Alternatively with any awk:
$ awk '{rec = rec $0} /\r$/{print rec; rec=""}' file | cat -v
12345|876|testdata some text|6209^M
The cat -vs above are just there to show where the \rs (^Ms) are.
Note that the solution below reads the input file as a whole into memory, which won't work for large files.
Generally, Ed Morton's awk solution is better.
Here's a POSIX-compliant sed solution:
tab=$(printf '\t')
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/\n${tab}/${tab}/g" Clicks.txt
Keys to making this POSIX-compliant:
POSIX sed doesn't recognize \t as an escape sequence, so a literal tab - via variable $tab, created with tab=$(printf '\t') - must be used in the script.
POSIX sed - or at least BSD sed - requires label names (such as :a and the a in ba above) - whether implied or explicit - to be terminated with an actual newline, or, alternatively, terminated implicitly by continuing the script in the next -e option, which is the approach chosen here.
-e ':a' -e '$!{N;ba' -e '}' is an established Sed idiom that simply "slurps" the entire input file (uses a loop to read all lines into its buffer first). This is the prerequisite for enabling subsequent string substitution across input lines.
Note how the option-argument for the last -e option is a double-quoted string so that the references to shell variable $tab are expanded to actual tabs before Sed sees them. By contrast, \n is the one escape sequence recognized by POSIX sed itself (in the regex part, not the replacement-string part).
Alternatively, if your shell supports ANSI C-quoted strings ($'...'), you can use them directly to produce the desired control characters:
sed -e ':a' -e '$!{N;ba' -e '}' -e $'s/\\n\t/\\t/g' Clicks.txt
Note how the option-argument for the last -e option is an ANSI C-quoted string, and how literal \n (which is the one escape sequence that is recognized by POSIX Sed) must then be represented as \\n. By contrast, $'...' expands \t to an actual tab before Sed sees it.
Thanks everyone for all your suggestions.. After looking at all the answers.. None quite did the trick... After some thought... I came up with
tr -d '\n' <Clicks.txt | tr '\r' '\n' | sed 's/\n/\r\n/g' >test.txt
Delete all newlines
translate all Carriage return to newline
Sed replace all newline with Carriage return line feed
This works in seconds on a 32mb file.

how to select lines containing several words using sed?

I am learning using sed in unix.
I have a file with many lines and I wanna delete all lines except lines containing strings(e.g) alex, eva and tom.
I think I can use
sed '/alex|eva|tom/!d' filename
However I find it doesn't work, it cannot match the line. It just match "alex|eva|tom"...
Only
sed '/alex/!d' filename
works.
Anyone know how to select lines containing more than 1 words using sed?
plus, with parenthesis like "sed '/(alex)|(eva)|(tom)/!d' file" doesn't work, and I wanna the line containing all three words.
sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:
awk '/alex/ && /eva/ && /tom/' file
delete all lines except lines containing strings(e.g) alex, eva and tom
As worded you're asking to preserve lines containing all those words but your samples preserve lines containing any. Just in case "all" wasn't a misspeak: Regular expressions can't express any-order searches, fortunately sed lets you run multiple matches:
sed -n '/alex/{/eva/{/tom/p}}'
or you could just delete them serially:
sed '/alex/!d; /eva/!d; /tom/!d'
The above works on GNU/anything systems, with BSD-based userlands you'll have to insert a bunch of newlines or pass them as separate expressions:
sed -n '/alex/ {
/eva/ {
/tom/ p
}
}'
or
sed -e '/alex/!d' -e '/eva/!d' -e '/tom/!d'
You can use:
sed -r '/alex|eva|tom/!d' filename
OR on Mac:
sed -E '/alex|eva|tom/!d' filename
Use -i.bak for inline editing so:
sed -i.bak -r '/alex|eva|tom/!d' filename
You should be using \| instead of |.
Edit: Looks like this is true for some variants of sed but not others.
This might work for you (GNU sed):
sed -nr '/alex/G;/eva/G;/tom/G;s/\n{3}//p' file
This method would allow a range of values to be present i.e. you wanted 2 or more of the list then use:
sed -nr '/alex/G;/eva/G;/tom/G;s/\n{2,3}//p' file