remove sequence of line feeds and spaces in file with sed

remove sequence of line feeds and spaces in file with sed - replace

I have a file wich contains an undesired sequence of line feeds and spaces that I want to remove. The actual file is about 1 million rows, this is just to provide a reproducible example.
I can grep the offending lines like this:
grep -ciP "\n\n {6,}" problem.rpt
And it correctly returns
## 3
So I tried with sed to replace the string:
sed "s/\n\n {6,}//g" problem.rpt > prob2.rpt
but instead of deleting the sequence "\n\n {6,}" I now have "\r\n\r\n {6,}" (it introduced a CR before each LF, without removing it or the 6+ spaces).
I'm working with GNU sed and grep in a windows 8.1 cmd.
What am I doing wrong, and what's the right way to approach this job?

does one of the followings help you? Very likely the 2nd one is what you are looking for:
awk -v RS="\n\n {6,}" '7' problem.rpt
awk -v RS="\n\n {6,}" -v ORS="" '7' problem.rpt
I think you have gawk too, right?
I don't have windows to test for you....

From a list of sed one-liners I found one command that solved my problem:
sed -e :a -e "$!N; s/\n //;ta" -e "P;D" problem.rpt > prob2.rpt
Then, trying to decipher the command, this is what I found here (copied verbatim):
sed ':a; $!N; s/\n/string/; ta'
--- ---- ------------- --
| | | |--> go back (`t`) to `a`
| | |-------------> substitute newlines with `string`
| |----------------------> If this is not the last line (`$!`), append the
| next line to the pattern space.
|----------------------------> Create the label `a`.
I still don't know what the P;D part does, I'd appreciate if someone with the knowledge edits this answer to add it.

Related

how to regex replace before colon?

this is my original string:
NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
I want to only add back slash to all the spaces before ':'
so, this is what I finally want:
NetworkManager/system\ connections/Wired\ 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
I need to do this in bash, so, sed, awk, grep are all ok for me.
I have tried following sed, but none of them work
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/ .*\(:.*$\)/\\ .*\1/g'
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/\( \).*\(:.*$\)/\\ \1.*\2/g'
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/ .*\(:.*$\)/\\ \1/g'
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/\( \).*\(:.*$\)/\\ \1\2/g'
thanks for answering my question.
I am still quite newbie to stackoverflow, I don't know how to control the format in comment.
so, I just edit my original question
my real story is:
when I do grep or use cscope to search keyword, for example "address1" under /etc folder.
the result would be like:
./NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
if I use vim to open file under cursor, suppose my vim cursor is now at word "NetworkManager",
then vim will understand it as
"./NetworkManager/system"
that's why I want to add "\" before space, so the search result would be more vim friendly:)
I did try to change cscope's source code, but very difficult to fully achieve this. so have to do a post replacement:(

If you only want to do the replacements if there is a : present in the string, you can check if there are at least 2 columns, setting the (output)field separator to a colon.
Data:
cat file michaelvandam#Michaels-MacBook-Pro
NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
NetworkManager/system connections/Wired 1.nmconnection 14 address1=10.1.10.71/24,10.1.10.1%
Example in awk:
awk 'BEGIN {FS=OFS=":"}{if(NF>1)gsub(" ","\\ ",$1)}1' file
Output
NetworkManager/system\ connections/Wired\ 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
NetworkManager/system connections/Wired 1.nmconnection 14 address1=10.1.10.71/24,10.1.10.1

This could be simply done in awk program, with your shown samples, please try following.
awk 'BEGIN{FS=OFS=":"} {gsub(/ /,"\\\\&",$1)} 1' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as : for this program. Then in main program using gsub(Global substitution) function of awk. Where substituting space with \ in 1st field only(as per OP's remarks it should be done before :) and printing line then.

An idea for a perl one liner in bash to use \G and \K (similar #CarySwoveland's comment).
perl -pe 's/\G[^ :]*\K /\\ /g' myfile
See this demo at tio.run or a pattern demo at regex101.

This might work for you (GNU sed):
sed -E ':a;s/^([^: ]*) /\1\n/;ta;s/\n/\\ /g' file
Replace spaces before : by newlines then replace newlines by \ 's.
Alternative using the hold space:
sed -E 's/:/\n:/;h;s/ /\\ /g;G;s/\n.*\n//' file
Split the line on the first :.
Amend the front section, remove the middle and append the unadulterated back section.

My answer is ugly and I think RavinderSingh13's answer is THE ONE, but I already took the time to write mine and it works (It's written step by step, but it's a one line command):
I got inspired by HatLess answer:
first get the text before the : with cut (I put the string in a file to make it easy to read, but this works on echo):
cut -d':' -f1 infile
Then replace spaces using sed:
cut -d':' -f1 infile | sed 's/\([a-z]\) /\1\\ /g'
Then echo the output with no new line:
echo -n "$(cut -d':' -f1 infile | sed -e 's/\([a-z]\) /\1\\ /g')"
Add the missing : and what comes after it:
echo -n "$(cut -d':' -f1 infile | sed -e 's/\([a-z]\) /\1\\ /g')" | cat - <(echo -n :) | cat - <(cut -d':' -f2 infile)

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2

The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt

sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.

Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='

You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.

Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)

The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'

zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

replace \n\t pattern in a file

ok I have a recordset that is pipe delimited
I am checking the number of delimiters on each line as they have started including | in the data (and we cannot change the incoming file)
while using a great awk to parse out the bad records into a bad file for processing we discovered that some data has a new line character (\n) (followed by a tab (\t) )
I have tried sed to replace \n\t with just \t but it always either changes the \n\t with \r\n or replaces all the \n (file is \r\n for line end)
yes to answer some quetions below...
files can be large 200+ mb
the line feed is in the data spuriously (not every row.. but enought to be a pain)
I have tried
sed ':a;N;$!ba;s/\n\t/\t/g' Clicks.txt >test2.txt
sed 's/\n\t/\t/g' Clicks.txt >test1.txt
sample record
12345|876|testdata\n
\t\t\t\tsome text|6209\r\n
would like
12345|876|testdata\t\t\t\tsome text|6209\r\n
please help!!!
NOTE must be in KSH (MKS KSH to be specific)
i don't care if it is sed or not.. just need to correct the issue...
several of the solutions below woke on small data or do part of the job...
as an aside i have started playing with removing all linefeeds and then replacing the caraige return with carrige return linefeed.. but can't quite get that to work either
I have tried TR but since it is single char it only does part of the issue
tr -d '\n' test.txt
leave me with a \r ended file....
need to get it to \r\n (and no-no dos2unix or unix2dos exists on this system)

If the input file is small (and you therefore don't mind processing it twice), you can use
cat input.txt | tr -d "\n" | sed 's/\r/\r\n/g'
Edit:
As I should have known by now, you can avoid using cat about everywhere.
I had reviewed my old answers in SO for UUOC, and carefully checked for a possible filename in the tr usage. As Ed pointed out in his comment, cat can be avoided here as well:
The command above can be improved by
tr -d "\n" < input.txt | sed 's/\r/\r\n/g'

It's unclear what you are trying to do but given this input file:
$ cat -v file
12345|876|testdata
some text|6209^M
Is this what you're trying to do:
$ gawk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/,"")} 1' file | cat -v
12345|876|testdata some text|6209^M
The above uses GNU awk for multi-char RS. Alternatively with any awk:
$ awk '{rec = rec $0} /\r$/{print rec; rec=""}' file | cat -v
12345|876|testdata some text|6209^M
The cat -vs above are just there to show where the \rs (^Ms) are.

Note that the solution below reads the input file as a whole into memory, which won't work for large files.
Generally, Ed Morton's awk solution is better.
Here's a POSIX-compliant sed solution:
tab=$(printf '\t')
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/\n${tab}/${tab}/g" Clicks.txt
Keys to making this POSIX-compliant:
POSIX sed doesn't recognize \t as an escape sequence, so a literal tab - via variable $tab, created with tab=$(printf '\t') - must be used in the script.
POSIX sed - or at least BSD sed - requires label names (such as :a and the a in ba above) - whether implied or explicit - to be terminated with an actual newline, or, alternatively, terminated implicitly by continuing the script in the next -e option, which is the approach chosen here.
-e ':a' -e '$!{N;ba' -e '}' is an established Sed idiom that simply "slurps" the entire input file (uses a loop to read all lines into its buffer first). This is the prerequisite for enabling subsequent string substitution across input lines.
Note how the option-argument for the last -e option is a double-quoted string so that the references to shell variable $tab are expanded to actual tabs before Sed sees them. By contrast, \n is the one escape sequence recognized by POSIX sed itself (in the regex part, not the replacement-string part).
Alternatively, if your shell supports ANSI C-quoted strings ($'...'), you can use them directly to produce the desired control characters:
sed -e ':a' -e '$!{N;ba' -e '}' -e $'s/\\n\t/\\t/g' Clicks.txt
Note how the option-argument for the last -e option is an ANSI C-quoted string, and how literal \n (which is the one escape sequence that is recognized by POSIX Sed) must then be represented as \\n. By contrast, $'...' expands \t to an actual tab before Sed sees it.

Thanks everyone for all your suggestions.. After looking at all the answers.. None quite did the trick... After some thought... I came up with
tr -d '\n' <Clicks.txt | tr '\r' '\n' | sed 's/\n/\r\n/g' >test.txt
Delete all newlines
translate all Carriage return to newline
Sed replace all newline with Carriage return line feed
This works in seconds on a 32mb file.

having a regex replacing across lines, retain the newlines?

I'd like to have a substitute or print style command with a regex working across lines. And lines retained.
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | tr -d '\n' | grep -or 'b.*f'
bcdef
or
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | tr -d '\n' | sed -r 's|b(.*)f|y\1z|'
aycdezg
i'd like to use grep or sed because i'd like to know what people would've done before awk or perl ..
would they not have? was .* not available? had they no other equivalent?
to possibly modify some input with a regex that spans across lines, and print it to stdout or output to a file, retaining the lines.

This should do what you're looking for:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;$s/b\([^f]*\)f/y\1z/;N;ba'
a
y
c
d
e
z
g
It accumulates all the lines then does the replacement. It looks for the first "f". If you want it to look for the last "f", change [^f] to ..
Note that this may make use of features added to sed after AWK or Perl became available (AWK has been around a looong time).
Edit:
To do a multi-line grep requires only a little modification:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;$s/^[^b]*\(b[^f]*f\)[^f]*$/\1/;N;ba'
b
c
d
e
f

sed can match across newlines through the use of its N command. For example, the following sed command will replace bar followed a newline followed by foo with ###:
$ echo -e "foo\nbar\nbaz\nqux" | sed 'N;s/bar\nbaz/###/;P;D'
foo
###
qux
The N command will append the next input line to the current pattern space separated by an embedded newline (\n)
The P command will print the current pattern space up to and including the first embedded newline.
The D command will delete up to and including the first embedded newline in the pattern space. It will also start next cycle but skip reading from the input if there is still data in the pattern space.
Through the use of these 3 commands, you can essentially do any sort of s command replacement looking across N-lines.
Edit
If your question is how can I remove the need for tr in the two examples above and just use sed then here you go:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;N;$!ba;s/\n//g;y/ag/yz/'
ybcdefz

Proven tools to the rescue.
echo -e "foo\nbar\nbaz\nqux" | perl -lpe 'BEGIN{$/=""}s/foo\nbar/###/'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

remove sequence of line feeds and spaces in file with sed - replace

does one of the followings help you? Very likely the 2nd one is what you are looking for: awk -v RS="\n\n {6,}" '7' problem.rpt awk -v RS="\n\n {6,}" -v ORS="" '7' problem.rpt I think you have gawk too, right? I don't have windows to test for you....

Related

how to regex replace before colon?

“sed” command to remove a line that matches an exact string on first word

Sed : print all lines after match

replace \n\t pattern in a file

having a regex replacing across lines, retain the newlines?

Categories

Resources