ok I have a recordset that is pipe delimited
I am checking the number of delimiters on each line as they have started including | in the data (and we cannot change the incoming file)
while using a great awk to parse out the bad records into a bad file for processing we discovered that some data has a new line character (\n) (followed by a tab (\t) )
I have tried sed to replace \n\t with just \t but it always either changes the \n\t with \r\n or replaces all the \n (file is \r\n for line end)
yes to answer some quetions below...
files can be large 200+ mb
the line feed is in the data spuriously (not every row.. but enought to be a pain)
I have tried
sed ':a;N;$!ba;s/\n\t/\t/g' Clicks.txt >test2.txt
sed 's/\n\t/\t/g' Clicks.txt >test1.txt
sample record
12345|876|testdata\n
\t\t\t\tsome text|6209\r\n
would like
12345|876|testdata\t\t\t\tsome text|6209\r\n
please help!!!
NOTE must be in KSH (MKS KSH to be specific)
i don't care if it is sed or not.. just need to correct the issue...
several of the solutions below woke on small data or do part of the job...
as an aside i have started playing with removing all linefeeds and then replacing the caraige return with carrige return linefeed.. but can't quite get that to work either
I have tried TR but since it is single char it only does part of the issue
tr -d '\n' test.txt
leave me with a \r ended file....
need to get it to \r\n (and no-no dos2unix or unix2dos exists on this system)
If the input file is small (and you therefore don't mind processing it twice), you can use
cat input.txt | tr -d "\n" | sed 's/\r/\r\n/g'
Edit:
As I should have known by now, you can avoid using cat about everywhere.
I had reviewed my old answers in SO for UUOC, and carefully checked for a possible filename in the tr usage. As Ed pointed out in his comment, cat can be avoided here as well:
The command above can be improved by
tr -d "\n" < input.txt | sed 's/\r/\r\n/g'
It's unclear what you are trying to do but given this input file:
$ cat -v file
12345|876|testdata
some text|6209^M
Is this what you're trying to do:
$ gawk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/,"")} 1' file | cat -v
12345|876|testdata some text|6209^M
The above uses GNU awk for multi-char RS. Alternatively with any awk:
$ awk '{rec = rec $0} /\r$/{print rec; rec=""}' file | cat -v
12345|876|testdata some text|6209^M
The cat -vs above are just there to show where the \rs (^Ms) are.
Note that the solution below reads the input file as a whole into memory, which won't work for large files.
Generally, Ed Morton's awk solution is better.
Here's a POSIX-compliant sed solution:
tab=$(printf '\t')
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/\n${tab}/${tab}/g" Clicks.txt
Keys to making this POSIX-compliant:
POSIX sed doesn't recognize \t as an escape sequence, so a literal tab - via variable $tab, created with tab=$(printf '\t') - must be used in the script.
POSIX sed - or at least BSD sed - requires label names (such as :a and the a in ba above) - whether implied or explicit - to be terminated with an actual newline, or, alternatively, terminated implicitly by continuing the script in the next -e option, which is the approach chosen here.
-e ':a' -e '$!{N;ba' -e '}' is an established Sed idiom that simply "slurps" the entire input file (uses a loop to read all lines into its buffer first). This is the prerequisite for enabling subsequent string substitution across input lines.
Note how the option-argument for the last -e option is a double-quoted string so that the references to shell variable $tab are expanded to actual tabs before Sed sees them. By contrast, \n is the one escape sequence recognized by POSIX sed itself (in the regex part, not the replacement-string part).
Alternatively, if your shell supports ANSI C-quoted strings ($'...'), you can use them directly to produce the desired control characters:
sed -e ':a' -e '$!{N;ba' -e '}' -e $'s/\\n\t/\\t/g' Clicks.txt
Note how the option-argument for the last -e option is an ANSI C-quoted string, and how literal \n (which is the one escape sequence that is recognized by POSIX Sed) must then be represented as \\n. By contrast, $'...' expands \t to an actual tab before Sed sees it.
Thanks everyone for all your suggestions.. After looking at all the answers.. None quite did the trick... After some thought... I came up with
tr -d '\n' <Clicks.txt | tr '\r' '\n' | sed 's/\n/\r\n/g' >test.txt
Delete all newlines
translate all Carriage return to newline
Sed replace all newline with Carriage return line feed
This works in seconds on a 32mb file.
Related
I need a sed regex command that will output every line in a while that ends with 'html', and does NOT start with 'a'.
Would my current code work?
sed 's/\[^a]\*\.\(html)\/p' text.txt
The sed command would be
sed -n '/^[^a].*html$/p'
But the canonical command to print matching lines is grep:
grep '^[^a].*html$'
Sed just over complicates things...you can use grep to handle that easily!
egrep "^[^a].+\.html$" -f sourcefile > text.txt
//loads file from within the program egrep
egrep "^[^a].+\.html$" < sourcefile > text.txt
//Converts stdin file descriptor with the input redirect
//to sourceFile for this stage of the` pipeline
are equivalent functionally.
or
pipe input | xargs -n1 egrep "^[^a].+\.html$" > text.txt
//xargs -n1 means take the stdin from the pipe and read it one line at a time in conjunction with the single command specified after any other xargs arguments
// ^ means from start of line,
//. means any one character
//+ means the previous matched expression(which can be a
//(pattern group)\1 or [r-ange], etc) one or more times
//\. means escape the single character match and match the period character
//$ means end of line(new line character)
//egrep is short for extended regular expression matches which are really
nice
(assuming you aren't using a pipe or cat, etc)
You can convert a newline delimited file into a single input line with this command:
cat file | tr -d '\n' ' '
//It converts all newlines into a space!
Anyway, get creative with simple utilities and you can do a lot:
xargs, grep, tr are a good combo that are easy to learn. Without the sedness of it all.
Don't do this with sed. Do it with two different calls to grep
grep -v ^a file.txt | grep 'html$'
The first grep gets all the lines that do not start with "a", and sends the output from that into the second grep, which pulls out all the lines that end with "html".
I have this sample file
{
{
doSomething();
}
}
What I am trying to achieve is:
{
{
doSomething();
}
}
I tried doing this
sed -r -i -e 's/}\n+}/}\n}/g' file.txt
but to no avail.
I want to remove the newlines between the closing parenthesis.
Note: I already read this How can I replace a newline (\n) using sed? and I am also aware of the N command but I can't create a working sed expression to achieve what I am trying to achieve.
Using sed:
$ sed 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file
{
{
doSomething();
}
}
If you want to change the file in place, use the -i switch:
sed -i 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file # GNU sed
sed -i '' -e 'H;1h;$!d;x; s/}\n\n*}/}\n}/g' file # BSD sed
How it works
H;1h;$!d;x;
This is a sed idiom which reads the whole file in to the pattern space. (If your file is huge, we would want to look at other solutions.)
s/}\n\n*}/}\n}/g
This is just your substitute command with one change: \n+ is changed to \n\n*. This is because + is not an active character under basic regular expressions. If we use extended regex, then braces, { and }, become active characters which has the potential to lead to other issues for this input.
Note that this removes the extra newlines only if the second closing brace, }, occurs at the beginning of a line.
Using awk:
awk -v RS='}' -v 'ORS=}' '!/{/{gsub(/\n+/,"\n")};1' file-name | head -n -1
It is not obvious what you exactly intend to do. However, if you want to work with multiline regexp you should be aware that in its classical form, sed isn't intended to work on more than one line at a time (maybe some multiline versions of sed have occured however). You can use other tools like perl. What I usually do when I need it is the following workflow:
tr '\n' 'µ' < myfile.txt | sed -e 's/µ}µ/}/' | tr 'µ' '\n'
(because the µ character is available on my keyboard and is most unlikely to appear in the file I am working with). The initial tr commands make one huge line from my file and the second tr command put the newline again at their initial locations.
I need to get X to Y in the file with multiple occurrences, each time it matches an occurrence it will save to a file.
Here is an example file (demo.txt):
\x00START how are you? END\x00
\x00START good thanks END\x00
sometimes random things\x00\x00 inbetween it (ignore this text)
\x00START thats nice END\x00
And now after running a command each file (/folder/demo1.txt, /folder/demo2.txt, etc) should have the contents between \x00START and END\x00 (\x00 is null) in addition to 'START' but not 'END'.
/folder/demo1.txt should say "START how are you? ", /folder/demo2.txt should say "START good thanks".
So basicly it should pipe "how are you?" and using 'echo' I can prepend the 'START'.
It's worth keeping in mind that I am dealing with a very large binary file.
I am currently using
sed -n -e '/\x00START/,/END\x00/ p' demo.txt > demo1.txt
but that's not working as expected (it's getting lines before the '\x00START' and doesn't stop at the first 'END\x00').
If you have GNU awk, try:
awk -v RS='\0START|END\0' '
length($0) {printf "START%s\n", $0 > ("folder/demo"++i".txt")}
' demo.txt
RS='\0START|END\0' defines a regular expression acting as the [input] Record Separator which breaks the input file into records by strings (byte sequences) between \0START and END\0 (\0 represents NUL (null char.) here).
Using a multi-character, regex-based record separate is NOT POSIX-compliant; GNU awk supports it (as does mawk in general, but seemingly not with NUL chars.).
Pattern length($0) ensures that the associated action ({...}) is only executed if the records is nonempty.
{printf "START%s\n", $0 > ("folder/demo"++i)} outputs each nonempty record preceded by "START", into file folder/demo{n}.txt", where {n} represent a sequence number starting with 1.
You can use grep for that:
grep -Po "START\s+\K.*?(?=END)" file
how are you?
good thanks
thats nice
Explanation:
-P To allow Perl regex
-o To extract only matched pattern
-K Positive lookbehind
(?=something) Positive lookahead
EDIT: To match \00 as START and END may appear in between:
echo -e '\00START hi how are you END\00' | grep -aPo '\00START\K.*?(?=END\00)'
hi how are you
EDIT2: The solution using grep would only match single line, for multi-line it's better use perl instead. The syntax will be very similar:
echo -e '\00START hi \n how\n are\n you END\00' | perl -ne 'BEGIN{undef $/ } /\A.*?\00START\K((.|\n)*?)(?=END)/gm; print $1'
hi
how
are
you
What's new here:
undef $/ Undefine INPUT separator $/ which defaults to '\n'
(.|\n)* Dot matches almost any character, but it does not match
\n so we need to add it here.
/gm Modifiers, g for global m for multi-line
I would translate the nulls into newlines so that grep can find your wanted text on a clean line by itself:
tr '\000' '\n' < yourfile.bin | grep "^START"
from there you can take it into sed as before.
I'm reformatting a file, and I want to perform the following steps:
Replace double CRLF's with a temporary character sequence ($CRLF$ or something)
Remove all CRLF's in the whole file
Go back and replace the double CRLF's.
So input like this:
This is a paragraph
of text that has
been manually fitted
into a certain colum
width.
This is another
paragraph of text
that is the same.
Will become
This is a paragraph of text that has been manually fitted into a certain colum width.
This is another paragraph of text that is the same.
It seems this should be possible by piping the input through a few simple sed programs, but I'm not sure how to refer to CRLF in sed (to use in sed 's/<CRLF><CRLF>/$CRLF$/'). Or maybe there's a better way of doing this?
You can use sed to decorate all rows with a {CRLF} at end:
sed 's/$/<CRLF>/'
then remove all \r\n with tr
| tr -d "\r\n"
and then replace double CRLF's with \n
| sed 's/<CRLF><CRLF>/\n/g'
and remove leftover CRLF's.
There was an one-liner sed which did all this in a single cycle, but I can't seem to find it now.
Try the below:
cat file.txt | sed 's/$/ /;s/^ *$/CRLF/' | tr -d '\r\n' | sed 's/CRLF/\r\n'/
That's not quite the method you've given; what this does is the below:
Add a space to the end of each line.
Replace any line that contains only whitespace (ie blank lines) with "CRLF".
Deletes any line-breaking characters (both CR and LF).
Replaces any occurrences of the string "CRLF" with a Windows-style line break.
This works on Cygwin bash for me.
Redefine the Problem
It looks like what you're really trying to do is reflow your paragraphs and single-space your lines. There are a number of ways you can do this.
A Non-Sed Solution
If you don't mind using some packages outside coreutils, you could use some additional shell utilities to make this as easy as:
dos2unix /tmp/foo
fmt -w0 /tmp/foo | cat --squeeze-blank | sponge /tmp/foo
unix2dos /tmp/foo
Sponge is from the moreutils package, and will allow you to write the same file you're reading. The dos2unix (or alternatively the tofrodos) package will allow to convert your line endings back and forth for easier integration with tools that expect Unix-style line endings.
This might work for you (GNU sed):
sed ':a;$!{N;/\n$/{p;d};s/\r\?\n/ /;ba}' file
Am I missing why this is not easier?
Add CRLF:
sed -e s/\s+$/$'\r\n'/ < index.html > index_CRLF.html
remove CRLF... go unix:
sed -e s/\s+$/$'\n'/ < index_CRLF.html > index.html
I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.
For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.
The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.
All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt
option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile
This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.
You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence
You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new
Using gawk:
gawk '{$1=$1}1' OFS="\n" file