file cleanup using sed and regex (remove some but not all newlines) - regex
i have a text file that i would like to load into hive. it has linebreaks within a string column so it won't load properly. from what i found out online the file needs to be preprocessed and all those linebreaks be removed. i have tried many regexes so far, but to no avail.
this is the file:
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
the desired output should be this:
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
i could achieve this in notepad++ by using this as a regex: (\r\n^(?!\/biz\/))+
however, when i run that regex using sed like so it doesn't work:
sed -e 's/(\r\n^(?!\/biz\/))+//g' original.csv > clean.csv
As stated, sed doesn't support lookaround assertions such as (?!\/biz\/).
Since your input is essentially record-oriented, awk offers a convenient solution.
With GNU awk or Mawk (required to support multi-character input record separators):
awk -v RS='/biz/' '$1=$1 { print RS $0 }' file
RS='/biz/' splits the input into records by /biz/ (reserved variable RS is the input-record separator, \n by default).
$1=$1 looks like a no-op, but actually rebuilds the input record at hand ($0) by normalizing any record-internal runs of whitespace - including newlines - to a single space each, relying on awk's default field-splitting and output behavior.
Additionally, since $1=$1 serves as a pattern (conditional), the outcome of the assignment decides whether the associated action ({ ... }) is executed for the record at hand.
For an empty record - such as the implied one before the very first /biz - the assignment returns '', which in a Boolean context evaluates to false and therefore skips the associated block.
{ print RS $0 } prints the rebuilt input record, prefixed by the input record separator; print automatically appends the output record separator, ORS, which defaults to \n.
Note: Your code references \r\n, i.e., Windows-style CRLF line breaks. Since you're trying to use sed, I trust that the versions of the Unix utilities available to you on Windows transparently handle CRLF.
If you're actually on a Unix platform and only happen to be dealing with a Windows-originated file, a little more work is needed.
maybe this can help you;
sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}'
test ;
$ sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}' test
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
awk to the rescue! (with multi-char RS support)
$ awk -v RS='\n?^/' 'NF{$1=$1; print "/" $0}' file
or
$ awk -v RS='\n?^/' 'NF{$1="/"$1}NF' file
Create files
$ cat biz.awk
{ # read entire input to a string `f' (skips newlines)
f = f $0
}
END {
gsub("[^^]/biz/", "\n/biz/", f) # add a newline to all but the
# first /biz/
print f
}
and
$ cat file
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
Usage:
awk -f biz.awk file
sed doesn't support lookarounds, perl does
$ perl -0777 -pe 's/(\n^(?!\/biz\/))+//mg' original.csv
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
-0777 option will slurp entire file as single string
m option allows to use ^$ anchors in multiline strings
Note, line endings in Unix like systems do not use \r, but if your input does have them, use \r\n as specified used in OP.
Use different delimiter to avoid having to escape /
perl -0777 -pe 's|(\n^(?!/biz/))+||mg' original.csv
Another way to do it is delete all \n characters between a pair of double quotes
$ perl -0777 -pe 's|".*?"|$&=~s/\n//gr|gse' ip.txt
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
s modifier allows .* to match across multiple lines and e modifier allows to use expression instead of string in replacement
$&=~s/\n//gr allows to perform substitution on matched text ".*?"
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk. With GNU awk for multi-char RS and RT:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
Related
Remove new line only if after a number
I've collected some CSV data from terminal but every line is only 80 characters long so it's not importing properly. Here's two lines of data: 28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691 ,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5 267,4924,4581,4246,4025,3857, 3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9 746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,1230 0,11970,12240,12867,13475,14310,15962,17624,19105,21075, I wanna remove the newline char only if it's after any number or comma, but not if it's only on it's own, since that means it's a new line of CSV data. I couldn't figure out how to do this on shell with sed. If any other program like awk or perl is better for this scenario then feel free to show me a solution for that. Expected output: 28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857, 3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,
Just remove the newline if it's preceded by a digit or comma: perl -pe 'chomp if /[\d,]$/' input-file > output-file -p reads the input line by line and prints the result chomp removes newline if present at the end \d matches a digit $ matches the end of line
With awk by reading in paragraph mode and replacing all \n $ awk -v RS= '{gsub("\n","")} 1' ip.txt 28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857, 3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075, To leave the blanks, set ORS to double newlines, however this will add an extra newline at end $ awk -v RS= -v ORS='\n\n' '{gsub("\n","")} 1' ip.txt 28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857, 3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,
You can use this regex: (?<!\n)\n(?!\n) and replace with empty string.
perl -0pe 's/([\d,])\n([\d,])/$1$2/sg' (file) should do it. That is, read the file without line delimiters, treat the whole thing as one string and remove the newlines that are preceded and followed by a digit or comma.
replace \n\t pattern in a file
ok I have a recordset that is pipe delimited I am checking the number of delimiters on each line as they have started including | in the data (and we cannot change the incoming file) while using a great awk to parse out the bad records into a bad file for processing we discovered that some data has a new line character (\n) (followed by a tab (\t) ) I have tried sed to replace \n\t with just \t but it always either changes the \n\t with \r\n or replaces all the \n (file is \r\n for line end) yes to answer some quetions below... files can be large 200+ mb the line feed is in the data spuriously (not every row.. but enought to be a pain) I have tried sed ':a;N;$!ba;s/\n\t/\t/g' Clicks.txt >test2.txt sed 's/\n\t/\t/g' Clicks.txt >test1.txt sample record 12345|876|testdata\n \t\t\t\tsome text|6209\r\n would like 12345|876|testdata\t\t\t\tsome text|6209\r\n please help!!! NOTE must be in KSH (MKS KSH to be specific) i don't care if it is sed or not.. just need to correct the issue... several of the solutions below woke on small data or do part of the job... as an aside i have started playing with removing all linefeeds and then replacing the caraige return with carrige return linefeed.. but can't quite get that to work either I have tried TR but since it is single char it only does part of the issue tr -d '\n' test.txt leave me with a \r ended file.... need to get it to \r\n (and no-no dos2unix or unix2dos exists on this system)
If the input file is small (and you therefore don't mind processing it twice), you can use cat input.txt | tr -d "\n" | sed 's/\r/\r\n/g' Edit: As I should have known by now, you can avoid using cat about everywhere. I had reviewed my old answers in SO for UUOC, and carefully checked for a possible filename in the tr usage. As Ed pointed out in his comment, cat can be avoided here as well: The command above can be improved by tr -d "\n" < input.txt | sed 's/\r/\r\n/g'
It's unclear what you are trying to do but given this input file: $ cat -v file 12345|876|testdata some text|6209^M Is this what you're trying to do: $ gawk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/,"")} 1' file | cat -v 12345|876|testdata some text|6209^M The above uses GNU awk for multi-char RS. Alternatively with any awk: $ awk '{rec = rec $0} /\r$/{print rec; rec=""}' file | cat -v 12345|876|testdata some text|6209^M The cat -vs above are just there to show where the \rs (^Ms) are.
Note that the solution below reads the input file as a whole into memory, which won't work for large files. Generally, Ed Morton's awk solution is better. Here's a POSIX-compliant sed solution: tab=$(printf '\t') sed -e ':a' -e '$!{N;ba' -e '}' -e "s/\n${tab}/${tab}/g" Clicks.txt Keys to making this POSIX-compliant: POSIX sed doesn't recognize \t as an escape sequence, so a literal tab - via variable $tab, created with tab=$(printf '\t') - must be used in the script. POSIX sed - or at least BSD sed - requires label names (such as :a and the a in ba above) - whether implied or explicit - to be terminated with an actual newline, or, alternatively, terminated implicitly by continuing the script in the next -e option, which is the approach chosen here. -e ':a' -e '$!{N;ba' -e '}' is an established Sed idiom that simply "slurps" the entire input file (uses a loop to read all lines into its buffer first). This is the prerequisite for enabling subsequent string substitution across input lines. Note how the option-argument for the last -e option is a double-quoted string so that the references to shell variable $tab are expanded to actual tabs before Sed sees them. By contrast, \n is the one escape sequence recognized by POSIX sed itself (in the regex part, not the replacement-string part). Alternatively, if your shell supports ANSI C-quoted strings ($'...'), you can use them directly to produce the desired control characters: sed -e ':a' -e '$!{N;ba' -e '}' -e $'s/\\n\t/\\t/g' Clicks.txt Note how the option-argument for the last -e option is an ANSI C-quoted string, and how literal \n (which is the one escape sequence that is recognized by POSIX Sed) must then be represented as \\n. By contrast, $'...' expands \t to an actual tab before Sed sees it.
Thanks everyone for all your suggestions.. After looking at all the answers.. None quite did the trick... After some thought... I came up with tr -d '\n' <Clicks.txt | tr '\r' '\n' | sed 's/\n/\r\n/g' >test.txt Delete all newlines translate all Carriage return to newline Sed replace all newline with Carriage return line feed This works in seconds on a 32mb file.
Trying to replace \r\n\n but not \r\n in a file
This is using GNU sed version 4.2.1 but I've also tried awk and Perl without any success so far. I have a file that is produced by a COBOL program (on Linux) and it has what can be considered nonstandard CRLF instead of LF (CRLF of course being Windows line terminators) but that's what I need to retain - anything CRLF stays. So \r\n sequences stay. What I need to replace are occasional \r\n\n sequences with \r\n\r\n without disturbing anything else. I have to match this file I produce using diff with the original file produced on BSD or SCO or something. This doesn't work and I expect the first /n is getting stripped by Sed as the line terminator sed -e 's/\r\n\n/\r\n\r\n/g' infile > outfile I tried hex 0x and also double escape too Thanks for any suggestions
I suggest you just add a CR before any LF that isn't already preceded by one. s/ (?<!\r) (?=\n) /\r/xg In a program that alters the data in a file it would look something like this use strict; use warnings; use open IO => ':raw'; my $data = do { local $/; <>; }; $data =~ s/ (?<!\r) (?=\n) /\r/xg; print $data; and you would run it like perl add_cr.pl myfile > newfile or, if you wanted to modify your file in-place (after testing it) you could use just perl -i add_cr.pl myfile
sed being a line oriented tool, blah\r\n\n will be a line blah\r followed by an empty line. So, add a \r to any empty line: sed 's/^$/\r/' infile > outfile
Just use this Perl one-liner: perl -pe "s/\R/\r\n/g" <input.txt >output.txt Magic here is about \R which matches any new-line combination accepted by Perl: \n, \r\n or \r alone. As far as I know, \R is Perl-only - not supported by sed or awk.
WIth GNU awk for multi-char RS: awk -v RS='\r\n\n' -v ORS='\r\n\r\n' '1' file
Try unix2dos utility: It handle all unix/dos/ and mixture of unix/dos cases. Note: dos2unix is also a good utility. Overwrite: unix2dos your-file Create new file: unix2dos < your-file > your-new-file
how to trim trailing spaces after all delimiter in a text file
Need help to remove trailing spaces after all delimiter in a text file I have Text file with below data. eg. ADDRESS_ID| COUNTRY_TP_CD| RESIDENCE_TP_CD| PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE 885637959852960985.0| 76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000| I want to remove spaces after the delimiter and the first letter of the word. Any regex or unix script that can do the same. Looking for output as below: ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE 885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU||||||2013-09-19 14:48:49.609000| Any help will be appreciated.
awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)} 1' file
Using a perl one-liner to remove the spacing around every field. Assumes no embedded delimiters: perl -i -lpe 's/\s*([^|]*?)\s*/$1/g' file.txt Switches: -i: Edit <> files in place (makes backup if extension supplied) -l: Enable line ending processing -p: Creates a while(<>){...; print} loop for each “line” in your input file. -e: Tells perl to execute the code on command line.
The below perl code would remove the spaces which are present at the start of a line or the spaces after to the delimiter | , $ perl -pe 's/(?<=\|) +|^ +//g' file ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE 885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000| To save the changes made to that file, perl -i -pe 's/(?<=\|) +|^ +//g' file
sed 's/\ //g' input.txt > output.txt
With sed: sed -r -e 's/(^|\|)\s+/\1/g' -e 's/\s+$//' filename In the first expression: (^|\|) matches the beginning of the line or a | character, and saves this in capture group 1. \s+ matches a sequence of whitespace characters after that. The replacement \1 substitutes capture group 1, so this deletes the whitespace at the beginning of the line and after the delimiter. The g modifier makes it operate on all the matches in the line. In the second expression: \s+ again matches a sequence of whitespace $ matches the end of the line The replacement replaces the whole thing with an empty string, this removing trailing spaces.
for posix sed (for GNU sed add --posix) sed 's/^[[:space:]]//;s/|[[:space:]]/|/g' YourFile use 2 substitution (there are no OR (|) in sed regex posix version) Remove starting space by replacing space at start( ^[[:space:]]*) by nothing Replace any sequence pipe than any space (|[[:space:]]*) by pipe [[:space:]] could be replace by a single space char if text only have space (ASCII 32) char
pipe sed command to create multiple files
I need to get X to Y in the file with multiple occurrences, each time it matches an occurrence it will save to a file. Here is an example file (demo.txt): \x00START how are you? END\x00 \x00START good thanks END\x00 sometimes random things\x00\x00 inbetween it (ignore this text) \x00START thats nice END\x00 And now after running a command each file (/folder/demo1.txt, /folder/demo2.txt, etc) should have the contents between \x00START and END\x00 (\x00 is null) in addition to 'START' but not 'END'. /folder/demo1.txt should say "START how are you? ", /folder/demo2.txt should say "START good thanks". So basicly it should pipe "how are you?" and using 'echo' I can prepend the 'START'. It's worth keeping in mind that I am dealing with a very large binary file. I am currently using sed -n -e '/\x00START/,/END\x00/ p' demo.txt > demo1.txt but that's not working as expected (it's getting lines before the '\x00START' and doesn't stop at the first 'END\x00').
If you have GNU awk, try: awk -v RS='\0START|END\0' ' length($0) {printf "START%s\n", $0 > ("folder/demo"++i".txt")} ' demo.txt RS='\0START|END\0' defines a regular expression acting as the [input] Record Separator which breaks the input file into records by strings (byte sequences) between \0START and END\0 (\0 represents NUL (null char.) here). Using a multi-character, regex-based record separate is NOT POSIX-compliant; GNU awk supports it (as does mawk in general, but seemingly not with NUL chars.). Pattern length($0) ensures that the associated action ({...}) is only executed if the records is nonempty. {printf "START%s\n", $0 > ("folder/demo"++i)} outputs each nonempty record preceded by "START", into file folder/demo{n}.txt", where {n} represent a sequence number starting with 1.
You can use grep for that: grep -Po "START\s+\K.*?(?=END)" file how are you? good thanks thats nice Explanation: -P To allow Perl regex -o To extract only matched pattern -K Positive lookbehind (?=something) Positive lookahead EDIT: To match \00 as START and END may appear in between: echo -e '\00START hi how are you END\00' | grep -aPo '\00START\K.*?(?=END\00)' hi how are you EDIT2: The solution using grep would only match single line, for multi-line it's better use perl instead. The syntax will be very similar: echo -e '\00START hi \n how\n are\n you END\00' | perl -ne 'BEGIN{undef $/ } /\A.*?\00START\K((.|\n)*?)(?=END)/gm; print $1' hi how are you What's new here: undef $/ Undefine INPUT separator $/ which defaults to '\n' (.|\n)* Dot matches almost any character, but it does not match \n so we need to add it here. /gm Modifiers, g for global m for multi-line
I would translate the nulls into newlines so that grep can find your wanted text on a clean line by itself: tr '\000' '\n' < yourfile.bin | grep "^START" from there you can take it into sed as before.