Remove new line only if after a number - regex

I've collected some CSV data from terminal but every line is only 80 characters long so it's not importing properly.
Here's two lines of data:
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691
,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5
267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9
746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,1230
0,11970,12240,12867,13475,14310,15962,17624,19105,21075,
I wanna remove the newline char only if it's after any number or comma, but not if it's only on it's own, since that means it's a new line of CSV data.
I couldn't figure out how to do this on shell with sed. If any other program like awk or perl is better for this scenario then feel free to show me a solution for that.
Expected output:
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,

Just remove the newline if it's preceded by a digit or comma:
perl -pe 'chomp if /[\d,]$/' input-file > output-file
-p reads the input line by line and prints the result
chomp removes newline if present at the end
\d matches a digit
$ matches the end of line

With awk by reading in paragraph mode and replacing all \n
$ awk -v RS= '{gsub("\n","")} 1' ip.txt
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,
To leave the blanks, set ORS to double newlines, however this will add an extra newline at end
$ awk -v RS= -v ORS='\n\n' '{gsub("\n","")} 1' ip.txt
28,26166,25180,23645,22824,21257,20080,18921,17893,16702,15650,14647,13667,12691,11971,11179,10393,9885,9294,8930,8390,8079,7660,7341,6907,6425,6120,5789,5588,5267,4924,4581,4246,4025,3857,
3423,3567,3636,3633,3714,3844,4543,5887,7287,8499,9746,10704,11658,12591,13379,13950,14679,14954,14756,14224,13921,13494,12849,12300,11970,12240,12867,13475,14310,15962,17624,19105,21075,

You can use this regex:
(?<!\n)\n(?!\n)
and replace with empty string.

perl -0pe 's/([\d,])\n([\d,])/$1$2/sg' (file)
should do it.
That is, read the file without line delimiters, treat the whole thing as one string and remove the newlines that are preceded and followed by a digit or comma.

Related

file cleanup using sed and regex (remove some but not all newlines)

i have a text file that i would like to load into hive. it has linebreaks within a string column so it won't load properly. from what i found out online the file needs to be preprocessed and all those linebreaks be removed. i have tried many regexes so far, but to no avail.
this is the file:
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
the desired output should be this:
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
i could achieve this in notepad++ by using this as a regex: (\r\n^(?!\/biz\/))+
however, when i run that regex using sed like so it doesn't work:
sed -e 's/(\r\n^(?!\/biz\/))+//g' original.csv > clean.csv
As stated, sed doesn't support lookaround assertions such as (?!\/biz\/).
Since your input is essentially record-oriented, awk offers a convenient solution.
With GNU awk or Mawk (required to support multi-character input record separators):
awk -v RS='/biz/' '$1=$1 { print RS $0 }' file
RS='/biz/' splits the input into records by /biz/ (reserved variable RS is the input-record separator, \n by default).
$1=$1 looks like a no-op, but actually rebuilds the input record at hand ($0) by normalizing any record-internal runs of whitespace - including newlines - to a single space each, relying on awk's default field-splitting and output behavior.
Additionally, since $1=$1 serves as a pattern (conditional), the outcome of the assignment decides whether the associated action ({ ... }) is executed for the record at hand.
For an empty record - such as the implied one before the very first /biz - the assignment returns '', which in a Boolean context evaluates to false and therefore skips the associated block.
{ print RS $0 } prints the rebuilt input record, prefixed by the input record separator; print automatically appends the output record separator, ORS, which defaults to \n.
Note: Your code references \r\n, i.e., Windows-style CRLF line breaks. Since you're trying to use sed, I trust that the versions of the Unix utilities available to you on Windows transparently handle CRLF.
If you're actually on a Unix platform and only happen to be dealing with a Windows-originated file, a little more work is needed.
maybe this can help you;
sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}'
test ;
$ sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}' test
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
awk to the rescue! (with multi-char RS support)
$ awk -v RS='\n?^/' 'NF{$1=$1; print "/" $0}' file
or
$ awk -v RS='\n?^/' 'NF{$1="/"$1}NF' file
Create files
$ cat biz.awk
{ # read entire input to a string `f' (skips newlines)
f = f $0
}
END {
gsub("[^^]/biz/", "\n/biz/", f) # add a newline to all but the
# first /biz/
print f
}
and
$ cat file
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
Usage:
awk -f biz.awk file
sed doesn't support lookarounds, perl does
$ perl -0777 -pe 's/(\n^(?!\/biz\/))+//mg' original.csv
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
-0777 option will slurp entire file as single string
m option allows to use ^$ anchors in multiline strings
Note, line endings in Unix like systems do not use \r, but if your input does have them, use \r\n as specified used in OP.
Use different delimiter to avoid having to escape /
perl -0777 -pe 's|(\n^(?!/biz/))+||mg' original.csv
Another way to do it is delete all \n characters between a pair of double quotes
$ perl -0777 -pe 's|".*?"|$&=~s/\n//gr|gse' ip.txt
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
s modifier allows .* to match across multiple lines and e modifier allows to use expression instead of string in replacement
$&=~s/\n//gr allows to perform substitution on matched text ".*?"
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk. With GNU awk for multi-char RS and RT:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18

Regexp to catch string between first and second comma, where there's alphabetical character in number

First, I must mention my native language is french, so I may make english mistake!
I try to use sed to catch and delete the lines where the second item in a CSV file contains other characters then numbers.
Here is an example of a OK line :
2323421,9781550431209,,2012-07-24 13:30:57,False,2012-07-01 00:00:00,False,118,,1,246501
A line that must be deleted :
1901461,3002CAN,,2010-09-29 13:46:59,True,,True,,,,
or
2977837,9782/76132396,,2015-04-27 10:14:47,True,2015-04-26 00:00:00,True,,,,
etc...
I'm not sure this is possible to be honest!
Thank you !
Here it is using sed
sed -e '/^[^,]*,[^,]*[^0-9,]/d'
A breakdown of the pattern:
^ Start of line
[^,]*, Everything up to the first comma inclusive
[^,]* Everything which isn't a comma
[^0-9,] At least one character which isn't a number or comma
Using awk you can do this:
awk -F, '$2 ~ /^[[:digit:]]+$/' file
Or (thanks to #ghoti):
awk -F, '$2 !~ /[^[:digit:]]/' file
to get only those line where 2nd column is an integer number.
Or using sed you can do:
sed -i.bak '/^[^,]*,[[:digit:]]*[^,[:digit:]]/d' file
Perl:
perl -F, -lane 'print if $F[1] =~ /^\d+$/' file
-a autosplit line to array #F, fields start with 0
-F, splits line using commas
print the line only if field 1 contain only digits: /^\d+$/

How to delete newline if the line doesn't end with " with trailing white-spaces

This is continuation of a similar question that i posted but with another parameter of having trailing white-spaces as ponted out by #Jubobs
Sample data:
"data","123" <-spaces
"data2","qwer" <-space
"false","234 <-spaces
And i'm the culprit" <-- spaces at the start of line and end of line
"data5","234567"
Output text should be
"data","123"
"data2","qwer"
"false","234 And i'm the culprit"
"data5","234567"
In essence, I want to fix my csv file (which is very large)
I'm using sed so an answer in sed would help a lot :)
EDIT: Added spaces to sample text
I've added a line at the end of your sample input that includes a field that starts with white space as it's important to test that that will work with any proposed solution you get:
$ cat file
"data","123"
"data2","qwer"
"false","234
And i'm the culprit"
"data5","234567"
"stuff","
foo"
So you can see the newlines and white space:
$ sed 's/$/\$/' file
"data","123" $
"data2","qwer" $
"false","234 $
And i'm the culprit"$
"data5","234567"$
"stuff"," $
foo"$
If you just want to remove the newlines but leave the trailing white space then this awk command is all you need (only piped to sed to show newlines)
$ awk '{q+=gsub(/"/,"&"); printf "%s%s",$0,(q%2?"":RS)}' file | sed 's/$/\$/'
"data","123" $
"data2","qwer" $
"false","234 And i'm the culprit"$
"data5","234567"$
"stuff"," foo"$
If you want to remove the trailing white space when it's within the fields too:
$ awk '{q+=gsub(/"/,"&"); if (q%2) sub(/[[:blank:]]+$/,""); printf "%s%s",$0,(q%2?"":RS)}' file | sed 's/$/\$/'
"data","123" $
"data2","qwer" $
"false","234And i'm the culprit"$
"data5","234567"$
"stuff","foo"$
In all cases above, the sed command is just to stick a $ at the end of the line to make the trailing white space visible for this example, the awk command is all you need.
All it's doing is counting how many "s you've seen so far (q+=gsub(/"/,"&")). If it's an odd number (q%2 is 1) then you're in the middle of a field so do not print a newline at the end of the line, otherwise just print the usual Record Separator which is a newline.
You could try something like
awk '/[a-zA-Z0-9][^"]*$/{ORS=""} /[a-zA-Z0-9]"[^"]*$/{ORS="\n"} 1 '
Test
$ awk '/[a-zA-Z0-9][^"]*$/{ORS=""} /[a-zA-Z0-9]"[^"]*$/{ORS="\n"} 1 ' input
"data","123"
"data2","qwer"
"false","234And i'm the culprit"
"data5","234567"
What it does?
[a-zA-Z0-9][^"]*$ matches all lines that do not have a " at the end.
{ORS=""} sets the output record separator as ""
[a-zA-Z0-9]"[^"]*$ matches all lines that ends with "
{ORS="\n"} sets the field record seperator as \n
This might work for you (GNU sed):
sed -r ':a;s/^(".*",".*").*/\1/;t;N;s/\n//;ta' file
If the line contains two double quoted fields separated by a comma remove anything following the last double quote and you are done. Otherwise append the next line and remove its newline and try again.

how to trim trailing spaces after all delimiter in a text file

Need help to remove trailing spaces after all delimiter in a text file
I have Text file with below data.
eg.
ADDRESS_ID| COUNTRY_TP_CD| RESIDENCE_TP_CD| PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0| 76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
I want to remove spaces after the delimiter and the first letter of the word.
Any regex or unix script that can do the same. Looking for output as below:
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU||||||2013-09-19 14:48:49.609000|
Any help will be appreciated.
awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)} 1' file
Using a perl one-liner to remove the spacing around every field. Assumes no embedded delimiters:
perl -i -lpe 's/\s*([^|]*?)\s*/$1/g' file.txt
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-l: Enable line ending processing
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
The below perl code would remove the spaces which are present at the start of a line or the spaces after to the delimiter | ,
$ perl -pe 's/(?<=\|) +|^ +//g' file
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
To save the changes made to that file,
perl -i -pe 's/(?<=\|) +|^ +//g' file
sed 's/\ //g' input.txt > output.txt
With sed:
sed -r -e 's/(^|\|)\s+/\1/g' -e 's/\s+$//' filename
In the first expression:
(^|\|) matches the beginning of the line or a | character, and saves this in capture group 1.
\s+ matches a sequence of whitespace characters after that.
The replacement \1 substitutes capture group 1, so this deletes the whitespace at the beginning of the line and after the delimiter.
The g modifier makes it operate on all the matches in the line.
In the second expression:
\s+ again matches a sequence of whitespace
$ matches the end of the line
The replacement replaces the whole thing with an empty string, this removing trailing spaces.
for posix sed (for GNU sed add --posix)
sed 's/^[[:space:]]//;s/|[[:space:]]/|/g' YourFile
use 2 substitution (there are no OR (|) in sed regex posix version)
Remove starting space by replacing space at start( ^[[:space:]]*) by nothing
Replace any sequence pipe than any space (|[[:space:]]*) by pipe
[[:space:]] could be replace by a single space char if text only have space (ASCII 32) char

sed not substituting as expected

I need to substitute each \n in a line with "\n (double quote followed by newline).
This should work. But it does nothing. Reports no error either. Any clues anyone?
sed -i 's/\n/\"\n/' filename
before, the file contains:
line 1
line 2
after, it contains the exact same.
Thanks
Balt
A line can't contain \n, because \n is the delimiter between lines. sed operates on a single line at a time, and the newline is not included in it.
If you want to put a character before the end of each line, use the $ regexp:
sed -i 's/$/"/' filename
Try following:
sed -i 's/$/"/' filename
used $ to denote end of the line.
Using awk
awk '{$0=$0"\""}1' filename