I am in troubles with a regexp to remove some \n - regex

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?

sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.

perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv

It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

perl regex negative-lookbehind detect file lacking final linefeed

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).
> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb' | test -n "$(tail -c1)" && echo pathological last line
pathological last line
>
One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.
(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
>
Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
>
Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.
/(?<=\n)$/ is a weird and expensive way of doing /\n$/.
/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.
A few approaches:
perl -n0777e'print "pathological last line\n" if !/\n\z/'
perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'
perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'
perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'
The last solution avoids slurping the entire file.
Why does my regex always match, even when the end-of-file is preceded by newline?
Because you mistakenly think that $ only matches at the end of the string. Use \z for that.
Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.
perl -le'while (!eof()) { $previous = getc(\*ARGV) }
if ($previous ne "\n") { print "pathological last line!" }'
PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:
perl -le 'use File::ReadBackwards;
my $bw = File::ReadBackwards->new(shift #ARGV);
print "pathological last line" if substr($bw->readline, -1) ne "\n"'
That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.
The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp.
(I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)
Thanks to the responses, here is an improved draft of the script:
#!/usr/bin/perl
use strict; use warnings; use 5.18.2;
# Loop slurp:
$/ = undef; # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
s/^\s*$/\n/mg; # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/^[\n]+//; # first line is visible or "nontrivial."
s/[\n]+\z/\n/; # last line is visible or "nontrivial."
print STDOUT;
print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF.
}
And here is how it works, when named zz.pl. First a messy file, then how it looks after zz.pl gets through with it:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t'
aaa
bb
bash:
bash:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t' | zz.pl
aaa
bb
bash:

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Regex to move second line to end of first line

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries
Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>´ and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

Using Perl Regex Multiline to reformat file

I have the file with the following format:
(Type 1 data:1) B B (Type 1 data:2) B B B
(Type 1 data:3) B ..
Now I want to reformat this file so that it looks like:
(Type 1 data:1) B B (Type 1 data:2) B B B (Type 1 data:3) B
...
My approach was to use perl regex in command line,
cat file | perl -pe 's/\n(B)/ $1/smg'
My reasoning was to replace the new line character with space.
but it doesn't seem to work. can you please help me? Thanks
The -p reads a line at a time, so there is nothing after the "\n" to match with.
perl -pe 'chomp; $_ = ($_ =~ /Type/) ? "\n".$_ : " ".$_'
this does almost what you want but puts one extra newline at the beginning and loses the final newline.
If the only place that ( shows up is at the beginning of where you want your lines to start, then you could use this command.
perl -l -0x28 -ne's/\n/ /g;print"($_"if$_' < file
-l causes print to add \n on the end of each line it prints.
-0x28 causes it to split on ( instead of on \n.
-n causes it to loop on the input. Basically it adds while(<>){chomp $_; to the beginning, and } at the end of what ever is in -e.
s/\n/ /g
print "($_" if $_ The if $_ part just stops it from printing an extra line at the beginning.
It's a little more involved as -n and -p fit best for processing one line at a time while your requirement is to combine several lines, which means you'd have to maintain state for a while.
So just read the entire file in memory and apply the regex like this:
perl -lwe ^
"local $/; local $_ = <>; print join q( ), split /\n/ for m/^\(Type [^(]*/gsm"
Feed your file to this prog on STDIN using input redirection (<).
Note this syntax is for the Windows command line. For Bash, use single quotes to quote the script.