perl regex negative-lookbehind detect file lacking final linefeed - regex

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).
> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb' | test -n "$(tail -c1)" && echo pathological last line
pathological last line
>
One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.
(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
>
Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
>
Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.

/(?<=\n)$/ is a weird and expensive way of doing /\n$/.
/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.
A few approaches:
perl -n0777e'print "pathological last line\n" if !/\n\z/'
perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'
perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'
perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'
The last solution avoids slurping the entire file.
Why does my regex always match, even when the end-of-file is preceded by newline?
Because you mistakenly think that $ only matches at the end of the string. Use \z for that.

Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.
perl -le'while (!eof()) { $previous = getc(\*ARGV) }
if ($previous ne "\n") { print "pathological last line!" }'
PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:
perl -le 'use File::ReadBackwards;
my $bw = File::ReadBackwards->new(shift #ARGV);
print "pathological last line" if substr($bw->readline, -1) ne "\n"'
That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.

The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp.
(I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)
Thanks to the responses, here is an improved draft of the script:
#!/usr/bin/perl
use strict; use warnings; use 5.18.2;
# Loop slurp:
$/ = undef; # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
s/^\s*$/\n/mg; # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/^[\n]+//; # first line is visible or "nontrivial."
s/[\n]+\z/\n/; # last line is visible or "nontrivial."
print STDOUT;
print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF.
}
And here is how it works, when named zz.pl. First a messy file, then how it looks after zz.pl gets through with it:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t'
aaa
bb
bash:
bash:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t' | zz.pl
aaa
bb
bash:

Related

I am in troubles with a regexp to remove some \n

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?
sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv
It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

Can not replace multiple empty lines with one

Why does the following not replace multiple empty lines with one?
$ cat some_random_text.txt
foo
bar
test
and this does not work:
$ cat some_random_text.txt | perl -pe "s/\n+/\n/g"
foo
bar
test
I am trying to replace the multiple new lines (i.e. empty lines) to a single empty new line but the regex I use for that does not work as you can see in the example snippet.
What am I messing up?
Expected outcome is:
foo
bar
test
The reason it doesn't work is that -p tells perl to process the input line by line, and there's never more than one \n in a single line.
Better idea:
perl -00 -lpe 1
-00: Enable paragraph mode (input records are terminated by any sequence of 2+ newlines).
-l: Enable autochomp mode (the input record separators are trimmed automatically, so since we're in paragraph mode, all trailing newlines are removed, and output records get "\n\n" added).
-p: Enable automatic input/output (the main code is executed for each input record; anything left in $_ is printed automatically).
-e 1: Use a dummy main program that does nothing.
Taken all together this does nothing except normalize paragraph terminators to exactly two newlines.
You are executing the following program:
LINE: while (<>) {
s/\n+/\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Since you are reading one line at at time, and since a line is a sequence of characters that aren't line feeds terminated by a line feed, your pattern will never match more than one newline.
The simple fix is to tell Perl to treat the entire file as one line. Also, you don't want to replace every line feed, but just those found in sequence of two or more, and you want to replace the sequence with two line feeds.
perl -0777pe's/\n\n\K\n+//g; s^\n+//; s/\n\K\n\z//' some_random_text.txt
The second and third substitutions ensure there are no blank lines at the start and end of the file.
While reading the entire file into memory is easy, it's not necessary. The desired output can also be achieved by maintaining a flag that indicates whether the previous line was blank or not.
perl -ne'if (/\S/) { print "\n" if $f; print; $f=0 } else { $f=1 }' some_random_text.txt
This solution also removes blank lines from the start and end of the file.
Given:
$ echo "$txt"
foo
bar
test
You can use sed to reduce the runs of blank lines to a single \n:
$ echo "$txt" | sed '/^$/N;/^\n$/D'
foo
bar
test
Even easier, you can use cat -s:
$ echo "$txt" | cat -s # same output
In perl the idiomatic 1 liner is to use -00 for paragraph mode:
$ echo "$txt" | perl -00pe0 # same output
And in awk you have the flexibility of using paragraph mode by setting RS= and then set ORS= to what you want the replacement for runs of \n to be:
$ echo "$txt" | awk '1' RS= ORS="\n\n" # same output
ikegami correctly states that printf 'a\n\n' | ... will produce two trailing spaces with these solutions. That may or may not be an issue.

Perl one liner to remove multiple line

input file is
<section_begin> mxsqlc
*** WARNING[13052] Cursor C is not fetched.
<section_end>
<section_begin> b2.lst
*
*** WARNING[13052] Cursor C is not fetched.
0 errors, 1 warnings in SQL C file "b2.ppp".
<section_end>
<section_begin> b2s0
SQLCODE=0
SQLSTATE=00000
a=10, b=abc, c=20
SQLCODE=0
SQLSTATE=00000
a=10, b=abc , c=10, d=xyz
<section_end>
expecting output without below lines.
<section_end>
<section_begin> b2s0
my code is
perl -ne 'print unless /^\<section_end\>(\s*|.*lst)?\s*$/' b2exp
It removes all <section_end> lines and doesn't remove this line <section_begin> *.lst
keep it simple
perl -ne 'print unless /^\<section_/' b2exp
bit more complicated
perl -ne 'print unless /^\<section_(end|begin)\>/' b2exp
Ah, your question isn't clear. ( to me, perhaps it is really)
I now read it as
"I have some sections marked out with <section_begin> tagname
at the start and </section_end> at the end.
I wish to exclude the sections with a particular tagname, bs20 in the example. I wish to keep all other lines
"
perl -ne 'BEGIN {$p=1} $p=0 if /section_begin.*b2s0/; print if $p; $p=1 if /<section_end>/;' ex.txt
If the intention is to merge the section with lst with the next section (and remove the stuff on the same line after the next section's begin tag), I would go with Awk instead.
awk '/<section_end>/ && lst { next }
/<section_begin>/ && lst { lst=0; next }
/<section_begin>.*lst/ {lst=1}
1' b2exp
The same thing can be done in Perl, of course; the simplest one-liner with perl -0777 -pe 's/.../.../s' file will be a lot less memory-efficient because of buffering.
perl -0777 -pe 's%(<section_begin>[^\n]*lst.*?)\n<section_end>\n<section_begin>[^\n]%$1%s' b2exp
This will read the whole file into memory (-0777) and replace the multi-line regexp. The greedy match .*? will make the match as short as possible, i.e. not span past a match on the rest of the pattern (newline, end tag, newline, begin tag optionally followed by non-newline data). We also take care to use [^\n] where we want to keep matches on the same line, since the /s flag turns . into a wildcard which can also match newlines.

How do I remove duplicate characters and keep the unique one only in Perl?

How do I remove duplicate characters and keep the unique one only.
For example, my input is:
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
Expected output is:
EFUAH
UEH
UJHACDEF
I came across perl -pe's/$1//g while/(.).*\/' which is wonderful but it is removing even the single occurrence of the character in output.
This can be done using positive lookahead :
perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME
The regex used is: (.)(?=.*?\1)
. : to match any char.
first () : remember the matched
single char.
(?=...) : +ve lookahead
.*? : to match anything in between
\1 : the remembered match.
(.)(?=.*?\1) : match and remember
any char only if it appears again
later in the string.
s/// : Perl way of doing the
substitution.
g: to do the substitution
globally...that is don't stop after
first substitution.
s/(.)(?=.*?\1)//g : this will
delete a char from the input string
only if that char appears again later
in the string.
This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.
To keep the relative order intact we can do what KennyTM tells in one of the comments:
reverse the input line
do the substitution as before
reverse the result before printing
The Perl one line for this is:
perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME
Since we are doing print manually after reversal, we don't use the -p flag but use the -n flag.
I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.
if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines
$ wc -l <file2
3210220
$ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null
real 1m1.761s
user 0m58.565s
sys 0m1.568s
$ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}' file2 > /dev/null
real 1m32.123s
user 1m23.623s
sys 0m3.450s
$ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null
real 1m17.818s
user 1m10.611s
sys 0m2.557s
$ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null
real 1m20.347s
user 1m13.069s
sys 0m2.896s
perl -ne'my%s;print grep!$s{$_}++,split//'
Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.
perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'
It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable
Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:
use Test::More 0.88;
use Tie::IxHash;
sub dedupe {
my $str=shift;
my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
return join('',$hash->Keys);
}
{
my $str='EFUAHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='EFUAHHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='UJUJHHACDEFUCU';
is(dedupe($str),'UJHACDEF');
}
done_testing();
Use uniq from List::MoreUtils:
perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates
Me bad - it removes only adjoining appearances. Disregard
This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.
However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):
perl -pe 's/(.)(?=.*?\1)//g'
And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.
MASSIVE EDIT
I've been spending the last half an hour on this, and this looks like this works, without the reversing.
perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME
I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).
With test input like this:
aabbbcbbccbabb
EFAUUUUH
ABCBBBBD
DEEEFEGGH
AABBCC
The output is like this:
abc
EFAUH
ABCD
DEFGH
ABC
I think it's working...
Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.
From the shell, this works:
sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'
In words: mark every linebreak with a <EOL> string, then put every character on a line of its own, then use uniq to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL> markers.
I found the -e :a -e '$!N; s/\n//; ta part in a forum post and I don't understand the seperate -e :a part, or the $!N part, so if anyone can explain those, I'd be grateful.
Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:
cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done
That puts the characters in each line in alphabetical order though.
use strict;
use warnings;
my ($uniq, $seq, #result);
$uniq ='';
sub uniq {
$seq = shift;
for (split'',$seq) {
$uniq .=$_ unless $uniq =~ /$_/;
}
push #result,$uniq;
$uniq='';
}
while(<DATA>){
uniq($_);
}
print #result;
__DATA__
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
The output:
EFUAH
UEH
UJHACDEF
for a file containing the data you list named foo.txt
python -c "print set(open('foo.txt').read())"