Using Perl Regex Multiline to reformat file - regex

I have the file with the following format:
(Type 1 data:1) B B (Type 1 data:2) B B B
(Type 1 data:3) B ..
Now I want to reformat this file so that it looks like:
(Type 1 data:1) B B (Type 1 data:2) B B B (Type 1 data:3) B
...
My approach was to use perl regex in command line,
cat file | perl -pe 's/\n(B)/ $1/smg'
My reasoning was to replace the new line character with space.
but it doesn't seem to work. can you please help me? Thanks

The -p reads a line at a time, so there is nothing after the "\n" to match with.
perl -pe 'chomp; $_ = ($_ =~ /Type/) ? "\n".$_ : " ".$_'
this does almost what you want but puts one extra newline at the beginning and loses the final newline.

If the only place that ( shows up is at the beginning of where you want your lines to start, then you could use this command.
perl -l -0x28 -ne's/\n/ /g;print"($_"if$_' < file
-l causes print to add \n on the end of each line it prints.
-0x28 causes it to split on ( instead of on \n.
-n causes it to loop on the input. Basically it adds while(<>){chomp $_; to the beginning, and } at the end of what ever is in -e.
s/\n/ /g
print "($_" if $_ The if $_ part just stops it from printing an extra line at the beginning.

It's a little more involved as -n and -p fit best for processing one line at a time while your requirement is to combine several lines, which means you'd have to maintain state for a while.
So just read the entire file in memory and apply the regex like this:
perl -lwe ^
"local $/; local $_ = <>; print join q( ), split /\n/ for m/^\(Type [^(]*/gsm"
Feed your file to this prog on STDIN using input redirection (<).
Note this syntax is for the Windows command line. For Bash, use single quotes to quote the script.

Related

I am in troubles with a regexp to remove some \n

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?
sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv
It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

I want to delete the header from all the files, and the header has the lines starting with //.
If I want to delete all the lines that starts with //, I can do following:
sed '/^\/\//d'
But, that is not something I need to do. I just need to delete the lines in the beginning of the file that starts with //.
Sample file:
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Update:
If there is a new line in the beginning or in-between, it doesn't work. Is there any way to take care of that scenario?
Sample file:
< new empty line >
// This is the header
< new empty line >
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Can someone suggest a way to do this? Thanks in advance!
Update: The accepted answer works well for white space in the beginning or in-between.
Could you please try following. This also takes care of new line scenario too, written and tested in https://ideone.com/IKN3QR
awk '
(NF == 0 || /^[[:blank:]]*\/\//) && !found{
next
}
NF{
found=1
}
1
' Input_file
Explanation: Simply checking conditions if a line either is empty OR starting from // AND variable found is NULL then simply skip those lines. Once any line without // found then setting variable found here so all next coming lines should be printed from line where it's get set to till end of Input_file printed.
With sed:
sed -n '1{:a; /^[[:space:]]*\/\/\|^$/ {n; ba}};p' file
print "Hi"
// This should not be deleted
print "Hello"
Slightly shorter version with GNU sed:
sed -nE '1{:a; /^\s*\/\/|^$/ {n; ba}};p' file
Explanation:
1 { # execute this block on the fist line only
:a; # this is a label
/^\s*\/\/|^$/ { n; # on lines matching `^\s*\/\/` or `^$`, do: read the next line
ba } # and go to label :a
}; # end block
p # print line unchanged:
# we only get here after the header or when it's not found
sed -n makes sed not print any lines without the p command.
Edit: updated the pattern to also skip empty lines.
I sounds like you just want to start printing from the first line that's neither blank nor just a comment:
$ awk 'NF && ($1 !~ "^//"){f=1} f' file
print "Hi"
// This should not be deleted
print "Hello"
The above simply sets a flag f when it finds such a line and prints every line from then on. It will work using any awk in any shell on every UNIX box.
Note that, unlike some of the potential solutions posted, it doesn't store more than 1 line at a time in memory and so will work no matter how large your input file is.
It was tested against this input:
$ cat file
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
To run the above on many files at once and modify each file as you go is this with GNU awk:
awk -i inplace 'NF && ($1 !~ "^//"){f=1} f' *
and this with any awk:
ip_awk() { local f t=$(mktemp) && for f in "${#:2}"; do awk "$1" "$f" > "$t" && mv -- "$t" "$f"; done; }
ip_awk 'NF && ($1 !~ "^//"){f=1} f' *
In case perl is available then this may also work in slurp mode:
perl -0777 -pe 's~\A(?:\h*(?://.*)?\R+)+~~' file
\A will only match start of the file and (?:\h*(?://.*)?\R+)+ will match 1 or more lines that are blank or have // with optional leading spaces.
With GNU sed:
sed -i -Ez 's/^((\/\/[^\n]*|\s*)\n)+//' file
The ^((\/\/[^\n]*|\s*)\n)+ expression will match one or more lines starting with //, also matching blank lines, only at the start of the file.
Using ed (the file editor that the stream editor sed is based on),
printf '1,/^[^/]/ g|^\(//.*\)\{0,1\}$| d\nw\n' | ed tmp.txt
Some explanations are probably in order.
ed takes the name of the file to edit as an argument, and reads commands from standard input. Each command is terminated by a newline. (You could also read commands from a here document, rather than from printf via a pipe.)
1,/^[^/]/ addresses the first lines in the file, up to and including the first one that does not start with /. (All the lines you want to delete will be included in this set.)
g|^\(//.*\)\{0,1\}$|d deletes all the addressed lines that are either empty or do start with //.
w saves the changes.
Step 2 is a bit ugly; unfortunately, ed does not support regular expression operators you may take for granted, like ? or |. Breaking the regular expression down a bit:
^ matches the start of the line.
//.* matches // followed by zero or more characters.
\(//.*\)\{0,1\} matches the preceding regular expression 0 or 1 times (i.e., optionally)
$ matches the end of the line.

perl regex negative-lookbehind detect file lacking final linefeed

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).
> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb' | test -n "$(tail -c1)" && echo pathological last line
pathological last line
>
One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.
(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
>
Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
>
Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.
/(?<=\n)$/ is a weird and expensive way of doing /\n$/.
/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.
A few approaches:
perl -n0777e'print "pathological last line\n" if !/\n\z/'
perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'
perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'
perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'
The last solution avoids slurping the entire file.
Why does my regex always match, even when the end-of-file is preceded by newline?
Because you mistakenly think that $ only matches at the end of the string. Use \z for that.
Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.
perl -le'while (!eof()) { $previous = getc(\*ARGV) }
if ($previous ne "\n") { print "pathological last line!" }'
PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:
perl -le 'use File::ReadBackwards;
my $bw = File::ReadBackwards->new(shift #ARGV);
print "pathological last line" if substr($bw->readline, -1) ne "\n"'
That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.
The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp.
(I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)
Thanks to the responses, here is an improved draft of the script:
#!/usr/bin/perl
use strict; use warnings; use 5.18.2;
# Loop slurp:
$/ = undef; # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
s/^\s*$/\n/mg; # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/^[\n]+//; # first line is visible or "nontrivial."
s/[\n]+\z/\n/; # last line is visible or "nontrivial."
print STDOUT;
print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF.
}
And here is how it works, when named zz.pl. First a messy file, then how it looks after zz.pl gets through with it:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t'
aaa
bb
bash:
bash:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t' | zz.pl
aaa
bb
bash:

Can not replace multiple empty lines with one

Why does the following not replace multiple empty lines with one?
$ cat some_random_text.txt
foo
bar
test
and this does not work:
$ cat some_random_text.txt | perl -pe "s/\n+/\n/g"
foo
bar
test
I am trying to replace the multiple new lines (i.e. empty lines) to a single empty new line but the regex I use for that does not work as you can see in the example snippet.
What am I messing up?
Expected outcome is:
foo
bar
test
The reason it doesn't work is that -p tells perl to process the input line by line, and there's never more than one \n in a single line.
Better idea:
perl -00 -lpe 1
-00: Enable paragraph mode (input records are terminated by any sequence of 2+ newlines).
-l: Enable autochomp mode (the input record separators are trimmed automatically, so since we're in paragraph mode, all trailing newlines are removed, and output records get "\n\n" added).
-p: Enable automatic input/output (the main code is executed for each input record; anything left in $_ is printed automatically).
-e 1: Use a dummy main program that does nothing.
Taken all together this does nothing except normalize paragraph terminators to exactly two newlines.
You are executing the following program:
LINE: while (<>) {
s/\n+/\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Since you are reading one line at at time, and since a line is a sequence of characters that aren't line feeds terminated by a line feed, your pattern will never match more than one newline.
The simple fix is to tell Perl to treat the entire file as one line. Also, you don't want to replace every line feed, but just those found in sequence of two or more, and you want to replace the sequence with two line feeds.
perl -0777pe's/\n\n\K\n+//g; s^\n+//; s/\n\K\n\z//' some_random_text.txt
The second and third substitutions ensure there are no blank lines at the start and end of the file.
While reading the entire file into memory is easy, it's not necessary. The desired output can also be achieved by maintaining a flag that indicates whether the previous line was blank or not.
perl -ne'if (/\S/) { print "\n" if $f; print; $f=0 } else { $f=1 }' some_random_text.txt
This solution also removes blank lines from the start and end of the file.
Given:
$ echo "$txt"
foo
bar
test
You can use sed to reduce the runs of blank lines to a single \n:
$ echo "$txt" | sed '/^$/N;/^\n$/D'
foo
bar
test
Even easier, you can use cat -s:
$ echo "$txt" | cat -s # same output
In perl the idiomatic 1 liner is to use -00 for paragraph mode:
$ echo "$txt" | perl -00pe0 # same output
And in awk you have the flexibility of using paragraph mode by setting RS= and then set ORS= to what you want the replacement for runs of \n to be:
$ echo "$txt" | awk '1' RS= ORS="\n\n" # same output
ikegami correctly states that printf 'a\n\n' | ... will produce two trailing spaces with these solutions. That may or may not be an issue.

Perl one liner to remove multiple line

input file is
<section_begin> mxsqlc
*** WARNING[13052] Cursor C is not fetched.
<section_end>
<section_begin> b2.lst
*
*** WARNING[13052] Cursor C is not fetched.
0 errors, 1 warnings in SQL C file "b2.ppp".
<section_end>
<section_begin> b2s0
SQLCODE=0
SQLSTATE=00000
a=10, b=abc, c=20
SQLCODE=0
SQLSTATE=00000
a=10, b=abc , c=10, d=xyz
<section_end>
expecting output without below lines.
<section_end>
<section_begin> b2s0
my code is
perl -ne 'print unless /^\<section_end\>(\s*|.*lst)?\s*$/' b2exp
It removes all <section_end> lines and doesn't remove this line <section_begin> *.lst
keep it simple
perl -ne 'print unless /^\<section_/' b2exp
bit more complicated
perl -ne 'print unless /^\<section_(end|begin)\>/' b2exp
Ah, your question isn't clear. ( to me, perhaps it is really)
I now read it as
"I have some sections marked out with <section_begin> tagname
at the start and </section_end> at the end.
I wish to exclude the sections with a particular tagname, bs20 in the example. I wish to keep all other lines
"
perl -ne 'BEGIN {$p=1} $p=0 if /section_begin.*b2s0/; print if $p; $p=1 if /<section_end>/;' ex.txt
If the intention is to merge the section with lst with the next section (and remove the stuff on the same line after the next section's begin tag), I would go with Awk instead.
awk '/<section_end>/ && lst { next }
/<section_begin>/ && lst { lst=0; next }
/<section_begin>.*lst/ {lst=1}
1' b2exp
The same thing can be done in Perl, of course; the simplest one-liner with perl -0777 -pe 's/.../.../s' file will be a lot less memory-efficient because of buffering.
perl -0777 -pe 's%(<section_begin>[^\n]*lst.*?)\n<section_end>\n<section_begin>[^\n]%$1%s' b2exp
This will read the whole file into memory (-0777) and replace the multi-line regexp. The greedy match .*? will make the match as short as possible, i.e. not span past a match on the rest of the pattern (newline, end tag, newline, begin tag optionally followed by non-newline data). We also take care to use [^\n] where we want to keep matches on the same line, since the /s flag turns . into a wildcard which can also match newlines.