Need to join certain lines based on multiple-line pattern match

Need to join certain lines based on multiple-line pattern match - regex

I have a file that looks like this:
2014-05-01 00:30:45,511
ZZZ|1|CE|web1||etc|etc
ZZZ|1|CE|web2||etc|etc
ZZZ|1|CE|web3|asd|SDAF
2014-05-01 00:30:45,511
ZZZ|1|CE|web1||etc|etc
ZZZ|1|CE|web2||etc|etc
ZZZ|1|CE|web3|asd|SDAF
I want to convert this into 2 lines by replacing the newlines followed by certain patterns with pipes. I want:
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF
I am trying multiline match with perl:
cat file | perl -pe 's/\nZZZ/\|ZZZ/m'
but this does not match.
I can do perl -pe 's/\n//m' but that is too much; I need to match '\nZZZ' so that only lines beginning with ZZZ are joined to the previous line.

You just need to indicate slurp mode using the -0777 switch because you're using a regular expression that's trying to match across multiple lines.
The full solution:
perl -0777 -pe 's/\n(?=ZZZ)/|/g' file
Explanation:
Switches:
-0777: slurp files whole
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
Code:
s/\n(?=ZZZ)/|/g: Replace any newline that is followed by ZZZ with a |

Try this if you want to avoid slurp mode:
perl -pe 'chomp unless eof; /\|/ and s/^/|/ or $.>1 and s/^/\n/' filename.txt
Add a record separator to the beginning of the line if it contains record separators.
Otherwise start a new line if we are past the first line.
Keep the new line at the end of the file.

I would suggest using a Lookahead, which does not kill your ZZZ Part
cat file | perl -pe 's/(\n(?=ZZZ))/|/gm'
EDIT: Online Demo

This is a pretty standard pattern. It looks like this. The path to the input file is expected as a parameter on the command line
use strict;
use warnings;
my $line;
while (<>) {
chomp;
if ( /^ZZZ/ ) {
$line .= '|' . $_;
}
else {
print $line, "\n" if $line;
$line = $_;
}
}
print $line, "\n" if $line;
output
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF

Related

Adding blank line spaces before and after pattern 'string' match

I am trying to add 5 blank line spaces in a text file (text.txt) before and after string pattern matches. I used the following to get spaces after the 'string' match which worked for me-
sed '/string/{G;G;G;G;G;}' text.txt
I want to apply the same sed command to obtain 5 blank lines before the 'string' Here I don't want spaces, but rather blank lines before and after them. Any suggestions?

sed -r 's/(^.*)(string)(.*$)/\1\n\n\n\n\n\2\n\n\n\n\n\3/' text.txt
Use -r or -E to allow regular expressions, split likes into three sections and then substitute the line for the first section, 5 new lines, the second section, 5 new lines and then finally the third section.

Use this Perl one-liner:
perl -pe 's/string/\n\n\n\n\n$&\n\n\n\n\n/' text.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
s/PATTERN/REPLACEMENT/ : change PATTERN to REPLACEMENT.
$& : matched pattern.
\n : newline character.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

For a single string match:
$ sed -e '/string/{ s/^/\n\n\n\n\n/; s/$/\n\n\n\n\n/ }' text.txt
For multiple strings, assuming same requirements:
$ sed -E '/(string1|string2|string3)/{ s/^/\n\n\n\n\n/; s/$/\n\n\n\n\n/ }' text.txt

This might work for you:
sed '/string/{G;s/\(string\)\(.*\)\(.\)/\3\3\3\3\3\1\3\3\3\3\3\2/}' file
Match on string, append an empty line, pattern match using the newline to separate the match by 5 lines either side.

And an awk version:
awk '{if(/string1|string2|.../){printf "\n\n\n\n\n%s\n\n\n\n\n",$0}else{print}}' file

using sed to delete lines containing slashes /

I know in some circumstances, other characters besides / can be used in a sed expression:
sed -e 's.//..g' file replaces // with the empty string in file since we're using . as the separator.
But what if you want to delete lines matching //comment in file?
sed -e './/comment.d' file returns
sed: -e expression #1, char 1: unknown command: `.'

You can use still use alternate delimiter:
sed '\~//~d' file
Just escape the start of delimeter once.

To delete lines with comments, select from these Perl one-liners below. They all use m{} form of regex delimiters instead of the more commonly used //. This way, you do not have to escape slashes like so: \/, which makes a double slash look less readable: /\/\//.
Create an example input file:
echo > in_file \
'no comment
// starts with comment
// starts with whitespace, then has comment
foo // comment is anywhere in the line'
Remove the lines that start start with comment:
perl -ne 'print unless m{^//}' in_file > out_file
Output:
no comment
// starts with whitespace, then has comment
foo // comment is anywhere in the line
Remove the lines that start with optional whitespace, followed by comment:
perl -ne 'print unless m{^\s*//}' in_file > out_file
Output:
no comment
foo // comment is anywhere in the line
Remove the lines that have a comment anywhere:
perl -ne 'print unless m{//}' in_file > out_file
Output:
no comment
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlrequick: Perl regular expressions quick start

Omit words in a text

Let's say I have this file (file.txt):
Hello my name is Giorgio,
I would like to go with you
to the cinema my friend
I want to exclude from the text the words: my, is and I (not the whole line).
The words are in a file (words.txt) like this:
my
is
I
So the output must be:
Hello name Giorgio,
would like to go with you
to the cinema friend
How can this be performed?

You can use sed to turn words.txt into a sed script:
sed 's=^=s/=;s=$=//g=' words.txt | sed -f- file.txt
The difference to the expected output is the whitespace: removing a word doesn't squeeze the surrounding whitespace.
To match only whole words, add the word boundaries \b:
s=^=s/\\b=;s=$=\\b//g=
Perl solution that also squeezes the spaces (and doesn't care about meta characters):
#!/usr/bin/perl
use warnings;
use strict;
open my $WORDS, '<', 'words.txt' or die $!;
my %words;
chomp, $words{$_} = q() while <$WORDS>;
open my $TEXT, '<', 'file.txt' or die $!;
while (<$TEXT>) {
s=( ?\b(\S+)\b ?)=$words{$2} // $1=ge;
print;
}

Pretty scruffy version in awk. If the list of words contains meta characters then this will die.It does take into account word boundaries though, so won't match in the middle of words.
awk 'FNR==NR{a[$1];next}
{for(i in a)gsub("(^|[^[:alpha:]])"i"([^[:alpha:]]|$)"," ")}1' {words,file}.txt
Hello name Giorgio,
would like to go with you
to the cinema friend
It saves the words from the first file into array a.
In the next file for each word saved it simply removes that word from the line using alpha(All alphabetic characters) and the line beginning and end to ensure the word is a complete word. 1 prints the line.

This should do it:
#!/bin/bash
cp file.txt newfile.txt # we will change newfile.txt in place
while IFS= read -r line;do
[[ $line != "" ]] && sed -i "s/\b$line[[:space:]]*//g" newfile.txt
done <words.txt
cat newfile.txt
Or modifying #choroba's sed solution:
sed 's=^=s/\\b=;s=$=[[:space:]]*//g=' words.txt | sed -f- file.txt
Both of the above will strip spaces (if any) from the end of matching string.
Output:
Hello name Giorgio,
would like to go with you
to the cinema friend #There's a space here (after friend)

Problem with perl multiline matching

I'm trying to use a perl one-liner to update some code that spans multiple lines and am seeing some strange behavior. Here's a simple text file that shows the problem I'm seeing:
ABCD START
STOP EFGH
I expected the following to work but it doesn't end up replacing anything:
perl -pi -e 's/START\s+STOP/REPLACE/s' input.txt
After doing some experimenting I found that the \s+ in the original regex will match the newline but not any of the whitespace on the 2nd line, and adding a second \s+ doesn't work either. So for now I'm doing the following workaround, which is to add an intermediate regex that only removes the newline:
perl -pi -e 's/START\s+/START/s' input.txt
This creates the following intermediate file:
ABCD START STOP EFGH
Then I can run the original regex (although the /s is no longer needed):
perl -pi -e 's/START\s+STOP/REPLACE/s' input.txt
This creates the final, desired file:
ABCD REPLACE EFGH
It seems like the intermediate step should not be necessary. Am I missing something?

You were close. You need either -00 or -0777:
perl -0777 -pi -e 's/START\s+/START/' input.txt

perl -p processes the file one line at a time. The regex you have is correct, but it is never matched against the multi-line string.
A simple strategy, assuming the file will fit in memory, is to read the whole thing (do this without -p):
$/ = undef;
$file = <>;
$file =~ s/START\s+STOP/REPLACE/sg;
print $file;
Note, I have added the /g modifier to specify global replacement.
As a shortcut for all that extra boilerplate, you can use your existing script with the -0777 option: perl -0777pi -e 's/START\s+STOP/REPLACE/sg'. Adding /g is still needed if you may need to make multiple replacements within the file.
A hiccup that you might run into, although not with this regex: if the regex were START.+STOP, and a file contains multiple START/STOP pairs, greedy matching of .+ will eat everything from the first START to the last STOP. You can use non-greedy matching (match as little as possible) with .+?.
If you want to use the ^ and $ anchors for line boundaries anywhere in the string, then you also need the /m regex modifier.

A relatively simple one-liner (reading the file in memory):
perl -pi -e 'BEGIN{undef $/;} s/START\s+STOP/REPLACE/sg;' input.txt
Another alternative (not so simple), not reading the file in memory:
perl -ni -e '$a.=$_; \
if ( $a =~ s/START\s+STOP/REPLACE/s ) { print $a; $a=""; } \
END{$a && print $a}' input.txt

perl -MFile::Slurp -e '$content = read_file(shift); $content =~ s/START\s+STOP/REPLACE/s; print $content' input.txt

Here's a one-liner that doesn't read the entire file into memory at once:
perl -i -ne 'if (($x = $last . $_) =~ s/START\n\s*STOP/REPLACE/) \
{ print $x; $last = ""; } else { print $last; $last = $_; } \
print $last if eof ARGV' input.txt

How do I display data from the beginning of a file until the first occurrence of a regular expression?

How do I display data from the beginning of a file until the first occurrence of a regular expression?
For example, if I have a file that contains:
One
Two
Three
Bravo
Four
Five
I want to start displaying the contents of the file starting at line 1 and stopping when I find the string "B*". So the output should look like this:
One
Two
Three

perl -pe 'last if /^B/' source.txt
An explanation: the -p switch adds a loop around the code, turning it into this:
while ( <> ) {
last if /^B.*/; # The bit we provide
print;
}
The last keyword exits the surrounding loop immediately if the condition holds - in this case, /^B/, which indicates that the line begins with a B.

if its from the start of the file
awk '/^B/{exit}1' file
if you want to start from specific line number
awk '/^B/{exit}NR>=10' file # start from line 10

sed -n '1,/^B/p'
Print from line 1 to /^B/ (inclusive). -n suppresses default echo.
Update: Opps.... didn't want "Bravo", so instead the reverse action is needed ;-)
sed -n '/^B/,$!p'
/I3az/

sed '/^B/,$d'
Read that as follows: Delete (d) all lines beginning with the first line that starts with a "B" (/^B/), up and until the last line ($).

Some of the sed commands given by others will continue to unnecessarily process the input after the regex is found which could be quite slow for large input. This quits when the regex is found:
sed -n '/^Bravo/q;p'

in Perl:
perl -nle '/B.*/ && last; print; ' source.txt

Just sharing some answers I've received:
Print data starting at the first line, and continue until we find a match to the regex, then stop:
<command> | perl -n -e 'print "$_" if 1 ... /<regex>/;'
Print data starting at the first line, and continue until we find a match to the regex, BUT don't display the line that matches the regular expression:
<command> | perl -pe '/<regex>/ && exit;'
Doing it in sed:
<command> | sed -n '1,/<regex>/p'

Your problem is a variation on an answer in perlfaq6: How can I pull out lines between two patterns that are themselves on different lines?.
You can use Perl's somewhat exotic .. operator (documented in perlop):
perl -ne 'print if /START/ .. /END/' file1 file2 ...
If you wanted text and not lines, you would use
perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
But if you want nested occurrences of START through END, you'll run up against the problem described in the question in this section on matching balanced text.
Here's another example of using ..:
while (<>) {
$in_header = 1 .. /^$/;
$in_body = /^$/ .. eof;
# now choose between them
} continue {
$. = 0 if eof; # fix $.
}

Here is a perl one-liner:
perl -pe 'last if /B/' file

If Perl is a possibilty, you could do something like this:
% perl -0ne 'if (/B.*/) { print $`; last }' INPUT_FILE

one liner with basic shell commands:
head -`grep -n B file|head -1|cut -f1 -d":"` file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Need to join certain lines based on multiple-line pattern match - regex

I would suggest using a Lookahead, which does not kill your ZZZ Part cat file | perl -pe 's/(\n(?=ZZZ))/|/gm' EDIT: Online Demo

Related

Adding blank line spaces before and after pattern 'string' match

using sed to delete lines containing slashes /

Omit words in a text

Problem with perl multiline matching

How do I display data from the beginning of a file until the first occurrence of a regular expression?

Categories

Resources