Split a string on a pattern using bash and/or awk - regex

I have a file that is formatted like
file header string(s)
"section title" : [status]
unknown
text
"next section" : [different_status]
different
amount of
strings
I want to break this into sections such as
file header string(s)
and
"section title" : [status]
unknown
text
and
"next section" : [different_status]
different
amount of
strings
though it isn't critical to capture that header string.
As you can see, the pattern I can depend on for splitting is
"string in quotes" : [string in square brackets]
This delimiting string needs to also be captured.
What is a simple way to do this within a bash script? I predict something in awk will do it, but my awk-fu is weak.

Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
my $output = 0;
open my $OUT, '>', "section-$output" or die $!;
while (<>) {
if (/"[^"]*" : \[[^\]]*\]/) {
$output++;
open $OUT, '>', "section-$output" or die $!;
}
print {$OUT} $_;
}

This does the trick in pure Bash:
#!/bin/bash
while read line; do
[[ "$line" =~ "^\"[^\"]*\" : \[[^]]*\]" ]] && i=$(( ++i ))
[[ $i > 0 ]] && echo "SECTION_$i: " $line
done < $1
Update: improved regex.

Should be a one-liner in awk. Assuming I'm interpreting your diving lines correctly, what about this?
awk '/^"[^"]+" : \[[^]]+\]$/ { printf("\n"); } 1' inputfile > outputfile
The "1" at the end is a shortcut that says "print the current line". The condition and expression pair before it will insert the blank if the current line matches the pattern.
You could alternately do the same thing in a sed one-liner:
sed -r '/^"[^"]+" : \[[^]]+\]$/{x;p;x;}' inputfile > outputfile
This uses the magic of sed's "hold space". You can man sed for details of how x works.

Related

Find-Replace Multiple Occurrences of a string and append iterating number

How can I iterate over the code of an html file and find certain recurring text and then append a word and and iterating number to it.
So:
<!-- TemplateBeginEditable -->
<!-- TemplateBeginEditable -->
<!-- TemplateBeginEditable -->
etc...
Becomes :
<!-- TemplateBeginEditable Event=1 -->
<!-- TemplateBeginEditable Event=2 -->
<!-- TemplateBeginEditable Event=3 -->
etc...
I have tried PERL thinking it would be the easiest/fastest and went to jQuery and then back to PERL.
It seems simple enough to find/replace many ways with REGEX and return an array of the occurrences, but getting the iterating variable tacked on proves to be more of a challenge.
Latest Example of what I have tried:
#!/usr/bin/perl -w
# Open input file
open INPUTFILE, "<", $ARGV[0] or die $!;
# Open output file in write mode
open OUTPUTFILE, ">", $ARGV[1] or die $!;
# Read the input file line by line
while (<INPUTFILE>) {
my #matches = ($_ =~ m/TemplateBeginEditable/g);
### what do I do ith matches array? ###
$_ =~ s/TemplateBeginEditable/TemplateBeginEditable Event=/g;
print OUTPUTFILE $_;
}
close INPUTFILE;
close OUTPUTFILE;
To perform a replacement, you don't need to match the pattern before, you can directly perform the replacement. Example with your code:
while (<INPUTFILE>) {
s/TemplateBeginEditable/TemplateBeginEditable Event=/g;
print OUTPUTFILE $_;
}
Now to add a counter incremented at each replacement, you can put a piece of code in the pattern itself using this syntax:
my $i;
while (<INPUTFILE>) {
s/TemplateBeginEditable(?{ ++$i })/TemplateBeginEditable Event=$i/g;
print OUTPUTFILE $_;
}
To make it shorter you can use the \K feature to change the start of the match result:
while (<INPUTFILE>) {
s/TemplateBeginEditable\K(?{ ++$i })/ Event=$i/g;
print OUTPUTFILE $_;
}
Or with a one-liner:
perl -pe 's/TemplateBeginEditable\K(?{++$i})/ Event=$i/g' file > output
If you have awk available, and the target text only occurs at most once per line, then Perl is overkill I think:
awk 'BEGIN{n=1}{n+=sub("TemplateBeginEditable","& Event="n)}1'
Some explanation: The sub function returns the number of substitutions performed (0 or 1); the & means "whatever matched"; "..."n is string concatenation (no operator in awk); the 1 is a "true" condition that invokes the default "action" of {print}.
Expanding on my one-liner in the comments:
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift or die "Usage: $0 <filename>\n";
open my $fh, '<', $file or die "Cannot open $file: $!\n";
open my $ofh, '>', "$file.modified" or die "Cannot open $file.modified: $!\n";
my $i = 1;
while (my $line = <$fh>) {
if ($line =~ s/TemplateBeginEditable/$& Event=$i/) {
$i++;
}
print $ofh $line;
}
__END__
Note that this assumes you will never have more than one instance of your desired text on a single line, as shown in your sample input.
I'd just do:
local $/=undef;
my $content = <FH>;
my $x = 0;
$content =~ s/(My expected pattern)/$1 . " time=" . (++$x)/ge;

how to to extract all text of the form "<key>=<value>" from a log file

Hi I have a requirement where I need to pull text of the form - = from a large log file.
log file consists of data like this:
[accountNumber=0, email=tom.cruise#gmail.com, firstName=Tom, lastName= , message=Hello How are you doing today ?
The output I expect is:
accountNumber=0
email=tom.cruise#gmail.com
firstName=Tom
etc.
Can anyone please help ? Also please explain the solution so that I can extend it to cater to my similar needs.
I wrote a one-liner for this:
perl -nle 's/^\[//; for (split(/,/)){s/(?:^\s+|\s+$)//g; print}' input.txt
I also made another line of input to test with:
Matt#MattPC ~/perl/testing/13
$ cat input.txt
[accountNumber=0, email=tom.cruise#gmail.com, firstName=Tom, lastName= , message=Hello How are you doing today ?
[accountNumber=2, email=john.smith#gmail.com, firstName=John, lastName= , message=What is up with you?
Here is the output:
Matt#MattPC ~/perl/testing/13
$ perl -nle 's/^\[//; for (split(/,/)){s/(?:^\s+|\s+$)//g; print}' input.txt
accountNumber=0
email=tom.cruise#gmail.com
firstName=Tom
lastName=
message=Hello How are you doing today ?
accountNumber=2
email=john.smith#gmail.com
firstName=John
lastName=
message=What is up with you?
Explanation:
Expanded code:
perl -nle '
s/^\[//;
for (split(/,/)){
s/(?:^\s+|\s+$)//g;
print
}'
input.txt
Line by line explanation:
perl -nle calls perl with the command line options -n, -l, and -e. The -n adds a while loop around the program like this:
LINE:
while (<>) {
... # your program goes here
}
The -l adds a newline at the end of every print. And the -e specifies my code which will be in single quotes (').
s/^\[//; removes the first [ if there is one. This searches and replaces on $_ which is equal to the line.
for (split(/,/)){ begins the for loop which will loop through the array returned by split(/,/). The split will split $_ since it was called with just one argument, and it will split on ,. $_ was equal to the line, but inside the for loop, $_ still get set to the element of the array we are on.
s/(?:^\s+|\s+$)//g; this line removes leading and trailing white space.
print will print $_ followed by a newline, which is our string=value.
}' close the for loop and finish the '.
input.txt provide input to the program.
Going off your specific data and desired output, you could try the following:
use strict;
use warnings;
open my $fh, '<', 'file.txt' or die "Can't open file $!";
my $data = do { local $/; <$fh> };
my #matches = $data =~ /(\w+=\S+),/g;
print join "\n", #matches;
Working Demo
Perl One-Liner
Use this:
perl -0777 -ne 'while(m/[^ ,=]+=[^,]*/g){print "$&\n";}' yourfile
Assuming that each line of the log ends with a closing square bracket, you can use this:
#!/usr/bin/perl
use strict;
use warnings;
my $line = '[accountNumber=0, email=tom.cruise#gmail.com, firstName=Tom, lastName= , message=Hello How are you doing today ?]';
while($line =~ /([^][,\s][^],]*?)\s*[],]/g) {
print $1 . "\n";
}

sed delete 1st line and remove leading/trailing white spaces

I am trying to delete the 1st line and removing leading and trailing white spaces in the subsequent lines using sed
If I have something like
line1
line2
line3
It should print
line2
line3
So I tried this command on unix shell:
sed '1d;s/^ [ \t]*//;s/[ \t]*$//' file.txt
and it works as expected.
When I try the same in my perl script:
my #templates = `sed '1d;s/^ [ \t]*//;s/[ \t]*$//' $MY_FILE`;
It gives me this message "sed: -e expression #1, char 10: unterminated `s' command" and doesn't print anything. Can someone tell me where I am going wrong
Why would you invoke Sed from Perl anyway? Replacing the sed with the equivalent Perl code is just a few well-planned keystrokes.
my #templates;
if (open (M, '<', $MY_FILE)) {
#templates = map { s/(?:^\s*|\s*$)//g; $_ } <M>;
shift #templates;
close M;
} else { # die horribly? }
The backticks work like double-quotes. Perl interpolates variables inside them, as you already know due to your use of $MY_FILE. What you may not know is that $/ is actually a variable, the input record separator (by default a newline character). The same is true for the backslashes before the tab character. Here Perl will interpret \t for you and replace it with the tab character. You'll need a second backslash so that sed sees \t instead of an actual tab character. The latter might work as well, though.
Consider to use safe pipe open instead of backticks, to avoid problems with escaping. For example:
my #templates = do {
open my $fh, "|-", 'sed', '1d;s/^ [ \t]*//;s/[ \t]*$//', $MY_FILE
or die $!;
local $/;
<$fh>;
};
You have a typo in your expression. You need a semicolon between the 2 substitution statements. You should use the following instead:
my #templates = `sed '1d;s/^ [ \\t]*//;s/[ \\t]*\$//' $MY_FILE`;
escaping $ and \ as suggested in the other answer. I should note that it also worked for me without escaping \ as it was replaced by a literal tab.
As others have mentioned, I would recommend you do this only in Perl, or only in Sed, because there's really no reason to use both for this task. Using Sed in Perl will mean you have to worry about escaping, quoting and capturing the output (unless reading from a pipe) somehow. Obviously, all that complicates things and it also makes the code very ugly.
Here is a Perl one-liner that will handle your reformatting:
perl -le 'my $line = <>; while (<>) { chomp; s/^\s*|\s*$//; print $_; }' file.txt
Basically, you just take the first line and store in a variable that won't be used, then process the rest of the lines. Below is a small script version that you can add to your existing script.
#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "$0 infile";
my $infile = shift or die $usage;
open my $in, '<', $infile or die "Could not open file: $infile";
my $first = <$in>;
while (<$in>) {
chomp;
s/^\s*|\s*$//;
# process your data here, or just print...
print $_, "\n";
}
close $in;
This can also be down with awk
awk 'NR>1 {$1=$1;print}' file
line2
line3

change number to english Perl

Hye, Can you check my script where is my problem..sorry I'm new in perl..I want to convert from number to english words for example 1400 -> one thousand four hundred...I already used
Lingua::EN::Numbers qw(num2en num2en_ordinal);
this is my input file.txt
I have us dollar 1200
and the output should be. "I have us dollar one thousand two hundred"
this is my script
#!/usr/bin/perl
use utf8;
use Lingua::EN::Numbers qw(num2en num2en_ordinal);
if(! open(INPUT, '< snuker.txt'))
{
die "cannot opent input file: $!";
}
select OUTPUT;
while($lines = <INPUT>){
$lines =~ s/usd|USD|Usd|uSd|UsD/us dollar/g;
$lines =~ s/\$/dollar /g;
$lines =~ s/rm|RM|Rm|rM/ringgit malaysia /g;
$lines =~ s/\n/ /g;
$lines =~ s/[[:punct:]]//g;
$lines =~ s/(\d+)/num2en($lines)/g; #this is where it should convert to english words
print lc($lines); #print lower case
}
close INPUT;
close OUTPUT;
close STDOUT;
the output i got is "i have us dollar num2en(i have us dollar 1200 )"
thank you
You need to refer to the capture using $1 instead of passing the $lines in your last regex where you also need an e flag at the end so that it is evaluated as an expression. You can use i flag to avoid writing all combinations of [Uu][Ss][Dd]...:
while($lines = <INPUT>){
$lines =~ s/usd/us dollar/ig;
$lines =~ s/\$/dollar /g;
$lines =~ s/rm/ringgit malaysia /ig;
$lines =~ s/\n/ /g;
$lines =~ s/[[:punct:]]//g;
$lines =~ s/(\d+)/num2en($1)/ge; #this is where it should convert to english words
print lc($lines), "\n"; #print lower case
}
You’re missing the e modifier on the regex substitution:
$ echo foo 42 | perl -pe "s/(\d+)/\$1+1/g"
foo 42+1
$ echo foo 42 | perl -pe "s/(\d+)/\$1+1/ge"
foo 43
See man perlop:
Options are as with m// with the addition of the following replacement
specific options:
        e    Evaluate the right side as an expression.
Plus you have to refer to the captured number ($1), not the whole string ($lines), but I guess you have already caught that.
The problem here is that you are confusing regexps with functions. In the line where you try to do the conversion, you're not calling the function num2en; instead, you're replacing the number with the text num2en($line). Here's a suggestion for you:
($text, $number) = $lines =~ s/(.*)+(\d+); # split the line into a text part and a number part
print lc($text . num2en($number)); # print first the text, then the converted number;

Using Perl, how can I replace newlines with commas?

I gave up on sed and I've heard it is better in Perl.
I would like a script that can be called from the 'unix' command line and converts DOS line endings CRLF from the input file and replaces them with commas in the output file:
like
myconvert infile > outfile
where infile was:
1
2
3
and would result in outfile:
1,2,3
I would prefer more explicit code with some minimal comments over "the shortest possible solution", so I can learn from it, I have no perl experience.
In shell, you can do it in many ways:
cat input | xargs echo | tr ' ' ,
or
perl -pe 's/\r?\n/,/' input > output
I know you wanted this to be longer, but I don't really see the point of writing multi line script to solve such simple task - simple regexp (in case of perl solution) is fully workable, and it's not something artificially shortened - it's the type of code that I would use on daily basis to solve the issue at hand.
#!/bin/perl
while(<>) { # Read from stdin one line at a time
s:\r\n:,:g; # Replace CRLF in current line with comma
print; # Write out the new line
}
use strict;
use warnings;
my $infile = $ARGV[0] or die "$0 Usage:\n\t$0 <input file>\n\n";
open(my $in_fh , '<' , $infile) or die "$0 Error: Couldn't open $infile for reading: $!\n";
my $file_contents;
{
local $/; # slurp in the entire file. Limit change to $/ to enclosing block.
$file_contents = <$in_fh>
}
close($in_fh) or die "$0 Error: Couldn't close $infile after reading: $!\n";
# change DOS line endings to commas
$file_contents =~ s/\r\n/,/g;
$file_contents =~ s/,$//; # get rid of last comma
# finally output the resulting string to STDOUT
print $file_contents . "\n";
Your question text and example output were not consistent. If you're converting all line endings to commas, you will end up with an extra comma at the end, from the last line ending. But you example shows only commas between the numbers. I assumed you wanted the code output to match your example and that the question text was incorrect, however if you want the last comma just remove the line with the comment "get rid of last comma".
If any command is not clear, http://perldoc.perl.org/ is your friend (there is a search box at the top right corner).
It's as simple as:
tr '\n' , <infile >outfile
Avoid slurping, don't tack on a trailing comma and print out a well-formed text file (all lines must end in newlines):
#!/usr/bin/perl
use strict;
use warnings;
my $line = <>;
while ( 1 ) {
my $next = <>;
s{(?:\015\012?|\012)+$}{} for $line, $next;
if ( length $next ) {
print $line, q{,};
$line = $next;
}
else {
print $line, "\n";
last;
}
}
__END__
Personally I would avoid having to look a line ahead (as in Sinar's answer). Sometimes you need to but I have sometimes done things wrong in processing the last line.
use strict;
use warnings;
my $outputcomma = 0; # No comma before first line
while ( <> )
{
print ',' if $outputcomma ;
$outputcomma = 1 ; # output commas from now on
s/\r?\n$// ;
print ;
}
print "\n" ;
BTW: In sed, it would be:
sed ':a;{N;s/\r\n/,/;ba}' infile > outfile
with Perl
$\ = "\n"; # set output record separator
$, = ',';
$/ = "\n\n";
while (<>) {
chomp;
#f = split('\s+', $_);
print join($,,#f);
}
in unix, you can also use tools such as awk or tr
awk 'BEGIN{OFS=",";RS=""}{$1=$1}1' file
or
tr "\n" "," < file