Regular expression to extract everything between braces in unix - regex

I am trying to extract everything between braces from a text file and write the output to another text file. I was able to construct a regular expression to match everything between {} and it works fine (I wrote a simple java program to test it) but I not very strong in unix hence not sure how to use this regular expression in unix.
The below regular expression matches everything between {} (works in jedit too)
\{([^}]+)\}
I tried the below sed command,
cat samplefile | sed -e 's/.*\{\([^}]+\)\}.*/\1/g'
I am getting the below error.
sed: -e expression #1, char 24: Invalid preceding regular expression
Between I found a regex to match everything between [] and it works great. Not sure where I am going wrong. Can someone help me fix my regex issu?
cat file |sed -e 's/.*\[\([^]]*\)\].*/\1/g'
Edit 1:
Solution:
cat file | sed -e 's/.*{\([^}]\+\)}.*/\1/g' --> works

You must escape the + quantifier

I know that you already solved the question, but most unix machines have a perl interpreter and that language has a built-in module that does the hard job of parsing text whit this kind of delimiters, it's Text::Balanced. Here a test:
Assuming this random text (taken from the question :-), added some text between curly braces and saved as infile file:
I am trying to extract {everything between braces} from a text file and
write the output to another text file. I was able to {construct a regular
expression} to match everything between {} and it works fine (I wrote a
simple {java program} to test it) but I not {very strong} in unix hence not
sure how to use this regular expression in unix.
The program script.pl:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::Balanced qw<extract_bracketed>;
my $str = do { undef $/; <> };
while ( my #result = extract_bracketed( $str, '{}', '[^{]*' ) ) {
last unless defined $result[0];
$result[0] =~ s/\n//g;
$result[0] = substr $result[0], 1, length( $result[0] ) - 2;
printf qq|%s\n|, $result[0];
}
reads the whole file in a variable and parses it looking for a pair of curly braces, for each loop saves in first position of the array #result the text inside them, so then I remove any newline character, leading and trailing curly braces and print it.
Run it like:
perl script.pl infile
That yields:
everything between braces
construct a regular expression
java program
very strong
Note that it parses correctly a blank pair in the third line. Also other with a newline inside them (second line) and when there are several in the same line, like in the fourth one.

Related

Perl regex can't find match

So I'm working on my own little formatting correction script that uses Perl regex for substitution, but I can't get this one to match. I've used similar matching for other fixes but this one doesn't work and I can't figure out why.
# basically takes in a string to modify and the match and substitution strings
perlRegex(){
PERL_BADLANG=0 perl -le '
$string = '"'"''"${1}"''"'"';
$string =~ s/'"${2}"'/'"${3}"'/gm;
print "$string\n";
exit';
}
LINE_BREAK='\n'
# contents is the example below
EDITED=$(cat file.txt);
EDITED=$(perlRegex "${EDITED}" '(?<='"${LINE_BREAK}"')( +)([^{]+{$'"${LINE_BREAK}"')([^\s][^;]+;$)' '$1$2$1$1$3')
My current attempt is https://regex101.com/r/vgatOd/1 which gives me the output I want.
(?<=\n)( +)([^{]+{$\n)([^\s][^;]+;$)
to
$1$2$1$1$3
(?<=\n)( +) $1: copies the spaces at the beginning of the line
([^{]+{$\n) $2: captures the remaining content of the line with ending {
([^\s][^;]+;$) $3: captures the next line without a leading spaces, with ending ;
The substitution will add the spaces twice on before the second line.
Example input:
if (debug) {
Tools.DebugLine("Log");
}
Aim is to pad the Tools line to be at the correct column:
if (debug) {
Tools.DebugLine("Log");
}
Given the regex101 does what I would like it to do, I'm perplexed as to what part of it does not work in Perl regex.
Taken approach to make java file formatting isn't error prone, you should consider a better way to achieve the desired effect.
The chosen regular expression is quite excessive and mixes \n, , \s all of which falls under \s class.
The following demo code strips regular expression for simplification.
use strict;
use warnings;
my $data = do { local $/; <DATA> };
my $re = qr/([^\n]+?\{\s+)(\S+?;)(\s+\})/;
my $indent = ' ' x 8;
$data =~ s/$re/$1${indent}$2$3/gsm;
print $data;
__DATA__
if (debug) {
Tools.DebugLine("Log");
}
Output
if (debug) {
Tools.DebugLine("Log");
}
Please see Perl regular expressions
Solved by changing to just using \n instead of $ along with them
https://regex101.com/r/poQlwm/1
(?<=\n)([ ]+)([^{]+\{\n)([^\s][^;]+;\n)

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

Perl regexp substitution - multiple matches

Friends,
need some help with substitution regex.
I have a string
;;;;;;;;;;;;;
and I need to replace it by
;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;
I tried
s/;;/;\\N/;/g
but it gives me
;\N;;\N;;\N;;\N;;\N;;\N;;
tried to fiddle with lookahead and lookbehind, but can't get it solved.
I wouldn't use a regex for this, and instead make use of split:
#!/usr/bin/env perl
use strict;
use warnings;
my $str = ';;;;;;;;;;;;;';
print join ( '\N', split ( //, $str ) );
Splitting on nulls, to get each character, and making use of the fact that join puts delimiters between characters. (So not before first, and not after last).
This gives:
;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;
Which I think matches your desired output?
As a oneliner, this would be:
perl -ne 'print join ( q{\N}, split // )'
Note - we need single quotes ' rather than double around the \N so it doesn't get interpolated.
If you need to handle variable content (e.g. not just ; ) you can add grep or map into the mix - I'd need some sample data to give you a useful answer there though.
I use this for infile edit, the regexp suits me better
Following on from that - perl is quite clever. It allows you to do in place editing (if that's what you're referring to) without needing to stick with regular expressions.
Traditionally you might do
perl -i.bak -p -e 's/something/somethingelse/g' somefile
What this is doing is expanding out that out into a loop:
LINE: while (defined($_ = <ARGV>)) {
s/someting/somethingelse/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
E.g. what it's actually doing is:
opening the file
iterating it by lines
transforming the line
printing the new line
And with -i that print is redirected to the new file name.
You don't have to restrict yourself to -p though - anything that generates output will work in this way - although bear in mind if it doesn't 'pass through' any lines that it doesn't modify (as a regular expression transform does) it'll lose data.
But you can definitely do:
perl -i.bak -ne 'print join ( q{\N}, split // )'
And inplace edit - but it'll trip over on lines that aren't just ;;;;; as your example.
So to avoid those:
perl -i.bak -ne 'if (m/;;;;/) { print join ( q{\N}, split // ) } else { print }'
Or perhaps more succinctly:
perl -i.bak -pe '$_ = join ( q{\N}, split // ) if m/;;;/'
Since you can't match twice the same character you approach doesn't work. To solve the problem you can only check the presence of a following ; with a lookahead (the second ; isn't a part of the match) :
s/;(?=;)/;\\N/g

Using string variables containing literal escapes in a Perl substitution

I'm new to Perl and I found behaviour which I don't understand and can't solve.
I'm making a small find and replace program and there are some things I need to do. I have bunch of files that I need to process. Then I have a list of find / replace rules in an external text file. In replacing there I need three special things:
Replacing utf-8 characters (Czech diacritics)
Work with adding/removing lines (so working in a slurp mode)
Use a regular expressions
I want a program that works alone, so I wrote it so that it takes three arguments:
The file to work on
What to find
What to replace.
I'm sending parameters in a loop from a bash script which parse the rules list and loads other files.
My problem is when I have a "\n" string in a rules list and I send it to the Perl script. If it's in the first part of replacement (in the find section) it looks for a newline correctly, but when it's in the second part (the replace section) it just prints \n instead of a newline.
I tried hardcoding "\n" to the string right into the variable instead of passing it from the list and then it works fine.
What's the reason Perl doesn't interpret the "\n" string there, and how can I make it work?
This is my code:
list.txt - One line from the external replacement list
1\. ?\\n?NÁZEV PŘÍPRAVKU;\\n<<K1>> NÁZEV PŘÍPRAVKU;
farkapitoly.sh - The bash script for parsing list.txt and cycling through all of the files and calling the Perl script
...
FILE="/home/tmp.txt"
while read LINE
do
FIND=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $1}'`
REPLACE=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $2}'`
perl -CA ./pathtiny.pl "$FILE" "$FIND" "$REPLACE"
done < list.txt
...
pathtiny.pl - The Perl script for find and replace
#!/usr/bin/perl
use strict;
use warnings;
use Modern::Perl;
use utf8; # Enable typing Unicode in Perl strings
use open qw(:std :utf8); # Enable Unicode to STDIN/OUT/ERR and filehandles
use Path::Tiny;
my $file = path("$ARGV[0]");
my $searchStr = "$ARGV[1]";
my $replaceStr = "$ARGV[2]";
# $replaceStr="\n<<K1>> NÁZEV PRÍPRAVKU"; # if I hardcode it here \n is replaced right away
print("Search String:", "$searchStr", "\n");
print("Replace String:", "$replaceStr", "\n\n");
my $guts = $file->slurp_utf8;
$guts =~ s/$searchStr/$replaceStr/gi;
$file->spew_utf8($guts);
If it's important, I'm using Linux Mint 13 64-bit on VirtualBox (under Win 8.1) and I have Perl v5.14.2. Every file is UTF-8 with Linux endings.
Example files can be found on pastebin. this should end up like this.
But examples varies a lot. I need a universal solution to write down newline in a replacement string so it replaces correctly.
The problem is that the replacement string is read literally from the file, so if your file contains
xx\ny
then you will read exactly those six characters. Also, the replacement part of a substitution is evaluated as if it was in double quotes. So your replacement string is "$replaceStr" which interpolates the variable and goes no further, so you will again have xx\nyy in the new string. (By the way, please avoid using capital letters in local Perl identifiers as in practice they are reserved for globals such as Module::Names.)
The answer lies in using eval, or its equivalent - the /e modifier on the substitution.
If I write
my $str = '<b>';
my $r = 'xx\ny';
$str =~ s/b/$r/;
then the replacement string is interpolated to xx\ny, as you have experienced.
A single /e modifier evaluates the replacement as an expression instead of just a double-quoted string, but of course $r as an expression is xx\ny again.
What you need is a second /e modifier, which does the same evaluation as a single /e and then does an additional eval of the result on top. For this it is cleanest if you use qq{ .. } as you need two levels of quotation.
If you write
$str =~ s/b/qq{"$r"}/ee
then perl will evaluate qq{"$r"} as an expression, giving "xx\nyy", which, when evaluated again will give you the string you need - the same as the expression 'xx' . "\n" . 'yy'.
Here's a full program
use strict;
use warnings;
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/qq{"$r"}/ee;
print $s;
output
<xx
yy>
But don't forget that, if your replacement string contains any double quotes, like this
my $r = 'xx\n"yy"'
then they must be escaped before putting the through the substitution as the expression itself also uses double quotes.
All of this is quite hard to grasp, so you may prefer the String::Escape module which has an unbackslash function that will change a literal \n (and any other escapes) within a string to its equivalent character "\n". It's not a core module so you probably will need to install it.
The advantage is that you no longer need a double evaluation, as the replacement string can be just unbackslash $r which give the right result if it evaluated as an expression. It also handles double quotes in $r without any problem, as the expression doesn't use double quotes itself.
The code using String::Escape goes like this
use strict;
use warnings;
use String::Escape 'unbackslash';
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/unbackslash $r/e;
print $s;
and the output is identical to that of the previous code.
Update
Here is a refactoring of your original program that uses String::Escape. I have removed Path::Tiny as I believe it is best to use Perl's built-in inplace-edit extension, which is documented under the General Variables section of perlvar.
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use 5.010;
use open qw/ :std :utf8 /;
use String::Escape qw/ unbackslash /;
our #ARGV;
my ($file, $search, $replace) = #ARGV;
print "Search String: $search\n";
print "Replace String: $replace\n\n";
#ARGV = ($file);
$^I = '';
while (<>) {
s/$search/unbackslash $replace/eg;
print;
}
You got \n as a content of a string. (as two chacters 1: \ and second n, and not as one newline.
Perl interprets the \n as newline when it is as literal (e.g. it is in your code).
The quick-fix would be:
my $replaceStr=eval qq("$ARGV[2]"); #evaling a string causes interpreting the \n as literal
or, if you don't like eval, you can use the String-Escape cpan module. (the unbackslash function)
You're wanting a literal string to be treated as if it were a double quoted string. To do that you'll have to translate any backslash followed by another character.
The other experts have shown you how to do that over the entire string (which is risky since it uses eval with unvalidated data). Alternatively, you could use a module, String::Escape, which requires an install (not a high bar, but too high for some).
However, the following does a translation of the return value string itself in a safe way, and then it can be used like a normal value in your other search and replace:
use strict;
use warnings;
my $r = 'xx\nyy';
$r =~ s/(\\.)/qq{"$1"}/eeg; # Translate \. as a double quoted string would
print $r;
Outputs:
xx
yy

How do I return all characters that begin and end with certain characters in Perl (Or C++)?

note: I'm running Perl 5 on Linux
I'm currently doing a project where I have to input a few words and then return words that begin with "d" and end with "e". I'm not using a pre-done list, for example I input into the console Done, Dish, Dome, and Death. I want it to return Done and Dome, but not the other words. I hope to receive help how to do this in Perl, but C++ would help if Perl doesn't work out.
perl -ne ' print if /^d/i && /e$/i ' < words
Since you are using Linux, it may be simpler to use grep(1):
grep -i '^d.*e$' < words
That's almost trivial in Perl:
$ perl -nE 'say "ok" if /^d.*e$/i'
Done
ok
Dish
Dome
ok
Death
It reads from STDIN and says ok if the line matched. This is useful while debugging regular expressions. You just want to output matching lines, so you could simply replace say "ok" by say
$ perl -nlE 'say if /^d.*e$/i' words
while words is the filename of your words file. It magically reads its lines. Short explanation of that regular expression match:
^ # start of the line
d # the literal character 'd' (case-insensitive because of the i switch)
.* # everything allowed here
$ # end of the line
Not often I answer perl questions, but I think this does the trick.
my #words = ...;
#words = grep(/^d.*e$/i, #words);
grep uses a regular expression to filter the words.
How about:
#!/usr/bin/perl -Tw
use strict;
use warnings;
for my $word (#ARGV) {
if ( $word =~ m{\A d .* e \z}xmsi ) {
print "$word\n";
}
}