I can't seem to get this lex regex working:
%{
#include"y.tab.h"
%}
%option yylineno
/* regular definitions */
angle_bracket_start "<"
%%
angle_bracket_start /*swallow it, do nothing!*/{}
%%
But when I test it with
lex lex.l
gcc lex.yy.c -lfl
I got:
$ ./a.out
<
< <--- If it prints out the "<", it means lex can't parse it, right?
^C
I'm asking this because I need a regex which matches verbatimly
<script type="text/JavaScript">
But I always get syntax error because lex decides it can't parse and thus throw out the < and parse script as an id
To refer to a definition, use curly braces ({}) around the name:
angle_bracket_start "<"
%%
{angle_bracket_start} /*swallow it, do nothing!*/
without the curly braces, it is looking for the literal string angle_bracket_start in the input...
Related
I need to use egrep to obtain an entry in an index file.
In order to find the entry, I use the following command:
egrep "^$var_name" index
$var_name is the variable read from a var list file:
while read var_name; do
egrep "^$var_name" index
done < list
One of the possible keys comes usually in this format:
$ERROR['SOME_VAR']
My index file is in the form:
$ERROR['SOME_VAR'] --> n
Where n is the line where the variable is found.
The problem is that $var_name is automatically escaped when read. When I enable the debug mode, I get the following command being executed:
+ egrep '^$ERRORS['\''SELECT_COUNTRY'\'']' index
The command above doesn't work, because egrep will try to interpret the pattern.
If I don't use the extended version, using grep or fgrep, the command will work only if I remove the ^ anchor:
grep -F "$var_name" index # this actually works
The problem is that I need to ensure that the match is made at the beginning of the line.
Ideas?
set -x shows the command being executed in shell notation.
The backslashes you see do not become part of the argument, they're just printed by set -x to show the executed command in a copypastable format.
Your problem is not too much escaping, but too little: $ in regex means "end of line", so ^$ERROR will never match anything. Similarly, [ ] is a character range, and will not match literal square brackets.
The correct regex to match your pattern would be ^\$ERROR\['SOME VAR'], equivalent to the shell argument in egrep "^\\\$ERROR\['SOME_VAR']".
Your options to fix this are:
If you expect to be able to use regex in your input file, you need to include regex escapes like above, so that your patterns are valid.
If you expect to be able to use arbitrary, literal strings, use a tool that can match flexibly and literally. This requires jumping through some hoops, since UNIX tools for legacy reasons are very sloppy.
Here's one with awk:
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
It passes the string in through the environment (because -v is sloppy) and then matches literally against the string from the start of the input.
Here's an example invocation:
$ cat script
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
$ cat list
$ERRORS['SOME_VAR']
\E and \Q
'"'%##%*'
$ cat index
hello world
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
etc
$ bash script
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
You can use printf "%q":
while read -r var_name; do
egrep "^$(printf "%q\n" "$var_name")" index
done < list
Update: You can also do:
while read -r var_name; do
egrep "^\Q$var_name\E" index
done < list
Here \Q and \E are used to make string in between a literal string removing all special meaning of regex symbols.
I have string
Message <Network=Data Center> All Verified
I need to extract all string except one in angular brackets
I tried
m/(?![^<]*\\>)/s
Not giving desired result.
Removing <..> regions
It's easier to remove the <..> parts from the string and then deal with the remaining string.
Try this oneliner:
cat file | perl -pne 's/<[^>]*?>//g;'
For your sample input, this is the output:
Message All Verified
Notice the non-greedy quantifier ? is used in the regex. Also, because this is a oneliner, the s/// search-and-replace construct is applied to $_ implicit variable (which is a line from standard input). So after search & replace has run in this oneliner, the $_ will be altered(there will be no <..> regions in it). Also the -p was used in order to print the variable $_ after running the block of code. You can read more about Perl commandline switches in perlrun.
This is one solution. Below there is another one:
Capturing regions outside of <..>
On the other hand, you can(if you want) match the parts outside of the <..> regions.
In order to do that let's build a regex. First, we want a < or > free region. The following regex matches just that
$p = ([^<>]*).
Next, we want to match everything before <, and for that we can write (?:$p<) and everything after >, and that's (?:>$p).
Now if we assemble all those parts together we get (?:>$p)|(?:$p<).
Notice that (?:) is a non-capturing group.
So now there are two capturing groups (the two $p you see above) but only one will match at a time, so some of the captures will be undef. We'll have to filter those out.
Finally, we can assemble all the captures, and we're done.
cat file | perl -ne '$p="([^<>]*)";#x=grep{defined} m{(?:>$p)|(?:$p<)}g; print join(" ",#x)."\n";'
Parse::Yapp parser
You might think that using Parser::Yapp for this particular problem is a bit too much(usually, if you have something complicated to parse, you would use a grammar and a parser generator), but .. why not.. :)
Ok, so we need a grammar, here's one right here grammar_file.yp:
#header
%%
#rules
expression:
| exterior '<' interior '>' exterior
| exterior
;
exterior:
| TOK { $_[0]->YYData->{DATA} .= $_[1]; }
| expression
;
interior: TOK;
%%
#footer
sub Error { my ($parser)=shift; }
sub Lexer {
use Data::Dumper;
my($parser)=shift;
$parser->YYData->{INPUT} or return('',undef);
#$parser->YYData->{INPUT}=~s/^\s+//;
for ($parser->YYData->{INPUT}) {
return ('TOK',$1) if(s/^([^<>]+)//);
return ( $1,$1) if(s/^([<>])//);
};
}
You will notice in the grammar above that the interior is completely ignored, and only the terminals from exterior are collected.
Here's a small program that will use the parser(MyParser.pm generated from grammar_file.yp) parse.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use MyParser;
my $parser=MyParser->new;
$parser->YYData->{INPUT} = "Message <Network=Data Center> All Verified";
my $value=$parser->YYParse(
yylex => \&MyParser::Lexer,
yyerror => \&MyParser::Error,
#yydebug => 0x1F,
);
my $nberr=$parser->YYNberr();
my $data=$parser->YYData->{DATA};
print "Result=$data"
And now a Makefile and we're done:
generate_parser_module:
yapp -m MyParser grammar_file.yp;
run:
perl parse.pl
all: generate_parser_module
Note
Some more Parser generators can be found here
Regexp::Grammars
Parse::RecDescent
Marpa::XS or Marpa::R2
You can do it other way: just remove the string in the angular brackets:
s#<.*>##
Or if > is not allowed:
s#<[^>]*>##
You can use sed for that:
cat yourfile |sed 's/<.*>//g' > newfile
If you need perl:
perl -i -pe "s/<.*?>//g" yourfile
Here is a compact approach. The following regex will capture your strings into Group 1:
<[^>]+>|([^<>]*)
What we are interested in here is not the overall match, but just the Group 1 matches.
So we need to iterate over Group 1 matches. I don't code in Perl, but following a recipe from the perlretut tutorial, this should do it:
while ($x =~ /<[^>]+>|([^<>]*)/g) {
print "$1","\n";
}
Please give it a try and let me know if it works for you.
I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">
find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done
The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:
<script>
//Foo
</script>
will remove the first script tag but will omit the "foo" and closing tag which I don't want.
Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?
Assuming that you have <script> tags on different lines, e.g. something like:
foo
bar
<script type="text/javascript">
some JS
</script>
foo
the following should work:
sed '/<script/,/<\/script>/d' inputfile
This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.
awk '/<script.*>/ { in=1; next }
/<\/script.*>/ { if (in) in=0; next }
{ if (!in) print; } ' $1
As you mentioned, the issue is that sed processes input line by line.
The simplest workaround is therefore to make the input a single line, e.g. replacing newlines with a character which you are confident doesn't exist in your input.
One would be tempted to use tr :
… |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'
However "currently tr fully supports only single-byte characters", and to be safe you probably want to use some improbable character like ˇ, for which tr is of no help.
Fortunately, the same thing can be achieved with sed, using branching.
Back on our <script>…</script> example, this does work and would be (according to the previous link) cross-platform :
… |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ˇ/g' -e 's~<script>.*</script>~~g' -e 's/ˇ/\n/g'
Or in a more condensed form if you use GNU sed and don't need cross-platform compatibility :
… |sed ':a;N;$!ba;s/\n/ˇ/g;s~<script>.*</script>~~g;s/ˇ/\n/g'
Please refer to the linked answer under "using branching" for details about the branching part (:a;N;$!ba;). The remaining part is straightforward :
s/\n/ˇ/g replaces all newlines with ˇ ;
s~<script>.*</script>~~g removes what needs to be removed (beware that it requires some securing for actual use : as is it will delete everything between the first <script> and the last </script> ; also, note that I used ~ instead of / to avoid escaping of the slash in </script> : I could have used just about any single-byte character except a few reserved ones like \) ;
s/ˇ/\n/g readds newlines.
I'm trying to make an automated build and use Perl in it to update some paths in a file.
Specifically, in an html file, I want to take the block shown below
<!-- BEGIN: -->
<script src="js/a.js"></script>
<script src="js/b.js"></script>
<script src="js/c.js"></script>
<!-- END: -->
and replace it with
<script src="js/all.js"></script>
I have tried a few regexes like:
perl -i -pe 's/<--BEGIN:(.|\n|\r)*:END-->/stuff/g' file.html
or just starting with:
perl -i -pe 's/BEGIN:(.|\n|\r)*/stuff/g' file.html
But I can't seem to get past the first line. Any ideas?
perl -i -pe 's/<--BEGIN:(.|\n|\r)*:END-->/stuff/g' file.html
This is so close.
Now just match with the /s modifier, this allows . to match any char, including newlines.
Most importantly, you want to start the match with <!--, note the !.
Also, you want a non-greedy match like .*?, in case you have multiple END markers.
Your example input shows that there may be extra spaces.
This would lead to the following substitution:
s/<!--\s*BEGIN:.*?END:\s*-->/stuff/sg
As #plusplus pointed out, the -p iterates over each line. Let's change Perl's concept of a “line” to “the whole file at once”:
BEGIN { $/ = undef }
or use the -0 command line switch, without a numeric argument.
I need to parse a file and grab certain fields from it using a regular expression as part of the delimiter. I thought I can use perl to do this(?). The problem is I can't get it to work properly. Here's a one liner which I thought would allow me to print fields that are separated by one more white spaces (in this case one or more space):
bash_prompt> perl -anF'/ +/' -e 'print "$F[0], $F[-1]\n"' build_outputfile
The output file is from a makefile.
Here, I want to print out the first token, and the last token. So in my case which compiler was used and which file was compiled. Perhaps there's a better way to do it, but now I'm bothered as to why my perl one liner does not work.
Anyways, the regular expression '/ +/ does not appear to work. I get some unexpected output. Perhaps F does not actually want a regular expression? When I replace F's argument with '/ /' that contains one space, I still don't get a expected output.
Can anyone help? Thanks.
Here's some test code for you to try. Save it in a file:
g++ -c -g -Wall -I/codedir/src/CanComm/include -I/home/codemonkey/workspace/thirdparty/Boost -Wno-deprecated SCMain.cpp
g++ -c -g -Wall -I./object/include -I./wrapper/include -I./Properties/include -I./Messaging/include -I/codedir/src/Logging/sclog/include ./object/SCObject.cpp ./object/RandNumGenerator.cpp ./object/ScannerConstraints.cpp ./object/ThreadSync.cpp ./object/SCData.cpp ./object/AirScanData.cpp ./object/ClusterData.cpp ./object/WarmupData.cpp ./object/SCCommand.cpp ./object/ScanCommands.cpp ./object/RCCommands.cpp ./object/ReconData.cpp ./object/UICommTool.cpp ./object/UIMsg.cpp ./object/UI2SCConversion.cpp ./object/RCMsg.cpp ./object/RCMessageInfo.cpp ./object/Utils.cpp ./object/ZBackupTable.cpp ./object/ZBackupFactory.cpp
g++ -c -g -Wall -I./Properties/include -I/codedir/src/Logging/sclog/include -I./object/include -I/home/codemonkey/workspace/thirdparty/Boost ./Properties/PropertyMap.cpp
According to perldoc perlrun:
-Fpattern
specifies the pattern to split on if -a is also in effect.
The pattern may be surrounded by "//", "", or '', otherwise it will be put in single quotes. You can't use literal whitespace in the pattern.
I have to admit: What a thoroughly arbitrary restriction!
For your problem you don't actually need to specify a pattern as the default which is space might do you good enough.
perl -anle 'print "$F[0], $F[-1]"' build_outputfile
Your Regex pattern should be like this:
'/\s+/'
\s means to match any whitespace