perl script to read content between marks - regex

In the perl , how to read the contents between two marks. Source data like this
START_HEAD
ddd
END_HEAD
START_DATA
eee|234|ebf
qqq| |ff
END_DATA
--Generate at 2011:23:34
then I only want to get data between "START_DATA" and "END_DATA". How to do this ?
sub readFile(){
open(FILE, "<datasource.txt") or die "file is not found";
while(<FILE>){
if(/START_DATA/){
record(\*FILE);#start record;
}
}
}
sub record($){
my $fileHandle = $_[0];
while(<fileHandle>){
print $_."\n";
if(/END_DATA/) return ;
}
}
I write this code, it doesn't work. do you know why ?
Thanks
Thanks

You can use the range operator:
perl -ne 'print if /START_DATA/ .. /END_DATA/'
The output will include the *_DATA lines, too, but it should not be so hard to get rid of them.

Besides a few typos, your code is not too far off. Had you used
use strict;
use warnings;
You might have figured it out yourself. Here's what I found:
Don't use prototypes if you do not need them, or know what they do.
Normal sub declaration is sub my_function (prototype) {, but you can leave out the prototype and just use sub my_function {.
while (<fileHandle>) { is missing the $ sign to denote that it is
a variable (scalar) and not a global. Should be $fileHandle.
print $_."\n"; will add an extra newline. Just print; will do
what you expect.
if(/END_DATA/) return; is a syntax error. Brackets are not optional
in perl in this case. Unless you reverse the statement.
Use either:
return if (/END_DATA/);
or
if (/END_DATA/) { return }
Below is the cleaned up version. I commented out your open() while testing, so this would be a functional code example.
use strict;
use warnings;
readFile();
sub readFile {
#open(FILE, "<datasource.txt") or die "file is not found";
while(<DATA>) {
if(/START_DATA/) {
recordx(\*DATA); #start record;
}
}
}
sub recordx {
my $fileHandle = $_[0];
while(<$fileHandle>) {
print;
if (/END_DATA/) { return }
}
}
__DATA__
START_HEAD
ddd
END_HEAD
START_DATA
eee|234|ebf
qqq| |ff
END_DATA
--Generate at 2011:23:34

This is a pretty simple thing to do with regular expressions, just use the /s or /m (single line or multiple line) flags - /s allows the . operator to match newlines, so you can do /start_data(.+)end_data/is.

Related

How to apply negative regex on array in perl?

Having this:
foo.pl:
#!/usr/bin/perl -w
#heds = map { /_h.+/ and s/^(.+)_.+/$1/ and "$_.hpp" } #ARGV;
#fls = map { !/_h.+/ and "$_.cpp" } #ARGV;
print "heds: #heds\nfls: #fls";
I want to separate headers from source files, and when I give input:
$./foo.pl a b c_hpp d_hpp
heds: e.hpp f.hpp
fls: e.cpp f.cpp a.cpp b.cpp
The headers are correctly separated, however the files are taken all. Why? I have applied the negative regex !/_h.+/ in the mapping so the files with *_h* should not be taken in account, but they are. Why so? and how to fix it?
Does not work even this:
#fls = map { if(!/_h.+/){ "$_.cpp" } } #ARGV;
still takes every files, despite the condition
The map { } for #heds includes a substitution on the $1 argument and changes it. Just reorder the mapppings to avoid the effect on #fls and you get the desired result. Though, if you need to access #ARGV after these mappings it is not the original #ARGV anymore, like in your example code.
#!/usr/bin/perl -w
#fls = map { !/_h.+/ and "$_.cpp" } #ARGV;
#heds = map { /_h.+/ and s/^(.+)_.+/$1/ and "$_.hpp" } #ARGV;
print "heds: #heds\nfls: #fls\n";

Open curly braces should be opened in the same line of if statement, for loops, methods, etc

Suppose i have code snippet like:
if(some condition)
{
some code here;
}
else if (some other condition)
{
some more code;
}
public static some_method(something passed here maybe)
{
some other code;
}
Then it should get formatted to:
if(some condition) {
some code here;
} else if (some other condition) {
some more code;
}
public static some_method(something passed here maybe) {
some other code;
}
This is just example code. I want to run sed script for whole file containing "if statements, for loops, methods, etc." maybe similar or in different formats. So mostly the purpose of the script should be to move these open curly braces one line up. Thanks in advance..
Just repeating the comment you got from Tom Fenech, especially if the requirements get more complicated:
My recommendation would be not to implement this yourself, and to use
an existing prettifier tool for the language you are writing
But still, here’s a possible solution, but I dont know how well it'll do in the real world. RS="\n[[:space:]]*{" splits the input on points where a newline is followed by spaces, tabs etc. before the brace, and then replace that with ' {'. Saving a line and printing it later avoids adding a final { at the end of the output.
awk '
BEGIN { RS="\n[[:space:]]*{"
NR == 1 {
line=$0;
next
}
{
printf "%s", line" {";
line=$0
}
END {
print line
}' _file_with_code_
Alternatively, this can be written up as follows, using more awk features. This does result in a trailing { at the end of the file though, which is removed by the added sed program
awk '
BEGIN {
# break record on `\n{`
RS="\n[[:space:]]*{";
# replace with `{`
ORS=" {"
}
# print for each line
{ print }
' input_file |
sed '
# for last line ($), `substitute` trailing `{` with ``
$ s/{$//
'
#!/bin/bash
# get input filename ...
filename="${1}"
# Do basic checks (etc) ...
if [[ -z "${filename}" ]]
then
echo "[ERR] No file name supplied."
echo "Usage: `basename ${0}` \"the file name\""
exit 1
fi
if [[ ! -f "${filename}" ]]
then
echo "[ERR] No file named \"${filename}\" exists anywhere."
exit 1
fi
# Create "SED crunch script" ...
sedscript="/tmp/sedscript.sed"
cat << EOF > "${sedscript}"
# if "function syntax found, then read in next line
# and print, else just print current line ...
/^.*([^)]*)[ | ]*$/{
N
/^.*([^)]*)[ | ]*\n[ | ]*{.*$/{
# Remove newline from first and appended line ...
s/\n//g
# ... and print what we have to STDOUT ...
p
b skip_default_print
}
# Next line did not start with open curly brace,
# so just print what we currently have and
# skip default print ..
p
b skip_default_print
}
p
:skip_default_print
EOF
# Execute crunch script against code ...
sed -n -f "${sedscript}" "${filename}"
Save above script into a bash script. Then execute it like below:
(example below assumes script is saved as crunch.sh)
./crunch.sh "code.java"
... where "code.java" is the code you wish to "crunch". Result is sent to STDOUT.
Here's input I used:
if(some condition)
{
some code here;
}
else if (some other condition)
{
some more code;
}
public static some_method(something passed here maybe)
{
some other code;
}
And output:
if(some condition){
some code here;
}
else if (some other condition){
some more code;
}
public static some_method(something passed here maybe) {
some other code;
}

validating HTML fields in form using regex using perl

I have a couple of quick questions regarding using regex to validate some fields in a form. But I seem to be having some problems.
so here is the code
$userNameReg = "[a-zA-Z0-9_]+";
$passwordReg = "([a-zA-Z]*)([A-Z]+)([0-9]+)";
$emailReg = "[a-zA-Z0-9_]#[a-zA-Z]\.[a-zA-Z]{2,3}";
if ($onLoad !=1)
{
#controlValue = ($userName, $password, $phoneNumber, $email);
#regex = ($userNameReg, $passwordReg, "phoneNumber", $emailReg);
#validated;
for ($i=0; $i<4; $i++)
{
$retVal= validatecontrols ($controlValue[$i], $regex[$i]);
if ($retVal)
{
$count++;
}
if (!$retVal)
{
$validated[$i]="*"
}
}
sub validatecontrols
{
$ctrlVal = shift();
$regexVal = shift();
if ($ctrlVal =~ /$regexVal/)
{
return 1;
}
if ($ctrlVal !~ /$regexVal/)
{
return 0;
}
}
}
So what happens is that it still validates special characters, and I can't understand why. It does throw a flag if I enter a single special character but if its part of a word in the middle, beginning or end it validates.
Also please disregard the phone number part, because I haven't gotten to that part yet. I still have to create a regex that validates the phone number, digits only, first digit greater than 2.
Thank you all in advance for your help and insight.
Cheers
My guess is that you're missing start/end anchors. So [a-zA-Z0-9_]+ should be ^[a-zA-Z0-9_]+$. This way pattern will only match full string.
Also I strongly recommend you to enable use strict;. It can save you from a lot of mistype errors. Just add following to the beginning of the script:
use strict;
use warnings;
This will force perl to only allow defined variables. In most case you'll need to add my to first use of your variables (for example my $ctrlVal).
In validatecontrols you don't need second if statement. You can just return false like this:
sub validatecontrols
{
my $ctrlVal = shift();
my $regexVal = shift();
if ($ctrlVal =~ /$regexVal/)
{
return 1;
}
return 0;
}

Writing a bubble sort using Perl regular expressions

I'm beginning to learn perl and I'm writing a simple bubble sort using regular expressions. However, I can't get it to sort properly (alphabetically, delimiting by whitespace). It just ends up returning the same string. Can someone help? I'm sure it's something really simple. Thanks:
#!/usr/bin/perl
use warnings;
use strict;
my $document=<<EOF;
This is the beginning of my text...#more text here;
EOF
my $continue = 1;
my $swaps = 0;
my $currentWordNumber = 0;
while($continue)
{
$document =~ m#^(\w+\s+){$currentWordNumber}#g;
if($document =~ m#\G(\w+)(\s+)(\w+)#)
{
if($3 lt $1)
{
$document =~ s#\G(\w+)(\s+)(\w+)#$3$2$1#;
$swaps++;
}
else
{
pos($document) = 0;
}
$currentWordNumber++;
}
else
{
$continue = 0 if ($swaps == 0);
$swaps = 0;
$currentWordNumber = 0;
}
}
print $document;
SOLVED: I figured out the problem. I wasn't taking into account punctuation after a word.
If you just want to sort all the words, you don't have to use regular expressions... Simply splitting up the text by newlines and white spaces should be much faster:
sub bsort {
my #x = #_;
for my $i (0..$#x) {
for my $j (0..$i) {
#x[$i, $j] = #x[$j, $i] if $x[$i] lt $x[$j];
}
}
return #x;
}
print join (" ", bsort(split(/\s+/, $document)));

Why my perl script isn't finding bad indetation from my regex match

My work's coding standard uses this bracket indentation:
some declaration
{
stuff = other stuff;
};
control structure, function, etc()
{
more stuff;
for(some amount of time)
{
do something;
}
more and more stuff;
}
I'm writing a perl script to detect incorrect indentation. Here's what I have in the body of a while(<some-file-handle>):
# $prev holds the previous line in the file
# $current holds the current in the file
if($prev =~ /^(\t*)[^;]+$/ and $current =~ /^(?<=!$1\t)[\{\}].+$/) {
print "$file # line ${.}: Bracket indentation incorrect\n";
}
Here, I'm trying to match:
$prev: A line not ended with a semi-colon, followed by...
$current: A line not having the number of leading tabs+1 of the previous line.
This doesn't seem to match anything, at the moment.
the $prev variable needs some modification.
it should be something like \t* then .+ then not ending in semicolon
also, the $current should be like:
anything ending in ; or { or } not having the number of leading tabs+1 of the previous line.
EDIT
the perl code to try the $prev
#!/usr/bin/perl -l
open(FP,"example.cpp");
while(<FP>)
{
if($_ =~ /^(\t*)[^;]+$/) {
print "got the line: $_";
}
}
close(FP);
//example.cpp
for(int i = 0;i<10;i++)
{
//not this;
//but this
}
//output
got the line: {
got the line: //but this
got the line: }
it did not detect the line with the for loop ...
am i missing something...
i see a couple of problems...
your prev regex matches all lines which do not have a ; anywhere. which will break on lines like (for int x = 1; x < 10; x++)
if the indent of the opening { is incorrect, you will not detect that.
try this instead, it only cares if you have a ;{ (followed by any whitespace) at the end.
/^(\s*).*[^{;]\s*$/
now you should change your strategy so that if you see a line which does not end in { or ; you increment the indent counter.
if you see a line which ends in }; or } decrement your indent counter.
compare all lines against this
/^\t{$counter}[^\s]/
so...
$counter = 0;
if (!($curr =~ /^\t{$counter}[^\s]/)) {
# error detected
}
if ($curr =~ /[};]+/) {
$counter--;
} else if ($curr =~ /^(\s*).*[^{;]\s*$/) }
$counter++;
}
sorry for not styling my code according to your standards... :)
And you intend to only count tabs (not spaces) for indentation?
Writing this kind of checker is complicated. Just think about all the possible constructs that uses braces that should not change indentation:
s{some}{thing}g
qw{ a b c }
grep { defined } #a
print "This is just a { provided to confuse";
print <<END;
This {
$is = not $code
}
END
But anyway, if the issues above aren't important to you, consider whether the semi colon is important at all in your regex. After all, writing
while($ok)
{
sort { some_op($_) }
grep { check($_} }
my_func(
map { $_->[0] } #list
);
}
Should be possible.
Have you considered looking at Perltidy?
Perltidy is a Perl script that reformats Perl code into set standards. Granted, what you have isn't part of the Perl standard, but you can probably tweak the curly braces via the configuration file Perltidy uses. If all else fails, you can hack through the code. After all, Perltidy is just a Perl script.
I haven't really used it, but it might be worth looking into. Your problem is trying to locate all the various edge cases, and making sure you're handling them correctly. You can parse 100 programs to find that the 101st reveal problems in your formatter. Perltidy has been used by thousands of people on millions of lines of code. If there is an issue, it probably already has been found.