perl regex replace single array element - regex

I want to read a number of lines from a for loop and split them.
After that i want to do a replace on A SINGLE array element.
my #fs = split(';', $line);
$fs[0] =~ s/\"//g;
This doesnt work however. The line
$fs[0] =~ s/\"//g;
returns a compiler error.
Is there a better way to do this?

Change the line with split to
my #fs = split(/;/, $line);
because split takes a regex as its first operand.
I suspect the parse error you are seeing is due to an error somewhere else because the syntax of the code in your question is correct.
In general, always fix the first error diagnosed by the parser. Good parsers try to recover so as to report as many errors as possible, but this process is not always reliable. What is the exact text of the error you are seeing?

Related

Multiple grep (with regex's) functions not working in Perl script

Having trouble with a script right now.
Trying to filter out portions of a file and put them into a scalar. Here is the code.
#value = (grep {m/(III[ABC])/g and m//g }<$fh>)
print #value;
#value = (grep { m/[012]iii/g}(<$fh>));
print #value;
When I run the first grep , the values appear in the print statment. But when I run the second grep. The 2nd print statement doesnt print anything. Does adding a second grep, cancel out the effectiveness of the first grep ?
I know that first and second grep work because even when I commented out the first grep. The second grep function worked.
All I really want to do, is filter out information, for multiple different individual arrays. I am really confused as to how to fix this problem, since I am planning on adding more grep's to the script.
The first read on <$fh> gets to the end of the file. Then the second invocation has nothing to read. Thus if you comment out the first one this doesn't happen and the second one works.
The code below adds to the same array. Change to the commented out code if needed. The regex is simplified, since it requires a comment while it doesn't affect the actual question. Please put it back the way it was, if that was what you really meant.
You can either rewind the filehandle after all lines have been read
my #vals = grep { /III[ABC]/ } <$fh>;
seek $fh, 0, 0;
# ready for reading again from the beginning
push #vals, grep { /[012]iii/ } <$fh>;
#or: my #vals_2 = grep { /[012]iii/ } <$fh>;
Or you can read all lines into an array that you can then process repeatedly.
my #original = <$fh>;
my #vals = grep { /III[ABC]/ } #original;
push #vals, grep { m/[012]iii/ } #original;
# or assign to a different array
If you don't need to store these results in such order it would be far more efficient to read the file line by line, and process and add as you go.
Update
I simplified the originally posted regex in order to focus on the question at hand, since the exact condition inside the block has no bearing on it. See the Note below. Thanks to ikegami for bringing it up and for explaining that // "repeats the last successful query".
The m//g is tricky and I removed it.
grep checks a condition and passes a line through if the condition evaluates true. In such scalar context /.../g modifier has effects which are a very different story, removed.
For the same reason as above, the capturing () is unneeded (excessive).
Cleaning up the syntax helps readability here, removed m/.
Note on regex
In scalar context /.../g modifier does the following, per perlrequick:
successive matches against a string will have //g jump from match to match
The empty string pattern m//g also has effects which are far from obvious, stated above.
Taken together these produce non-trivial results in my tests and need mental tracing to understand. I removed them from the code here since leaving them begs a question on whether they are intended trickery or subtle bugs, thus distracting from the actual question -- which they do not affect at all.
I don't know what you think the g modifier does, but it makes no sense here.
I don't know what you think // (a match with an empty pattern) does, but it makes no sense here.
In list context, <$fh> returns all remaining lines in the file. It returns nothing the second time you evaluate it since since you've already reached the end of the file the first time you evaluated it.
Fix:
my #lines = <$fh>;
my #values1 = grep { /III[ABC]/ && /.../ } #lines;
my #values2 = grep { /[012]iii/ } #lines;
Of course, substitute ... for what you meant to use there.

If Pattern matched in 1st line i need to remove the 4th line

Friends,
I need some help in regex pattern match and replace
I usually use %s/findstring/replacestring/g for the pattern match and replace in same line
But if my file is some thing like this
<tracker xid="tracker4795">
<title>MIC-DMI Change Requests</title>
<description>New tracker created </description>
<dateCreated>2010-05-03 15:18:10 EST</dateCreated>
<displayLines>1</displayLines>
<isRequired>false</isRequired>
I need to pattern match the <tracker xid.*> and escape all the lines until it match <displayLine.*> again if these match both the pattern i need to remove the
<isRequired>.*
Something like if pattern matched in both 4th and 6th line remove the 7th line
Kindly throw some light on how to achieve this
You have to match the entire set of lines. For that, note that . does not match a newline character; this must be explicitly specified via \n. With that, you have multiple options:
Match the entire block, use capture groups to excise the line
The pattern is more complex, but this is the general approach:
:%s/\(<tracker xid=.*\n\%(.*\n\)\{3}<displayLines>.*\n\)<isRequired.*\n/\1/g
Match the minimal block, delete separately
This just establishes a match via :global, then uses relative addressing to remove the line.
:g/<tracker xid=.*\n\%(.*\n\)\{3}<displayLines>.*/+5delete
Caveats
Only do this if you are absolutely sure that the XML source is in a consistent, well-known format. Text editors / regular expressions are a quick and ready tool for this, but fundamentally are the wrong tool. Be aware of this, and don't blame the tool when something goes wrong. Read more here. For production-grade reliability and automation, please use an XML tool (like XSL transformations).
When you say 'something like this' it looks like what you've got there is XML. I can't say for sure, because 'something like this' covers a lot of defects.
However if it is XML, it's a really bad idea to try and parse it with a regular expression. The reason being that XML is a defined data format with a quite strict specification. If everyone sticks to that spec, then all is fine and dandy.
However, if someone is assuming you will handle their XML as XML, and you're not (because you're using a regular expression), what you will be creating is a brittle piece of code that at some point in the future will just randomly break for no apparent reason - because they stuck to the XML spec, but changed something in an entirely valid way.
So assuming that it is XML, and looks 'something like' the example below - I would suggest using Perl and XML::Twig to parse your data.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $xml;
{ local $/; $xml = <DATA> };
my $data = XML::Twig->new( pretty_print => 'indented' )->parse($xml);
foreach my $element ( $data->root->children('tracker') ) {
my $xid = $element->att('xid');
print $xid, "\n";
foreach my $subelement ( $element->children ) {
if ( $subelement->name eq 'isRequired' ) {
#delete the 'isRequired' line
$subelement->delete;
}
}
}
$data->print;
__DATA__
<xml>
<tracker xid="tracker4795">
<title>MIC-DMI Change Requests</title>
<description>New tracker created </description>
<dateCreated>2010-05-03 15:18:10 EST</dateCreated>
<displayLines>1</displayLines>
<isRequired>false</isRequired>
</tracker>
</xml>
If you know the input is in the example format (with only one open-tag per line, and all tracker tags contain a displaylines and isrequired tag), or you can force it to that format, then I think a search-and-replace is too unwieldy, and full XML parsing is "correct" but way more complicated than you need, and you should try a simpler method with the :g command:
:g#<tracker xid#/<displayLine/d
This just searches for lines matching "<tracker xid", then deletes the next line after that matching "<displayLine"
Thus you don't need a specific number of lines in between "<tracker" and "<displayLine" so it is more robust to variances in line offsets, but it is still quite fragile to format changes.
However, I repeat the warnings from others: if the format is not easily and consistently predictable then I'd suggest parsing the file line by line in a loop, or using a real XML parser (possibly using Vim's Perl or Python integration), rather than using an :s or :g command.

Perl replace every occurrence differently

In a perl script, I need to replace several strings. At the moment, I use:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/\>$1/g;
The aim is to format in a FASTA file every sequence name. It works well in my case so I don't need to touch this part. However, it happens that a sequence name appears several times in the file. I must not have at the end twice - or more - the same sequence name. I thus need to have for instance:
seqName1
seqName2
etc.
(instead of seqName, seqName, etc.)
Is this possible to somehow process differently every occurrence automatically? I don't know how many sequence there are, if there are similar names, etc. An idea would be to concatenate a random string at every occurrence for instance, hence my question.
Many thanks.
John perfectly solved it and chepner helped with the smart idea to avoid conflicts, here is the final result:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
return '>'.$1.$i++;
}->();
/eg;
Many many thanks.
I was actually trying to do something like this the other day, here's what I came up with
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
# return random string
}->();
/eg;
the \e modifier interprets the substitution as code, not text. I use an anonymous code ref so that I can return at any point.

Why regular expression doesn't work with Global identification in Perl?

It is very weird, and I don't have any idea what is the issue!
I have a very big string (length=648745), and I don't know if its length can make this issue, but I'm trying to find some parameters inside it, and push them to an array, like this:
push(#items_ids, [$2, $3]) while ($all_items_list =~ /itemID&(id|num)=([\d]*)\">\#([\d]*)/g);
It doesn't work, it returns an empty array at the end. I thought may be my RegEx is not right, but when I run this code:
while ($all_items_list =~ /itemID&(id|num)=([\d]*)\">\#([\d]*)/){
print "\nItemID=$2 Identity=$3\n";die;
}
it finds the first occurrence, when I put "g" at the end of ReEx it can't find it any more...
I know I'm missing something here, Please help me, this is not a hard part of my script and I'm stuck, :( ...
Thanks in advance for your help.
In scalar context, m/.../g starts looking after where a previous successful m/.../g left off. I would suggest resetting the search-position right before the loop:
pos($all_items_list) = undef;
push(#items_ids, [$2, $3]) while ($all_items_list =~ /itemID&(id|num)=([\d]*)\">\#([\d]*)/g);
and seeing if that helps. (See http://perldoc.perl.org/functions/pos.html.)

What's a good Perl regex to untaint an absolute path?

Well, I tried and failed so, here I am again.
I need to match my abs path pattern.
/public_html/mystuff/10000001/001/10/01.cnt
I am in taint mode etc..
#!/usr/bin/perl -Tw
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
$ENV{PATH} = "bin:/usr/bin";
delete ($ENV{qw(IFS CDPATH BASH_ENV ENV)});
I need to open the same file a couple times or more and taint forces me to untaint the file name every time. Although I may be doing something else wrong, I still need help constructing this pattern for future reference.
my $file = "$var[5]";
if ($file =~ /(\w{1}[\w-\/]*)/) {
$under = "/$1\.cnt";
} else {
ErroR();
}
You can see by my beginner attempt that I am close to clueless.
I had to add the forward slash and extension to $1 due to my poorly constructed, but working, regex.
So, I need help learning how to fix my expression so $1 represents /public_html/mystuff/10000001/001/10/01.cnt
Could someone hold my hand here and show me how to make:
$file =~ /(\w{1}[\w-\/]*)/ match my absolute path /public_html/mystuff/10000001/001/10/01.cnt ?
Thanks for any assistance.
Edit: Using $ in the pattern (as I did before) is not advisable here because it can match \n at the end of the filename. Use \z instead because it unambiguously matches the end of the string.
Be as specific as possible in what you are matching:
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
if ( $fn =~ m!
^(
/public_html
/mystuff
/[0-9]{8}
/[0-9]{3}
/[0-9]{2}
/[0-9]{2}\.cnt
)\z!x ) {
print $1, "\n";
}
Alternatively, you can reduce the vertical space taken by the code by putting the what I assume to be a common prefix '/public_html/mystuff' in a variable and combining various components in a qr// construct (see perldoc perlop) and then use the conditional operator ?::
#!/usr/bin/perl
use strict;
use warnings;
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
my $prefix = '/public_html/mystuff';
my $re = qr!^($prefix/[0-9]{8}/[0-9]{3}/[0-9]{2}/[0-9]{2}\.cnt)\z!;
$fn = $fn =~ $re ? $1 : undef;
die "Filename did not match the requirements" unless defined $fn;
print $fn, "\n";
Also, I cannot reconcile using a relative path as you do in
$ENV{PATH} = "bin:/usr/bin";
with using taint mode. Did you mean
$ENV{PATH} = "/bin:/usr/bin";
You talk about untainting the file path every time. That's probably because you aren't compartmentalizing your program steps.
In general, I break up these sort of programs into stages. One of the earlier stages is data validation. Before I let the program continue, I validate all the data that I can. If any of it doesn't fit what I expect, I don't let the program continue. I don't want to get half-way through something important (like inserting stuff into a database) only to discover something is wrong.
So, when you get the data, untaint all of it and store the values in a new data structure. Don't use the original data or the CGI functions after that. The CGI module is just there to hand data to your program. After that, the rest of the program should know as little about CGI as possible.
I don't know what you are doing, but it's almost always a design smell to take actual filenames as input.