I have a file of the following:
Question:What color is the sky?
Explanation:The sky reflects the ocean.
Question:Why did the chicken cross the road?
Explanation:He was hungry.
What I'm trying to obtain is a list of ("What color is the sky?", "Why did the chicken cross the road")
I'm trying to use perl regex to parse this file, but with no luck.
I have the entire contents of my file in a string called $file, and this is what I'm trying
my #questions = ($file =~ /Question:(.*)\n/g);
But this always just returns the entire $file string to me.
Your (.*) is greedily matching the whole line until it gets to the \n, which is probably a result of how you are getting the string.
You can add a ? to make the match not greedy.
So try
my #questions = ($file =~ /Question:(.*?\?)/g);
Notice I escaped \?, so the regex will match up to the questionmark
Put the whole file in a value will occupy too many memory if the is large, a better way is to process the file line by line.
For example you could do something like
my #questions;
while (<>) {
chomp;
if (m/Question:(.*)/) {
push #questions, $1;
}
}
Some explanations:
I/O Operators of perlop:
Input from <> comes either from standard input, or from each file listed on the command line.
Related
I assume some sort of regex would be used to accomplish this?
I need to get it where each word consists of 2 or more characters, start with a letter, and the remaining characters consist of letters, digits, and underscores.
This is the code I currently have, although it isn't very close to my desired output:
while (my $line=<>) {
# remove leading and trailing whitespace
$line =~ s/^\s+|\s+$//g;
$line = lc $line;
#array = split / /, $line;
foreach my $a (#array){
$a =~ s/[\$##~!&*()\[\];.,:?^ `\\\/]+//g;
push(#list, "$a");
}
}
A sample input would be:
#!/usr/bin/perl -w
use strict;
# This line will print a hello world line.
print "Hello world!\n";
exit 0;
And the desired output would be (alphabetical order):
bin
exit
hello
hello
line
perl
print
print
strict
this
use
usr
will
world
my #matches = $string =~ /\b([a-z][a-z0-9_]+)/ig;
If case-insensitive operation need be applied only to a subpattern, can embed it
/... \b((?i)[a-z][a-z0-9_]+) .../
(or, it can be turned off after the subpattern, (?i)pattern(?-i))
That [a-zA-Z0-9_] goes as \w, a "word character", if that's indeed exactly what is needed.
The above regex picks words as required without a need to first split the line on space, done in the shown program. Can apply it on the whole line (or on the whole text for that matter), perhaps after the shown stripping of the various special characters.†
There is a question of some other cases -- how about hyphens? Apostrophes? Tilde? Those aren't found in identifiers, while this appears to be intended to process programming text, but comments are included; what other legitimate characters may there be?
Note on split-ing on whitespace
The shown split / /, $line splits on exactly that one space. Better is split /\s+/, $line -- or, better yet is to use split's special pattern split ' ', $line: split on any number of any consecutive whitespace, and where leading and trailing spaces are discarded.
† The shown example is correctly processed as desired by the given regex alone
use strict;
use warnings;
use feature 'say';
use Path::Tiny qw(path); # convenience, to slurp the file
my $fn = shift // die "Usage: $0 filename\n";
my #matches = sort map { lc }
path($fn)->slurp =~ /\b([a-z][a-z0-9_]+)/ig;
say for #matches;
I threw in sorting and lower-casing to match the sample code in the question but all processing is done with the shown regex on the file's content in a string.
Output is as desired (except that line and world here come twice, what is correct).
Note that lc can be applied on the string with the file content, which is then processed with the regex, what is more efficient. While this is in principle not the same in this case it may be
perl -MPath::Tiny -wE'$f = shift // die "Need filename\n";
#m = sort lc(path($f)->slurp) =~ /\b([a-z]\w+)/ig;
say for #m'
Here I actually used \w. Adjust to the actual character to match, if different.
Curiously, this can be done with one of those long, typical Perl one-liners
$ perl -lwe'print for sort grep /^\pL/ && length > 1, map { split /\W+/ } map lc, <>' a.txt
bin
exit
hello
hello
line
line
perl
print
print
strict
this
use
usr
will
world
world
Lets go through that and see what we can learn. This line reads from right to left.
a.txt is the argument file to read
<> is the diamond operator, reading the lines from the file. Since this is list context, it will exhaust the file handle and return all the lines.
map lc, short for map { lc($_) } will apply the lc function on all the lines and return the result.
map { split /\W+/ } is a multi-purpose operation. It will remove the unwanted characters (the non-word characters), and also split the line there, and return a list of all those words.
grep /^\pL/ && length > 1 sorts out strings that begin with a letter \pL and are longer than 1 and returns them.
sort sorts alphabetically the list coming in from the right and returns it left
for is a for-loop, applied to the incoming list, in the post-fix style.
print is short for print $_, and it will print once for each list item in the for loop.
The -l switch in the perl command will "fix" line endings for us (remove them from input, add them in output). This will make the print pretty at the end.
I won't say this will produce a perfect result, but you should be able to pick up some techniques to finish your own program.
I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.
Say I have a fixed variable:
$f_variable = "hello.exe";
and I want to search through a file line by line and find the path that contains this word, for example:
/Desktop/Downloads/hello.exe
or
/Desktop/Downloads/hello_qwdqd.exe
Lets say that the file extension can be either exe or ex
and i wrote this line of code:
if ($line =~ m/(\/$f_variable.*\.[exe]+)
obviously it won't works because the line of code above is actually:
if ($line =~ m/(\/hello.exe.*\.[exe]+)
which will not match anything.
So my question is what changes should I make in order to match and capture the whole path properly without changing the value of $f_variable?
What you wrote isn't even a complete regex. Though you've not said clearly what's in the file (Are the lines just paths?), you probably want something like:
my $pattern = $f_variable;
$pattern =~ s/\.exe?$//;
if ($line =~ m{(/\S*\Q$pattern\E[^/]*\.exe?)}) {
print "$1\n";
}
This removes the file name .ex or .exe suffix to get the base name, then matches the first string that contains the base name including any non-space leading characters and trailing non-space characters ending in .ex or .exe.
I am trying to search for a substring and replace the whole string if the substring is found. in the below example someVal could be any value that is unknown to me.
how i can search for someServer.com and replace the whole string $oldUrl and with $newUrl?
I can do it on the whole string just fine:
$directory = "/var/tftpboot";
my $oldUrl = "someVal.someServer.com";
my $newUrl = "someNewVal.someNewServer.com";
opendir( DIR, $directory ) or die $!;
while ( my $files = readdir(DIR) ) {
next unless ( $files =~ m/\.cfg$/ );
open my $in, "<", "$directory/$files";
open my $out, ">", "$directory/temp.txt";
while (<$in>) {
s/.*$oldUrl.*/$newUrl/;
print $out $_;
}
rename "$directory/temp.txt", "$directory/$files";
}
Your script will delete much of your content because you are surrounding the match with .*. This will match any character except newline, as many times as it can, from start to end of each line, and replace it.
The functionality that you are after already exists in Perl, the use of the -pi command line switches, so it would be a good idea to make use of it rather than trying to make your own, which works exactly the same way. You do not need a one-liner to use the in-place edit. You can do this:
perl -pi script.pl *.cfg
The script should contain the name definitions and substitutions, and any error checking you need.
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/\Q$old\E/$new/g;
This is the simplest possible solution, when running with the -pi switches, as I showed above. The \Q ... \E is the quotemeta escape, which escapes meta characters in your string (highly recommended).
You might want to prevent partial matches. If you are matching foo.bar, you may not want to match foo.bar.baz, or snafoo.bar. To prevent partial matching, you can put in anchors of different kinds.
(?<!\S) -- do not allow any non-whitespace before match
\b -- match word boundary
Word boundary would be suitable if you want to replace server1.foo.bar in the above example, but not snafoo.bar. Otherwise use whitespace boundary. The reason we do a double negation with a negative lookaround assertion and negated character class is to allow beginning and end of line matches.
So, to sum up, I would do:
use strict;
use warnings;
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/(?<!\S)\Q$old\E(?!\S)/$new/g;
And run it with
perl -pi script.pl *.cfg
If you want to try it out beforehand (highly recommended!), just remove the -i switch, which will make the script print to standard output (your terminal) instead. You can then run a diff on the files to inspect the difference. E.g.:
$ perl -p script.pl test.cfg > test_replaced.cfg
$ diff test.cfg test_replaced.cfg
You will have to decide whether word boundary is more desirable, in which case you replace the lookaround assertions with \b.
Always use
use strict;
use warnings;
Even in small scripts like this. It will save you time and headaches.
If you want to match and replace any subdomain, then you should devise a specific regular expression to match them.
\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com
The following is a rewrite of your script using more Modern Perl techniques, including Path::Class to handle file and directory operations in a cross platform way and $INPLACE_EDIT to automatically handle the editing of a file.
use strict;
use warnings;
use autodie;
use Path::Class;
my $dir = dir("/var/tftpboot");
while (my $file = $dir->next) {
next unless $file =~ m/\.cfg$/;
local #ARGV = "$file";
local $^I = '.bak';
while (<>) {
s/\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com\b/someNewVal.someNewServer.com/;
print;
}
#unlink "$file$^I"; # Optionally delete backup
}
Watch for the Dot-Star: it matches everything that surrounds the old URL, so the only thing remaining on the line will be the new URL:
s/.*$oldUrl.*/$newUrl/;
Better:
s/$oldUrl/$newUrl/;
Also, you might need to close the output file before you try to rename it.
If the old URL contains special characters (dots, asterisks, dollar signs...) you might need to use \Q$oldUrl to suppress their special meaning in the regex pattern.
I'm not too familiar with regex but I know what I need to find-
I have a long list of data separated by newlines, and I need to delete all the lines of data that contain a string "(V)". The lines are of variable length, so I guess something to do with selecting everything between two newline characters if there's a (V) inside?
Try searching for this regular expression:
^.*\(V\).*$
Explanation:
^ start of line
.* any characters apart from new line
\( open parenthesis (escaped to avoid special behaviour)
V V
\) close parenthesis (escaped to avoid special behaviour)
.* any characters apart from new line
$ end of line (not strictly need here, included only for clarity)
Depending on your language you may need to add delimiters such as / and/or quotes " around the regular expression and you may need to enable multiline mode.
Here's an online example showing it working: Rubular
If the data is indeed rather large, then running a single regex against the whole string would be a bad idea. Instead, a simple solution like this Perl script could work for you:
open my $fh, '<', 'data.txt' or die $!;
while (my $line = <$fh>) {
if ($line =~ m/\(V\)/) {
next;
}
print $line;
}
close $fh;
This script reads the data file one line at a time and prints the lines that do not contain "(V)" to stdout. (You obviously could replace the "print" with a different data processing task)
Use the UNIX command grep, if you have access to such a system.
$ grep -v '(V)' data.txt
Grep matches all lines containing "(V)" in data.txt, and shows only the lines not matching (-v).