Ok so i need to gather the name of people in a txt file using perl, i can find the names and print them out with NAME: in front using a regular expression but i need to gather just the persons name I want to do this using regular expressions, because there is multiple different names to gather in each file.
Example of file input:
NAME: Bigelow, Patrick R DATE: 28 Apr 2014
code so far:
if (/NAME:/){
my #arr = /NAME:\s\S*\s\S*\s\S*/g or next;
print "$_\n" for #arr;
}
Better use two lines of code, one for stripping out the names(between Name: ... Date:), and then split the names for comma.
my ($tmp) = /(?<=NAME: )(.*?)(?=\s+DATE:)/g or next;
my #arr = split(/,\s+/, $tmp);
You can use s/regex/replace/ to do what you want. I avoid using post fix loops (easy to miss and are a pain to fix when you add more lines) and the use of $_ (does not improve readability, is inconstantly implemented, and is global in scope):
use strict;
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
TEXT_FILE => 'file.txt',
NAME_LINE => qr/^NAME:\s+/,
};
open my $name_fh, "<", TEXT_FILE; # Autodie eliminates need for testing if this worked
while ( my $line = <DATA> ) {
chomp $line;
next unless $line =~ NAME_LINE; # Skip lines if they don't contain name
( my $name = $line ) =~ s/NAME:\s+(.+?)\s+DATE.*/$1/;
say qq("$name");
}
close $name_fh;
The regular expresion s/NAME:\s+(.+?) +DATE.*/$1/ needs a bit of explaining:
The parentheses (...) are a capture group. I can refer to them by the $1 on the substitution side.
The +? is a non greedy qualifier. I am matching the smallest group I can that will match the regular expression. The trick is that \s+DATE.* will match as big as possible. Between the two, I eliminate all of the spaces from the end of the name to the DATE string this way. It's really not that efficient, but it works well.
The qq(...) is just a fancy way of doing double quotes. It makes it easy to put quotes in your quotes this way.
I use say instead of print because say will automatically add in the \n on the end. this becomes important where adding . "\n" to the end of something can cause problems.
For example:
print join ": ", #foo . "\n"; # Doesn't work.
say join ": ", #foo; # This works fine.
Related
I have a string that looks like this (key":["value","value","value"])
"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]
and I use the following regex to select from the string. (the regex is setup in a way where it wont select a string that looks like this "key":[{"key":"value","key":"value"}] )
(?<=:\[").*?(?="])
Resulting Selection:
google.co.uk","google.com","google.com","google.com","google.co.uk
I want to remove the " in that select string, and i was wondering if there was an easy way to do this using the replace command. Desired result...
"emailDomains":["google.co.uk, google.com, google.com, google.com, google.co.uk"]
How do I solve this problem?
If your string indeed has the form "key":["v1", "v2", ... "vN"], you can split off the part that needs to be changed, replace "," by a space in it, and re-assemble:
my #parts = split / (\["\s* | \s*\"]) /x, $string; #"
$parts[2] =~ s/",\s*"/ /g;
my $processed = join '', #parts;
The regex pattern for the separator in split is captured since in that case the separators are also in the returned list, what is helpful here for putting the string back together. Then, we need to change the third element of the array.
In this approach, we have to change a specific element in the array so if your format varies, even a little, this may not (or still may) be suitable.
This should of course be processed as JSON, using a module. If the format isn't sure, as indicated in a comment, it would be best to try to ensure that you have JSON. Picking bits and pieces like above (or below) is a road to madness once requirements slowly start evolving.
The same approach can be used in a regex, and this may in fact have an advantage to be able to scoop up and ignore everything preceding the : (with split that part may end up with multiple elements if the format isn't exactly as shown, what then affects everything)
$string =~ s{ :\["\s*\K (.*?) ( "\] ) }{
my $e = $2;
my $n = $1 =~ s/",\s*"/ /gr;
$n.$e
}ex;
Here /e modifier makes it so that the replacement side is evaluated as code, where we do the same as with the split above. Notes on regex
Have to save away $2 first, since it gets reset in the next regex
The /r modifier†, which doesn't change its target but rather returns the changed string, is what allows us to use substitution operator on the read-only $1
If nothing gets captured for $2, and perhaps for $1, that means that there was no match and the outcome is simply that $string doesn't change, quietly. So if this substitution should always work then you may want to add handling of such unexpected data
Don't need a $n above, but can return ($1 =~ s/",\s*"/ /gr) . $e
Or, using lookarounds as attempted
$string =~ s{ (?<=:\[") (.+?) (?="\]) }{ $1 =~ s/",\s*"/ /gr }egx;
what does reduce the amount of code, but may be trickier to work with later.
While this is a direct answer to the question I think it's least maintainable.
† This useful modifier, for "non-destructive substitution," appeared in v5.14. In earlier Perl versions we would copy the string and run regex on that, with an idiom
(my $n = $1) =~ s/",\s*"/ /g;
In the lookarounds-example we then need a little more
$string =~ s{...}{ (my $n = $1) =~ s/",\s*"/ /g; $n }gr
since s/ operator returns the number of substitutions made while we need $n to be returned from that whole piece of code in {} (the replacement side), to be used as the replacement.
You can use this \G based regex to start the match with :[" and further captures the values appropriately and replaces matched text so that only comma is retained and doublequotes are removed.
(:\[")|(?!^)\G([^"]+)"(,)"
Regex Demo
Your text is almost proper JSON, so it's really easy to go the final inch and make it so, and then process that:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say postderef/;
no warnings qw/experimental::postderef/;
use JSON::XS; # Install through your OS package manager or a CPAN client
my $str = q/"emailDomains":["google.co.uk","google.com","google.com","google.com","google.co.uk"]/;
my $json = JSON::XS->new();
my $obj = $json->decode("{$str}");
my $fixed = $json->ascii->encode({emailDomains =>
join(', ', $obj->{'emailDomains'}->#*)});
$fixed =~ s/^\{|\}$//g;
say $fixed;
Try Regex: " *, *"
Replace with: ,
Demo
I have to parse current line "abc\",","\"," by regex in Perl,
and get this result "abc\"," and "\","
I do this
while (/(\s*)/gc) {
if (m{\G(["])([^\1]+)\1,}gc){
say $2;
}
}
but it is wrong, because this regexp go to the last ",
My question is, How can I jump over this \" and stop on first ", ?
The following program performs matches according to your specification:
while (<>) {
#arr = ();
while (/("(?:\\"|[^"])*")/) {
push #arr, $1;
$_ = $';
}
print join(' ', #arr), "\n";
}
Input file input.txt:
"abc", "def"
"abc\",","\","
Output:
$ ./test.pl < input.txt
"abc" "def"
"abc\"," "\","
It can be improved to match more strictly because in this form a lot of input is possible that is maybe not desirable, but it serves as a first pointer. Additionally, it is better matching a CSV file with the corresponding module and not with regular expressions, but you have not stated if your input is really a CSV file.
Don't reinvent the wheel. If you have CSV, use a CSV parser.
use Text::CSV_XS qw( );
my $string = '"abc\",","\","';
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
$csv->parse($_)
my #fields = $csv->fields();
Regexes aren't the best tool for this task. The standard Text::ParseWords module does this easily.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::ParseWords;
my $line = '"abc\",","\","';
my #fields = parse_line(',', 1, $line);
for (0 .. $#fields) {
say "$_: $fields[$_]"
}
The output is:
0: "abc\","
1: "\","
split /(?<!\\)",(?<!\\)"/, $_
(preceded by cleaning the boundary of $_ with s/^"// && s/"$//; because enclosing external quotes didn't need to be in the definition of the input string, but you have them)
returns directly the array you want (without the need of external loop as the loop is inside the core perl function split, you might add \s* surrounding the comma according to how the string might be provided).
..but (actually just a note as you didn't mention) there could be a deeper case
If you have \" meaning " you possibly have also \\ meaning \, so you might have \\\" and \\", the last one (more generally an even number of \ preceding ") is complicate with one line regexp because look-behind is implemented for fixed size, and the unsupported regexp form (?<!\\(?:\\\\)*)" which would potentially get well also a string delimiter after backslash not intending as escape quote \" from the sequence \\", is inapplicable and a less efficient code that mine would be required, but again this marginal consideration is about the case that \\ has to be hypothetically interpreted too.
I'm trying to create a regex as following :
print $time . "\n"; --> match only print because time is a variable ($ before)
$epoc = time(); --> match only time
My regex for the moment is /(?-xism:\b(print|time)\b)/g but it match time in $time in the first example.
Check here.
I tried things like [^\$] but then it doesn't match print anymore.
(I will have more keyword like print|time|...|...)
Thanks
Parsing perl code is a common and useful teaching tool since the student must understand both the parsing techniques and the code that they're trying to parse.
However, to do this properly, the best advice is to use PPI
The following script parses itself and outputs all of the barewords. If you wanted to, you could compare the list of barewords to the ones that you're trying to match. Note, this will avoid things within strings, comments, etc.
use strict;
use warnings;
use PPI;
#my $src = do {local $/; <DATA>}; # Could analyze the smaller code in __DATA__ instead
my $src = do {
local #ARGV = $0;
local $/;
<>;
};
# Load a document
my $doc = PPI::Document->new( \$src );
# Find all the barewords within the doc
my $barewords = $doc->find( 'PPI::Token::Word' );
for (#$barewords) {
print $_->content, "\n";
}
__DATA__
use strict;
use warnings;
my $time = time;
print $time . "\n";
Outputs:
use
strict
use
warnings
use
PPI
my
do
local
local
my
PPI::Document
new
my
find
for
print
content
__DATA__
What you need is a negative lookbehind (?<!\$), it's zero-width so it doesn't "consume" characters.
(?<!\$)a means match a if not preceded with a literal $. Note that we escaped $ since it means end of string (or line depending on the m modifier).
Your regex will look like (?-xism:\b(?<!\$)(print|time)\b).
I'm wondering why you are turning off the xism modifiers. They are off by default.So just use /\b(?<!\$)(?:print|time)\b/g as pattern.
Online demo
SO regex reference
I am studying on regular expression in perl.
I want to write a script that accepts a C source code file and finds strings.
This is my code:
my $file1= #ARGV;
open my $fh1, '<', $file1;
while(<>)
{
#words = split(/\s/, $_);
$newMsg = join '', #words;
push #strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\*.*\\*.*"/);
print Dumper(\#strings);
foreach(#strings)
{
print"strings: $_\n";
}
but i have problem in matching multiple string like this
const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";
what i must do?
Here is a fun solution. It uses MarpaX::Languages::C::AST, an experimental C parser. We can use the c2ast.pl program that ships with the module to convert a piece of C source file to an abstract syntax tree, which we dump to some file (using Data::Dumper). We can then extract all strings with a bit of magic.
Unfortunately, the AST objects have no methods, but as they are autogenerated, we know how they look on the inside.
They are blessed arrayrefs.
Some contain a single unblessed arrayrefs of items,
Others contain zero or more items (lexemes or objects)
“Lexemes” are an arrayref with two fields of location information, and the string contents at index 2.
This information can be extracted from the grammar.
The code:
use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';
our $VAR1;
require "test.dump"; # populates $VAR1
my #strings = map extract_value($_), find_strings($$VAR1);
say for #strings;
sub find_strings {
my $ast = shift;
return $ast if $ast->isa("C::AST::string");
return map find_strings($_), map flatten($_), #$ast;
}
sub flatten {
my $thing = shift;
return $thing if blessed($thing);
return map flatten($_), #$thing if ref($thing) eq "ARRAY";
return (); # we are not interested in other references, or unblessed data
}
sub extract_value {
my $string = shift;
return unless blessed($string->[0]);
return unless $string->[0]->isa("C::AST::stringLiteral");
return $string->[0][0][2];
}
A rewrite of find_strings from recursion to iteration:
sub find_strings {
my #unvisited = #_;
my #found;
while (my $ast = shift #unvisited) {
if ($ast->isa("C::AST::string")) {
push #found, $ast;
} else {
push #unvisited, map flatten($_), #$ast;
}
}
return #found;
}
The test C code:
/* A "comment" */
#include <stdio.h>
static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";
int main() {
printf("Hello %s:\n%s\n", "World", text2);
return 0;
}
I ran the commands
$ perl $(which c2ast.pl) test.c -dump >test.dump;
$ perl find-strings.pl
Which produced the output
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
""
""
""
""
""
""
Notice how there are some empty strings not from our source code, which come somewhere from the included files. Filtering those out would probably not be impossible, but is a bit impractical.
It appears you're trying to use the following regular expression to capture multiple lines in a string:
my $your_regexp = m{
(
.* # anything
\\* # any number of backslashes
.* # anything
\\* # any number of backslashes
.* # anything
\\* # any number of backslashes
.* # anything
)
}x
But it appears more of a grasp of desperation than a deliberately thought out plan.
So you've got two problems:
find everything between double quotes (")
handle the situation where there might be multiple lines between those quotes
Regular expressions can match across multiple lines. The /s modifier does this. So try:
my $your_new_regexp = m{
\" # opening quote mark
([^\"]+) # anything that's not a quote mark, capture
\" # closing quote mark
}xs;
You might actually have a 3rd problem:
remove trailing backslash/newline pairs from strings
You could handle this by doing a search-replace:
foreach ( #strings ) {
$_ =~ s/\\\n//g;
}
Here is a simple way of extracting all strings in a source file. There is an important decision we can make: Do we preprocess the code? If not, we may miss some strings if they are generated via macros. We would also have to treat the # as a comment character.
As this is a quick-and-dirty solution, syntactic correctness of the C code is not an issue. We will however honour comments.
Now if the source was pre-processed (with gcc -E source.c), then multiline strings are already folded into one line! Also, comments are already removed. Sweet. The only comments that remain are mention line numbers and source files for debugging purposes. Basically all that we have to do is
$ gcc -E source.c | perl -nE'
next if /^#/; # skip line directives etc.
say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'
Output (with the test file from my other answer as input):
""
"__isoc99_fscanf"
""
"__isoc99_scanf"
""
"__isoc99_sscanf"
""
"__isoc99_vfscanf"
""
"__isoc99_vscanf"
""
"__isoc99_vsscanf"
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
So yes, there is a lot of garbage here (they seem to come from __asm__ blocks), but this works astonishingly well.
Note the regex I used: /(" (?:[^"\\]+ | \\.)* ")/x. The pattern inside the capture can be explained as
" # a literal '"'
(?: # the begin of a non-capturing group
[^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
\\. # an escape sequence like '\n', '\"', '\\' ...
)* # zero or more times
" # closing '"'
What are the limitations of this solution?
We need the a preprocessor
This code was tested with gcc
clang also supports the -E option, but I have no idea how the output is formatted.
Character literals are a failure mode, e.g. myfunc('"', a_variable, '"') would be extracted as "', a_variable, '".
We also extract strings from other source files. (false positives)
Oh wait, we can fix the last bit by parsing the source file comments which the preprocessor inserted. They look like
# 29 "/usr/include/stdio.h" 2 3 4
So if we remeber the current filename, and compare it to the filename we want, we can skip unwanted strings. This time, I'll write it as a full script instead of a one-liner.
use strict; use warnings;
use autodie; # automatic error handling
use feature 'say';
my $source = shift #ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;
# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;
my $file;
while (<$preprocessed>) {
$file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
next if /^#/;
next if $file ne qq("$source");
say $1 while /($string_re)/xg;
}
Usage: $perl extract-strings.pl source.c
This now produces the output:
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
If you cannot use the convenient preprocessor to fold multiline strings and remove comments, this gets a lot uglier, because we have to account for all of that ourselves. Basically, you want to slurp in the whole file at once, not iterate it line by line. Then, you skip over any comments. Do not forget to ignore preprocessor directives as well. After that, we can extract the strings as usual. Basically, you have to rewrite the grammar
Start → Comment Start
Start → String Start
Start → Whatever Start
Start → End
to a regex. As the above is a regular language, this isn't too hard.
I'm trying to find a way to replace spaces and double quotes with pipes (||) while leaving the spaces within the double quotes untouched.
For example, it would make something like 'word "word word" word' into 'word||word word||word' and another like 'word word word' into 'word||word||word'.
Right now I have this to work off of:
[%- MACRO typestrip(value) PERL -%]
my $htmlVal = $stash->get('value');
$htmlVal =~ s/"/||/g;
print $htmlVal
[%- END -%]
Which handles replacing double quotes with pipes just fine.
I don't know how simple or complex this should be or if it can even be done, since I have no actual background in programming and, while I have worked with some Perl, it's never been this kind before, so I apologize if I'm not doing a good job of explaining this.
I think it might be easier to use the core module Text::ParseWords to split on non-quoted whitespace, then rejoin the "words" with pipes.
#!/usr/bin/env perl
use warnings;
use strict;
use Text::ParseWords;
while (my $line = <DATA>) {
print space2pipes($line);
print "\n";
}
sub space2pipes {
my $line = shift;
chomp $line;
my #words = parse_line( qr/\s+/, 0, $line );
return join '||', #words;
}
__DATA__
word "word word" word
word word word
Putting this into your templating engine is left as an exercise for the reader :-)
This is related to a frequently-asked question, answered in section 4 of the Perl FAQ.
How can I split a [character]-delimited string except when inside [character]?
Several modules can handle this sort of parsing—Text::Balanced, Text::CSV, Text::CSV_XS, and Text::ParseWords, among others.
Take the example case of trying to split a string that is comma-separated into its different fields. You can’t use split(/,/) because you shouldn’t split if the comma is inside quotes. For example, take a data line like this:
SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"Error, Core Dumped"
Due to the restriction of the quotes, this is a fairly complex problem. Thankfully, we have Jeffrey Friedl, author of Mastering Regular Expressions, to handle these for us. He suggests (assuming your string is contained in $text):
my #new = ();
push(#new, $+) while $text =~ m{
# groups the phrase inside the quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text,-1,1) eq ',';
If you want to represent quotation marks inside a quotation-mark-delimited field, escape them with backslashes (e.g., "like \"this\"").
Alternatively, the Text::ParseWords module (part of the standard Perl distribution) lets you say:
use Text::ParseWords;
#new = quotewords(",", 0, $text);
For parsing or generating CSV, though, using Text::CSV rather than implementing it yourself is highly recommended; you’ll save yourself odd bugs popping up later by just using code which has already been tried and tested in production for years.
Adapting the technique to your situation gives
my $htmlVal = 'word "word word" word';
my #chunks;
push #chunks, $+ while $htmlVal =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)"
| (\S+)
}gx;
$htmlVal = join "||", #chunks;
print $htmlVal, "\n";
Output:
word||word word||word
Looking back, it turns out that this is an application of Randal’s Rule, as dubbed in Regular Expression Mastery by Mark Dominus:
Randal's Rule
Randal Schwartz (author of Learning Perl [and also a Stack Overflow user]) says:
Use capturing or m//g when you know what you want to keep.
Use split when you know what you want to throw away.
In your situation, you know what you want to keep, so use m//g to hang on to the text within quotes or otherwise separated by whitespace.
While Joel's answer is fine, things can be simplified a bit by specifically using shellwords to tokenize lines:
#!/usr/bin/env perl
use strict; use warnings;
use Text::ParseWords qw( shellwords );
my #strings = (
'word "word word" word',
'word "word word" "word word"',
);
#strings = map join('||', shellwords($_)), #strings;
use YAML;
print Dump \#strings;
Isn't that more readable than a bunch of regex-gobbledygook?
Seems possible and might be useful if only a regex is applicable:
$htmlVal =~ s/(?:"([^"]+)"(\s*))|(?:(\S+)(\s*))/($1||$3).($2||$4?'||':'')/eg;
(Might be beautified a bit after closer introspection.)
input:
my $htmlVal ='word "word word" word';
output:
word||word word||word
Original code has been modified after failing this case:
my $htmlVal ='word "word word" "word word"';
will now work too:
word||word word||word word
Explanation:
$htmlVal =~ s/
(?: " ([^"]+) " (\s*)) # search "abc abc" ($1), End ($2)
| # OR
(?: (\S+) (\s*)) # abcd ($3), End ($4)
/
($1||$3) . ($2||$4 ? '||' : '') # decide on $1/$2 or $3/$4
/exg;
Regards
rbo