Regex to get just everything in CAPS - regex

I am looking for a regex to get just about the words in CAPS
for eg : I have an array that is storing the file paths and these could be in any following pattern
images/p/n/ct/XYZ-WW_V1.jpg
images/p/c/ABC-TY_V2.jpg
So basically I want just "XYZ-WW" and "ABC-TY" .
Any suggestions what regex to use in my split code . I am using the following
foreach (#filefound){
my #result = split('_',$_);
push #split1, $result[0];
}
This is just splitting at the _ and I am accessing the [0] the value but now I want to get just the part that is in CAPS .
Any Suggestions please !!

No reason to use split at all. Just grab the bits you want via a regular expression. From your example, it looks like you want everything which is made of capital ASCII letters and dashes:
my #bignames;
foreach (#filefound){
if ( /([A-Z-]+)/ ) {
push #bignames, $1;
}
}

I'm thinking this should work:
[A-Z]+-[A-Z]+

foreach (#filefound) {
if ($_ ~= /.*([A-Z]+-[A-Z]+)_[A-Z]\d\..{3}$ ) {
push #split1, $1;
}

You can try this:
if ($_ =~ /[\-A-Z]+/)
push #split1, $&;
that will match any combination of uppercase letters and -; or, if you want a stricter control, this:
if ($_ =~ /\/([A-Z]{3}-[A-Z]{2})_/)
push #split1, $1;
which will match only a sequence of uppercase letters followed by - and by a sequence of 2 uppercase letters; starting with a / and ending in _ (those are excluded).
From these example you can build the exact regex that you need.

Keep in mind that a match in list context will return captured strings.
#!/usr/bin/perl
use warnings; use strict;
use File::Basename qw(basename);
my #files = qw(
images/p/n/ct/XYZ-WW_V1.jpg
images/p/c/ABC-TY_V2.jpg
);
my #prefixes = map { (basename $_) =~ /^( [A-Z]+ - [A-Z]+ )/x } #files;
print "$_\n" for #prefixes;

Related

How to find all the words that begin with a|b and end with a|b. (Ex: “adverb” and “balalaika”)

The following perl program has a regex written to serve my purpose. But, this captures results present within a string too. How can I only get strings separated by spaces/newlines/tabs?
The test data I used is present below:
http://sainikhil.me/stackoverflow/dictionaryWords.txt
use strict;
use warnings;
sub print_a_b {
my $file = shift;
$pattern = qr/(a|b|A|B)\S*(a|b|A|B)/;
open my $fp, $file;
my $cnt = 0;
while(my $line = <$fp>) {
if($line =~ $pattern) {
print $line;
$cnt = $cnt+1;
}
}
print $cnt;
}
print_a_b #ARGV;
You could consider using an anchor like \b: word boundary
That would help apply the regexp only after and before a word.
\b(a|b|A|B)\S*(a|b|A|B)\b
Simpler, as Avinash Raj adds in the comments:
(?i)\b[ab]\S*[ab]\b
(using the case insensitive flag or modifier)
If you have multiple words in the same line then you can use word boundaries in a regex like this:
(?i)\b[ab][a-z]*[ab]\b
The pattern code is:
$pattern = /\b[ab][a-z]*[ab]\b/i;
However, if you want to check for lines with only has a word, then you can use:
(?i)$[ab][a-z]*[ab]$
Update: for your comment * lines that begin and end with the same character*, you can use this regex:
(?i)\b([a-z])[a-z]*\1\b
But if you want any character and not letters only like above you can use:
(?i)\b(.)[a-z]*\1\b

regex for multiple strings

I have two strings that I'm trying to match in a file and return just that line.
The first string will match what I'm looking for always but not completely.
Example:
I might be looking for Matcht in Matchthisstring so I match but not the entire string.
String 1 might come before or after string 2 and it might start with an uppercase or lower case letter.
Example:
I might be looking for actually but have Actually.
But it together and I have things like Matchthisstring.Actually or actually.Matchthisstring or Matchthisstring.someotherjunkIdon'tcareabout.actually any other combinations like that.
I'm having problems using the two search strings together and getting the reggae to work.
Here's an example of some code that works:
my #matches;
while (<$in_fh>) {
#push #matches, $_ if / \Q$wanted\E .* \Q$prefix\E /x;
#push #matches, $_ if / \Q^*(.*)$wanted\s*(.*)\E .* \Q^*(.*)$prefix\s*(.*)\E /x;
push #matches, $_ if / \Q$wanted\E /x;
}
what I really want to work is one of the other options that's commented out. I don't think I'm joining the two searches into one string together properly.
Thanks in advance for the assistance.
Sounds like you want
push #matches, $_ if /\Q$wanted\E/ and /\Q$prefix\E/;
Demo:
$ perl -ne 'use strict; use warnings; use vars qw($wanted $prefix #matches);
> BEGIN { $wanted = "/donate"; $prefix = "lghfound"; }
> push #matches, $_ if /\Q$wanted\E/ and /\Q$prefix\E/; print #matches' <<'HERE'
> http://www.google.com/donate
> http://lghfoundation.com/pages/donate.html
> http://example.com
> HERE
http://lghfoundation.com/pages/donate.html
http://lghfoundation.com/pages/donate.html

Print line of pattern match in Perl regex

I am looking for a keyword in a multiline input using a regex like this,
if($input =~ /line/mi)
{
# further processing
}
The data in the input variable could be like this,
this is
multi line text
to be matched
using perl
The code works and matches the keyword line correctly. However, I would also like to obtain the line where the pattern was matched - "multi line text" - and store it into a variable for further processing. How do I go about this?
Thanks for the help.
You can grep out the lines into an array, which will then also serve as your conditional:
my #match = grep /line/mi, split /\n/, $input;
if (#match) {
# ... processing
}
TLP's answer is better but you can do:
if ($input =~ /([^\n]+line[^\n]+)/i) {
$line = $1;
}
I'd look if the match is in the multiline-String and in case it is, split it into lines and then look for the correct index number (starting with 0!):
#!/usr/bin/perl
use strict;
use warnings;
my $data=<<END;
this is line
multi line text
to be matched
using perl
END
if ($data =~ /line/mi){
my #lines = split(/\r?\n/,$data);
for (0..$#lines){
if ($lines[$_] =~ /line/){
print "LineNr of Match: " . $_ . "\n";
}
}
}
Did you try his?
This works for me. $1 represents the capture of regex inside ( and )
Provided there is only one match in one of the lines.If there are matches in multiple lines, then only the first one will be captured.
if($var=~/(.*line.*)/)
{
print $1
}
If you want to capture all the lines which has the string line then use below:
my #a;
push #a,$var=~m/(.*line.*)/g;
print "#a";

How do I tokenise a word given tokens that are subsumed incompletely in the word?

I understand how to use regex in Perl in the following way:
$str =~ s/expression/replacement/g;
I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:
$str =~ s/(a)/($1)dosomething/;
But is there a way to capture the ($1) above outside of the regex expression?
I have a full word which is a string of consonants, e.g. bEdmA, its vowelized version baEodamaA (where a and o are vowels), as well its split up form of two tokens, separated by space, bEd maA. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda, maA. I'm trying to capture the token within the full word expression, so I have:
$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";
foreach $t(#tokens) {
#find the token within the full word, and capture its vowels
}
I'm trying to do something like this:
$vowelizedword = m/($t)/;
This is completely wrong for two reasons: the token $t is not present in exactly its own form, such as bEd, but something like m/b.E.d/ would be more relevant. Also, how do I capture it in a variable outside the regular expression?
The real question is: how can I capture the vowelized sequences baEoda and maA, given the tokens bEd, mA from the full word beEodamaA?
Edit
I realized from all the answers that I missed out two important details.
Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:
#vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');
The following seems to do what you want:
#!/usr/bin/env perl
use warnings;
use strict;
my #tokens = ('bEd', 'mA');
my $vowelizedword = "beEodamaA";
my #regex = map { join('.?', split //) . '.?' } #tokens;
my $regex = join('|', #regex);
$regex = qr/($regex)/;
while (my ($matched) = $vowelizedword =~ $regex) {
$vowelizedword =~ s{$regex}{};
print "matched $matched\n";
}
Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:
#!/usr/bin/env perl
use warnings;
use strict;
my #tokens = ('bEd', 'mA', 'Al', 'ywm');
my $vowelizedword = "beEodamaA Alyawmi"; # Caveat: Without the space it won't work.
my #regex = map { join('.?', split //) . '.?$' } #tokens;
my $regex = join('|', #regex);
$regex = qr/($regex)/;
while (my ($matched) = $vowelizedword =~ $regex) {
$vowelizedword =~ s{$regex}{};
print "matched $matched\n";
}
Use the m// operator in so-called "list context", as this:
my #tokens = ($input =~ m/capturing_regex_here/modifiershere);
ETA: From what I understand now, what you were trying to say is that you want to match an optional vowel after each character of the tokens.
With this, you can tweak the $vowels variable to only contain the letters you seek. Optionally, you may also just use . to capture any character.
use strict;
use warnings;
use Data::Dumper;
my #tokens = ("bEd", "mA");
my $full = "baEodamA";
my $vowels = "[aeiouy]";
my #matches;
for my $rx (#tokens) {
$rx =~ s/.\K/$vowels?/g;
if ($full =~ /$rx/) {
push #matches, $full =~ /$rx/g;
}
}
print Dumper \#matches;
Output:
$VAR1 = [
'baEoda',
'mA'
];
Note that
... $full =~ /$rx/g;
does not require capturing groups in the regex.
I suspect that there is an easier way to do whatever you're trying to accomplish. The trick is not to make the regex generation code so tricky that you forget what it's actually doing.
I can only begin to guess at your task, but from your single example, it looks like you want to check that the two subtokens are in the larger token, ignoring certain characters. I'm going to guess that those sub tokens have to be in order and can't have anything else between them besides those vowel characters.
To match the tokens, I can use the \G anchor with the /g global flag in scalar context. This anchors the match to the character one after the end of the last match for the same scalar. This way allows me to have separate patterns for each sub token. This is much easier to manage since I only need to change the list of values in #subtokens.
Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.
use v5.14;
my $vowels = '[ao]*';
my #subtokens = qw(bEd mA);
# prepare the subtoken regular expressions
my #patterns = map {
my $s = join "$vowels", map quotemeta, (split( // ), '');
qr/$s/;
} #subtokens;
my #tokens = qw( baEodamA mAabaEod baEoda mAbaEoda );
my #grand_matches;
TOKEN: foreach my $token ( #tokens ) {
say "-------\nMatching $token..........";
my #matches;
PATTERN: foreach my $pattern ( #patterns ) {
say "Position is ", pos($token) // 0;
# scalar context /g and \G
next TOKEN unless $token =~ /\G($pattern)/g;
push #matches, $1;
say "Matched with $pattern";
}
push #grand_matches, [ $token, \#matches ];
}
# Now report the original
foreach my $tuple ( #grand_matches ) {
say "$tuple->[0] has both fragments: #{$tuple->[1]}";
}
Now, here's the nice thing about this structure. I've probably guessed wrong about your task. If I have, it's easy to fix without changing the setup. Let's say that the subtokens don't have to be in order. That's an easy change to the pattern I created. I just get rid of the
\G anchor and the /g flag;
next TOKEN unless $token =~ /($pattern)/;
Or, suppose that the tokens have to be in order, but other things may be between them. I can insert a .*? to match that stuff, effectively skipping over it:
next TOKEN unless $token =~ /\G.*?($pattern)/g;
It would be much nicer if I could manage all of this from the map where I create the patterns, but the /g flag isn't a pattern flag. It has to go with the operator.
I find it much easier to manage changing requirements when I don't wrap everything in a single regular expression.
Assuming the tokens need to appear in order and without anything (other than a vowel) between them:
my #tokens = ( "bEd", "mA" );
my $vowelizedword = "baEodamaA";
my $vowels = '[ao]';
my (#vowelized_sequences) = $vowelizedword =~ ( '^' . join( '', map "(" . join( $vowels, split( //, $_ ) ) . "(?:$vowels)?)", #tokens ) . '\\z' );
print for #vowelized_sequences;

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.