Perl - Parse last occurrence of whitespace - regex

I am trying to parse the last chunk of the following line:
77 0 wl1271/wpa_supplicant_lib/driver_ti.h
Unfortunately, the spaces are not always the same length. I assume that printf or something similar was used to output the data so it lined up in columns. This means that sometimes I have spaces, and sometimes I have tab characters.
I have successfully gotten the first two numbers through the use of regex in perl. The way I thought I would get the last bit would be to search of the last occurrence of any whitespace character and then grab the rest of the string starting there. I tried using rindex but that only accepts a character for the searchable parameter and not a regex (I thought that \s would do the trick).
Can anyone solve the issue I'm having here either by walking my through how to get the last whitespace character or by helping me with a solution to grab that string some other way?

Why not split?
use strict;
use warnings;
my $string = '77 0 wl1271/wpa_supplicant_lib/driver_ti.h';
my ( $num1, $num2, $lastPart ) = split ' ', $string;
print "$num1\n$num2\n$lastPart";
Output:
77
0
wl1271/wpa_supplicant_lib/driver_ti.h

Why not just match the regex \S+$ - namely, the last set of non-whitespace characters in the string?
\S = non-whitespace character
$ = end of line
Edit: You really should use split though, as suggested by Kenosis.

Related

Regex to match and capture a word

I wrote a regex in a perl script to find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð. The word may be at the beginning or end of the sentence. When I tested this regex on regex101.com, it matches even when the input is nothing.The way I interpret this regex is: match one of the patterns "fp", "fd", "sp" or "sd" and capture everything around it until either a whitespace or the beginning of the line on the left side and a whitespace or end of the line on the right side. This is the regex: ^|\s(.*[fs][ˈ|ˌ]?[pd].*)\s|$ I also tried using the ? quantifier to make the .* pattern lazy, but it still shows a match when the input is nothing.
Here are some examples of what I need it to capture in parentheses:
(fpgθ) tig <br/>
tig (gfpθ) tig<br/>
tig (gθfp)<br/>
Edit: I forgot to explain the middle part. The [ˈˌ]? part (I made a mistake, I don't need the |) just allows for those characters to be between the [fs] and [pd]. I wouldn't want it to match things like tigf pg. I want it to match any word (defined by the space around it - so in a sentence like tig you rθðthe words it contains are tig, you, and rθð). This "word" could be at the end, at the beginning, or in the middle of the sentence. Is there a way to assert the position at the beginning of the string within a bracket? I think that would solve my problem. Also, I tried using \w, but because I have things like θ or ð it doesn't match those.
There is still a little openness in the description, but this works with shown data
use warnings;
use strict;
use feature 'say';
use utf8;
use open ":std", ":encoding(UTF-8)";
my #strs = (
'(fpgθ) tig <br/>',
'tig (gfpθ) tig<br/>',
'tig (gθfp)<br/>',
);
for (#strs)
{
my #m = /\b( \S*? [fs][pd] \S*? )\b/gx;
say "#m" if #m; # no space allowed by the pattern
}
Depending on clarifications you may want to tweak the \S and \b that are used. I capture into an array, with /g, for strings with more than one match. I left parentheses in for an additional test.
The use utf8 allows UTF-8 in the source, so it's for my #strs array only.
The use open pragma, however, is essential as it sets default (PerlIO) input and output layers, in this case standard streams for UTF-8. So you can read from a file and print to a file or console.
find and capture a word that contains the sequence "fp", "fd", "sp" or
"sd" in a sentence. However, the word may contain some non-word
characters like θ or ð.
You should match Unicode letters \p{L} instead of regular word characters \w:
\p{L}*[fs][pd]\p{L}*
Click on the pattern to try it online. I have simplified the pattern according to your latest edits.
use warnings;
use strict;
use utf8;
use open ":std", ":encoding(UTF-8)";
my #regex = qr/\p{L}*[fs][pd]\p{L}*/mp;
my #strs = 'fpgθ tig <br/>
tig gfpθ tig<br/>
tig gθfp<br/>
fptig gfpθ tig<br/>
sddgsdθ(θ#) tig gθfp<br/>';
for (#strs)
{
my #m = /#regex/gm;
print "#m" if #m; # no space allowed by the pattern
}

Perl Split using "*"

If I use split like this:
my #split = split(/\s*/, $line);
print "$split[1]\n";
with input:
cat dog
I get:
a
However if I use \s+ in split, I get:
dog
I'm curious as to why they don't produce the same result? Also, what is the proper way to split a string by character?
Thanks for your help.
\s* effectively means zero or more whitespace characters. Between c and a in cat are zero spaces, yielding the result you're seeing.
To the regex engine, your string looks as follows:
c
zero spaces
a
zero spaces
t
multiple spaces
d
zero spaces
o
zero spaces
g
Following this logic, if you use \s+ as a separator, it will only match the multiple spaces between cat and dog.
* matches 0 or more times. Which means it can match the empty string between characters. + matches 1 or more times, which means it must match at least one character.
This is described in the documentation for split:
If PATTERN matches the empty string, the EXPR is split at the match position (between characters).
Additionally, when you split on whitespace, most of the time you really want to use a literal space:
.. split ' ', $line;
As described here:
As another special case, "split" emulates the default behavior of the
command line tool awk when the PATTERN is either omitted or a literal
string composed of a single space character (such as ' ' or "\x20",
but not e.g. "/ /"). In this case, any leading whitespace in EXPR is
removed before splitting occurs, and the PATTERN is instead treated as
if it were "/\s+/"; in particular, this means that any contiguous
whitespace (not just a single space character) is used as a separator.
However, this special treatment can be avoided by specifying the
pattern "/ /" instead of the string " ", thereby allowing only a
single space character to be a separator.
If you want to split a string into a list of individual characters then you should use an empty regex pattern for split, like this
my $line = 'cat';
my #split = split //, $line;
print "$_\n" for #split;
output
c
a
t
Some people prefer unpack, like this
my #split = unpack '(A1)*', $line;
which gives exactly the same result.

Notepad++ Regex Ending With

I want to remove all the lines until some terminating string. So something like this :
These
lines
|| that can contain numbers 123 or characters like |*{
will be removed
But the following lines
will
remain.
I want to obtain :
But the following lines
will
remain.
I tried searching the regexp /removed$/ and replacing with an empty string, but it says 0 matches.
How can I do this?
Thanks for any help !
Make sure you check ". matches newline", and use .*removed\.$:
Notice that you need .* at the start to match everything up until the terminating string, and you have to escape the literal . at the end.
Also, Notepad++ doesn't want slashes around its regexes.
If you don't have the . at the end of your terminator, just remove the \. from the regex.
To remove the trailing newline as well, you could use .*removed\r?\n?.

Add a optional white space between characters in a string

I have a string For Exampe
string SampleString = "F456-G12345-9090-GHI"
I need to add a optional white space between all characters in the above string.
The above string needs to match the same string which may or may not have the white space between ewach character. The other string will be like
string samplestring1 = "F456-G12345- 9090 -GHI"
Thanks
Padma
I'm not positive that I'm understanding what you'll be matching. If you're looking for a specific string, then the easiest way is probably to substitute all white space for '' across the string and then do the match.
In perl I'd do:
$string =~ s/\s//g;
while ($string =~ m/F456-G12345-9090-GHI/g) {
# Do something
}
If you're looking for multiple strings, and not just a specific one, you might just want to add \s as a potential match [\w\s-]+
However, if you're going to be matching against a specific string, I'd just toss the whitespace whole cloth first rather than performing an expensive regex checking for (and discarding) any whitespace found before checking the string.
you will probably have to add \s* between each character. (or other control characters for whitespace)
\s*F\s*4\s*5\s*6\s*-\s*G\s*1\s*2\s*3\s*4\s*5\s*-\s*9\s*0\s*9\s*0\s*-\s*G\s*H\s*I\s*
Or, depending on your regex dialect, you might be able to pass an option to ignore whitespace in the source text, but it would depend on which regex library you're using.

PERL-Subsitute any non alphanumerical character to "_"

In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;