Capture word between optional hyphens regex - regex

I've following type of strings,
abc - xyz
abc - pqr - xyz
abc - - xyz
abc - pqr uvw - xyz
I want to retrieve the text xyz from 1st string and pqr from 2nd string, `` (empty) from 3rd & pqr uvw. The 2nd hyphen is optional. abc is static string, it has to be there. I've tried following regex,
/^(?:abc) - (.*)[^ -]?/
But it gives me following output,
xyz
pqr - xyz
- xyz
pqr uvw - xyz
I don't need the last part in the second string. I'm using perl for scripting. Can it be done via regex?

Note that (.*) part is a greedily quantified dot and it grabs any 0+ chars other than line break chars, as many as possible, up to the end of the line and the [^ -]?, being able to match an empty string due to the ? quantifier (1 or 0 repetitions), matches the empty string at the end of the line. Thus, pqr - xyz output for abc - pqr - xyz is only logical for the regex engine.
You need to use a more restrictive pattern here. E.g.
/^abc\h*-\h*((?:[^\s-]+(?:\h+[^\s-]+)*)?)/
See the regex demo.
Details
^ - start of a string
abc - an abc
\h*-\h* - a hyphen enclosed with 0+ horizontal whitespaces
((?:[^\s-]+(?:\h+[^\s-]+)*)?) - Group 1 capturing an optional occurrence of
[^\s-]+ - 1 or more chars other than whitespace and -
(?:\h+[^\s-]+)* - zero or more repetitions of
\h+ - 1+ horizontal whitespaces
[^\s-]+ - 1 or more chars other than whitespace and -

You could use ^[^-]*-\s*\K[^\s-]*.
Here's how it works:
^ # Matches at the beginning of the line (in multiline mode)
[^-]* # Matches every non - characters
- # Followed by -
\s* # Matches every spacing characters
\K # Reset match at current position
[^\s-]* # Matches every non-spacing or - characters
Demo.
Update for multiple enclosed words: ^[^-]*-\s*\K[^\s-]*(?:\s*[^\s-]+)*
Last part (?:\s*[^\s-]+)* checks for existence of any other word preceded by space(s).
Demo

You could use split:
$answer = (split / \- /, $t)[1];
Where $t is the text string and you want the 2nd split (i.e. [1] as starts from 0). Works for everything except abc - - xyz but if the separator is " - " then it should have 2 spaces in the middle to return nothing. If abc - - xyz is correct then you can do this before the split for all to work:
$t =~ s/\- \-/- -/;
It simply inserts an extra space so it'll match " - " twice with nothing in-between.

Can it be done via regex?
Yes, with three simple regexes: - and ^\s+ and \s+$.
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
open my $INFILE, '<', 'data.txt';
my #results = map {
(undef, my $target) = split /-/, $_, 3;
$target =~ s/^\s+//; #remove leading spaces
$target =~ s/\s+$//; #remove trailing spaces
$target;
} <$INFILE>;
close $INFILE;
say Dumper \#results;
--output:--
$VAR1 = [
'xyz',
'pqr',
'',
'pqr uvw'
];

Related

Exclude a substring after a pattern is matched using regex

I want to write a regex that splits a string such as only few elements are selected. For example:
M:\Shares\Profiles\Server\Profiles\abcd.contoso.V2.01
the result I am aiming for is:
abcd.V2.01, so that the domain name i.e. 'contoso' is dropped
However, I am unable to exclude a part of the string after a match is found. I tried
$original = 'M:\Shares\Profiles\Server\Profiles\abcd.contoso.V2.01'
$modified = $original -replace '.*\\([^\\.]+.contoso.V2)[^\\]*$', '$1'
that returns
$modified as 'abcd.contoso.V2'
You can use two capturing groups:
$original = 'M:\Shares\Profiles\Server\Profiles\abcd.contoso.V2.01'
$original -replace '.*\\([^\\.]*)\.contoso(\.V2[^\\]*)$', '$1$2'
# => abcd.V2.01
Do not forget to escape literal dots in the regex pattern. Here is a demo of the above regex. Details:
.* - any zero or more chars other than LF chars
\\ - a \ char
([^\\.]*) - Group 1 ($1): any zero or more chars other than \ and .
\.contoso - a .contoso string
(\.V2[^\\]*) - Group 2 ($2): .V2 string and then any zero or more chars other than \
$ - end of string.

String split in windows powershell

Can you please help me to get the desired output, where SIT is the environment and type of file is properties, i need to remove the environment and the extension of the string.
#$string="<ENV>.<can have multiple period>.properties
*$string ="SIT.com.local.test.stack.properties"
$b=$string.split('.')
$b[0].Substring(1)*
Required output : com.local.test.stack //can have multiple period
This should do.
$string = "SIT.com.local.test.stack.properties"
# capture anything up to the first period, and in between first and last period
if($string -match '^(.+?)\.(.+)\.properties$') {
$environment = $Matches[1]
$properties = $Matches[2]
# ...
}
You may use
$string -replace '^[^.]+\.|\.[^.]+$'
This will remove the first 1+ chars other than a dot and then a dot, and the last dot followed with any 1+ non-dot chars.
See the regex demo and the regex graph:
Details
^ - start of string
[^.]+ - 1+ chars other than .
\. - a dot
| - or
\. - a dot
[^.]+ - 1+ chars other than .
$ - end of string.
You can use -match to capture your desired output using regex
$string ="SIT.com.local.test.stack.properties"
$string -match "^.*?\.(.+)\.[^.]+$"
$Matches.1
You can do this with the Split operator also.
($string -split "\.",2)[1]
Explanation:
You split on the literal . character with regex \.. The ,2 syntax tells PowerShell to return 2 substrings after the split. The [1] index selects the second element of the returned array. [0] is the first substring (SIT in this case).

How to write nested regex to find words below some string?

I am converting one pdf to text with xpdf and then find some words
with help of regex and preg_match_all.
I am seperating my words with colon in pdftotext.
Below is my pdftotext output:
In respect of Shareholders
Name: xyx
Residential address: dublin
No of Shares: 2
Name: abc
Residential address: canada
No of Shares: 2
So i write one regex that will show me words after colon in text().
$regex = '/(?<=: ).+/';
preg_match_all($regex, $string, $matches);
But Now i want regex that will display all data after In respect of Shareholders.
So, i write $regex = '/(?<=In respect of Shareholders).*?(?=\s)';
But it shows me only :
Name: xyx
I want first to find all data after In respect of shareholders and then another regex to find words after colon.
You may use
if (preg_match_all('~(?:\G(?!\A)|In respect of Shareholders)\s*[^:\r\n]+:\h*\K.*~', $string, $matches)) {
print_r($matches[0]);
}
See the regex demo
Details
(?:\G(?!\A)|In respect of Shareholders) - either the end of the previous successful match or In respect of Shareholders text
\s* - 0+ whitespaces
[^:\n\r]+ - 1 or more chars other than :, CR and LF
: - a colon
\h* - 0+ horizontal whitespaces
\K - match reset operator that discards all text matched so far
.* - the rest of the line (0 or more chars other than line break chars).
In your regex (?<=: ).+ you will match any character 1+ times after a colon and a space. To capture all that follows the spaces or tabs in a group, you could use (?<=: )[\t ](.+)
Another way to match the texts using a capturing group could be:
^.*?:[ \t]+(\w+)
Explanation
^ Assert start of the string
.*?: Match any character non greedy followed by a :
[ \t]+ Match 1+ times a space or a tab
(\w+) Capture in a group 1+ word characters
Regex demo | Php demo
Or use \K to forget what was matched if that is supported:
^.*?:\h*\K\w+
Regex demo

How to Regex and extract even new line until a match

I have use regex to successfully extract anything right after "Abc 123" but it doesn't extract anything from the new line.
Is there any way I can use regex to extract the following:
"Abc 123 def
ghi
jkl"
"Abc 123 def ghi jkl mno"
"Abc 123 def ghi jkl
mno"
I am using Regex in Talend.
I think you want to exract substrings that start at the beginning of a line with 1+ word chars, then a whitespace, then 1 or more digits and span across multiple lines up to the same pattern.
You may use the following regex (note the flags and notation may differ depending on the language you are using):
/^(\w+)\s(\d+)(.*(?:\r?\n(?!\w+\s\d).*)*)/gm
See the regex demo.
Details:
^ - start of a line
(\w+) - Group 1: one or more word chars
\s - 1 whitespace
(\d+) - Group 2: one or more digits
(.*(?:\r?\n(?!\w+\s\d).*)*) - Group 3:
.* - any 0+ chars other than line break chars
(?:\r?\n(?!\w+\s\d).*)* - zero or more sequences of:
\r?\n - a line break...
(?!\w+\s\d) - that is not followed with 1+ word chars, whitespace, 1+ digits
.* - any 0+ chars other than line break chars
(\w)+\s(\d+)((.|\R)+) is what you want so after escaping it'll be:
(\\w)+\\s(\\d+)((.|\\R)+).
\R is a new group in Java regex available since Java 8 - it stands for a line break. Both: \r\n and \n.
If you only allow a single linebreak:
(\w)+\s(\d+)((.+)(\R.+){0,1})
I think that you should specify more what is your desired output, but from this answer you can learn how to include multiple lines or up to 2 lines

Matching first letter of word

I want to match the first letter of a word in one string to another with the similar letter. In this example the letter H:
25HB matches to HC
I am using the match operator shown below:
my ($match) = ( $value =~ m/^d(\w)/ );
to not match the digit, but the first matching word character. How could I correct this?
That regex doesn't do what you think it does:
m/^d(\w)/
Matches 'start of line' - letter d then a single word character.
You may want:
m/^\d+(\w)/
Which will then match one or more digits from the start of line, and grab the first word character after that.
E.g.:
my $string = '25HC';
my ( $match ) =( $string =~ m/^\d+(\w)/ );
print $match,"\n";
Prints H
You are not clear about what you want. If you want to match the first letter in a string to the same letter later in the string:
m{
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
See perldoc perlre for more details.
Addendum:
If by word, you mean any alphanumeric sequence, this may be closer to what you want:
m{
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
You could try ^.*?([A-Za-z]).
The following code returns:
ITEM: 22hb
MATCH: h
ITEM: 33HB
MATCH: H
ITEM: 3333
MATCH:
ITEM: 43 H
MATCH: H
ITEM: HB33
MATCH: H
Script.
#!/usr/bin/perl
my #array = ('22hb','33HB','3333','43 H','HB33');
for my $item (#array) {
my $match = $1 if $item =~ /^.*?([A-Za-z])/;
print "ITEM: $item \nMATCH: $match\n\n";
}
I believe this is what you are looking for:
(If you can provide more clear example of what you are looking for we may be able to help you better)
The following code takes two strings and finds the first non-digit character common in both the strings:
my $string1 = '25HB';
my $string2 = 'HC';
#strip all digits
$string1 =~ s/\d//g;
foreach my $alpha (split //, $string1) {
# for each non-digit check if we find a match
if ($string2 =~ /$alpha/) {
print "First matching non-numeric character: $alpha\n";
exit;
}
}