Perl Regular Expression extracting sub-string? - regex

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?

You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)

\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/

use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.

This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

Related

Perl Regular expression | how to exclude words from a file

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project.
First i want to exclude strings from a txt file (dictionary).
For example if my file have this strings:
path.../Document.txt |
tree
car
ship
i using Regular Expression
a1testtre -- match
orangesh1 -- match
apleship3 -- not match [contains word from file ]
Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).
For example :
adminnisstrator21 -- match (have 2 times a repetition of chars)
kkeeykloakk -- not match have over 3 times repetition
stack22ooverflow -- match (have 2 times a repetition of chars)
for this i have try
\b(?:([a-z])(?!\1))+\b
but it works only for the first char-reppeat
Any idea how to solve these two?
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
^(?!.*(?:tree|car|ship)).*$
^ Assert start of string
(?! negative lookahead, assert what is on the right is not
.*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
) Close negative lookahead
.* Match any char except a newline
$ Assert end of string
Regex demo
To not allow a string to have over 3 times a char repeat you could use:
\b(?!(?:\w*(\w)\1){3})\w+\b
\b Word boundary
(?! Negative lookahead, assert what is on the right is not
(?: NOn capturing group
\w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
){3} Close non capturing group and repeat 3 times
) close negative lookahead
\w+ Match 1+ word characters
\b word boundary
Regex demo
Update
According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
^
Pattern demo
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my #words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } #words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my #chars_that_repeat = $string =~ /(.)\1+/g;
if (#chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
†  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } #words;
#==> so|sole|solely
for a quicker match (so matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }
I hope someone else will come with a better solution, but this seems to do what you want:
\b Match word boundary
(?: Start capture group
(?:([a-z0-9])(?!\1))* Match all characters until it encounters a double
(?:([a-z0-9])\2)+ Match all repeated characters until a different one is reached
){0,2} Match capture group 0 or 2 times
(?:([a-z0-9])(?!\3))+ Match all characters until it encounters a double
\b Match end of word
I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.
My problem is that i have 2 regex that working:
Not allow over 3 pairs of chars:
(?=^(?!(?:\w*(.)\1){3}).+$)
Not allow over 4 times a char to repeat:
(?=^(?:(.)(?!(?:.*?\1){4}))*$)
Now i want to combine them into one row like:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
but its working only the regex that is first and not both of them
As mentioned in comment to #zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:
# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST
# TRUSTEES|TRUSTEE|TRUST
sub tight_match_order {
return #_ unless #_ > 1;
my (#alts, #ordered_alts, %alts_seen);
#alts = map { $alts_seen{$_}++ ? () : $_ } #_;
TEST: {
my $alt = shift #alts;
if (grep m#$alt#, #alts) {
push #alts => $alt;
} else {
push #ordered_alts => $alt;
}
redo TEST if #alts;
}
#ordered_alts
}
So following #zdim's answer:
...
my #words = split ' ', path($file)->slurp;
#words = tight_match_order(#words); # add this line
my $exclude = join '|', map { quotemeta } #words;
...
HTH

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/

Trying to understand this perl regex bracketed character class?

Below is a script that I was playing with. With the script below it will print a
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.])/ ) {
print $1."\n";
}
BUT if I change it to:
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.]+)/ ) {
print $1."\n";
}
then it prints: cd abc/test/.
From my understanding the + matches one or more of the matching sequence, correct me if i am wrong please. But why in the first case it only matches a? I thought it should match nothing!!
Thank you.
You are correct. In the first case you match a single character from that character class, while in the second you match at least one, with as many as possible after the first one.
First one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
)
"
Second one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
In regexes, characters in brackets only count for a match of one character within the given bracket. In other words, [\w\/\.] matches exactly one of the following characters:
An alphanumeric character or "_" (the \w).
A forward slash (the \/--notice that the forward slash needs to be escaped, since it is used as the default marker for the beginning and end of a regex)
A period (the \.--again, escaped since . denotes any character except the newline character).
Because /cd ([\w\/\.])./ only captures one character into $1, it grabs the first character, which in this case is "a".
You are correct in that the + allows for a match of one or more such characters. Since regexes match greedily by default, you should get all of "abc/test/." for $1 in the second match.
If you haven't already done so, you might want to peruse perldoc perlretut.

regular expressions: find every word that appears exactly one time in my document

Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?
If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse
It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.
(\w+){1} will match each word.
After that you could always perfrom the count on the matches....
Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}
Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}