Trying to understand this perl regex bracketed character class?

Trying to understand this perl regex bracketed character class? - regex

Below is a script that I was playing with. With the script below it will print a
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.])/ ) {
print $1."\n";
}
BUT if I change it to:
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.]+)/ ) {
print $1."\n";
}
then it prints: cd abc/test/.
From my understanding the + matches one or more of the matching sequence, correct me if i am wrong please. But why in the first case it only matches a? I thought it should match nothing!!
Thank you.

You are correct. In the first case you match a single character from that character class, while in the second you match at least one, with as many as possible after the first one.
First one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
)
"
Second one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

In regexes, characters in brackets only count for a match of one character within the given bracket. In other words, [\w\/\.] matches exactly one of the following characters:
An alphanumeric character or "_" (the \w).
A forward slash (the \/--notice that the forward slash needs to be escaped, since it is used as the default marker for the beginning and end of a regex)
A period (the \.--again, escaped since . denotes any character except the newline character).
Because /cd ([\w\/\.])./ only captures one character into $1, it grabs the first character, which in this case is "a".
You are correct in that the + allows for a match of one or more such characters. Since regexes match greedily by default, you should get all of "abc/test/." for $1 in the second match.
If you haven't already done so, you might want to peruse perldoc perlretut.

Related

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

I have a UNIX passwd file and i need to find using egrep if the first 7 characters from GECOS are inside the username. I want to check if the username (jkennedy) contains the word "kennedy" from the GECOS.
I was planning to use back-references but the username is before the gecos so i don't know how to implement it.
For example the passwd file contains this line:
jkennedy:x:2473:1067:kennedy john:/root:/bin/bash

As per my original comment, the regex below works for me.
See it in use here - note this regex differs slightly as it's more used for display purposes. The regex below is the POSIX version of this and removes non-capture groups and the unneeded capture group around the backreference.
^[^:]*([^:]{7})([^:]*:){4}\1.*$
^ assert position at the start of the line
[^:]* match any character except : any number of times
([^:]{7}) capture exactly seven of any character except :
([^:]*:){4} match the following exactly four times
[^:]*: match any character except : any number of times, followed by : literally
\1 match the backreference; matches what was previously matched by the first capture gorup
.* match any character (except newline characters) any number of times
$ assert position at the end of the line

Assuming you do NOT want case sensitivity to foul your matching -
declare -l tmpUsr tmpName
while IFS=: read usr x x x name x
do tmpUsr="$usr"; tmpName="$name"
(( ${#name} )) && [[ "$tmpUsr" =~ ${tmpName:0:7} ]] &&
printf "$usr ($name<${tmpName:0:7}>)\n"
done</etc/passwd

Regex perl with letters and numbers

I need to extract a strings from a text file that contains both letters and numbers. The lines start like this
Report filename: ABCL00-67900010079415.rpt ______________________
All I need is the last 8 numbers so in this example that would be 10079415
while(<DATA>){
if (/Report filename/) {
my ($bagID) = ( m/(\d{8}+)./ );
print $bagID;
}
Right now this prints out the first 8 but I want the last 8.

You just need to escape the dot, so that it would match the 8 digit characters which exists before the dot charcater.
my ($bagID) = ( m/(\d{8}+)\./ );
. is a special character in regex which matches any character. In-order to match a literal dot, you must need to escape that.

To match the last of anything, just precede it with a wildcard that will match as many characters as possible
my ($bag_id) = / .* (\d{8}) /x
Note that I have also use the /x modifier so that the regex can contain insignificant whitespace for readability. Also, your \d{8}+ is what is called a possessive quantifier; it is used for optimising some regex constructions and makes no difference at the end of the pattern

Perl Regular Expression extracting sub-string?

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?

You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)

\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/

use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.

This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

What does this regular expression try to match?

These days I am learning regular expressions, but it seems like a little hard to me. I am reading some code in TCL, but what does it want to match?
regexp ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]" $input

If you un-escape the characters, you get the following:
.* ([\d]{3}:[\d]{3}:[\d]{3}.[\d]{5}).[^\n]
The term [\d]{x} would match x number of consecutive digits. Therefore, the portion inside the parentheses would match something of the form ###:###:###?##### (where # can be any digit and ? can be any character). The parentheses themselves aren't matched, they're just used for specifying what part of the input to "capture" and return to the caller. Following this sequence is a single dot ., which matches a single character (which can be anything). The trailing [^\n] will match a single character that is anything except a newline (a ^ at the start of a bracketed expression inverts the match). The .* term at the very beginning matches a sequence of characters of any length (even zero), followed by a space.
With all of this taken into account, it appears that this regular expression extracts a series of digits from the middle of a line. Given the format of the numbers, it may be looking for a timestamp in the hours:minutes:seconds.milliseconds format (although if that is the case, {1,3} and {1,5} should be used instead). The trailing .[^\n] term looks like it could be trying to exclude timestamps that are at or near the end of a line. Timestamped logs often have a timestamp followed by some sort of delimiting character (:, >, a space, etc). A regular expression like this might be used to extract timestamps from the log while ignoring "blank" lines that have a timestamp but no message.
Update:
Here's an example using TCL 8.4:
% set re ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]"
% regexp $re "TEST: 123:456:789:12345> sample log line"
1
% regexp $re " 111:222:333.44444 foo"
1
% regexp $re "111:222:333.44444 foo"
0
% regexp $re " 111:222:333.44444 "
0
% regexp $re " 10:44:56.12344: "
0
%
% regexp $re "TEST: 123:456:789:12345> sample log line" match data
1
% puts $match
TEST: 123:456:789:12345>
% puts $data
123:456:789:12345
The first two examples match the expression. The third fails because it lacks the space character before the first number sequence. The fourth fails because it doesn't have a non-newline character at the end after the trailing space. The fifth fails because the numerical sequences don't have enough digits. By passing parameters after the input, you can store the part of the input that matched the expression as well as the data that was "captured" by using parentheses. See the TCL wiki for details on the regexp command.
The interesting part with TCL is that you have to escape the [ character but not the ], while both the { and } need escaping.

.* ==> match junk part of the input
( ==> start capture
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}. ==> match 3 digits followed by any character
\[\\d]\{5\} ==> match 5 digits
). ==> close capture and match any character
\[^\\n] ==> match a character that is not a newline

regular expressions: find every word that appears exactly one time in my document

Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?

If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse

It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}

If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.

(\w+){1} will match each word.
After that you could always perfrom the count on the matches....

Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}

Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Trying to understand this perl regex bracketed character class? - regex

Related

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

Regex perl with letters and numbers

Perl Regular Expression extracting sub-string?

What does this regular expression try to match?

regular expressions: find every word that appears exactly one time in my document

Categories

Resources