Why one word breaks all right output in regex (perl)? - regex

I want to understand the situation with regular expression in Perl.
$str = "123-abc 23-rr";
Need to show both words beside minus.
Regular expression is:
#mas=$str=~/(?:([\d\w]+)\-([\d\w]+))/gx;
And it show right output: 123, abc, 23, rr.
But if I change string a little and put one word in start:
$str = "word 123-abc 23-rr";
And I want to take account this first word, so I change my regexp:
#mas=$str=~/\w+\s(?:\s*([\d\w]+)\-([\d\w]+))*/gx;
My output must be same, but there are: 23, rr. If I remove \s* or * the output is 123, abc. But it's still not right. Anyone knows why?

Rather than making an ever more specific regex for an ever more specific string, consider taking advantage of the overall pattern.
Each piece is separated by whitespace.
The first piece is a word.
The rest are pairs separated by dashes.
First split the pieces on whitespace.
my #pieces = split /\s+/, $str;
Then remove the first piece, it doesn't have to be split.
my $word = shift #pieces;
Then split each piece on - into pairs.
my %pairs = map { split /-/, $_ } #words;

For each match, each capture is returned.
In the first snippet, the pattern matches twice.
123-abc 23-rr
\_____/ \___/
There are two captures, so four (2*2=4) values are returned.
In the second snippet, the pattern matches once.
word 123-abc 23-rr
\________________/
There are two captures, so two (2*1=2) values are returned.

Related

Regex for text (and numbers and special characters) between multiple commas [duplicate]

I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)

Can Regex find 3 words in exact order without repeating in between? Without lookahead?

This is fairly easy if you just want to find if a sentence contains 3 words. But, if you want to make sure it contains 3 words without repeating, and the three words in order, that’s is what I’m having a hard time finding. So a sample sentence:
This is a sample sentence to find a sample which will work for finding words
So in the above sentence, if we want to find if the sentence contains the 3 words:
Sample
Sentence
Work
That is straightforward. Something like:
sample.*?sentence.*?work
But if we want to find if they are in the exact order without repeating, that’s what I’m trying to do but it’s tricky. As you can see in the sample sentence, the word “sample” is repeated before the word “work”, so that line would not be valid. Using a lookahead I was able to see something that almost works, but the kicker is that my engine cannot use lookaheads or lookforwards.
Is this possible to do without a lookahead?
Yes, it is possible, but the regex is long, and tedious and error-prone to write.
/\bsample\b
(?:[^sw]|\B[sw]|[sw]\b|
s(?:[^ae]|[ae]\b|
a(?:[^m]|m\b|
m(?:[^p]|p\b|
p(?:[^l]|l\b|
l(?:[^e]|e\B))))|
e(?:[^n]|n\b|
n(?:[^t]|t\b|
t(?:[^e]|e\b|
e(?:[^n]|n\b|
n(?:[^c]|c\b|
c(?:[^e]|e\B))))))
)|
w(?:[^o]|o\b|
o(?:[^r]|r\b|
r(?:[^k]|k\B))
)
)*
\bsentence\b
(?:[^sw]|\B[sw]|[sw]\b|
s(?:[^ae]|[ae]\b|
a(?:[^m]|m\b|
m(?:[^p]|p\b|
p(?:[^l]|l\b|
l(?:[^e]|e\B))))|
e(?:[^n]|n\b|
n(?:[^t]|t\b|
t(?:[^e]|e\b|
e(?:[^n]|n\b|
n(?:[^c]|c\b|
c(?:[^e]|e\B))))))
)|
w(?:[^o]|o\b|
o(?:[^r]|r\b|
r(?:[^k]|k\B))
)
)*
\bwork\b/
The regex will not match part of a sentence with the words 'sample', 'sentence', or 'work' in that order, if the words 'sample', 'sentence', or 'work' appear in between the words 'sample' and 'sentence' or in between the words 'sentence' and 'work'.
Heavy use is made of the word boundary zero-width assertion \b and its opposite \B.
Perl example:
my $s = 'This is a sample sentence which will work for finding words'; # Should match
# my $s = 'This is a sample sentence to find a sample which will work for finding words'; # Shouldn't match
my $re = '\bsample\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bsentence\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bwork\b';
if ($s =~ $re) {
print "$&";
}
JavaScript example:
const sentences = [
// Should match:
'This is a sample sentence which will work for finding words',
'This is a sampled sample or sampled sentence which works and will work for finding words',
'This work is a sentence sample and it is a sentence which will work as a sample sentence',
// Shouldn't match:
'This is a sample sentence to find a sample which will work for finding words',
'This is a sample to work to find a sentence which will work for finding words',
'This is a sentence to find a work sample which will work for finding words',
'This is a sentence to find which sample will work for finding words',
'This is a sample sentence to find a sample which will work for finding words',
'This is a sentence to find a sample which will work for finding words',
];
const re = /\bsample\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bsentence\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bwork\b/;
for (const s of sentences) {
const m = s.match(re);
if (m != null) console.log(m[0]);
}
No, you can't do it without lookahead. Think about it.
I'll give you analytic/algorithmic help so you can figure it out the regex solution:
Write the regex that matches the first word.
Then append to it a regex that has a negative lookahead to prevent matching the first word at every position as it scans for the second word.
Then append another regex that has a negative lookahead to prevent matching the first and second words at every position as it scans for the third word.
This is totally doable. But as you can see if you can see a sequence of four works gets means yet a yet longer regex and so forth.
It would be easier with a loop construct in procedural code, especially if you need to generalize for N words.
You can’t take advantage of a Regex that supports recursion, because successive calls are not the same.

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

Regular expression that finds and replaces a long string of words

I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.
I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}
try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases
This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/
match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.