Regex: match when string has repeated letter pattern - regex

I'm using the Regex interpreter found in XYplorer file browser. I want to match any string (in this case a filename) that has repeated groups of 'several' characters. More specifically, I want a match on the string:
jack johnny - mary joe ken johnny bill
because it has 'johnny' at least twice. Note that it has spaces and a dash too.
It would be nice to be able to specify the length of the group to match, but in general 4, 5 or 6 will do.
I have looked at several previous questions here, but either they are for specific patterns or involve some language as well. The one that almost worked is:
RegEx: words with two letters repeated twice (eg. ABpoiuyAB, xnvXYlsdjsdXYmsd)
where the answer was:
\b\w*(\w{2})\w*\1
However, this fails when there are spaces in the strings.
I'd also like to limit my searches to .jpg files, but XYplorer has a built-in filter to only look at image files so that isn't so important to me here.
Any help will be appreciated, thanks.
.
.
.
EDIT -
The regex by OnlineCop below answered my original question, thanks very much:
(\b\w+.\b).(\1)
I see that it matches words, not arbitrary string chunks, but that works for my present need. And I am not interested in capturing anything, just in detecting a match.
As a refinement, I wonder if it can be changed or extended to allow me to specify the length of words (or string chunks) that must be the same in order to declare a match. So, if I specified a match length of 5 and my filenames are:
1) jack john peter paul mary johnnie.jpg
2) jack johnnie peter paul mary johnnie.jpg
the first one would not match since no substring of five characters or more is repeated. The second one would match since 'johnnie' is repeated and is more than 5 chars long.

Do you wish to capture the word 'johnny' or the stuff between them (or both)?
This example shows that it selects everything from the first 'johnny' to the last, but it does not capture the stuff between:
Re: (\b\w+\b).*(\1)
Result: jack bill
This example allows some whitespace between names/words:
Re: (\b\w+.*\b).*(\1)
String: Jackie Chan fought The Dragon who was fighting Jackie Chan
Result: Jackie Chan Jackie Chan

Use perl:
#!/usr/bin/perl
use strict;
use warnings;
while ( my $line = <STDIN> ) {
chomp $line;
my #words = split ( /\s+/, $line );
my %seen;
foreach my $word ( #words ) {
if ( $seen{$word} ) { print "Match: $line\n"; last }
$seen{$word}++;
}
}
And yes, it's not as neat as a one line regexp, but it's also hopefully a bit clearer what's going on.

Related

Using REGEX to remove duplicates when entire line is not a duplicate

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

Regex - get string after full date and before standard text

I'm stuck on another regex. I'm extracting email data. In the below example, only the time, date and message in quotes changes.
Message Received 6:06pm 21st February "Hello. My name is John Smith" Some standard text.
Message Received 8:08pm 22nd February "Hello. My name is "John Smith"" Some standard text.
How can I get the message only if I need to start with the positive lookbehind, (?<=Message Received ) to begin searching at this particular point of the data? The message will always start and end with quotes but the user is able to insert their own quotes as in the second example.
You can just use a negated charcter class in a capturing group:
/Message Received.*?"([^\n]+)"/
Snippet:
$input = 'Message Received 6:06pm 21st February "Hello. My name is John Smith" Some standard text.
Message Received 8:08pm 22nd February "Hello. My name is "John Smith"" Some standard text.}';
preg_match_all('/Message Received.*?"([^\n]+)"/', $input, $matches);
foreach ($matches[1] as $match) {
echo $match . "\r\n";
}
Output:
> Hello. My name is John Smith
> Hello. My name is "John Smith"
For extracting message in between double quotes.
(?=Message Received)[^\"]+\K\"[\w\s\"\.]+\"
Regex demo
You capture the message in a group
(?<=Message Received)[^"]*(.*)(?=\s+Some standard text)
Two out of the other three posted answers on this page provide an incorrect result. None of the other posted answers are as efficient as they could be:
To correctly extract the substring between the outer double quotes, use one of the following patterns:
/Message Received[^"]+"\K[^\n]+(?=")/ (No capture group, takes 132 steps, Demo)
/Message Received[^"]+"([^\n]+)"/ (Capture group, takes 130 steps, Demo)
Both patterns provide maximum accuracy and efficiency using negated character classes leading up to and including the targeted substring. The first pattern reduces preg_match_all()'s output array bloat by 50% by using \K instead of a capture group. For these reasons, one of these patterns should be used in your project. As your input string increases in size, my patterns provide increasingly better performance versus the other posted patterns.
PHP Implementation:
$in represents your input string.
Pattern #1 Method:
var_export(preg_match_all('/Message Received[^"]+"\K[^\n]+(?=")/',$in,$out)?$out[0]:[]);
// notice the output array only has elements in the fullstring subarray [0]
Output:
array (
0 => 'Hello. My name is John Smith',
1 => 'Hello. My name is "John Smith"',
)
Pattern #2 Method:
var_export(preg_match_all('/Message Received[^"]+"([^\n]+)"/',$in,$out)?$out[1]:[]);
// notice because a capture group is used, [0] subarray is ignored, [1] is used
Output:
array (
0 => 'Hello. My name is John Smith',
1 => 'Hello. My name is "John Smith"',
)
Both methods provide the desired output.
Anirudha's incorrect pattern: /(?<=Message Received)[^"]*(.*)(?=\s+Some standard text)/ (345 steps + a capture group + includes the unwanted outer double quotes)
Josh Crozier's pattern: /Message Received.*?"([^\n]+)"/ (174 steps + a capture group)
Sahil Gulati's incorrect pattern: /(?=Message Received)[^\"]+\K\"[\w\s\"\.]+\"/ (109 steps + includes the unwanted outer double quotes + unnecessarily escapes characters in the pattern)

Perl matching multiple capitalized words

I'm doing a perl program (script?) that reads through a text file and identifies all names and categorizes them as either person, location, organization, or miscellaneous. I'm having trouble with things like New York or Pacific First Financial Corp. where there are multiple capitalized words in a row. I've been using:
/([A-Z][a-z]+)+/
to capture as many capitalized words in a row as there are on a given line. From what I understand the + will match 1 or more instances of such pattern, but it's only matching one (i.e. New in New York). For New York, I can just repeate the [A-Z][a-z]+ twice but it doesn't find patterns with more than 2 capitalized words in a row. What am I doing wrong?
PS Sorry if my use of vocabulary is off I'm always so bad with that.
You were just missing the spacing between words.
The following matches whitespace before each word, except the first, so covers the cases you've described:
use strict;
use warnings;
while (<DATA>) {
while (/(?=\w)((?:\s*[A-Z][a-z]+)+)/g) {
print "$1\n";
}
}
__DATA__
I'm doing a perl program (script?) that reads through a text file and identifies all names and categorizes them as either person, location, organization, or miscellaneous. I'm having trouble with things like New York or Pacific First Financial Corp. where there are multiple capitalized words in a row. I've been using:
to capture as many capitalized words in a row as there are on a given line. From what I understand the + will match 1 or more instances of such pattern, but it's only matching one (i.e. New in New York). For New York, I can just repeate the [A-Z][a-z]+ twice but it doesn't find patterns with more than 2 capitalized words in a row. What am I doing wrong?
PS Sorry if my use of vocabulary is off I'm always so bad with that.
Outputs:
New York
Pacific First Financial Corp
From
New
New York
For New York
What
Sorry
There's a CPAN module called Lingua::EN::NamedEntity which seems to do what you want. Might be worth taking a quick look at it.
The How
The pattern you provide, /([A-Z][a-z]+)+/, in your question matches one of more capitalised words given consecutively, like this
This
ThisAndThat
but it won't match this
Not This
It actually matches each of these individually
Not
This
So lets modify the regex to /(?:[A-Z][a-z]+)(?:\s*[A-Z][a-z]+)*/. Now that is a bit of a mouthful so lets break it down a bit at a time
(?: ... ) Groups like this don't capture which is more efficient
[A-Z][a-z]+ Matches a capitalised word
\s*[A-Z][a-z]+ Matches a subsequent capitalised word, optionally starting with
whitespace
The What - TL;DR
Put this all together and we now have a regex that matches a capitalised word, then any subsequent ones with or without whitespace seperation. So it matches
This
ThisAndThat
Not This
We can now abstract this regex a bit to avoid repetition and use it in code as so
my $CAPS_WORD = qr/[A-Z][a-z]+/;
my $FULL_RE = qr/(?:$CAPS_WORD)(?:\s*$CAPS_WORD)*/;
$string =~ /$FULL_RE/;
say $&;
The Why
This answer gives an alternative to the already great one given by #Miller, both will work fine but this solution is quite a bit faster since it doesn't use a lookahead. This is faster than this by a factor of 7
$ time ./bench-simple.pl
Running 100000 runs
800000 matches
real 0m2.869s
user 0m2.860s
sys 0m0.008s
$ time ./bench-lookahead.pl
Running 100000 runs
800000 matches
real 0m19.845s
user 0m19.831s
sys 0m0.012s

Regex for single space

I'm trying to match a file which is delimited by multiple spaces. The problem I have is that the first field can contain a single space. How can I match this with a regex?
Eg:
Name Other Data Other Data 2
Bob Smith XX1 0101010101
John Doe XX2 0101010101
Bob Doe XX3 0101010101
John Smith XX4 0101010101
Can I split these lines into three fields with a regex, splitting by a space but allowing for the single space in the first field?
Hi the following regex should work
(\w*\s\w*)\s+\w{2}\d\s+\d*
This would work:
Pattern:
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Replacement:
+$1+ -$2- *$3*
$1 contains the first column, $2 the second and $3 the third one.
Example:
http://regexr.com?32tbt
You could split at two or more spaces:
[ ]{2,}
But you are probably better off, determining the lengths of the captures of this regular expression:
(Name[ ]+)(Other Data[ ]+)
And then to use a simple substring method that slices your lines into portions of the same length.
So in your case the first capture would be 15 characters long, the second 14 and the column would have 13 (but the last one doesn't really matter, which is why it isn't actually captured). Then you take the first 15, the next 14 and the remaining characters of every line and trim each one (remove trailing whitespace).
I think the simplest is to use a regex that matches two or more spaces.
/ +/
Which breaks down as... delimiter (/) followed by a space () followed by another space one or more times (+) followed by the end delimiter (/ in my example, but is language specific).
So simply put, use regex to match space, then one or more spaces as a means to split your string.
Usually, with this kind of files, the best approach is to get a substring based on where your required information is and then trim it. I see your file contains 16 chars before the second field, you can get a substring of length 16 from the beginning which will contain your desired text. You should trim it to get only the text you need without the spaces.
If the spacing pattern you posted is consistent (if it won't change among different files of this kind) you have also another problem: what happens to longer names?
Name Other Data
Johnny AppleseeXX1
TutankamonfirstXX2
if you really want to use a regex, be sure to avoid those corner cases.

How can I access capture buffers in brackets with quantifiers?

How can I access capture buffers in brackets with quantifiers?
#!/usr/local/bin/perl
use warnings;
use 5.014;
my $string = '12 34 56 78 90';
say $string =~ s/(?:(\S+)\s){2}/$1,$2,/r;
# Use of uninitialized value $2 in concatenation (.) or string at ./so.pl line 7.
# 34,,56 78 90
With #LAST_MATCH_START and #LAST_MATCH_END it works*, but the line gets too long.
Doesn't work, look at TLP's answer.
*The proof of the pudding is in the eating isn't always right.
say $string =~ s/(?:(\S+)\s){2}/substr( $string, $-[0], length($-[0]-$+[0]) ) . ',' . substr( $string, $-[1], length($-[1]-$+[1]) ) . ','/re;
# 12,34,56 78 90
You can't access all previous values of the first capturing group, only the last value (or the current at the match end, as you can see it) will be saved in $1 (unless you want to use a (?{ code }) hack).
For your example you could use something like:
s/(\S+)\s+(\S+)\s+/$1,$2,/
The statement that you say "works" has a bug in it.
length($-[0]-$+[0])
Will always return the length of the negative length of your regex match. The numbers $-[0] and $+[0] are the offset of the start and end of the first match in the string, respectively. Since the match is three characters long (in this case), the start minus end offset will always be -3, and length(-3) will always be 2.
So, what you are doing is taking the first two characters of the match 12 34, and the first two characters of the match 34 and concatenating them with a comma in the middle. It works by coincidence, not because of capture groups.
It sounds as though you are asking us to solve the problems you have with your solution, rather than asking us about the main problem.