regular expression - match word only once in line - regex

Case:
ehello goodbye hellot hello goodbye
ehello goodbye hello hello goodbye
I want to match line 1 (only has 'hello' once!)
DO NOT want to match line 2 (contains 'hello' more than once)
Tried using negative look ahead look behind and what not... without any real success..

A simple option is this (using the multiline flag and not dot-all):
^(?!.*\bhello\b.*\bhello\b).*\bhello\b.*$
First, check you don't have 'hello' twice, and then check you have it at least once.
There are other ways to check for the same thing, but I think this one is pretty simple.
Of course, you can simple match for \bhello\b and count the number of matches...

A generic regex would be:
^(?:\b(\w+)\b\W*(?!.*?\b\1\b))*\z
Altho it could be cleaner to invert the result of this match:
\b(\w+)\b(?=.*?\b\1\b)
This works by matching a word and capturing it, then making sure with a lookahead and a backreference that it does/not follow anywhere in the string.

Since you're only worried about words (ie tokens separated by whitespace), you can just split on spaces and see how often "hello" appears. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $a1="ehello goodbye hellot hello goodbye";
my $a2="ehello goodbye hello hello goodbye";
my #arr1=split(/\s+/,$a1);
my #arr2=split(/\s+/,$a2);
#grab the number of times that "hello" appears
my $num_hello1=scalar(grep{$_ eq "hello"}#arr1);
my $num_hello2=scalar(grep{$_ eq "hello"}#arr2);
print "$num_hello1, $num_hello2\n";
The output is
1, 2

Related

Matching Uppercase words

How to match following words easily using regular expression in Perl ?
Example
AFSAS245F gdsgasdg (agadsg,asdgasdg, .ASFH(gasdgsadg) )
ASG23XLG hasdg (dagad, SgAdsga, .FG(haha))
Expected output :-
[Match First uppercase words only]
AFSAS245F
ASG23XLG
print "$1\n" if /^([A-Z0-9]+)\s+.*\(/;
this prints only the first word (followed by a newline char) if the line starts with that word followed by space(s) and a ( somewhere after.
This isn't an answer to your question (so I'm fine with it being deleted if people think that's appropriate), but I thought it would be useful to show you how this question should have been asked.
I'm trying to filter data out of the input below. I'm trying to extract the first whitespace-delimited word that consists solely of uppercase letters and digits. In addition, I need to ignore lines that don't contain (.
Here's a test program.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
# This is where I need help. This regex obviously doesn't work
print if /[A-Z]\s+/;
}
__DATA__
AFSAS245F gdsgasdg (agadsg,asdgasdg, .ASFH(gasdgsadg) )
ASG23XLG hasdg (dagad, SgAdsga, .FG(haha))
The output I'm expecting from this is:
AFSAS245F
ASG23XLG
(It's also worth pointing out that this isn't particularly good test data. You should include a line that doesn't have ( in it - as that tests an important part of the requirements.)

Getting multiple matches within a string using regex in Perl

After having read this similar question and having tried my code several times, I keep on getting the same undesired output.
Let's assume the string I'm searching is "I saw wilma yesterday".
The regex should capture each word followed by an 'a' and its optional 5 following characters or spaces.
The code I wrote is the following:
$_ = "I saw wilma yesterday";
if (#m = /(\w+)a(.{5,})?/g){
print "found " . #m . " matches\n";
foreach(#m){
print "\t\"$_\"\n";
}
}
However, I kept on getting the following output:
found 2 matches
"s"
"w wilma yesterday"
while I expected to get the following one:
found 3 matches:
"saw wil"
"wilma yest"
"yesterday"
until I found out that the return values inside #m were $1 and $2, as you can notice.
Now, since the /g flag is on, and I don't think the problem is about the regex, how could I get the desired output?
You can try this pattern that allows overlapped results:
(?=\b(\w+a.{1,5}))
or
(?=(?i)\b([a-z]+a.{0,5}))
example:
use strict;
my $str = "I saw wilma yesterday";
my #matches = ($str =~ /(?=\b([a-z]+a.{0,5}))/gi);
print join("\n", #matches),"\n";
more explanations:
You can't have overlapped results with a regex since when a character is "eaten" by the regex engine it can't be eaten a second time. The trick to avoid this constraint, is to use a lookahead (that is a tool that only checks, but not matches) which can run through the string several times, and put a capturing group inside.
For another example of this behaviour, you can try the example code without the word boundary (\b) to see the result.
Firstly you want to capture everything inside the expression, i.e.:
/(\w+a(?:.{5,})?)/
Next you want to start your search from one character past where the last expression's first character matched.
The pos() function allows you to specify where a /g regex starts its search from.
$s = "I saw wilma yesterday";
while ($s =~ /(\w+a(.{0,5}))/g){
print "\t\"$1\"\n";
pos($s) = pos($s) - length($2);
}
Gives you:
"saw wil"
"wilma yest"
"yesterday"
But I don't know why you should get day and not yesterday.

regex for n characters or at least m characters

This should be a pretty simple regex question but I couldn't find any answers anywhere. How would one make a regex, which matches on either ONLY 2 characters, or at least 4 characters. Here is my current method of doing it (ignore the regex itself, that's besides the point):
[A-Za-z0_9_]{2}|[A-Za-z0_9_]{4,}
However, this method takes twice the time (and is approximately 0.3s slower for me on a 400 line file), so I was wondering if there was a better way to do it?
Optimize the beginning, and anchor it.
^[A-Za-z0-9_]{2}(?:|[A-Za-z0-9_]{2,})$
(Also, you did say to ignore the regex itself, but I guessed you probably wanted 0-9, not 0_9)
EDIT Hm, I was sure I read that you want to match lines. Remove the anchors (^$) if you want to match inside the line as well. If you do match full lines only, anchors will speed you up (well, the front anchor ^ will, at least).
Your solution looks pretty good. As an alternative you can try smth like that:
[A-Za-z0-9_]{2}(?:[A-Za-z0-9_]{2,})?
Btw, I think you want hyphen instead of underscore between 0 and 9, don't you?
The solution you present is correct.
If you're trying to optimize the routine, and the number of matches strings matching 2 or more characters is much smaller than those that do not, consider accepting all strings of length 2 or greater, then tossing those if they're of length 3. This may boost performance by only checking the regex once, and the second call need not even be a regular expression; checking a string length is usually an extremely fast operation.
As always, you really need to run tests on real-world data to verify if this would give you a speed increase.
so basically you want to match words of length either 2 or 2+2+N, N>=0
([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)
working example:
#!/usr/bin/perl
while (<STDIN>)
{
chomp;
my #matches = ($_=~/([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)/g);
for my $m (#matches) {
print "match: $m\n";
}
}
input file:
cat in.txt
ab abc bcad a as asdfa
aboioioi i i abc bcad a as asdfa
output:
perl t.pl <in.txt
match: ab
match: ab
match: bcad
match: as
match: asdf
match: aboioioi
match: ab
match: bcad
match: as
match: asdf

Using a substring twice in regex

First, this question may have been asked before, but I'm not sure what phrase to search on.
I have a string:
Maaaa
I have a pattern:
aaa
I would like to match twice, giving me starting indices of 1 and 2. But of course I only get a single match (start index 1), because the regex engine gobbles up all 3 "a"s and can't use them again, leaving me with 1 "a" which doesn't match.
How do I solve this?
Thanks!
You could use a lookahead assertion to find an a followed by 2 a's
a(?=aa)
The man perlre manpage suggests:
my #a;
"Maaaa" =~ /aaa(?{push #a,$&})(*FAIL)/;
print join "\n",#a;
print "\n";
which yields
aaa
aaa

Regex line by line: How to match triple quotes but not double quotes

I need to check to see if a string of many words / letters / etc, contains only 1 set of triple double-quotes (i.e. """), but can also contain single double-quotes (") and double double-quotes (""), using a regex. Haven't had much success thus far.
A regex with negative lookahead can do it:
(?!.*"{3}.*"{3}).*"{3}.*
I tried it with these lines of java code:
String good = "hello \"\"\" hello \"\" hello ";
String bad = "hello \"\"\" hello \"\"\" hello ";
String regex = "(?!.*\"{3}.*\"{3}).*\"{3}.*";
System.out.println( good.matches( regex ) );
System.out.println( bad.matches( regex ) );
...with output:
true
false
Try using the number of occurrences operator to match exactly three double-quotes.
\"{3}
["]{3}
[\"]{3}
I've quickly checked using http://www.regextester.com/, seems to work fine.
How you correctly compile the regex in your language of choice may vary, though!
Depends on your language, but you should only need to match for three double quotes (e.g., /\"{3}/) and then count the matches to see if there is exactly one.
There are probably plenty of ways to do this, but a simple one is to merely look for multiple occurrences of triple quotes then invert the regular expression. Here's an example from Perl:
use strict;
use warnings;
my $match = 'hello """ hello "" hello';
my $no_match = 'hello """ hello """ hello';
my $regex = '[\"]{3}.*?[\"]{3}';
if ($match !~ /$regex/) {
print "Matched as it should!\n";
}
if ($no_match !~ /$regex/) {
print "You shouldn't see this!\n";
}
Which outputs:
Matched as it should!
Basically, you are telling it to find the thing you DON'T want, then inverting the truth. Hope that makes sense. Can help you convert the example to another language if you need help.
This may be a good start for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
See it in action at regex101.com.