Perl - Find same sequence of characters in strings - regex

In fact, I have a text file where sentences are written on each line and I have to find the same sequences of characters for each sentence of each line. For instance, one of the sentences is
no pain no gain
and I want to be able to determine that the sequence of shared characters in this string is ain.
I tried with regular expressions (found on stackoverflow by the way) but it was to find sequences of same consecutive characters, and it's not what I'm looking for. So as a beginner in perl, I don't know how to implement that.
Thank you by advance for your time and attention.
edit: here is what I've tried but not what I want:
#!/usr/bin/perl
use utf8;
open $file, "<:encoding(utf8)", "text.txt";
while($ligne=<$file>)
{
while($ligne =~ /(.)\1+/g)
{
$gram = $1;
print "$ligne\n";
print "$gram\n";
}
}

This is a simple proof of concept that matches the ain of "pain" and then looks for that same match later on in the string, which it then finds in "gain". I'm using the "match named subpattern 'Match', which is how the regex matches ain (or no).
#!/usr/bin/perl
use strict;
use warnings;
my $string = "no pain no gain";
if ($string =~ m/(?<Match>[a-zA-Z]{3}).*\k<Match>/g) {
print "Match: $+{Match}\n";
}
Output:
Match: ain
Note that if you change the length specifier to 2, the match becomes "no", rather than "ain".
Implementing a more robust regex for whatever your actual needs are and just iterate over every line you have and test for a match.
By the way, regex101.com is an amazing resource for learning and practicing regular expressions. I recommend it 10000%.

Related

Regex - skip over expressions and parse the rest

I use regular expressions for sorting data into groups. The lines look somewhat like:
testword test
test testword
tes.w. tes.
tes tes.w.
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
The word test is to be found as well as othertexttobefound and sometexttobefound.
Now I am trying to tell my parser that it is supposed to plainly ignore testword and its derivatives while searching and focus on the rest of my data entries. The "good words" and the "bad words" can be anywhere in each line.
I have tried [^w] which is fine for the beginning of strings, but in my versions not for the other cases. Also (?:w) didn't do the trick. I cannot use lookarounds as these would keep the whole line from being detected.
After long searches on the internet I am hoping for help here!
After much appreciated help from Naxos84, I am adding some German real life examples:
sozialabgabe sozialarbeiter
soz.abg. sozialarbeiter
sozarbeiter soz.abg.
sozialarbeiter otherirrelevantstuff
otherirrelevantstuff soz abg
otherirrelevantstuff sozabg
otherirrelevantstuff sozialabgabe
If I search with:
sozial["^\ab"]|soz["^\ab"]|sometexttobefound|othertexttobefound
Lines 6 and 7 get marked as well, but I don't want those.
What am I doing wrong?
A link:
regexr
To find all the matches you want: any occurence of "test" and "sometexttobefound" and "othertexttobefound you can try the following regex:
test[^\w]|sometexttobefound|othertexttobefound
This regex means:
Find every "test" that is not followed by a word OR sometexttobefound OR othertexttobefound
I tried this regex with the follow text (I added a few 'test's)
testword test
test testword
tes.w. testtes.
tes tes.w. test
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
at regexr (when using the global flag)
If you also want to find things like "tes" I guess you should add it. (I'm not a regex expert)
Like:
test[^\w]|tes[^\w]|sometexttobefound|othertexttobefound
If you want to get all words from the text except from some special words, you could use:
#words = grep{$_ ne 'testword'} split /\P{L}+/, $str;
(if $str is your complete string)
See perl docs for \P{...}. Instead of \P{L}, you could also use \W, but those are locale-dependent.
But if you need to use regexps only, then you could use
#words = $str =~ /\b(?!testword)\p{L}+\b/g;
But again, \b is locale-dependent again, so you might want to use \b{...} or rebuild the word boundary matches with \p{L}:
#words = $str =~ /
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
(?!testword)\p{L}+
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
/gx;

Pre-compiled regex with special characters matching

I'm trying to match if a word such as *FOO (* as a normal character) is in a line. My input is a C++ source code. I need to use a pre-compiled regex for this due to program flow requirements, so I tried the following:
$pattern = qr/[^a-zA-Z](\*FOO)[^a-zA-Z]|^\s*(\*FOO)[^a-zA-Z]/;
And I use it like this:
if ($line =~ m/$pattern/) { ... }
It works and catches lines containing *FOO such as hey *FOO.BAR but also matches lines such as:
//FOO programming using stuff and things
which I want to ignore. What am I missing? Is \* not the right way to escape * in a pre-compiled regex in perl? If *FOO is stored in $word and the pattern looks like this:
$pattern = qr/[^a-zA-Z](\\$word)[^a-zA-Z]|^\s*(\\$word)[^a-zA-Z]/;
Is that different from the previous pattern? Because I tried both and the result seems to be the same.
I found a way to bypass this problem by removing the first char of $word and escaping * in the pattern, but if $word = "**.?FOO" for example, how do I create a qr// with $word so that all the meta-characters are escaped?
You do need to escape the *. One way to do it is by the quotemeta \Q operator:
use warnings;
use strict;
my $qr = qr/\Q*FOO/;
while (<DATA>) { print if /$qr/ }
__DATA__
//FOO programming using stuff and things
hey *FOO.BAR
Note that this escapes all ASCII non-"word" characters through the rest of the pattern. If you need to limit its action to only a part of the pattern then stop it using \E. Please see linked docs.
The above determines whether *FOO is in the line, regardless of whether it is a word or a part of one. It is not clear to me which is needed. Once that is specified the pattern can be adjusted.
Note that /\*FOO/ works, too. What you tried failed probably because of all the rest that you are trying to match, which purpose I do not understand. If you only need to detect whether the pattern is present the above does it. if there is a more specific requirement please clarify.
As for the examples: for me that string //FOO... is not matched by the main (first) $pattern you show. The second one won't interpolate $word -- but is firstly much too convoluted. The regex can really tie one in nasty knots when pushed; I suggest to keep it simple as much as possible.
Question 1:
my $word = '*FOO';
my $pattern = qr/\\$word/;
is equivalent to
my $pattern = qr/\\*FOO/; # zero or more '\' followed by 'FOO'
The $word is simply interpolated as is.
To get something equivalent to
my $pattern = qr/\*FOO/;
you should use
my $word = '*FOO';
my $pattern = qr/\Q$word\E/;
By default, an interpolated variable is considered a mini-regular expression, meta characters in the variable such as *, +, ? are still interpreted as meta character. \Q...\E will add a backslash before any character not matching /[A-Za-z_0-9]/, thus any meta characters in the interpolated variable is interpreted as literal ones. Refer to perldoc.
Question 2
I tried
my $pattern = qr/[^a-zA-Z](\*FOO)[^a-zA-Z]|^\s*(\*FOO)[^a-zA-Z]/;
my $line = '//FOO programming using stuff and things';
if($line =~ m/$pattern/){
print "$&\n";
}
else{
print "No match!";
}
and it printed "No match!". I can't explain how you get it matched.

How to limit match length before a certain character?

I am using the following regular expression to scan input text files for valid emails.
[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
Now I also need to limit the matches to 20 characters before the '#' sign in the email address, but not sure how to do it.
PS. I am using the Perl regular expression library (TPerlRegex) found in Delphi XE2.
Please can you help me?
Since your library is supposed to be PERL compatible, it should support lookaheads. These are convenient to ensure several "orthogonal" restrictions in the pattern:
(?=[^#]{1,20}#)[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
The lookahead will only match if there is an # after no more than 20 non-# characters. However, the lookahead does not actually advance the position of the regex engine in your subject string, so after the condition has been checked, the engine is still at the beginning of the email (or whichever position it is checking at the moment) and will continue with your pattern as previously.
Consider using Email::Address to capture email addresses, and then grepping the results for those having 20 or fewer characters before the #:
use strict;
use warnings;
use Email::Address;
my #addresses;
while ( my $line = <DATA> ) {
push #addresses, $_
for grep { /([^#]+)/ and length $1 < 21 }
Email::Address->parse($line);
}
print "$_\n" for #addresses;
__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZguest#host.com frank#email.net Line noise. test#host.com
Some stuff here... help#perl.org And even more here!
Nothing to see here. 01234567890123456789#numbers.com Nothing to see.
Output:
frank#email.net
test#host.com
help#perl.org
01234567890123456789#numbers.com

Search html file for random string using regex

I am trying to use Perl to search through an html file, looking for a semi-random string and store the match in a variable or print it out.
The string is the name of a jpg image and always follows the pattern of 9 digits followed by 6 lower case letters, i.e.
140005917smpxgj.jpg
But it is random every time. I am sure Perl can do this, but I will admit I am getting a bit confused.
Not too complicated. You may want to watch out for varying caps in the extension, e.g. JPG. If that is a concern, you may add (?i) before the extension.
You may also wish to prevent partial names, e.g. discard a match that has more than 9 digits. That is the (?<!\d) part: Make sure no digit characters precede the match.
ETA: Now extracts multiple matches too, thanks to ikegami.
while (<>) {
for (/(?<!\d)([0-9]{9}[a-z]{6}\.(?i)jpg)/g) {
say;
push #match, $_;
}
}
Try this regex:
/\b\d{9}[a-z]{6}\.jpg/
perldoc perlre
use warnings;
use strict;
while (<DATA>) {
if (/ ( [0-9]{9} [a-z]{6} [.] jpg ) /x) {
print "$1\n";
}
}
__DATA__
foo 140005917smpxgj.jpg bar
sdfads 777666999abcdef.jpg dfgffgh
Prints:
140005917smpxgj.jpg
777666999abcdef.jpg
the solution regex is \d{9}[a-z]{6}\.jpg

Perl search and replace the last character occurrence

I have what I thought would be an easy problem to solve but I am not able to find the answer to this.
How can I find and replace the last occurrence of a character in a string?
I have a string: GE1/0/1 and I would like it to be: GE1/0:1 <- This can be variable length so no substrings please.
Clarification:
I am looking to replace the last / with a : no matter what comes before or after it.
use strict;
use warnings;
my $a = 'GE1/0/1';
(my $b = $a) =~ s{(.*)/}{$1:}xms;
print "$b\n";
I use the greedy behaviour of .*
Perhaps I have not understand the problem with variable length, but I would do the following :
You can match what you want with the regex :
(.+)/
So, this Perl script
my $text = 'GE1/0/1';
$text =~ s|(.+)/|$1:|;
print 'Result : '.$text;
will output :
Result : GE1/0:1
The '+' quantifier being 'greedy' by default, it will match only the last slash character.
Hope this is what you were asking.
This finds a slash and looks ahead to make sure there are no more slashes past it.:
Raw regex:
/(?=[^/]*$)
I think the code would look something like this, but perl isn't my language:
$string =~ s!/(?=[^/]*$)!\:!g;
"last occurrence in a string" is slightly ambiguous. The way I see it, you can mean either:
"Foo: 123, yada: GE1/0/1, Bar: null"
Meaning the last occurrence in the "word" GE1/0/1, or:
"GE1/0/1"
As a complete string.
In the latter case, it is a rather simple matter, you only have to decide how specific you can be in your regex.
$str =~ s{/(\d+)$}{:$1};
Is perfectly fine, assuming the last character(s) can only be digits.
In the former case, which I don't think you are referring to, but I'll include anyway, you'd need to be much more specific:
$str =~ s{(\byada:\s+\w+/\w+)/(\w+\b)}{$1:$2};