Perl: string manipulation - surrounding a word with a character '#' - regex

I am trying to extract email address from a txt file. I've thought about surrounding words that contain the '#' character. Does anybody know a expression to do that?

Whenever you need some reasonably common matching problem resolve in Perl, you should always first check the Regexp::Common family on CPAN. In this case: Regexp::Common::Email::Address. From POD Synopsys:
use Regexp::Common qw[Email::Address];
use Email::Address;
while (<>) {
my (#found) = /($RE{Email}{Address})/g;
my (#addrs) = map $_->address, Email::Address->parse("#found");
print "X-Addresses: ", join(", ", #addrs), "\n";
}

Here's a very quick and dirty regex which will match non-whitespace characters on either side of an #:
/\S+#\S+/
This will match john.smith#example.com in
some rubbish text john.smith#example.com more rubbish text
Hope this helps.

Related

Perl Regex, get strings between two strings

I am new to Perl and trying to use Regex to get a piece of string between two tags that I know will be there in that string. I already tried various answers from stackoverflow but none of them seems to be working for me. Here's my example...
The required data is in $info variable out of which I want to get the useful data
my $info = "random text i do not want\n|BIRTH PLACE=Boston, MA\n|more unwanted random text";
The Useful Data in the above string is Boston, MA. I removed the newlines from the string by $info =~ s/\n//g;. Now $info has this string "random text i do not want|BIRTH PLACE=Boston, MA|more unwanted random text". I thought doing this will help me capture the required data easily.
Please help me in getting the required data. I am sure that the data will always be preceded by |BIRTH PLACE= and succeeded by |. Everything before and after that is unwanted text. If a question like this is already answered please guide me to it as well. Thanks.
Instead of replacing everything around it, you could search for /\|BIRTH PLACE=([^\|]+)\n\|/, [^\|]+ being one or more of anything that is not a pipe.
$info =~ m{\|BIRTH PLACE=(.*?)\|} or die "There is no data in \$info?!";
my $birth_place = $1;
That should do the trick.
You know, actually, those newlines might have helped you. I would have gone for an initial regular expression of:
/^\|BIRTH PLACE=(.*)$/m
Using the multiline modifer (m) to match ^ at the beginning of a line and $ at the end of it, instead of just matching at the beginning and end of the string. Heck, you can even get really crazy and match:
/(?<=^\|BIRTH PLACE=).+$/m
To capture only the information you want, using lookbehind ((?<= ... )) to assert that it's the birth place information.
Why curse the string twice when you can do it once?
So, in perl:
if ($info =~ m/(?<=^\|BIRTH PLACE=).+$/m) {
print "Born in $&.\n";
} else {
print "From parts unknown";
}
You have presumably read this data from a file, which is a bad start. You program should look like this
use strict;
use warnings;
use autodie;
open my $fh, '<', 'myfile';
my $pob;
while (<$fh>) {
if (/BIRTH PLACE=(.+)/) {
$pob = $1;
last;
}
}
print $pob;
output
Boston, MA

How to match the pattern between a specified characters?

I have a fixed message delimited by "|"... tag=value is the pair between the delimiter;
(8=FIX.4.2|9=0360|35=8|49=BLPFT|56=ESP|34=8415|52=20110201-15:59:59|50=MBA|143=LN|115=MSET|57=2457172|30=CHIX|60=20110201-15:59:59.121|150=1|31=56.3100|151=71785|32=137|6=56.4058|37=9D9ZIhgu4BGU9sBtfHcYeQA|38=97370|39=1|40=1|11=20110201-05529|12=0.0012|13=2|14=25585|15=EUR|76=CHIXCCP|17=272674|47=A|167=CS|18=1|48=FR0000131104|20=0|21=1|22=4|113=N|54=1|55=BNP|207=FP|29=1|59=0|10=205|)
How to extract a data between "11=" and a first occurrence of "|" after a match?
For example i want a data
20110201-05529
which is between "|11=" and "|"
Can you please tell me the regular expression?
The best approach will depend on how much you know about the data you are trying to match. If you know it will be comprised of numbers and dashes only:
m/11=([0-9\-]+)/
Conversely, if the data could contain any kind of characters, use:
m/11=([^|]+)/
Which matches anything that isn't a pipe character. This is probably the most reliable expression.
In both cases, the data you want is captured into the $1 special variable.
If you don't always want to match the value for the key 11, you can use variables in the pattern, so:
my $key = 42; # or any number
if ($text =~ m/$key=([^|]+)/) {
print "I found $1"; # prints "I found 20110201-05529"
}
As always, there is more than 1 way to solve the problem. Therefore, there is no such thing as "the regular expression".
But you will definitely want to perldoc split.
Something like this will match everything else than = then everything else than |
[^=]+=([^|])+

Regex line by line: How to match triple quotes but not double quotes

I need to check to see if a string of many words / letters / etc, contains only 1 set of triple double-quotes (i.e. """), but can also contain single double-quotes (") and double double-quotes (""), using a regex. Haven't had much success thus far.
A regex with negative lookahead can do it:
(?!.*"{3}.*"{3}).*"{3}.*
I tried it with these lines of java code:
String good = "hello \"\"\" hello \"\" hello ";
String bad = "hello \"\"\" hello \"\"\" hello ";
String regex = "(?!.*\"{3}.*\"{3}).*\"{3}.*";
System.out.println( good.matches( regex ) );
System.out.println( bad.matches( regex ) );
...with output:
true
false
Try using the number of occurrences operator to match exactly three double-quotes.
\"{3}
["]{3}
[\"]{3}
I've quickly checked using http://www.regextester.com/, seems to work fine.
How you correctly compile the regex in your language of choice may vary, though!
Depends on your language, but you should only need to match for three double quotes (e.g., /\"{3}/) and then count the matches to see if there is exactly one.
There are probably plenty of ways to do this, but a simple one is to merely look for multiple occurrences of triple quotes then invert the regular expression. Here's an example from Perl:
use strict;
use warnings;
my $match = 'hello """ hello "" hello';
my $no_match = 'hello """ hello """ hello';
my $regex = '[\"]{3}.*?[\"]{3}';
if ($match !~ /$regex/) {
print "Matched as it should!\n";
}
if ($no_match !~ /$regex/) {
print "You shouldn't see this!\n";
}
Which outputs:
Matched as it should!
Basically, you are telling it to find the thing you DON'T want, then inverting the truth. Hope that makes sense. Can help you convert the example to another language if you need help.
This may be a good start for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
See it in action at regex101.com.

Is there a way, using regular expressions, to match a pattern for text outside of quotes?

As stated in the title, is there a way, using regular expressions, to match a text pattern for text that appears outside of quotes. Ideally, given the following examples, I would want to be able to match the comma that is outside of the quotes, but not the one in the quotes.
This is some text, followed by "text, in quotes!"
or
This is some text, followed by "text, in quotes" with more "text, in quotes!"
Additionally, it would be nice if the expression would respect nested quotes as in the following example. However, if this is technically not feasible with regular expressions then it wold simply be nice to know if that is the case.
The programmer looked up from his desk, "This can't be good," he exclaimed, "the system is saying 'File not found!'"
I have found some expressions for matching something that would be in the quotes, but nothing quite for something outside of the quotes.
Easiest is matching both commas and quoted strings, and then filtering out the quoted strings.
/"[^"]*"|,/g
If you really can't have the quotes matching, you could do something like this:
/,(?=[^"]*(?:"[^"]*"[^"]*)*\Z)/g
This could become slow, because for each comma, it has to look at the remaining characters and count the number of quotes. \Z matches the end of the string. Similar to $, but will never match line ends.
If you don't mind an extra capture group, it could be done like this instead:
/\G((?:[^"]*"[^"]*")*?[^"]*?)(,)/g
This will only scan the string once. It counts the quotes from the beginning of the string instead. \G will match the position where last match ended.
The last pattern could need an example.
Input String: 'This is, some text, followed by "text, in quotes!" and more ,-as'
Matches:
1. ['This is', ',']
2. [' some text', ',']
3. [' and followed by "text, in quotes!" and more ', ',']
It matches the string leading up to the comma, as well as the comma.
This can be done with modern regexes due to the massive number of hacks to regex engines that exist, but let me be the one to post the "Don't Do This With Regular Expressions" answer.
This is not a job for regular expressions. This is a job for a full-blown parser. As an example of something you can't do with (classical) regular expressions, consider this:
()(())(()())
No (classical) regex can determine if those parenthesis are matched properly, but doing so without a regex is trivial:
/* C code */
char string[] = "()(())(()())";
int parens = 0;
for(char *tmp = string; tmp; tmp++)
{
if(*tmp == '(') parens++;
if(*tmp == ')') parens--;
}
if(parens > 0)
{
printf("%s too many open parenthesis.\n", parens);
}
else if(parens < 0)
{
printf("%s too many closing parenthesis.\n", -parens);
}
else
{
printf("Parenthesis match!\n");
}
# Perl code
my $string = "()(())(()())";
my $parens = 0;
for(split(//, $string)) {
$parens++ if $_ eq "(";
$parens-- if $_ eq ")";
}
die "Too many open parenthesis.\n" if $parens > 0;
die "Too many closing parenthesis.\n" if $parens < 0;
print "Parenthesis match!";
See how simple it was to write some non-regex code to do the job for you?
EDIT: Okay, back from seeing Adventureland. :) Try this (written in Perl, commented to help you understand what I'm doing if you don't know Perl):
# split $string into a list, split on the double quote character
my #temp = split(/"/, $string);
# iterate through a list of the number of elements in our list
for(0 .. $#temp) {
# skip odd-numbered elements - only process $list[0], $list[2], etc.
# the reason is that, if we split on "s, every other element is a string
next if $_ & 1;
if($temp[$_] =~ /regex/) {
# do stuff
}
}
Another way to do it:
my $bool = 0;
my $str;
my $match;
# loop through the characters of a string
for(split(//, $string)) {
if($_ eq '"') {
$bool = !$bool;
if($bool) {
# regex time!
$match += $str =~ /regex/;
$str = "";
}
}
if(!$bool) {
# add the current character to our test string
$str .= $_;
}
}
# get trailing string match
$match += $str =~ /regex/;
(I give two because, in another language, one solution may be easier to implement than the other, not just because There's More Than One Way To Do It™.)
Of course, as your problems grow in complexity, there will arise certain benefits of constructing a full-blown parser, but that's a different horse. For now, this will suffice.
As mentioned before, regexp cannot match any nested pattern, since it is not a Context-free language.
So if you have any nested quotes, you are not going to solve this with a regex.
(Except with the "balancing group" feature of a .Net regex engine - as mentioned by Daniel L in the comments - , but I am not making any assumption of the regex flavor here)
Except if you add further specification, like a quote within a quote must be escaped.
In that case, the following:
text before string "string with \escape quote \" still
within quote" text outside quote "within quote \" still inside" outside "
inside" final outside text
would be matched successfully with:
(?ms)((?:\\(?=")|[^"])+)(?:"((?:[^"]|(?<=\\)")+)(?<!\\)")?
group1: text preceding a quoted text
group2: text within double quotes, even if \" are present in it.
Here is an expression that gets the match, but it isn't perfect, as the first match it gets is the whole string, removing the final ".
[^"].*(,).*[^"]
I have been using my Free RegEx tester to see what works.
Test Results
Group Match Collection # 1
Match # 1
Value: This is some text, followed by "text, in quotes!
Captures: 1
Match # 2
Value: ,
Captures: 1
You should better build yourself a simple parser (pseudo-code):
quoted := False
FOR char IN string DO
IF char = '"'
quoted := !quoted
ELSE
IF char = "," AND !quoted
// not quoted comma found
ENDIF
ENDIF
ENDFOR
This really depends on if you allow nested quotes or not.
In theory, with nested quotes you cannot do this (regular languages can't count)
In practice, you might manage if you can constrain the depth. It will get increasingly ugly as you add complexity. This is often how people get into grief with regular expressions (trying to match something that isn't actually regular in general).
Note that some "regex" libraries/languages have added non-regular features.
If this sort of thing gets complicated enough, you'll really have to write/generate a parser for it.
You need more in your description. Do you want any set of possible quoted strings and non-quoted strings like this ...
Lorem ipsum "dolor sit" amet, "consectetur adipiscing" elit.
... or simply the pattern you asked for? This is pretty close I think ...
(?<outside>.*?)(?<inside>(?=\"))
It does capture the "'s however.
Maybe you could do it in two steps?
First you replace the quoted text:
("[^"]*")
and then you extract what you want from the remaining string
,(?=(?:[^"]*"[^"]*")*[^"]*\z)
Regexes may not be able to count, but they can determine whether there's an odd or even number of something. After finding a comma, the lookahead asserts that, if there are any quotation marks ahead, there's an even number of them, meaning the comma is not inside a set of quotes.
This can be tweaked to handle escaped quotes if needed, though the original question didn't mention that. Also, if your regex flavor supports them, I would add atomic groups or possessive quantifiers to keep backtracking in check.

How can I find repeated letters with a Perl regex?

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:
booooooot or abbott
I won't know the letter I am looking for ahead of time.
This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.
You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.
my $str = "Foooooobar";
$str =~ /(\w)(\1+)/;
print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'
I think you actually want this rather than the "\w" as that includes numbers and the underscore.
([a-zA-Z])\1+
Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.
([[:alpha:]])\1+
I Think using a backreference would work:
(\w)\1+
\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.
(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)
Use \N to refer to previous groups:
/(\w)\1+/g
You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.
Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.
That might be easier to understand by framing it as the question:
"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/
#! /usr/local/bin/perl
use strict;
use warnings;
# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');
while (<DATA>) {
chomp;
if (/([^\W_0-9])\1+/) {
print "$_: dup [$1]\n";
}
else {
print "$_: nope\n";
}
}
__DATA__
100
food
créé
a::b
The following code will return all the characters, that repeat two or more times:
my $str = "SSSannnkaaarsss";
print $str =~ /(\w)\1+/g;
Just for kicks, a completely different approach:
if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}
FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.
How about:
(\w)\1+
The first part makes an unnamed group around a character, then the back-reference looks for that same character.
I think this should also work:
((\w)(?=\2))+\2
/(.)\\1{2,}+/u
'u' modifier matching with unicode