Finding and Concatenating strings - regex

I want to find some punctuation characters and concatenate them with spaces.
For example:
If any punctuation are found then I want to add spaces to front and end of them.
$line =~ s/[?%&!,.%*\[◦\]\\;#<>{}#^=\+()\$]/" $1 "/g ;
I tried using $ as used in Php where we can use $1, but it didn't work.
I searched on the web and couldn't find the Perl syntax?
Additionally, how can I preserve ... as a single token?
What is the true syntax for my problem.

Use this:
#!/usr/bin/perl -w
use strict;
my $string = "For example; If i found any puncs. above list, i want to add spaces to front and end of token.";
$string =~ s/([[:punct:]])/ $1 /g;
print "$string\n";
Outputs:
For example ; If i found any puncs . above list , i want to add spaces to front and end of token .
Obviously, if you want your output different from above, you can just add it in-between / / - I've just replaced all punctuation with " punctuation ".

You need to surround match pattern with () to capture it into $1
$line =~ s/([?%&!,.%*\[◦\]\\;#<>{}#^=\+\(\)\$])/ $1 /g;
EDIT (as per OP's comment)
how can i preserve '...' a single token ?
One way would be to revert back the changes for that token.
$line =~ s/ \. \. \. /.../g;

Related

regular expression that matches any word that starts with pre and ends in al

The following regular expression gives me proper results when tried in Notepad++ editor but when tried with the below perl program I get wrong results. Right answer and explanation please.
The link to file I used for testing my pattern is as follows:
(http://sainikhil.me/stackoverflow/dictionaryWords.txt)
Regular expression: ^Pre(.*)al(\s*)$
Perl program:
use strict;
use warnings;
sub print_matches {
my $pattern = "^Pre(.*)al(\s*)\$";
my $file = shift;
open my $fp, $file;
while(my $line = <$fp>) {
if($line =~ m/$pattern/) {
print $line;
}
}
}
print_matches #ARGV;
A few thoughts:
You should not escape the dollar sign
The capturing group around the whitespaces is useless
Same for the capturing group around the dot .
which leads to:
^Pre.*al\s*$
If you don't want words like precious final to match (because of the middle whitespace, change regex to:
^Pre\S*al\s*$
Included in your code:
while(my $line = <$fp>) {
if($line =~ /^Pre\S*al\s*$/m) {
print $line;
}
}
You're getting messed up by assigning the pattern to a variable before using it as a regex and putting it in a double-quoted string when you do so.
This is why you need to escape the $, because, in a double-quoted string, a bare $ indicates that you want to interpolate the value of a variable. (e.g., my $str = "foo$bar";)
The reason this is causing you a problem is because the backslash in \s is treated as escaping the s - which gives you just plain s:
$ perl -E 'say "^Pre(.*)al(\s*)\$";'
^Pre(.*)al(s*)$
As a result, when you go to execute the regex, it's looking for zero or more ses rather than zero or more whitespace characters.
The most direct fix for this would be to escape the backslash:
$ perl -E 'say "^Pre(.*)al(\\s*)\$";'
^Pre(.*)al(\s*)$
A better fix would be to use single quotes instead of double quotes and don't escape the $:
$ perl -E "say '^Pre(.*)al(\s*)$';"
^Pre(.*)al(\s*)$
The best fix would be to use the qr (quote regex) operator instead of single or double quotes, although that makes it a little less human-readable if you print it out later to verify the content of the regex (which I assume to be why you're putting it into a variable in the first place):
$ perl -E "say qr/^Pre(.*)al(\s*)$/;"
(?^u:^Pre(.*)al(\s*)$)
Or, of course, just don't put it into a variable at all and do your matching with
if($line =~ m/^Pre(.*)al(\s*)$/) ...
Try removing trailing newline character(s):
while(my $line = <$fp>) {
$line =~ s/[\r\n]+$//s;
And, to match only words that begin with Pre and end with al, try this regular expression:
/^Pre\w*al$/
(\w means any letter of a word, not just any character)
And, if you want to match both Pre and pre, do a case-insensitive match:
/^Pre\w*al$/i

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

perl regex partial word match

I am trying to remove all words that contain two keys (in Perl).
For example, the string
garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2
Should become
garble variable10 variable1
To just remove the normal, unappended/prepended keys is easy:
$var =~ s/(vssx|vddx)/ /g;
However I cannot figure out how to get it to remove the entire xi_21_vssx part. I tried:
$var =~ s/\s.*(vssx|vddx).*\s/ /g
Which does not work correctly. I do not understand why... it seems like \s should match the space, then .* matches anything up to one of the patterns, then the pattern, then .* matches anything preceding the pattern until the next space.
I also tried replacing \s (whitespace) with \b (word boundary) but it also did it work. Another attempt:
$var =~ s/ .*(vssx|vddx).* / /g
$var =~ s/(\s.*vssx.*\s|\s.*vddx.*\s)/ /g
As well as a few other mungings.
Any pointers/help would be greatly appreciated.
-John
I think the regex will just be
$var =~ s/\S*(vssx|vddx)\S*/ /g;
You can use
\s*\S*(?:vssx|vddx)\S*\s*
The problem with your regex were:
The .* should have been non-greedy.
The .* in front of (vssx|vddx) mustn't match whitespace characters, so you have to use \S*.
Note that there's no way to properly preserve the space between words - i.e. a vssx b will become ab.
regex101 demo.
I am trying to remove all words that [...]
This type of problem lends itself well to grep, which can be used to find the elements in a list that match a condition. You can use split to convert your string to a list of words and then filter it like this:
use strict;
use warnings;
use 5.010;
my $string = 'garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2';
my #words = split ' ', $string;
my #filtered = grep { $_ !~ /(?:vssx|vddx)/ } #words;
say "#filtered";
Output:
garble variable10 variable1
Try this as the regex:
\b[\w]*(vssx|vddx)[\w]*\b

In Perl, how can I correctly extract URLs that are enclosed in parentheses?

I've got two question about Regexp::Common qw/URI/ and Regex in Perl.
I use Regexp::Common qw/URI/ to parse URI in the strings and delete them. But I've got an error when a URI is between parentheses.
For example: (http://www.example.com)
The error is caused by ')', and when it try to parse the URI, the app crash. So I've thought two fixes:
Do a simple (or I thought so) that writes a whitespace between parentheses and ) characters
The Regexp::Common qw/URI/ has a function that implement a fix.
In my code I've tried to implement the Regex but the app freezes. The code that I've tried is this:
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.example.com)";
while ($str =~ m/\)/){
$str =~ s/\)/ \)/;
}
my ($uri) = $str =~ /$RE{URI}{-keep}/;
print "$uri\n";
print $str;
The output that I want is: (http://www.example.com )
I'm not sure, but I think that the problem is in $str =~ s/\)/ \)/;
BTW, I've got a question about Regexp::Common qw/URI/. I've got two string type:
ablalbalblalblalbal http://www.example.com
asfasdfasdf http://www.example.com aasdfasdfasdf
I want to remove the URI if it is the last component (and save it). And, if not, save it without removing it from the text.
You don't have to first test for a match to be able to use the s/// operator correctly: If the string does not match the search pattern, it will not do anything.
#!/usr/bin/perl
use strict; use warnings;
my $str = "Hello!!, I love (GOOGLE)";
$str =~ s/\)/ )/g;
print "$str\n";
The general problem of detecting URLs correctly in text is error-prone. See for example Jeff's thoughts on this.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
}
Your program goes into an infinite loop at this point. To see why, try printing the value of $str each time round the loop.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
print $str, "\n";
}
The first time it prints "Hello!!, I love (GOOGLE )". The while loop condition is then evaluated again. Your string still matches your regular expression (it still contains a closing parenthesis) so the replacement is run again and this time it prints out "Hello!!, I love (GOOGLE )" with two spaces.
And so it goes on. Each time round the loop another space is added, but each time you still have a closing parenthesis, so another substitution is run.
The simplest solution I can see is to only match the closing parenthesis if it is preceded by a non-whitespace character (using \S).
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/\S)/){
$str =~ s/)/ )/;
print $str, "\n";
}
In this case the loop is only executed once.
Why not just include the parentheses in the search? If the URLs will always be bracketed, then something like this:
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.google.com)";
my ($uri) = $str =~ / \( ( $RE{URI} ) \) /x;
print "$uri\n";
The regex from Regex::Common can be used as part of a longer regex, it doesn't have to be used on its own. Also I've used the 'x' modifier on the regex to allow whitespace so you can see more clearly what is going on - the brackets with the backslashes are treated as characters to match, those without define what is to matched (presumably like the {-keep} - I've not used that before).
You could also make the brackets optional, with something like:
/ (?: \( ( $RE{URI} ) \) | ( $RE{URI} ) ) /
although that would result in two match variables, one undefined - so something like following would be needed:
my $uri = $1 || $2 || die "Didn't match a URL!";
There's probably a better way to do this, and also if you're not bothered about matching parentheses then you could simply make the brackets optional (via a '?') in the first regex...
To answer your second question about only matching URLs at the end of the line - have a look at Regex 'anchors' which can force a match against the beginning or end of a line: ^ and $ (or \A and \Z if you prefer). e.g. matching a URL at the end of a line only:
/$RE{URI}\Z/