Using Regular expression to replace repetitive punctuation?

Using Regular expression to replace repetitive punctuation? - regex

Here is a sentence like this:
Happy birthday!! I have a good day. :)
I want to know how to process these sentence using regular expression to the following formate:
Happy birthday! I have a good day.

Here's how to do it in PERL (since you didn't specify a programming language.
my $str = "Happy birthday!! I have a good day. :)";
$str =~ s/([.!?]){2,}/$1/g; #remove multiple punctuation
$str =~ s/[:;()]+//g; #remove emoticon
print $str;

Related

Limit the translation to just one word in a phrase?

Coming new to Perl world from Python, and wonder if there is a simple way to limit the translation or replace to just one word in a phrase?
In the example, the 2nd word kind also got changed to lind. Is there a simple way to do the translation without diving into some looping? Thanks.
The first word has been correctly translated to gazelle, but 2nd word has been changed too as you can see.
my $string = 'gazekke is one kind of antelope';
my $count = ($string =~ tr/k/l/);
print "There are $count changes \n";
print $string; # gazelle is one lind of antelope <-- kind becomes lind too!

I don't know of an option for tr to stop translation after the first word.
But you can use a regex with backreferences for this.
use strict;
my $string = 'gazekke is one kind of antelope';
# Match first word in $1 and rest of sentence in $2.
$string =~ m/(\w+)(.*)/;
# Translate all k's to l's in the first word.
(my $translated = $1) =~ tr/k/l/;
# Concatenate the translated first word with the rest
$string = "$translated$2";
print $string;
Outputs: gazelle is one kind of antelope

Pick the first match (a word in this case), precisely what regex does when without /g, and in that word replace all wanted characters, by running code in the replacement side, by /e
$string =~ s{(\w+)}{ $1 =~ s/k/l/gr }e;
In the regex in the replacement side, /r modifier makes it handily return the changed string and doesn't change the original, what also allows a substitution to run on $1 (which can't be modified as is a read-only).

tr is a character class transliterator. For anything else you would use regex.
$string =~ s/gazekke/gazelle/;
You can put a code block as the second half of s/// to do more complicated replacements or transmogrifications.
$string =~ s{([A-Za-z]+)}{ &mangler($1) if $should_be_mangled{$1}; }ge;
Edit:
Here's how you would first locate a phrase and then work on it.
$phrase_regex = qr/(?|(gazekke) is one kind of antelope|(etc))/;
$string =~ s{($phrase_regex)}{
my $match = $1;
my $word = $2;
$match =~ s{$word}{
my $new = $new_word_map{$word};
&additional_mangling($new);
$new;
}e;
$match;
}ge;
Here's the Perl regex documentation.
https://perldoc.perl.org/perlre

Perl - Removing all special characters except a few

So i came across a Perl regex "term" which allows you to remove all punctuation. Here is the code:
$string =~ s/[[:punct:]]//g;.
However this proceeds to remove all special characters. Is there a way that particular regex expression can be modified so that for example, it removes all special characters except hyphens. As i stated on my previous question with Perl, i am new to the language, thus obvious things don't come obvious to me. Thanks for all the help :_

Change your code like below to remove all the punctuations except hyphen,
$string =~ s/(?!-)[[:punct:]]//g;
DEMO
use strict;
use warnings;
my $string = "foo;\"-bar'.,...*(){}[]----";
$string =~ s/(?!-)[[:punct:]]//g;
print "$string\n";
Output:
foo-bar----

You may also use unicode property:
$string =~ s/[^-\PP]+//g;

perl regex match from last, skipping the last delimeter

I am trying to write the Regex in perl for the pattern:
""Wagner JS, Adson MA, Van Heerden JA et al (1984) The natural history of hepatic metastases from colorectal cancer. A comparison with resective treatment. Ann Surg 199:502–508""\s
to get the last part: "Ann Surg 199:502–508"
SO I wrote
$string =~ m/\.([^\d]*\s\d*\:\d*\–\d*)\"\"\s$/
The match part I am getting in $1 is: "A comparison with resective treatment. Ann Surg 199:502–508" but I am expecting: "Ann Surg 199:502–508".
In some of the cases it is working but in some of them it is not. Tried searching but didn't get satisfactory answer. Please suggest something.

You only need to add the dot in the character class:
$string =~ m/\.([^\d.]*\s\d*:\d*–\d*)""\\s$/
But a better way is to split the string with dot as delimiter and take the last part.

If you want the last part of every string, then all you need is
$string =~ /([^.]+)$/
or, to avoid the space after the full stop
$string =~ /([^.\s][^.]+)$/

Please give this a try:
$string =~ m/\.\s*([^\.\d]*\s*\d*\:\d*\–\d*)""\\s$/;

Another option, taking everything after the last period excluding leading spaces:
$string =~ m/(?!\s)([^.]+)$/

How to substitute whitesapaces and tabs in a string with _ in perl?

$string = I am a boy
How to substitute whitespaces between words with underscore ?

You need a regular expression and the substitution operator to do that.
my $string = 'I am a boy';
$string =~ s/\s/_/g;
You can learn more about regex in perlre and perlretut. A nice tool to play around with is Rubular.
Also, your code will not compile. You need to quote your string, and you need to put a semicolon at the end.

$string = 'I am a boy';
$string =~ s/ /_/g;

$string =~ tr( \t)(_); # Double underscore not necessary as per Dave's comment

This is just to show another option in perl. I think Miguel Prz and imbabque showed more smarter ways, personally i follow the way imbabque showed.
my $str = "This is a test string";
$str =~ s/\p{Space}/_/g;
print $str."\n";
and the output is
This_is_a_test_string

In Perl, how can I strip spaces from a string except in double-quotes, and replace those quotation marks with ||?

I'm trying to find a way to replace spaces and double quotes with pipes (||) while leaving the spaces within the double quotes untouched.
For example, it would make something like 'word "word word" word' into 'word||word word||word' and another like 'word word word' into 'word||word||word'.
Right now I have this to work off of:
[%- MACRO typestrip(value) PERL -%]
my $htmlVal = $stash->get('value');
$htmlVal =~ s/"/||/g;
print $htmlVal
[%- END -%]
Which handles replacing double quotes with pipes just fine.
I don't know how simple or complex this should be or if it can even be done, since I have no actual background in programming and, while I have worked with some Perl, it's never been this kind before, so I apologize if I'm not doing a good job of explaining this.

I think it might be easier to use the core module Text::ParseWords to split on non-quoted whitespace, then rejoin the "words" with pipes.
#!/usr/bin/env perl
use warnings;
use strict;
use Text::ParseWords;
while (my $line = <DATA>) {
print space2pipes($line);
print "\n";
}
sub space2pipes {
my $line = shift;
chomp $line;
my #words = parse_line( qr/\s+/, 0, $line );
return join '||', #words;
}
__DATA__
word "word word" word
word word word
Putting this into your templating engine is left as an exercise for the reader :-)

This is related to a frequently-asked question, answered in section 4 of the Perl FAQ.
How can I split a [character]-delimited string except when inside [character]?
Several modules can handle this sort of parsing—Text::Balanced, Text::CSV, Text::CSV_XS, and Text::ParseWords, among others.
Take the example case of trying to split a string that is comma-separated into its different fields. You can’t use split(/,/) because you shouldn’t split if the comma is inside quotes. For example, take a data line like this:
SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"Error, Core Dumped"
Due to the restriction of the quotes, this is a fairly complex problem. Thankfully, we have Jeffrey Friedl, author of Mastering Regular Expressions, to handle these for us. He suggests (assuming your string is contained in $text):
my #new = ();
push(#new, $+) while $text =~ m{
# groups the phrase inside the quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text,-1,1) eq ',';
If you want to represent quotation marks inside a quotation-mark-delimited field, escape them with backslashes (e.g., "like \"this\"").
Alternatively, the Text::ParseWords module (part of the standard Perl distribution) lets you say:
use Text::ParseWords;
#new = quotewords(",", 0, $text);
For parsing or generating CSV, though, using Text::CSV rather than implementing it yourself is highly recommended; you’ll save yourself odd bugs popping up later by just using code which has already been tried and tested in production for years.
Adapting the technique to your situation gives
my $htmlVal = 'word "word word" word';
my #chunks;
push #chunks, $+ while $htmlVal =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)"
| (\S+)
}gx;
$htmlVal = join "||", #chunks;
print $htmlVal, "\n";
Output:
word||word word||word
Looking back, it turns out that this is an application of Randal’s Rule, as dubbed in Regular Expression Mastery by Mark Dominus:
Randal's Rule
Randal Schwartz (author of Learning Perl [and also a Stack Overflow user]) says:
Use capturing or m//g when you know what you want to keep.
Use split when you know what you want to throw away.
In your situation, you know what you want to keep, so use m//g to hang on to the text within quotes or otherwise separated by whitespace.

While Joel's answer is fine, things can be simplified a bit by specifically using shellwords to tokenize lines:
#!/usr/bin/env perl
use strict; use warnings;
use Text::ParseWords qw( shellwords );
my #strings = (
'word "word word" word',
'word "word word" "word word"',
);
#strings = map join('||', shellwords($_)), #strings;
use YAML;
print Dump \#strings;
Isn't that more readable than a bunch of regex-gobbledygook?

Seems possible and might be useful if only a regex is applicable:
$htmlVal =~ s/(?:"([^"]+)"(\s*))|(?:(\S+)(\s*))/($1||$3).($2||$4?'||':'')/eg;
(Might be beautified a bit after closer introspection.)
input:
my $htmlVal ='word "word word" word';
output:
word||word word||word
Original code has been modified after failing this case:
my $htmlVal ='word "word word" "word word"';
will now work too:
word||word word||word word
Explanation:
$htmlVal =~ s/
(?: " ([^"]+) " (\s*)) # search "abc abc" ($1), End ($2)
| # OR
(?: (\S+) (\s*)) # abcd ($3), End ($4)
/
($1||$3) . ($2||$4 ? '||' : '') # decide on $1/$2 or $3/$4
/exg;
Regards
rbo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Regular expression to replace repetitive punctuation? - regex

Here is a sentence like this: Happy birthday!! I have a good day. :) I want to know how to process these sentence using regular expression to the following formate: Happy birthday! I have a good day.

Here's how to do it in PERL (since you didn't specify a programming language. my $str = "Happy birthday!! I have a good day. :)"; $str =~ s/([.!?]){2,}/$1/g; #remove multiple punctuation $str =~ s/[:;()]+//g; #remove emoticon print $str;

Related

Limit the translation to just one word in a phrase?

Perl - Removing all special characters except a few

perl regex match from last, skipping the last delimeter

How to substitute whitesapaces and tabs in a string with _ in perl?

In Perl, how can I strip spaces from a string except in double-quotes, and replace those quotation marks with ||?

Categories

Resources