Removing all variations of quotes and apostrophes using Perl regex - regex

I am trying to remove apostrophes and double quotes from a string, and have noticed there are various versions that create into the data I'm using depending on how its created. For instance, Word documents tend to use these:
It’s raining again.
What do you mean by “weird”?
Whereas text editors are like this:
It's raining again.
What do you mean by "weird"?
As I go through the various character charts and data I've noticed that there are other variations of quotes and apostrophes, for example: http://www.fileformat.info/info/unicode/char/0022/index.htm
While I could go through and do a reasonable job of finding them all, is there an existing Perl regex or function that removes all variations of quotes and apostrophes?

In order to remove all quotation marks and apostrophies, you can use
[\p{Pi}\p{Pf}'"]
And replace with empty string.
See demo
And IDEONE demo:
#!/usr/bin/perl
use utf8;
my $st = "“Quotes1” «Quotes2» ‘Quotes3’ 'Quotes4' \"Quotes5\"";
print "Before: $st\n";
$st =~ s/[\p{Pi}\p{Pf}'"]//g;
print "After: $st\n";
"Saying"
Before: “Quotes1” «Quotes2» ‘Quotes3’ 'Quotes4' "Quotes5"
After: Quotes1 Quotes2 Quotes3 Quotes4 Quotes5

Related

How to Extract the string between double quotes having newline embedded in between the string?

I want to extract the text between the " quotation marks and append them. While ignoring the newlines embedded in the string.
What I have so far is something like:
$whole_text="\"Ankit Stackoverflow is \n awesome\" \"a\" asd asd \"he\nllo\"\n";
while ($whole_text=~ /(.*?)"(.*?)"(.*?)/m)
{
$whole_text=~ s/(.*?)"(.*?)"(.*?)/$2/m;
}
Expected result:
Ankit Stackoverflow is awesome and hello
You're using the wrong modifier on your regex. The /m treats the string has multiple lines where you actually want to use /s which changes "." to match any character including \n.
Changing this won't actually get you the output you want because you're repeatedly applying the transformation and it will delete any past quoted portions too. You want to also use the /g modifier which will find all possible matches and then only apply the regex the once.
use strict;
my $whole_text="\"Ankit Stackoverflow is \n awesome\" \"a\" asd asd \"he\nllo\"\n";
$whole_text =~ s/(.*?)"(.*?)"(.*?)/$2/sg;
And then if you want to get rid of the \n you'd also need.
$whole_text =~ s/\n//sg;

Stop regex selecting first character after match

I have a csv in the following format;
"12345"|"ABC"|"ABC"[tab delimeter]
"12345"|"ABC"|"ABC"[tab delimeter]
"12345"|"ABC"|"ABC"[tab delimeter]
However, tabs also appear in the text, I need to remove the tabs which are not preceeded by a " .
I have the following regex which highlights the tabs which are not followed by a "
\t[^\"]
but this highlights the character after the tab as well, I would like to only select and remove the tab.
Note: Not sure if this matters but i am running the command in TextPad before I run it in Perl.
EDIT test data http://pastebin.com/dYfrcSPc
Use this one:
\t(?!")
It means a tab character that is not followed by a " character.
If you cannot download a proper CSV module such as Text::CSV, you can use a lightweight alternative that is part of the core: Text::ParseWords:
use strict;
use warnings;
use Text::ParseWords;
while (<DATA>) {
my #list = quotewords('\t', 1, $_);
tr/\t//d for #list;
print join "\t", #list;
}
__DATA__
"12345"|"ABC "|"ABC" next field
"12345"|"ABC"|" ABC" next field
"123 45"|"ABC"|"ABC" next field
(Note: Tab characters might have been destroyed by stackoverflow formatting)
This will parse the lines and ignore quoted tabs. We can then simply remove them and put the line back together.
Well, the easiest way would be using negative lookbehind...
s/(?<!")\t//g;
... as it'll match only those tab characters not preceded by " character. But if your perl doesn't support it, don't worry - there's another way:
s/([^"])\t/$1/g;
... that is, replacing any non-" symbol followed by \t with that symbol alone.

Change `"` quotation marks to latex style

I'm editing a book in LaTeX and its quotation marks syntax is different from the simple " characters. So I want to convert "quoted text here" to ``quoted text here''.
I have 50 text files with lots of quotations inside. I tried to write a regular expression to substitute the first " with `` and the second " with '', but I failed. I searched on internet and asked some friends, but I had no success at all. The closest thing I got to replace the first quotation mark is
s/"[a-z]/``/g
but this is clearly wrong, since
"quoted text here"
will become
``uoted text here"
How can I solve my problem?
I'm a little confused by your approach. Shouldn't it be the other way round with s/``/"[a-z]/g? But then, I think it'll be better with:
s/``(.*?)''/"\1"/g
(.*?) captures what's between `` and ''.
\1 contains this capture.
If it's the opposite that you're looking for (i.e. I wrongly interpreted your question), then I would suggest this:
s/"(.*?)"/``\1''/g
Which works on the same principles as the previous regex.
Use the following to tackle multiple quotations, replacing all " in one step.
echo '"Quote" she said, "again."' | sed "s/\"\([^\"]*\)\"/\`\`\1''/g"
The [^\"]* avoids the need for ungreedy matching, which does not seem possible in sed.
If you are using the TeXmaker software, you could use a regular expression with the Replace command (CTRL+R), and put the following into the Find field:
"([^}]*)"
and into the Replace field:
``$1''
And then just press the Replace All button. But after that, you still have to check that everything is fine, and maybe you need to do some corrections. This has worked pretty well for me.
Try grouping the word:
sed 's/"\([a-z]\)/``\1/'
On my PC:
abhishekm71#PC:~$ echo \"hello\" | sed 's/"\([a-z]\)/``\1/'
``hello"
It depends a little on your input file (are quotes always paired, or can there be ommissions?). I suggest the following robust approach:
sed 's/"\([0-9a-zA-Z]\)/``\1/g'
sed "s/\([0-9a-zA-Z]\)\"/\1\'\'/g"
Assumption: An opening quotation mark is always immediately followed by a letter or digit, a closing quotation mark is preceeded by one. Quotations can span over several words an even several input lines (some of the other solutions don't work when this happens).
Note that I also replace the closing quotation mark: Depending on the fonts you use the double quotation mark can be typeset as neutral straight quotation mark.
You are looking for something contained in straight quotation marks not containing a quotation mark, so the best regex is "([^"]*?)". Replace it with ``\1''. In Perl this can be simplified to s/"([^"]*?)"/``\1''/g. I would be very careful with this approach, it only works if all opening quotation marks have matching closing ones, for example in "one" two "three" four. But it will fail in "one" t"wo "three" four producing ``one'' t``wo ''three".

RegEx Expression to find strings with quotation marks and a backslash

I am using a program that pastes what is in the clipboard in a modified format according to what I specify.
I would like for it to paste paths (i.e. "C:\folder\My File") without the pair of double quotes.
This, which isn't using RegEx works: Find " (I simply enter than in one line) and replace with nothing. I enter nothing in the second field. I leave it blank.
Now, though that works, it will remove double quotes in this scenario: Bob said "What are you doing?"
I would like the program to remove the quotes only if the the words enclosed in the double quotes have a backslash.
So, once again, just to make sure I am clear, I need the following:
1) RegEx Expression to find strings that have both double quotes and a backslash within those set of quotes.
2) A RegEx Expression that says: replace the backslashes with backslashes (i.e. leave them there).
Thank you for the fast response. This program has two fields. One for what to find and the other for what to replace. So, what would go in the 2nd field?
The program came with the Remove HTML entry, which has
<[^>]*> in the match pattern
and nothing (it's blank) in the Replacement field.
You didn't say which language you use, here's an example in Javascript:
> s = 'say "hello" and replace "C:\\folder\\My File" thanks'
"say "hello" and replace "C:\folder\My File" thanks"
> s.replace(/"([^"\\]*\\[^"]*)"/g, "$1")
"say "hello" and replace C:\folder\My File thanks"
This should work in .NET:
^".*?\\.*?"$

Regex Replace Cleaning a string from unwanted characters

I'm creating a method to modify page titles into a good string for to use URL rewriting.
Example: "Latest news", would be "latest-news"
The problem is the page titles are out of my control and some are similar to the following:
Football & Rugby News!. Ideally this would become football-rugby-news.
I've done some work to get this to football-&-rugby-news!
Is there a possible regex to identify unwanted characters in there and the extra '-' ?
Basically, I need numbers and letters separated by a single '-'.
I only have basic knowledge of regex, and the best I could come up with was:
[^a-z0-9-]
I'm not sure if I'm being clear enough here.
Try a 'replace all' with something like this.
[^a-zA-Z0-9\\-]+
Replace the matches with a dash.
Alternative regex:
[^a-zA-Z0-9]+
This one will avoid multiple dashes if a dash itself is found near other unwanted characters.
This Perl script also does what you're looking for. Of course you'd have to feed it the string by some other means than just hardcoding it; I merely put it in there for the example.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "Football & Rugby News!";
$string = lc($string); # lowercase
my $allowed = qr/a-z0-9-\s/; # all permitted characters
$string =~ s/[^$allowed]//g; # remove all characters that are NOT in $allowed
$string =~ s/\s+/-/g; # replace all kinds of whitespace with '-'
print "$string\n";
prints
football-rugby-news