Stop regex selecting first character after match - regex

I have a csv in the following format;
"12345"|"ABC"|"ABC"[tab delimeter]
"12345"|"ABC"|"ABC"[tab delimeter]
"12345"|"ABC"|"ABC"[tab delimeter]
However, tabs also appear in the text, I need to remove the tabs which are not preceeded by a " .
I have the following regex which highlights the tabs which are not followed by a "
\t[^\"]
but this highlights the character after the tab as well, I would like to only select and remove the tab.
Note: Not sure if this matters but i am running the command in TextPad before I run it in Perl.
EDIT test data http://pastebin.com/dYfrcSPc

Use this one:
\t(?!")
It means a tab character that is not followed by a " character.

If you cannot download a proper CSV module such as Text::CSV, you can use a lightweight alternative that is part of the core: Text::ParseWords:
use strict;
use warnings;
use Text::ParseWords;
while (<DATA>) {
my #list = quotewords('\t', 1, $_);
tr/\t//d for #list;
print join "\t", #list;
}
__DATA__
"12345"|"ABC "|"ABC" next field
"12345"|"ABC"|" ABC" next field
"123 45"|"ABC"|"ABC" next field
(Note: Tab characters might have been destroyed by stackoverflow formatting)
This will parse the lines and ignore quoted tabs. We can then simply remove them and put the line back together.

Well, the easiest way would be using negative lookbehind...
s/(?<!")\t//g;
... as it'll match only those tab characters not preceded by " character. But if your perl doesn't support it, don't worry - there's another way:
s/([^"])\t/$1/g;
... that is, replacing any non-" symbol followed by \t with that symbol alone.

Related

How to Extract the string between double quotes having newline embedded in between the string?

I want to extract the text between the " quotation marks and append them. While ignoring the newlines embedded in the string.
What I have so far is something like:
$whole_text="\"Ankit Stackoverflow is \n awesome\" \"a\" asd asd \"he\nllo\"\n";
while ($whole_text=~ /(.*?)"(.*?)"(.*?)/m)
{
$whole_text=~ s/(.*?)"(.*?)"(.*?)/$2/m;
}
Expected result:
Ankit Stackoverflow is awesome and hello
You're using the wrong modifier on your regex. The /m treats the string has multiple lines where you actually want to use /s which changes "." to match any character including \n.
Changing this won't actually get you the output you want because you're repeatedly applying the transformation and it will delete any past quoted portions too. You want to also use the /g modifier which will find all possible matches and then only apply the regex the once.
use strict;
my $whole_text="\"Ankit Stackoverflow is \n awesome\" \"a\" asd asd \"he\nllo\"\n";
$whole_text =~ s/(.*?)"(.*?)"(.*?)/$2/sg;
And then if you want to get rid of the \n you'd also need.
$whole_text =~ s/\n//sg;

Removing all variations of quotes and apostrophes using Perl regex

I am trying to remove apostrophes and double quotes from a string, and have noticed there are various versions that create into the data I'm using depending on how its created. For instance, Word documents tend to use these:
It’s raining again.
What do you mean by “weird”?
Whereas text editors are like this:
It's raining again.
What do you mean by "weird"?
As I go through the various character charts and data I've noticed that there are other variations of quotes and apostrophes, for example: http://www.fileformat.info/info/unicode/char/0022/index.htm
While I could go through and do a reasonable job of finding them all, is there an existing Perl regex or function that removes all variations of quotes and apostrophes?
In order to remove all quotation marks and apostrophies, you can use
[\p{Pi}\p{Pf}'"]
And replace with empty string.
See demo
And IDEONE demo:
#!/usr/bin/perl
use utf8;
my $st = "“Quotes1” «Quotes2» ‘Quotes3’ 'Quotes4' \"Quotes5\"";
print "Before: $st\n";
$st =~ s/[\p{Pi}\p{Pf}'"]//g;
print "After: $st\n";
"Saying"
Before: “Quotes1” «Quotes2» ‘Quotes3’ 'Quotes4' "Quotes5"
After: Quotes1 Quotes2 Quotes3 Quotes4 Quotes5

What regex should be used to avoid symbols before text in perl?

i have following variable. i only want to print yes if the variable has "imoport/canada/campingplaces/tobermory" not # or anything. What should insert in a regex for this kind of things.
my $textfile = "# imoport/canada/campingplaces/tobermory
imoport/canada/campingplaces/tobermory
#imoport/canada/campingplaces/tobermory";
my $textNeeded= "imoport/canada/campingplaces/tobermory"
THIS IS WHAT i am using
if ($textfile =~ m/$textNeeded/i) {
print "yes working"
}
note:- i am getting data from differnt text files so some text files might just have "#imoport/canada/campingplaces/tobermory". I want to avoid those
Despite the quite vague problem description, I think I have puzzled out what you mean. You mean you may have lines where the text is commented out with #, and you want to avoid matching those.
print "yes" if $textfile =~ /^\s*$textNeeded/im;
This will match any string inside $textfile which has a newline followed by optional whitespace followed by your string. The /m option makes the regex multiline, meaning that ^ and $ match line endings represented by newlines inside a larger string.
You may wish to be wary of regex meta characters in your search string. If for example your search string is foo[bar].txt, those brackets will be interpreted as a character class instead. In which case you would use
/^\s*\Q$textNeeded\E/im
instead. The \Q ... \E will make the text inside match only literal characters.
I think you need to create an Anchor to say you want a match if your target string appears at the BEGINNING of the line. This uses the up-carat symbol:
if ($textfile =~ m/^$textNeeded/i) {
print "yes working"
}
This wont report a match if you have spaces or tabs before your textNeeded string.
To simply return the rows having no leading hash, something like this:
my $textfile = "# imoport/canada/campingplaces/tobermory
imoport/canada/campingplaces/tobermory
#imoport/canada/campingplaces/tobermory";
for (split /^/, $textfile) {
print $_ if(m/^\s*[a-zA-Z].*/);
}
Returns:
imoport/canada/campingplaces/tobermory

Regex Replace Cleaning a string from unwanted characters

I'm creating a method to modify page titles into a good string for to use URL rewriting.
Example: "Latest news", would be "latest-news"
The problem is the page titles are out of my control and some are similar to the following:
Football & Rugby News!. Ideally this would become football-rugby-news.
I've done some work to get this to football-&-rugby-news!
Is there a possible regex to identify unwanted characters in there and the extra '-' ?
Basically, I need numbers and letters separated by a single '-'.
I only have basic knowledge of regex, and the best I could come up with was:
[^a-z0-9-]
I'm not sure if I'm being clear enough here.
Try a 'replace all' with something like this.
[^a-zA-Z0-9\\-]+
Replace the matches with a dash.
Alternative regex:
[^a-zA-Z0-9]+
This one will avoid multiple dashes if a dash itself is found near other unwanted characters.
This Perl script also does what you're looking for. Of course you'd have to feed it the string by some other means than just hardcoding it; I merely put it in there for the example.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "Football & Rugby News!";
$string = lc($string); # lowercase
my $allowed = qr/a-z0-9-\s/; # all permitted characters
$string =~ s/[^$allowed]//g; # remove all characters that are NOT in $allowed
$string =~ s/\s+/-/g; # replace all kinds of whitespace with '-'
print "$string\n";
prints
football-rugby-news

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.