perl regex to remove dashes - regex

I have some files I am processing, and I would like to remove the dashes from the non date fields.
I came up with s/([^0-9]+)-([^0-9]+)/$1 $2/g but that only works if there is one dash only in the string, or I should say it will only remove one dash.
So lets say I have:
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
What regex would I use to produce
2014-05-01
this and
this and that
this and that and that too
2015-01-01

Don't do it with one regex. There is no requirement that a single regex must contain all of your code's logic.
Use one regex to see if it's a date, and then a second one to do your transformation. It will be much clearer to the reader (that's you, in the future) if you split it up into two.
#!/usr/bin/perl
use warnings;
use strict;
while ( my $str = <DATA>) {
chomp $str;
my $old = $str;
if ( $str !~ /^\d{4}-\d{2}-\d{2}$/ ) { # First regex to see if it's a date
$str =~ s/-/ /g; # Second regex to do the transformation
}
print "$old\n$str\n\n";
}
__DATA__
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
Running that gives you:
2014-05-01
2014-05-01
this-and
this and
this-and-that
this and that
this-and-that-and-that-too
this and that and that too
2015-01-01
2015-01-01

Using look around :
$ perl -pe 's/
(?<!\d) # a negative look-behind with a digit: \d
- # a dash, literal
(?!\d) # a negative look-ahead with a digit: \d
/ /gx' file
OUTPUT
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Look around are some assertions to ensure that there's no digit (in this case) around -. A look around don't make any capture, it's really just there to test assertions. It's a good tool to have near you.
Check :
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html

Lose the + - it's catching the string up until the last -, including any previous - characters:
s/([^0-9]|^)-+([^0-9]|$)/$1 $2/g;
Example: https://ideone.com/r2CI7v

As long as your program receives each field separately in the $_ variable, all you need is
tr/-/ / if /[^-\d]/

This should do it
$line =~ s/(\D)-/$1 /g;

As I explained in a comment, you really need to use Text::CSV to split each record into fields before you edit the data. That's because data that contain whitespace need to be enclosed in double quotes, so a field like this-and-that will start out without spaces, but needs them added when the hyphens are translated to spaces.
This program shows a simple example that uses your own data.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({eol => $/});
while (my $row = $csv->getline(\*DATA)) {
for (#$row) {
tr/-/ / unless /^\d\d\d\d-\d\d-\d\d$/;
}
$csv->print (\*STDOUT, $row);
}
__DATA__
2014-05-01,this-and-that,this-and-that,this-and-that-and-that-too,2015-01-01
output
2014-05-01,"this and that","this and that","this and that and that too",2015-01-01

Related

How to group string of characters by 4?

I have string 1234567890 and I want to format it as 1234 5678 90
I write this regex:
$str =~ s/(.{4})/$1 /g;
But for this case 12345678 this does not work. I get excess whitespace at the end:
>>1234 5678 <<
I try to rewrite regex with lookahead:
s/((?:.{4})?=.)/$1 /g;
How to rewrite regex to fix that case?
Just use unpack
use strict;
use warnings 'all';
for ( qw/ 12345678 1234567890 / ) {
printf ">>%s<<\n", join ' ', unpack '(A4)*';
}
output
>>1234 5678<<
>>1234 5678 90<<
Context is your friend:
join(' ', $str =~ /(.{1,4})/g)
In list context, the match will all four character chunks (and anything shorter than that at the end of the string -- thanks to greediness). join will ensure the chunks are separated by spaces and there are no trailing spaces at the end.
If $str is huge and the temporary list increases the memory footprint too much, then you might just want to do the s///g and strip the trailing space.
My preference is for using the simplest possible patterns in regexes. Also, I haven't measured but with long strings, just a single chop might be cheaper than a conditional pattern in the s///g:
$ echo $'12345678\n123456789' | perl -lnE 's/(.{1,4})/$1 /g; chop; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<
You had the syntax almost right. Instead of just ?=., you need (?=.) (parens are part of the lookahead syntax). So:
s/((?:.{4})(?=.))/$1 /g
But you don't need the non-capturing grouping:
s/(.{4}(?=.))/$1 /g
And I think it is more clear if the capture doesn't include the lookahead:
s/(.{4})(?=.)/$1 /g
And given your example data, a non-word-boundary assertion works too:
s/(.{4})\B/$1 /g
Or using \K to automatically Keep the matched part:
s/.{4}\B\K/ /g
To fix the regex I should write:
$str =~ s/(.{4}(?=.))/$1 /g;
I should just add parentheses around ?=.. Without them ?=. is counted as non greed match followed by =.
So we match four characters and append space after them. Then I look ahead that there are still characters. For example, the regex will not match for string 1234
Just use a look ahead to see that you have at least one character remaining:
$ echo $'12345678\n123456789' | perl -lnE 's/.{4}\K(?=.{1})/ /g; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

I have a perl regex which converts hyphens to spaces eg:-
$string =~ s/-/ /g;
I need to modify this to ignore specific hyphenated phrases and not replace the hyphen e.g. in a string like this:
"use-either-dvi-d-or-dvi-i"
I wish to NOT replace the hyphen in dvi-d and dvi-i so it reads:
"use either dvi-d or dvi-i"
I have tried various negative look ahead matches but failed miserably.
You can use this PCRE regex with verbs (*SKIP)(*F) to skip certain words from your match:
dvi-[id](*SKIP)(*F)|-
RegEx Demo
This will skip words dvi-i and dvi-d for splitting due to use of (*SKIP)(*F).
For your code:
$string =~ s/dvi-[id](*SKIP)(*F)|-/ /g;
Perl Code Demo
There is an alternate lookarounds based solution as well:
/(?<!dvi)-|-(?![di])/
Which basically means match hyphen if it is not preceded by dvi OR if it is not followed by d or i, thus making sure to not match - when we have dvi on LHS and [di] on RHS.
Perl code:
$string =~ s/(?<!dvi)-|-(?![di])/ /g;
Perl Code Demo 2
$string =~ s/(?<!dvi)-(?![id])|(?<=dvi)-(?![id])|(?<!dvi)-(?=[id])/ /g;
While using just (?<!dvi)-(?![id]) you will exclude also dvi-x or x-i, where x can be any character.
It is unlikely that you could get a simple and straightforward regex solution to this. However, you could try the following:
#!/usr/bin/env perl
use strict;
use warnings;
my %whitelist = map { $_ => 1 } qw( dvi-d dvi-i );
my $string = 'use-either-dvi-d-or-dvi-i';
while ( $string =~ m{ ( [^-]+ ) ( - ) ( [^-]+ ) }gx ) {
my $segment = substr($string, $-[0], $+[0] - $-[0]);
unless ( $whitelist{ $segment } ) {
substr( $string, $-[2], 1, ' ');
}
pos( $string ) = $-[ 3 ];
}
print $string, "\n";
The #- array contains the starting offsets of matched groups, and the #+ array contains the ends offsets. In both cases, element 0 refers to the whole match.
I had to resort to something like this because of how \G works:
Note also that s/// will refuse to overwrite part of a substitution that has already been replaced; so for example this will stop after the first iteration, rather than iterating its way backwards through the string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
Maybe #tchrist can figure out how to bend various assertions to his will.
we can ignore specific words using negative Look-ahead and negative Look-behind
Example :
(?!pattern)
is a negative look-ahead assertion
in your case the pattern is
$string =~ s/(?<!dvi)-(?<![id])/ /g;
output :
use either dvi-d or dvi-i
Reference : http://www.perlmonks.org/?node_id=518444
Hope this will help you.

How to find and replace a regex in perl

I have a text and want to replace all \w\(, for example myword( to the same with a space, so it should be myword (. How to do that with s///? or is there another way to do that?
Try this
$s = "myword( word2(";
$s =~s/(\w+)(\()/$1 $2/g;
print $s;
As from #ikegami command. My above regex \w+ will backtrack this is needless. And no need to group the (, because known one. So i changed my regex accordingly,
New RegEx
$s =~s/(\w)\(/$1 (/g;
Here is a way:
my $str = "myword(";
$str =~ s/(\w+)(\()/$1 $2/;
print $str, "\n";
Output:
myword (
Use look ahead:
$ perl -pe 's/(\w)(?=\()/$1 /' <<< 'word('
word (
Or look ahead together with look behind:
$ perl -pe 's/(?<=\w)(?=\()/ /' <<< 'word('
word (
s/\w\K\(/ (/g # 5.10+
or
s/(\w)\(/$1 (/g
or
s/(?<=\w)\(/ (/g
The first is much faster than the other two, but all are faster than the other correct solutions provided. (Not sure which is the fastest of the second and third.)
Another way will be to use \K i.e forget what you matched before:
#!/usr/bin/perl
use strict;
use warnings;
my $string = q{myword( myword2(};
$string=~s/\w\K\(/ (/g;
print $string,"\n";

perl regex partial word match

I am trying to remove all words that contain two keys (in Perl).
For example, the string
garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2
Should become
garble variable10 variable1
To just remove the normal, unappended/prepended keys is easy:
$var =~ s/(vssx|vddx)/ /g;
However I cannot figure out how to get it to remove the entire xi_21_vssx part. I tried:
$var =~ s/\s.*(vssx|vddx).*\s/ /g
Which does not work correctly. I do not understand why... it seems like \s should match the space, then .* matches anything up to one of the patterns, then the pattern, then .* matches anything preceding the pattern until the next space.
I also tried replacing \s (whitespace) with \b (word boundary) but it also did it work. Another attempt:
$var =~ s/ .*(vssx|vddx).* / /g
$var =~ s/(\s.*vssx.*\s|\s.*vddx.*\s)/ /g
As well as a few other mungings.
Any pointers/help would be greatly appreciated.
-John
I think the regex will just be
$var =~ s/\S*(vssx|vddx)\S*/ /g;
You can use
\s*\S*(?:vssx|vddx)\S*\s*
The problem with your regex were:
The .* should have been non-greedy.
The .* in front of (vssx|vddx) mustn't match whitespace characters, so you have to use \S*.
Note that there's no way to properly preserve the space between words - i.e. a vssx b will become ab.
regex101 demo.
I am trying to remove all words that [...]
This type of problem lends itself well to grep, which can be used to find the elements in a list that match a condition. You can use split to convert your string to a list of words and then filter it like this:
use strict;
use warnings;
use 5.010;
my $string = 'garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2';
my #words = split ' ', $string;
my #filtered = grep { $_ !~ /(?:vssx|vddx)/ } #words;
say "#filtered";
Output:
garble variable10 variable1
Try this as the regex:
\b[\w]*(vssx|vddx)[\w]*\b

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_