I have string 1234567890 and I want to format it as 1234 5678 90
I write this regex:
$str =~ s/(.{4})/$1 /g;
But for this case 12345678 this does not work. I get excess whitespace at the end:
>>1234 5678 <<
I try to rewrite regex with lookahead:
s/((?:.{4})?=.)/$1 /g;
How to rewrite regex to fix that case?
Just use unpack
use strict;
use warnings 'all';
for ( qw/ 12345678 1234567890 / ) {
printf ">>%s<<\n", join ' ', unpack '(A4)*';
}
output
>>1234 5678<<
>>1234 5678 90<<
Context is your friend:
join(' ', $str =~ /(.{1,4})/g)
In list context, the match will all four character chunks (and anything shorter than that at the end of the string -- thanks to greediness). join will ensure the chunks are separated by spaces and there are no trailing spaces at the end.
If $str is huge and the temporary list increases the memory footprint too much, then you might just want to do the s///g and strip the trailing space.
My preference is for using the simplest possible patterns in regexes. Also, I haven't measured but with long strings, just a single chop might be cheaper than a conditional pattern in the s///g:
$ echo $'12345678\n123456789' | perl -lnE 's/(.{1,4})/$1 /g; chop; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<
You had the syntax almost right. Instead of just ?=., you need (?=.) (parens are part of the lookahead syntax). So:
s/((?:.{4})(?=.))/$1 /g
But you don't need the non-capturing grouping:
s/(.{4}(?=.))/$1 /g
And I think it is more clear if the capture doesn't include the lookahead:
s/(.{4})(?=.)/$1 /g
And given your example data, a non-word-boundary assertion works too:
s/(.{4})\B/$1 /g
Or using \K to automatically Keep the matched part:
s/.{4}\B\K/ /g
To fix the regex I should write:
$str =~ s/(.{4}(?=.))/$1 /g;
I should just add parentheses around ?=.. Without them ?=. is counted as non greed match followed by =.
So we match four characters and append space after them. Then I look ahead that there are still characters. For example, the regex will not match for string 1234
Just use a look ahead to see that you have at least one character remaining:
$ echo $'12345678\n123456789' | perl -lnE 's/.{4}\K(?=.{1})/ /g; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<
Related
I have the following situation:
^ID[ \t]*=[ \t]*('(.*)'|"(.*)")
The group with content
01
when a file contains:
ID = '01'
is the second.
Instead if:
ID = "01"
is the third.
This cause me a problem with perl:
perl -lne "print \$2 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/" test.txt
That if group with single quotes matches then i get the output:
01
Otherwise i obtain an empty string.
How do I make both the case of single quotes and double quotes interpret as group two in regex?
You can print both the groups, as they can never match at the same time:
perl -lne "print \$2.\$3 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/"
or remember the quotes in $2 and use $3 for the quoted string, followed by the remembered quote:
perl -lne "print \$3 if /^ID[ \t]*=[ \t]*((['\"])(.*)\2)/"
This looks like it's a good candidate for the branch reset operator, (?|...). Either capture in that alternation is $1, and the branch-reset construct takes care of the grouping without capturing anything:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01');
foreach ( #strings ) {
say $1 if m/^ID \h* = \h* (?|'(\d+)'|"(\d+)") /x
}
You need v5.10, and that allows you to use the \h to match horizontal whitespace.
But, you don't need to repeat the pattern. You can match the quote and match that same quote later. A relative backreference, \g{N}, can do that:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01' );
foreach ( #strings ) {
say $2 if m/^ID \h* = \h* (['"])(\d+)\g{-2} /x
}
I prefer that \g{-2} because I usually don't have to update numbering if I change the pattern to include more captures before the thing if refers to.
And, since this is a one-liner, don't type out the literal quotes (as ikegami has already shown):
say $2 if m/^ID \h* = \h* ([\x22\x27])(\d+)\g{-2} /x
Only one of the two will be defined, so simply use the one that's defined.
perl -nle'print $1//$2 if /^ID\h*=\h*(?:\x27(.*)\x27|"(.*)")/' # \x27 is '
You could also use a backreference.
perl -nle'print $2 if /^ID\h*=\h*(["\x27])(.*)\1/'
Note that all the provided solutions including these two fail (leave the escape sequence in) if you have something like ID="abc\"def" or ID="abc\ndef", assuming those are supported.
Thank you #brian_d_foy:
perl -lne "print \$1 if /^ID\h*=\h*(?|'(.*)'|\"(.*)\")/" test.txt
Or better:
perl -lne "print \$2 if /^ID\h*=\h*(['\"])(.*)\1/" test.txt
I have decided of accept also
ID = 01 #Followed by one or more horizontal spaces.
In addition to:
ID = "01" #Followed by one or more horizontal spaces.
And:
ID = '01' #Followed by one or more horizontal spaces.
Therefore I have adopted a super very complex solution:
perl -lne "print \$2 if /^ID\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/" test.txt
I have done a fusion of your both solutions #brian_d_foy. The double round parentheses are used to bring the second alternative to the second group as well, otherwise it would be the first group and without even the "branch reset operator", it would be group 4.
I after have enhanced the sintax in a function
function parse-config {
command perl -pe "s/\R/\n/g" "$2" | command perl -lne "print \$2 if /^$1\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/"
return $?
}
parse-config "ID" "test.txt"
In this:
"s/\R/\n/g"
I replace all CRLF or CR or LF, in LF. \R is a super powerfull special character present from perl v5.10. Apparently this version of perl has introduced several fundamental innovations for me. The chance would have that I needed all (\h \R ?|). Whoever did the update was brilliant.
I needed this because the dollar "$" at the end of the line did not work, because there was a "\r" before the "Linux end of line" "\n".
I am trying to remove all words that contain two keys (in Perl).
For example, the string
garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2
Should become
garble variable10 variable1
To just remove the normal, unappended/prepended keys is easy:
$var =~ s/(vssx|vddx)/ /g;
However I cannot figure out how to get it to remove the entire xi_21_vssx part. I tried:
$var =~ s/\s.*(vssx|vddx).*\s/ /g
Which does not work correctly. I do not understand why... it seems like \s should match the space, then .* matches anything up to one of the patterns, then the pattern, then .* matches anything preceding the pattern until the next space.
I also tried replacing \s (whitespace) with \b (word boundary) but it also did it work. Another attempt:
$var =~ s/ .*(vssx|vddx).* / /g
$var =~ s/(\s.*vssx.*\s|\s.*vddx.*\s)/ /g
As well as a few other mungings.
Any pointers/help would be greatly appreciated.
-John
I think the regex will just be
$var =~ s/\S*(vssx|vddx)\S*/ /g;
You can use
\s*\S*(?:vssx|vddx)\S*\s*
The problem with your regex were:
The .* should have been non-greedy.
The .* in front of (vssx|vddx) mustn't match whitespace characters, so you have to use \S*.
Note that there's no way to properly preserve the space between words - i.e. a vssx b will become ab.
regex101 demo.
I am trying to remove all words that [...]
This type of problem lends itself well to grep, which can be used to find the elements in a list that match a condition. You can use split to convert your string to a list of words and then filter it like this:
use strict;
use warnings;
use 5.010;
my $string = 'garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2';
my #words = split ' ', $string;
my #filtered = grep { $_ !~ /(?:vssx|vddx)/ } #words;
say "#filtered";
Output:
garble variable10 variable1
Try this as the regex:
\b[\w]*(vssx|vddx)[\w]*\b
I have some files I am processing, and I would like to remove the dashes from the non date fields.
I came up with s/([^0-9]+)-([^0-9]+)/$1 $2/g but that only works if there is one dash only in the string, or I should say it will only remove one dash.
So lets say I have:
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
What regex would I use to produce
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Don't do it with one regex. There is no requirement that a single regex must contain all of your code's logic.
Use one regex to see if it's a date, and then a second one to do your transformation. It will be much clearer to the reader (that's you, in the future) if you split it up into two.
#!/usr/bin/perl
use warnings;
use strict;
while ( my $str = <DATA>) {
chomp $str;
my $old = $str;
if ( $str !~ /^\d{4}-\d{2}-\d{2}$/ ) { # First regex to see if it's a date
$str =~ s/-/ /g; # Second regex to do the transformation
}
print "$old\n$str\n\n";
}
__DATA__
2014-05-01
this-and
this-and-that
this-and-that-and-that-too
2015-01-01
Running that gives you:
2014-05-01
2014-05-01
this-and
this and
this-and-that
this and that
this-and-that-and-that-too
this and that and that too
2015-01-01
2015-01-01
Using look around :
$ perl -pe 's/
(?<!\d) # a negative look-behind with a digit: \d
- # a dash, literal
(?!\d) # a negative look-ahead with a digit: \d
/ /gx' file
OUTPUT
2014-05-01
this and
this and that
this and that and that too
2015-01-01
Look around are some assertions to ensure that there's no digit (in this case) around -. A look around don't make any capture, it's really just there to test assertions. It's a good tool to have near you.
Check :
http://www.perlmonks.org/?node_id=518444
http://www.regular-expressions.info/lookaround.html
Lose the + - it's catching the string up until the last -, including any previous - characters:
s/([^0-9]|^)-+([^0-9]|$)/$1 $2/g;
Example: https://ideone.com/r2CI7v
As long as your program receives each field separately in the $_ variable, all you need is
tr/-/ / if /[^-\d]/
This should do it
$line =~ s/(\D)-/$1 /g;
As I explained in a comment, you really need to use Text::CSV to split each record into fields before you edit the data. That's because data that contain whitespace need to be enclosed in double quotes, so a field like this-and-that will start out without spaces, but needs them added when the hyphens are translated to spaces.
This program shows a simple example that uses your own data.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({eol => $/});
while (my $row = $csv->getline(\*DATA)) {
for (#$row) {
tr/-/ / unless /^\d\d\d\d-\d\d-\d\d$/;
}
$csv->print (\*STDOUT, $row);
}
__DATA__
2014-05-01,this-and-that,this-and-that,this-and-that-and-that-too,2015-01-01
output
2014-05-01,"this and that","this and that","this and that and that too",2015-01-01
I'm cleaning a file with Perl and I have one line that is a bit tough to work with.
It looks something like:
^L#$%##$^%^3456 [rest of string]
but I need to get rid of everything before the 3456
the issue is that the 3456 change every single time, so I need to use a sed command that is non specific. I should also add that the stuff before the 3456 will never be numbers
now s/^.*$someString/$someString/ works when i'm working with strings, but the same line doesn't work when it's not a string.
anyway, please help!
This will remove all non-numbers from beginning of the line,
s/^ \D+ //x;
You probably want a regular expression with a lookahead, plus non-greedy matching.
A lookahead is a pattern that would match at the current position, but doesn't consume characters:
my $str = "abc";
$str =~ s/a(?=b)//; # $str eq "bc"
Non-greedy matching modifies the * or + operator by appending a ?. It will now match as few characters as possible.
$str = "abab";
$str =~ s/.*(?=b)//; # $str eq "b"
$str = "abab";
$str =~ s/.*?(?=b)//; # $str eq "bab"
To interpolate a string that should never be treated as a pattern, protect it with \Q...\E:
$re = "^foo.?"
$str = "abc^foo.?baz";
$str =~ s/^.*?(?=\Q$re\E)//; # $str eq "baz"
I need to get rid of everything before the 3456
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so
s/^(?:(?!3456).)*//s;
It can also be done using the non-greedy modifier (.*?), but I dislike using it.
s/^.*?3456/3456/s;
s/^.*?(3456)/$1/s; # Without duplication.
s/^.*?(?=3456)//s; # Without the performance penalty of captures.
How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_