Removing delimiters from a date/time string

Removing delimiters from a date/time string - regex

I want to take this
Code:
2010-12-21 20:00:00
and make it look like this:
Code:
20101221200000
This is the last thing I tried
Code:
#!/usr/bin/perl -w
use strict;
my ($teststring) = '2010-12-21 20:00:00';
my $result = " ";
print "$teststring\n";
$teststring =~ "/(d\{4\})(d\{3\})(d\{3\})(d\{3\})(d\{3\})(d\{3\})/$result";
{
print "$_\n";
print "$result\n";
print "$teststring\n";
}
And it produced this:
Code:
nathan#debian:~/Desktop$ ./ptest
2010-12-21 20:00:00
Use of uninitialized value $_ in concatenation (.) or string at ./ptest line 8.
2010-12-21 20:00:00
nathan#debian:~/Desktop$
-Thanks

First, here is the problem with your code:
$teststring =~ "/(d\{4\})(d\{3\})(d\{3\})(d\{3\})(d\{3\})(d\{3\})/$result";
You want to use =~ with the substitution operator s///. That is, the right hand side should not be a plain string, but s/pattern/replacement/.
In the pattern part, \d would denote a digit. However, \d includes all sorts characters that are in the Unicode digit class, so it is safer to use the character class [0-9] if that's what you want to match against. [0-9]{4} would mean match characters 0 through 9 four times. Note that you should not escape the curly brackets { and }.
The parentheses ( and ) define capture groups. In the replacement part, you want to keep the stuff you captured, and ignore the stuff you did not.
In addition, I am assuming these timestamps occur in other input, and you do not want to accidentally replace stuff you did not mean to (by blindly removing all non-digits).
Below, I use the /x modifier for the s/// operator so I can format the pattern more clearly using white-space.
#!/usr/bin/perl
use strict; use warnings;
while ( <DATA> ) {
s{
^
([0-9]{4})-
([0-9]{2})-
([0-9]{2})[ ]
([0-9]{2}):
([0-9]{2}):
([0-9]{2})
}{$1$2$3$4$5$6}x;
print;
}
__DATA__
Code:
2010-12-21 20:00:00
or, using named capture groups introduced in 5.10 can make the whole thing slightly more readable:
#!/usr/bin/perl
use 5.010;
while ( <DATA> ) {
s{
^
( ?<year> [0-9]{4} ) -
( ?<month> [0-9]{2} ) -
( ?<day> [0-9]{2} ) [ ]
( ?<hour> [0-9]{2} ) :
( ?<min> [0-9]{2} ) :
( ?<sec> [0-9]{2} )
}
{
local $";
"#+{qw(year month day hour min sec)}"
}ex;
print;
}
__DATA__
Code:
2010-12-21 20:00:00

Use a regular expression to replace all non-digits ([^\d] or [\D]) with the empty string:
$ perl -e '$_ = "2010-12-21 20:00:00"; s/[\D]//g; print $_;'
20101221200000

Can't you just remove anything that's not a digit?
s/[^\d]//g
in sed format, can't remember the perl.

($result = $teststring) =~ y/0-9//cd;

Related

How can I find sentences nested deeper than one bracket '()' set?

I want to print sentences from text file placed in () brackets deeper than one pair of brackets.
For example for this text file :
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
the output should be :
print me
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
hhhhh
This is what I've done so far:
#!/usr/bin/perl -w
open(FILE, "<", $ARGV[0]) or die "file open error";
if ( #ARGV ) #if there are args
{
if ( -f $ARGV[0] ) #if its regular file
{
while(<FILE>)
{
my #array = split('\)',$_);
foreach(#array)
{
if ($_ =~ /.*\((.*)/)
{
print "$1\n";
}
}
}
close(FILE);
}
else{
print "Arg is not a file\n";}
}
else{
print "no args\n";}
My code can't separate the sentences placed in deeper brackets.

Assuming brackets are balanced:
use strict;
use warnings;
my #a;
while (<DATA>) {
while (/\(([^()]*(?:\(((?1))\)[^()]*(?{push #a, $2}))*+)\)/g){}
}
print join "\n", #a;
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bb(xxxx)b))aa)
blabla (blabla(hhhhh))
It returns:
print me
cccc
xxxx
bbbb(cccc)bb(xxxx)b
eeee(bbbb(cccc)bb(xxxx)b)
hhhhh
The idea is to store the capture group 2 content after each recursion, using the (?{...}) construct to execute code in the pattern.
Note that the order of results isn't ideal since the innermost content appears first. Unfortunately, I didn't find a way to change the order of results.
Pattern details:
\( # opening bracket level 1
( # open capture group 1
[^()]* # all that is not a bracket
(?:
\( # opening bracket for level 2 (or more when a recursion occurs)
( # capture group 2: to store the result
(?1) # recursion
)
\) # closing bracket for level 2 (or more ...)
[^()]* #
(?{push #a, $2}) # store the capture group 2 content in #a
)*+ # repeat when needed
)
\) # closing bracket level 1
EDIT: This pattern assumes that brackets are balanced, but if it isn't the case, this may cause problems of unwanted results for certain strings. The reason is that results are stored before the whole pattern succeeds.
Example with the string 1234 ( 5678 (abcd(efgh)ijkl) where a closing bracket is missing:
1234 ( 5678 (abcd(efgh)ijkl)
# ^ ^---- second attempt succeeds, "efgh" is stored
# '---- first attempt fails, but "efgh", "abcd(efgh)ijkl" are stored
To solve the problem, you can choose between two default behaviours:
the strict behaviour that only accepts balanced brackets. All you need is to store the results in a temporary array and to reset this array in the while loop or when a closing bracket is missing. In this case the result will only be "efgh":
my #a;
my #b;
while (<DATA>) {
while (/\(([^()]*(?:\(((?1))\)[^()]*(?{push #b, $2}))*+)(?:\)|(?{undef #b})(*F))/g) {
push #a, #b;
undef #b;
}
}
a more tolerant behaviour that doesn't make mandatory the closing bracket. To do that you must replace each \) with (?:\)|$). In this case, the first attempt succeeds and consumes characters until the end of the string (in other words, there isn't a second attempt). The results are "efgh" and "abcd(efgh)ijkl"

This is probably easiest, and the most maintainable with a two-pass solution.
The initial pass captures all first level parentheses. The second pass captures all enclosed parenthesis groups, only advancing a single character in order to match every level of embedded paren groups:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
my $data = do { local $/; <DATA> };
my $parens_content_re = qr{
\(
(
(?:
[^()]*+
|
\( (?1) \)
)*
)
\)
}x;
say for map {/(?=$parens_content_re)\(/g} map {/$parens_content_re/g} $data;
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
----(----(aaaa(123)bbbb(456)cccc)----)----
Outputs:
$ perl parens.pl
print me
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
hhhhh
aaaa(123)bbbb(456)cccc
123
456

This code works by capturing levels recursively, using a simple regex for ) and split-ing by ( for the opening paren. It first prepares by peeling off the two starting layers of nesting. It works for shown examples, and a few others. However, there are other ways to nest pairs, for which rules are not specified. Also, this is probably rough around the edges. There is no magic of any kind involved and adjusting code for new cases should be feasible.
use warnings;
use strict;
my ($lev, #el, #res, $rret);
while (my $str = <DATA>)
{
print "\nString: $str\n";
#res = ();
# Drop two layers to start: strip last two ), split by ( and drop 0,1
$str =~ s/ (.*) \) [^)]* \) [^)]* $/$1/x;
#el = split '\(', $str;
#el = #el[2..$#el];
# Edge case: may have one element and be done, but with extra )
if (#el > 1) { $lev = join '(', #el }
else { ($lev = $el[0]) =~ s|\)||g }
push #res, $lev;
# Get next level and join string back, recursively
while ( $rret = nest_one($lev) ) {
$lev = join '(', #$rret;
push #res, $lev;
last if #$rret == 1;
}
print "\t$_\n" for #res;
}
# Strip last ) and past it, split by ( and drop first element
sub nest_one {
(my $lev = $_[0]) =~ s/(.*) \) [^)]* $/$1/x;
my #el = split '\(', $lev;
shift #el;
return (#el) ? \#el : undef;
}
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
It prints
blabla(nothing(print me)) nanana (nanan)
print me
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
blabla (blabla(hhhhh))
hhhhh

Perl Parsing CSV file with embedded commas

I'm parsing a CSV file with embedded commas, and obviously, using split() has a few limitations due to this.
One thing I should note is that the values with embedded commas are surrounded by parentheses, double quotes, or both...
for example:
(Date, Notional),
"Date, Notional",
"(Date, Notional)"
Also, I'm trying to do this without using any modules for certain reasons I don't want to go into right now...
Can anyone help me out with this?

This should do what you need. It works in a very similar way to the code in Text::CSV_PP, but doesn't allow for escaped characters within the field as you say you have none
use strict;
use warnings;
use 5.010;
my $re = qr/(?| "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = "$line," =~ /$re/g;
say "<$_>" for #fields;
output
<Date, Notional 1>
<Date, Notional 2>
<Date, Notional 3>
Update
Here's a version for older Perls (prior to version 10) that don't have the regex branch reset construct. It produces identical output to the above
use strict;
use warnings;
use 5.010;
my $re = qr/(?: "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = grep defined, "$line," =~ /$re/g;
say "<$_>" for #fields;

I know you already have a working solution with Borodin's answer, but for the record there is also a simple solution with split (see the results at the bottom of the online demo). This situation sounds very similar to regex match a pattern unless....
#!/usr/bin/perl
$regex = '(?:\([^\)]*\)|"[^"]*")(*SKIP)(*F)|\s*,\s*';
$subject = '(Date, Notional), "Date, Notional", "(Date, Notional)"';
#splits = split($regex, $subject);
print "\n*** Splits ***\n";
foreach(#splits) { print "$_\n"; }
How it Works
The left side of the alternation | matches complete (parentheses) and (quotes), then deliberately fails. The right side matches commas, and we know they are the right commas because they were not matched by the expression on the left.
Possible Refinements
If desired, the parenthess-matching portion could be made recursive to match (nested(parens))
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

I know that this is quite old question, but for completeness I would like to add solution from great book "Mastering Regular Expressions" by Jeffrey Friedl (page 271):
sub parse_csv {
my $text = shift; # record containing comma-separated values
my #fields = ( );
my $field;
chomp($text);
while ($text =~ m{\G(?:^|,)(?:"((?>[^"]*)(?:""[^"]*)*)"|([^",]*))}gx) {
if (defined $2) {
$field = $2;
} else {
$field = $1;
$field =~ s/""/"/g;
}
# print "[$field]";
push #fields, $field;
}
return #fields;
}
Try it against test row:
my $line = q(Ten Thousand,10000, 2710 ,,"10,000",,"It's ""10 Grand"", baby",10K);
my #fields = parse_csv($line);
my $i;
for ($i = 0; $i < #fields; $i++) {
print "$fields[$i],";
}
print "\n";

Regex equivalent

What is the regex equivalent of $string=~/[^x]/ if x is replaced by multi-character string say xyz ? i.e string doesn't contain contain xyz
I eventually want to match
$string = 'beginning string xyz remaining string which doesn't contain xyz';
using
$string =~/(<pattern>)xyz(<pattern>)xyz/
so that
$1 = 'beginning string '
$2 = ' remaining string which doesn't contain '

(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR.
(Actually, far more than just strings can be used in this fashion. For example, you can use STRING1|STRING2 just as well as for STRING.)
$string =~ /
( (?:(?!xyz).)* )
xyz
( (?:(?!xyz).)* )
xyz
/sx
If that matches, that will always match at position zero, so let's anchor it to prevent needless backtracking on failure.
$string =~ /
^
( (?:(?!xyz).)* )
xyz
( (?:(?!xyz).)* )
xyz
/sx

In your particular case, a non-greedy .* will work. That is:
(.*?)xyz(.*?)xyz
will give you what you're looking for, as shown in http://rubular.com/r/RtaMG6ZvWK
However, as pointed out in the comment from #ikegami below, this is a fragile approach. And it turns out there is a "string" counterpart to the character-based [^...] construct, as shown in #ikegami's answer https://stackoverflow.com/a/20367916/1008891
You can see this in rubular at http://rubular.com/r/zsO1F0nkXu

while (<>) {
if (/(([^x]|x(?!yz))+)xyz(([^x]|x(?!yz))+)xyz/) {
printf("'%s' '%s'\n", $1, $3);
}
}

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?

Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;

There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;

I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)

To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;

Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);

Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".

How can I match a pipe character followed by whitespace and another pipe?

I am trying to find all matches in a string that begins with | |.
I have tried: if ($line =~ m/^\\\|\s\\\|/) which didn't work.
Any ideas?

You are escaping the pipe one time too many, effectively escaping the backslash instead.
print "YES!" if ($line =~ m/^\|\s\|/);

Pipe character should be escaped with a single backslash in a Perl regex. (Perl regexes are a bit different from POSIX regexes. If you're using this in, say, grep, things would be a bit different.) If you're specifically looking for a space between them, then use an unescaped space. They're perfectly acceptable in a Perl regex. Here's a brief test program:
my #lines = <DATA>;
for (#lines) {
print if /^\| \|/;
}
__DATA__
| | Good - space
|| Bad - no space
| | Bad - tab
| | Bad - beginning space
Bad - no bars

If it's a literal string you're searching for, you don't need a regular expression.
my $search_for = '| |';
my $search_in = whatever();
if ( substr( $search_in, 0, length $search_for ) eq $search_for ) {
print "found '$search_for' at start of string.\n";
}
Or it might be clearer to do this:
my $search_for = '| |';
my $search_in = whatever();
if ( 0 == index( $search_in, $search_for ) ) {
print "found '$search_for' at start of string.\n";
}
You might also want to look at quotemeta when you want to use a literal in a regexp.

Remove the ^ and the double back-slashes. The ^ forces the string to be at the beginning of the string. Since you're looking for all matches in one string, that's probably not what you want.
m/\|\s\|/

What about:
m/^\|\s*\|/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing delimiters from a date/time string - regex

Use a regular expression to replace all non-digits ([^\d] or [\D]) with the empty string: $ perl -e '$_ = "2010-12-21 20:00:00"; s/[\D]//g; print $_;' 20101221200000

Can't you just remove anything that's not a digit? s/[^\d]//g in sed format, can't remember the perl.

($result = $teststring) =~ y/0-9//cd;

Related

How can I find sentences nested deeper than one bracket '()' set?

Perl Parsing CSV file with embedded commas

Regex equivalent

Matching balanced parenthesis in Perl regex

How can I match a pipe character followed by whitespace and another pipe?

Categories

Resources