Substitute first character before match - regex

For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.
Example:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-via-summer_school.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-via-summer_school.7z.001;file;
Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.

This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.
s/;\s*\K(?=\S)/;/
If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.
The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.

First of all, here is a program that seems to match your requirements:
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:
program.pl input-data.txt
Here is an explanation of the regular expression:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Based on your sample input, I would use a more specific regular expression:
s/^(\d*; *)(\w)/$1;$2/;
This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.
Take what fits best to your needs!

First of all thank you for your really great answers!
Actually my code snippet looks like this:
our $seperator=";" # at the beginning of the file
#...
sub insert {
my ( $seperator, $line, #all_lines, $count, #all_out );
$count = 0;
#all_lines = read_file($filename);
foreach $line (#all_lines) {
$count = sprintf( "%08d", $count );
chomp $line;
$line =~ s/\:/$seperator/; # works
$line =~ s/\ file/file/; # works
#$line=~s/;\s*\K(?=\S)/;/; # doesn't work
$line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
say $count . $seperator . $line . $seperator;
$count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
push( #all_out, $count . $seperator . $line . $seperator . "\n" );
}
write_file( $csvfile, #all_out ); # using File::Slurp
}
In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.
I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}

Related

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Perl regex to match up to but not including optional end-token

I am trying to efficiently match lines up to but not including an optional end-token.
/(.*)(?:$tok)?/
doesn't work. The end-token is optional, hence the final ?, but then the first group
greedily captures it.
/(.*?)(?:$tok)?/
also doesn't work: the first group matches a zero-length string
The best I can do so far is
my $tok = 'end';
while (<>) {
my ($line) = /
(?| # 'branch reset'
(.*)$tok # either a line terminated with the end token
| # or
(.*) # the whole line
) # end branch reset group
/x;
print $line, "\n";
}
This works, but strikes me as inefficient. The regex engine has to parse the line twice, which is what I was trying to avoid.
I'm aware the problem as stated would be better solved with index():
my $i = index($_, $end);
$line = $i < 0 ? $_ : substr $_, 0, $i;
but I need to do other processing of the line making a regex desirable - and in any event, I see this as a learning opportunity ;-)
Please take a look at the following example. Here it is looking for the word great at the end of the matching or the end of line($).
my $str = 'alexander the great alex';
if ($str =~ m/(.*?)(?=great|$)/i) {
print "$1";
}
You can replace your $token with great from above example.
This should work -
/^(.*?)(?:(?:\b$tok)?$)/gm
Demo here

Finding and Concatenating strings

I want to find some punctuation characters and concatenate them with spaces.
For example:
If any punctuation are found then I want to add spaces to front and end of them.
$line =~ s/[?%&!,.%*\[◦\]\\;#<>{}#^=\+()\$]/" $1 "/g ;
I tried using $ as used in Php where we can use $1, but it didn't work.
I searched on the web and couldn't find the Perl syntax?
Additionally, how can I preserve ... as a single token?
What is the true syntax for my problem.
Use this:
#!/usr/bin/perl -w
use strict;
my $string = "For example; If i found any puncs. above list, i want to add spaces to front and end of token.";
$string =~ s/([[:punct:]])/ $1 /g;
print "$string\n";
Outputs:
For example ; If i found any puncs . above list , i want to add spaces to front and end of token .
Obviously, if you want your output different from above, you can just add it in-between / / - I've just replaced all punctuation with " punctuation ".
You need to surround match pattern with () to capture it into $1
$line =~ s/([?%&!,.%*\[◦\]\\;#<>{}#^=\+\(\)\$])/ $1 /g;
EDIT (as per OP's comment)
how can i preserve '...' a single token ?
One way would be to revert back the changes for that token.
$line =~ s/ \. \. \. /.../g;

Perl - remove first word in a string with regexps

I'm new to both Perl and reg-ex's, and I'm trying to remove the first word in a string (or the first word in a line in a text file) , along with any whitespace that follows it.
For example, if my string is 'one two abd123words', I want to remove 'one '.
The code I was trying is: $line =~/(\S)$/i;
but this only gives me the last word.
If it makes any difference, the word i'm trying to remove is an input, and stored as $arg.
To remove the first word of each line use:
$line =~ s/^\S+\s*//;
EDIT for a explanation:
s/.../.../ # Substitute command.
^ # (Zero-width) Begin of line.
\S+ # Non-space characters.
\s* # Blank-space characters.
// # Substitute with nothing, so remove them.
You mean, like this? :
my $line = 'one two abd123words';
$line =~ s/^\s*\S+\s*//;
# now $line is 'two abd123words'
(That removes any initial whitespace, followed by a one or more non-whitespace characters, followed by any newly-initial whitespace.)
In one-liner form:
$ perl -pi.bak -e 's{^\s*\S+\s*}//' file.txt

replace newlines within quoted string with \n

I need to write a quick (by tomorrow) filter script to replace line breaks (LF or CRLF) found within double quoted strings by the escaped newline \n. The content is a (broken) javascript program, so I need to allow for escape sequences like "ab\"cd" and "ab\\"cd"ef" within a string.
I understand that sed is not well-suited for the job as it work per line, so I turn to perl, of which I know nothing :)
I've written this regex: "(((\\.)|[^"\\\n])*\n?)*" and tested it with the http://regex.powertoy.org. It indeed matches quoted strings with line breaks, however, perl -p -e 's/"(((\\.)|[^"\\\n])*(\n)?)*"/TEST/g' does not.
So my questions are:
how to make perl to match line breaks?
how to write the "replace-by" part so that it keeps the original string and only replaces newlines?
There is this similar question with awk solution, but it is not quite what I need.
NOTE: I usually don't ask "please do this for me" questions, but I really don't feel like learning perl/awk by tomorrow... :)
EDIT: sample data
"abc\"def" - matches as one string
"abc\\"def"xy" - match "abcd\\" and "xy"
"ab
cd
ef" - is replaced by "ab\ncd\nef"
Here is a simple Perl solution:
s§
\G # match from the beginning of the string or the last match
([^"]*+) # till we get to a quote
"((?:[^"\\]++|\\.)*+)" # match the whole quote
§
$a = $1;
$b = $2;
$b =~ s/\r?\n/\\n/g; # replace what you want inside the quote
"$a\"$b\"";
§gex;
Here is another solution in case you wouldn't want to use /e and just do it with one regex:
use strict;
$_=<<'_quote_';
hai xtest "aa xx aax" baix "xx"
x "axa\"x\\" xa "x\\\\\"x" ax
xbai!x
_quote_
print "Original:\n", $_, "\n";
s/
(
(?:
# at the beginning of the string match till inside the quotes
^(?&outside_quote) "
# or continue from last match which always stops inside quotes
| (?!^)\G
)
(?&inside_quote) # eat things up till we find what we want
)
x # the thing we want to replace
(
(?&inside_quote) # eat more possibly till end of quote
# if going out of quote make sure the match stops inside them
# or at the end of string
(?: " (?&outside_quote) (?:"|\z) )?
)
(?(DEFINE)
(?<outside_quote> [^"]*+ ) # just eat everything till quoting starts
(?<inside_quote> (?:[^"\\x]++|\\.)*+ ) # handle escapes
)
/$1Y$2/xg;
print "Replaced:\n", $_, "\n";
Output:
Original:
hai xtest "aa xx aax" baix "xx"
x "axa\"x\\" xa "x\\\\\"x" ax
xbai!x
Replaced:
hai xtest "aa YY aaY" baix "YY"
x "aYa\"Y\\" xa "Y\\\\\"Y" ax
xbai!x
To work with line breaks instead of x, just replace it in the regex like so:
s/
(
(?:
# at the beginning of the string match till inside the quotes
^(?&outside_quote) "
# or continue from last match which always stops inside quotes
| (?!^)\G
)
(?&inside_quote) # eat things up till we find what we want
)
\r?\n # the thing we want to replace
(
(?&inside_quote) # eat more possibly till end of quote
# if going out of quote make sure the match stops inside them
# or at the end of string
(?: " (?&outside_quote) (?:"|\z) )?
)
(?(DEFINE)
(?<outside_quote> [^"]*+ ) # just eat everything till quoting starts
(?<inside_quote> (?:[^"\\\r\n]++|\\.)*+ ) # handle escapes
)
/$1\\n$2/xg;
Until the OP posts some example content to test by, try adding the "m" (and possibly the "s") flag to the end of your regex; from perldoc perlreref (reference):
m Multiline mode - ^ and $ match internal lines
s match as a Single line - . matches \n
For testing you might also find that adding the command line argument "-i.bak" so that you keep a backup of the original file (now with the extension ".bak").
Note also that if you want to capture but not store something you can use (?:PATTERN) rather than (PATTERN). Once you have your captured content use $1 through $9 to access stored matches from the matching section.
For more info see the link about as well as perldoc perlretut (tutorial) and perldoc perlre (full-ish documentation)
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common;
$_ = '"abc\"def"' . '"abc\\\\"def"xy"' . qq("ab\ncd\nef");
print "befor: {{$_}}\n";
s{($RE{quoted})}
{ (my $x=$1) =~ s/\n/\\n/g;
$x
}ge;
print "after: {{$_}}\n";
Using Perl 5.14.0 (install with perlbrew) one can do this:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.14.0;
use Regexp::Common qw/delimited/;
my $data = <<'END';
"abc\"def"
"abc\\"def"xy"
"ab
cd
ef"
END
my $output = $data =~ s/$RE{delimited}{-delim=>'"'}{-keep}/$1=~s!\n!\\n!rg/egr;
print $output;
I need 5.14.0 for the /r flag of the internal replace. If someone knows how to avoid this please let me know.