Perl - remove first word in a string with regexps - regex

I'm new to both Perl and reg-ex's, and I'm trying to remove the first word in a string (or the first word in a line in a text file) , along with any whitespace that follows it.
For example, if my string is 'one two abd123words', I want to remove 'one '.
The code I was trying is: $line =~/(\S)$/i;
but this only gives me the last word.
If it makes any difference, the word i'm trying to remove is an input, and stored as $arg.

To remove the first word of each line use:
$line =~ s/^\S+\s*//;
EDIT for a explanation:
s/.../.../ # Substitute command.
^ # (Zero-width) Begin of line.
\S+ # Non-space characters.
\s* # Blank-space characters.
// # Substitute with nothing, so remove them.

You mean, like this? :
my $line = 'one two abd123words';
$line =~ s/^\s*\S+\s*//;
# now $line is 'two abd123words'
(That removes any initial whitespace, followed by a one or more non-whitespace characters, followed by any newly-initial whitespace.)

In one-liner form:
$ perl -pi.bak -e 's{^\s*\S+\s*}//' file.txt

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?
A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.
You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

What do these perl regular expression mean?

chomp(); # Remove the newline.
$_ =~ s/\s*//; # No extra spaces
$_ =~ s/\\//; # Kill any line connector
I am not very familiar with Perl/regex. I am modifying an existing perl script with the above snippet. This chunk of code removes the newline character, spaces and line ending with '\' as connector.
My question is, in the following line, what do the two bold character mean? I understand that anything between '/ /' is regular expression. But,
1) What does the s preceding the '/ /' mean?
2) What does the second '/' at the end mean?
In $_ =~ s/\s* //;
s means "do replace".
$_ =~ s/SEARCH/REPLACE/;
So, s/\s*//; means find every white spaces and remove them (or replace them with empty string).
The "s" is the substitution function so it is saying "substitute"/"what matches here"/"with what's here"/
The "/" here is the opening and closing delimiter of the parameters: s/something_to_match/replacement. '/' is very commonly used as the delimiter character, but perl pretty much allows any character that isn't used in the regexes as the delimiter.
See: http://perldoc.perl.org/functions/s.html

Regex not working, at least in command line

I have a regex:
($value) = $line =~ /\ABC(.+?)\#/;
For input, e.g.:
(32321213321) ABC 24432.232 #Junk
Which is meant to catch the number between FD and #.
When I run it through the command line, it returns a space. Through Padre, it returns a space + the number before #.
Is there something wrong with the regex?
In your regex, you have escaped the A. This then becomes an escape sequence, an assertion \A to match the beginning of the string. Another version of the same escape is ^ . And your string does not start there, so the regex cannot match. You have another redundant escape as well, before #. The regex you need is
/ABC(.+?)#/
You can use:
$line =~ /ABC *([0-9 ]+?) *#/;
OR better:
$line =~ /ABC *(\d+(?: \d+)*) *#/;

Substitute first character before match

For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.
Example:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-via-summer_school.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-via-summer_school.7z.001;file;
Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.
This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.
s/;\s*\K(?=\S)/;/
If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.
The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.
First of all, here is a program that seems to match your requirements:
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:
program.pl input-data.txt
Here is an explanation of the regular expression:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Based on your sample input, I would use a more specific regular expression:
s/^(\d*; *)(\w)/$1;$2/;
This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.
Take what fits best to your needs!
First of all thank you for your really great answers!
Actually my code snippet looks like this:
our $seperator=";" # at the beginning of the file
#...
sub insert {
my ( $seperator, $line, #all_lines, $count, #all_out );
$count = 0;
#all_lines = read_file($filename);
foreach $line (#all_lines) {
$count = sprintf( "%08d", $count );
chomp $line;
$line =~ s/\:/$seperator/; # works
$line =~ s/\ file/file/; # works
#$line=~s/;\s*\K(?=\S)/;/; # doesn't work
$line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
say $count . $seperator . $line . $seperator;
$count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
push( #all_out, $count . $seperator . $line . $seperator . "\n" );
}
write_file( $csvfile, #all_out ); # using File::Slurp
}
In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.
I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}

Regex match string UNTIL string in a comma separated line

All words start with "Passed", but I only want to match those that also end with "Unique".
Input:
PassedShownWeekUnique,PassedShownDayUnique,PassedFailedWeek,PassedFailedDayUnique,Passed1Week,Passed1WeekUnique
Desired output:
PassedShownWeekUnique,PassedShownDayUnique,PassedFailedDayUnique,Passed1WeekUnique
I tried regex Passed.* and it matches everything. Passed.*Unique isn't working, anyone help?
Just use the following. Match from Passed, then everything, until Unique
Passed.*Unique
if [[ $line =~ Passed.*Unique ]]; then echo line matched $line done; fi
EDIT: Since op revised his question to be a comma separated line.
line=PassedShownWeekUnique,PassedShownDayUnique,PassedFailedWeek,PassedFailedDayUnique,Passed1Week,Passed1WeekUnique
REGEX=Passed.*Unique
IFS=',';
for word in $line; do
if [[ $word =~ $REGEX ]]; then
echo matched $word
fi
done
Output:
matched PassedShownWeekUnique
matched PassedShownDayUnique
matched PassedFailedDayUnique
matched Passed1WeekUnique
You can either use the regex:
Unique$
to get lines that end with the word "Unique", or:
^Passed.+?Unique$
to get lines that start with "Passed" and end with "Unique". Depending on your specific implementation, you may want to choose one or the other.
And if you have comma-separated input, as you described:
(Passed.+?Unique),|$
This will capture each instance of a word that starts with "Passed" and ends with "Unique". You can check each capture group to print out the item that it matched.
How about you try to use ^ and $
^ Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.
$ Matches the empty string at the end of a line.
So something like this
^Passed.*?Unique$
You can read more about it here.