line anchor behavior with perl regex - regex

I recently wrote a little Perl script to trim whitespace from the end of lines and ran into unexpected behavior. I decided that Perl must include line-end characters when breaking up lines, so tested that theory and got even more unexpected behavior. I do not should either match \s+$ or t$...Not both. Very confused. Can anyone enlighten me?
£ cat example
I have space after me
I do not
£ perl -ne 'print if /\s+$/' example
I have a space after me
I do not
£ perl -ne 'print if /t$/' example
I do not
£
PCRE tester gives expected results. I've also tried the /m suffix with no change in behavior.
edit. for completeness:
£ perl -ne 'print if /e$/' example
£
Expected behavior from perl -ne 'print if...' was the same as grep -P:
£ grep -P '\s+$' example
I have a space after me
£
Can repro under Ubuntu 16.04 perl v5.22.1 (both 60 and 68 patch version) and MINGW perl v5.26.1.

You see your current behavior because in example file the second line has \n character at the end. \n is the space which matched by \s
perlretut
no modifiers: Default behavior. ... '$' matches only at the end or before a newline at the end.
At your regex \s matches a whitespace character, the set [\ \t\v\r\n\f]. In other words it matches the spaces and \n character. Then $ matches the end of line (no characters, just the position itself). Like word anchor \b matches word boundary, and ^ matches the beginning of the line and not the first character
You could rewrite your regex like this:
/[\t ]+$/
The content of example would look like this if second line didn't end with a \n character:
£ cat example
I have space after me
I do not£
NOTICE that shell prompt £ is not on next line
The results are different because grep abstracts out line endings like Perl's -l flag. (grep -P '\n' will return no results on a text file where grep -Pz '\n' will.)

Your problems stem from the -n option and the use of \s. The -n flag feeds the input to Perl line by line into $_, then it calls the print if match statement.
In your match you use the $ anchor to match the end of the line. The anchor is purely positional and does not consume the newline or any other character.
Check it yourself here with \s+: Whether your add a $ or not, the regex matches the same number of characters.
This is because \s is equal to [\r\n\t\f\v ] and matches any whitespace character and you have added the + quantifier. So, it matches between one and unlimited times, as many times as possible (greedy).
If you searched just for trailing space characters instead you are good: [ ]+$ (here escaped with a group):
£ perl -ne 'print if /[ ]+$/' example
That way it does not match the \n like \s does. Try it yourself here.
Bonus:
Here are some common Perl one-liners to trim spaces:
# Strip leading whitespace (spaces, tabs) from the beginning of each line
perl -ple 's/^[ \t]+//'
perl -ple 's/^\s+//'
# Strip trailing whitespace (space, tabs) from the end of each line
perl -ple 's/[ \t]+$//'
# Strip whitespace from the beginning and end of each line
perl -ple 's/^[ \t]+|[ \t]+$//g'

Related

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?
The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).
Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'
The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).
sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)
Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

how to trim trailing spaces after all delimiter in a text file

Need help to remove trailing spaces after all delimiter in a text file
I have Text file with below data.
eg.
ADDRESS_ID| COUNTRY_TP_CD| RESIDENCE_TP_CD| PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0| 76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
I want to remove spaces after the delimiter and the first letter of the word.
Any regex or unix script that can do the same. Looking for output as below:
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU||||||2013-09-19 14:48:49.609000|
Any help will be appreciated.
awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)} 1' file
Using a perl one-liner to remove the spacing around every field. Assumes no embedded delimiters:
perl -i -lpe 's/\s*([^|]*?)\s*/$1/g' file.txt
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-l: Enable line ending processing
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
The below perl code would remove the spaces which are present at the start of a line or the spaces after to the delimiter | ,
$ perl -pe 's/(?<=\|) +|^ +//g' file
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
To save the changes made to that file,
perl -i -pe 's/(?<=\|) +|^ +//g' file
sed 's/\ //g' input.txt > output.txt
With sed:
sed -r -e 's/(^|\|)\s+/\1/g' -e 's/\s+$//' filename
In the first expression:
(^|\|) matches the beginning of the line or a | character, and saves this in capture group 1.
\s+ matches a sequence of whitespace characters after that.
The replacement \1 substitutes capture group 1, so this deletes the whitespace at the beginning of the line and after the delimiter.
The g modifier makes it operate on all the matches in the line.
In the second expression:
\s+ again matches a sequence of whitespace
$ matches the end of the line
The replacement replaces the whole thing with an empty string, this removing trailing spaces.
for posix sed (for GNU sed add --posix)
sed 's/^[[:space:]]//;s/|[[:space:]]/|/g' YourFile
use 2 substitution (there are no OR (|) in sed regex posix version)
Remove starting space by replacing space at start( ^[[:space:]]*) by nothing
Replace any sequence pipe than any space (|[[:space:]]*) by pipe
[[:space:]] could be replace by a single space char if text only have space (ASCII 32) char

How to match only items preceded by a-z, A-Z, space, or the start of a line when searching with grep?

I need to display all lines in file.txt containing the character "鱼", but only those where "鱼" is immediately preceded by a-z, A-Z, a space, or a line break.
I tried using grep, like this:
grep "[a-zA-Z\s\n]鱼" file.txt
The regular expression [a-zA-Z\s\n] does not appear to work. How can I search for this character, when appearing after a-z, A-Z, a space, or a line break?
If you want to match a space with grep, use a space:
grep "[a-zA-Z ]鱼" file.txt
If you want to match any whitespace, you can use the Posix standard character class:
grep "[a-zA-Z[:space:]]鱼" file.txt
("Any whitespace" is space, newline, carriage return, form feed, tab and vertical tab. If you just want to match space and tab, you can use [:blank:].)
You might also want to use a standard class for letters. Unless you are in the Posix or "C" locale, the meanings of character ranges like A-Z are unpredictable.
grep "[[:alpha:][:space:]]鱼" file.txt
grep works line by line, so it will never see a newline. But using an "extended" pattern, you can also match at the beginning of the line:
egrep "(^|[[:alpha:][:space:]])鱼" file.txt
(You can use grep -E instead of egrep if you prefer. But you need one or the other for the above regular expression to work.)
Grep does not support this by default
$ man grep | grep '\\s'
But awk does
$ man awk | grep '\\s'
\s Matches any whitespace character.
So perhaps use
awk '/[a-zA-Z\s\n]鱼/' file.txt
Use awk:
awk '/[A-Za-z \t]鱼/ || (NR > 1 && /^鱼/)' file
Which would print line if 鱼 is after [A-Za-z \t] or if it's not on the first line and it's in the beginning of the line: NR > 1 && /^鱼/.
If you just really want that it's on the beginning or is followed by [A-Za-z \t], you can simply do this:
awk '/(^|[A-Za-z \t])鱼/' file
Or
grep -E '/(^|[A-Za-z \t])鱼/' file
Try this one:
^[a-zA-Z \n]{1,}鱼
{1,} will make u assure that 鱼 got at least 1 of these element before
what is more i suggest to use awk in this particular case

Matching strings with grep and \A regexp

Given the string in some file:
hel string1
hell string2
hello string3
I'd like to capture just hel using cat file | grep 'regexp here'
I tried doing a bunch of regexp but none seem to work. What makes the most sense is: grep -E '\Ahel' but that doesn't seem to work. It works on http://rubular.com/ however. Any ideas why that isn't working with grep?
Also, when pasting the above string with a tab space before each line, the \A does not seem to work on rubular. I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?
ERE (-E) does not support \A for indicating start of match. Try ^ instead.
Use -m 1 to stop grepping after the first match in each file.
If you want grep to print only the matched string (not the entire line), use -o.
Use -h if you want to suppress the printing of filenames in the grep output.
Example:
grep -Eohm 1 "^hel" *.log
If you need to enforce only outputting if the search string is on the first line of the file, you could use head:
head -qn 1 *.log | grep -Eoh "^hel"
ERE doesn't support \A but PCRE does hence grep -P can be used with same regex (if available):
grep -P '\Ahel\b' file
hel string1
Also important is to use word boundary \b to restrict matching hello
Alternatively in ERE you can use:
egrep '^hel\b'
hel string1
I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?
\A matches the very beginning of the text, it doesn't match the start-of-line when you have one or more lines in your text.
Anyway, grep doesn't support \A so you need to use ^ which by the way matches the start of each line in multi-line mode contrary to \A.
Using awk
awk '$1=="hel"' file
PS you do not need to cat file to grep, use grep 'regexp here' file

perl -pe regex problem

I use perl to check some text input for a regex pattern, but one pattern doesn't work with perl -pe.
Following pattern doesn't work with the command call:
s![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)/.*!$1!
I use the linux shell. Following call I use to test my regex:
cat test | perl -pe 's![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)/.*!$1!'
File test:
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
A MaintanceGie?\195?\159mannFlock/System/Comp-Cache/abc.h
Result:
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
Cache
How can I remove the first result?
Thanks for any advice.
That last slash after "Comp-(.*)" may be what's doing it. Your file content in the "Database" doesn't have a slash. Try replacing Comp-(.*)/.* with Comp-(.*)[/.].* so you can match either the subdirectory or the file extension.
$ cat input
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
A MaintanceGie?\195?\159mannFlock/System/Comp-Cache/abc.h
$ perl -ne 'print if s![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)/.*!$1!' input
Cache
The problem is in last slash character in the regex. Instead of escaping the dot, it is just normal slash character, which is missing from input string. Try this:
s![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)[./].*!$1!
Edit: Updated to match new input data and added another option:
On the other hand, your replacement regex might be replaced by something like:
perl -ne 'print "$1\n" if /Comp-(.*?)[.\/]/'
Then there is no need to parse full line with whatever it contains.
\s match whitespace (spaces, tabs, and line breaks) and '+' means one or more characters. In this case '\s+' would mean search for one or more whitespaces.
cat test
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
A MaintanceGie?\195?\159mannFlock/System/Comp-Cache/abc.h
perl -ne 'print "$1\n" if /\w+?\d+?\d+\w+\/\w+\/Comp-(\w+)[\/]/' test