In Perl, how can I use the regex substitution operator to replace non-ASCII characters in a substring? - regex

How to use this command:
perl -pi -e 's/[^[:ascii:]]/#/g' file
to change only characters at offset A to offset B of each line?

Alternatively to rubber boots' answer, you can operate on a substring instead of the whole string to begin with:
perl -pi -e 'substr($_, 5, 5) =~ s/[^[:ascii:]]/#/g' file
To illustrate:
perl -e 'print "\xff" x 16' | \
perl -p -e 'substr($_, 5, 5) =~ s/[^[:ascii:]]/#/g' | \
hd
will print
ff ff ff ff ff 23 23 23 23 23 ff ff ff ff ff ff
In this code, the first offset is 0-based, and you have to use the length instead of the second offset, so it will be
substr($_, A-1, B-A).

Under reservation that I didn't understand your question correctly, if the offsets A, and B are 5 and 10, then it should be like:
perl -pi -e 's/(?<=.{5})(?<!.{10})[^[:ascii:]]/#/g' file
Explanation:
[^[:ascii:]] <- the character which is looked for
(?<=.{5}) <- if at least 5 chars were before (offset 5)
(?<!.{10}) <- but no more than 10 characters before (offset 10)
The constructs:
(?<= ...) and (?<! ...)
are called positive and negative lookbehinds, which are zero-with assertions.
(You can google them, see section Look-Around Assertions in perlre)
Addendum 1
You mentioned substr() in your title, which I overlooked first. This would work, of course, too:
perl -pi -e 'substr($_,5,10)=~s/[^[:ascii:]]/#/g' file
The description of substr EXPR,OFFSET,LENGTH can be found in perldoc.
This example nicely illustrates the use of substr() as a left-value.
Addendum 2
When updating this post, Grrrr added the same solution as an answer, but his came first by a minute! (so he deserves the booty)
Regards
rbo

Related

Capturing multiple regexp patterns on the same line

Here's what I want to do. I have a file with lines delimited in more than one ways, and I want to capture more than one substrings based on patterns from those lines.
So an example line would be something like this:
servername.domain:2017 08 07.SomeText1.otherIrrelevantStuff;SomeText2.MoreStuff
^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^
In other words I want to capture "servername", "2017 08 07", "SomeText1" and "SomeText2" in each line of my file.
I tried doing it with perl -P and positive lookahead/behind but only the first one works. The results per line should also be printed in a single line (so piping through several grep -oP's isn't acceptable).
How would you do it?
In awk, add desired regexps to the match:
$ awk '
BEGIN { OFS="," }
{
while(match($0,/servername|2017 08 07|SomeText1|SomeText2/)) {
b=b (b==""?"":OFS)substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print b
}' file
servername,2017 08 07,SomeText1,SomeText2
It seems that you want to extract the string ahead of . until : or ; is met. If the logic is what you desired, then you may use grep with perl to do that,
$ s="servername.domain:2017 08 07.SomeText1.otherIrrelevantStuff;SomeText2.MoreStuff"
$ grep -oP '[0-9a-zA-Z\s]+(?=\.)' <<< "$s"
servername
2017 08 07
SomeText1
SomeText2
Brief explanation,
(?=\.) : matches words ahead of the dot
[0-9a-zA-Z\s]+: grep would print this part, the pattern which matched 0-9, A-Z, a-z, or spaces.

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.

sed command is working in solaris but not in Linux

I am new to shell scripting and sed command.
The following sed command is working in Solaris but giving error in Linux:
sed -n 's/^[a-zA-z0-9][a-zA-z0-9]*[ ][ ]*\([0-9][0-9]*\).*[/]dir1[/]subdir1\).*/\2:\1/p'
The error is:
sed: -e expression #1, char 79: Invalid range end
I have no clue why it is giving invalid range end error.
It seems like Linux Sed doesn't like your A-z (twice). It doesn't really make sense, anyway.
Use [A-Z] (upper-case Z)
As blue112 said, A-z as a range makes no sense. Solaris sed is interpreting this as "the ASCII code for A through the ASCII code for z", in which case you could have unintended matches. A-Z occurs before a-z in ASCII, but there are a few characters falling between Z and a.
59 Y
5a Z
----
5b [
5c \
5d ]
5e ^
5f _
60 `
----
61 a
62 b
Here is an example showing Solaris sed (Solaris 8 in this case). Given this range, it substitutes _ and \ as well as the alphabetics you were apparently targeting.
% echo "f3oo_Ba\\r" | /usr/bin/sed 's/[A-z]/./g';echo
.3.....
(Note that the 3 was not substituted as it does not fall into the specified ASCII range.)
GNU sed is protecting you from shooting yourself in the foot by mistake.

Extracting a number before a pattern

I have a file containing a sequence like this one (a PGN file used in chess notation if you wonder):
1. e4 e5 2. Nf3 Nf6 3. Nc3 d6 4. d4 a6 5. Bc4 Be6 6. Bxe6 fxe6 7. Be3 Nc6 8. a3 h6 9. Qd3 Qd7 10. b4 b6 11. d5 exd5 12. Nxd5 Ne7 13. c4 Nexd5 14. exd5 e4 15. Qe2 exf3 16. Qxf3 O-O-O 17. O-O Re8 18. h3 Kb8 19. a4 Be7 20. b5 a5 21. Bd4 Ref8 22. Rfe1 Ne8 23. Qe3 Rf7 24. Qe6 Bd8 25. Re3 Re7 26.
Qxd7 Rxd7 27. Rae1 Nf6 28. g4 g5 29. Re6 Rf7 30. Kg2 h5 31. f3
Notice it is split in several lines. Now, from this file, which is continually updated, I'd like to extract the number before the last dot, in this case 31.
I have managed to extract the last line only and remove possible blank lines with:
sed '/^ *$/d' thefile.pgn | tail -1
However, I have no clue as how to capture the last number before the dot. Is there a tool (awk, sed, grep, whathaveyou) that could do the job?
This awk can also work:
awk -F '\.' 'END{split($(NF-1), a, " "); print a[length(a)]}' file
31
If the file consists just in one line, you can use sed:
$ sed -r 's/.* ([0-9]+)\. \w+$/\1/' file
31
This matches all the line and catches the last block of numbers before the end of line. Then, it prints it back with \1.
If the file contains many lines, let's go for grep:
grep -Po " \K[0-9]+(?=\.)" file
With this, you can get all the numbers in a different line. To get the last line, just pipe to tail -1:
$ grep -Po " \K[0-9]+(?=\.)" file | tail -1
31
It works by matching all the numbers that appear before a dot. As we use -o, every match is printed in a different line, hence the usage of tail -1 to get the last one.
Thanks all! Hard to choose between the answers. This was my version:
sed -e 's/\*//' -e '/^ *$/d' thefile.p | tail -1 | awk '{print $(NF-1)}' FS='[ .]+'
I accept fedorqui's answer because it is more elegant.
Your sed script can easily be extended to do the tail and grep parts too. (With sed -n and a regex to control printing, removing empty lines is no longer even necessary.)
sed -n '$s/^.* \([1-9][0-9]*\)\.[^.]*$/\1/p' thefile.pgn
This is assuming the last line is never empty. It's not hard to adapt to that additional requirement, either. Here is a moderately more complex version which does that:
sed -n '/^.* \([1-9][0-9]*\)\.[^.]*$/{;s//\1/;x;};$!b;x;p' thefile.pgn
Lines matching the pattern are reduced to just the last number and stored. On the last line, retrieve the stored string and print it.

RegEx to extract numeric value immediately before a search character /string found

what would be the best regEx to extract all the number (only numbers) before a search string ?
ABC Y C S 1 $ 46CC MAN 25/ 31
Need to extract 25 in this case, but its not fixed length ? Any help ?
'\d+(?=/)'
should work. see test with grep:
kent$ echo "ABC Y C S 1 $ 46CC MAN 25/ 31 "|grep -Po '\d+(?=/)'
25
Perl regex:
while ($subject =~ m!\d+(?=.*/)!g) {
# matched text = $&
}
Output:
1
46
25
So basically keep matching, as long as a / exist somewhere later.