Removing `^` from `s/^/1/;` causes my code to fail. Why? - regex

I've been working on this problem over at the code golf exchange which is why my code looks so funny.
Here's a program with use strict and use warnings that recreates the problem:
use strict;
use warnings;
$_ = "";
for my $i (1..33){
s//1/; # Just prepends 1 to the string $_
}
print "$_\n";
for my $i (34..127) {
if( chr(y/1/1/) !~ /[!"'()*+,-.\/12357:;<=>?CEFGHIJKLMNSTUVWXYZ[\\\]^_`cfhijklmnrstuvwxyz{|}~]/ ) {
print chr y/1/1/;
}
s/^/1/; # Prepends 1 to the start of the string.
}
Here is the output:
111111111111111111111111111111111
#$%&04689#ABDOPQRabdegopq
This works as I would expect. However, when I take ^ out of the second regex, the regex no longer matches and lengthens the string.
use strict;
use warnings;
$_ = "";
for my $i (1..33){
s//1/;
}
print "$_\n";
for my $i (34..127) {
if( chr(y/1/1/) !~ /[!"'()*+,-.\/12357:;<=>?CEFGHIJKLMNSTUVWXYZ[\\\]^_`cfhijklmnrstuvwxyz{|}~]/ ) {
print chr y/1/1/;
}
s//1/; # No Longer matches!
}
Why does this happen? s//1/ works in the first loop, so why does changing it in the second one break everything?
For an additional point of confusion, if you put the if block in braces, the regex matches again:
for my $i (34..127) {
{
if( chr(y/1/1/) !~ /[!"'()*+,-.\/12357:;<=>?CEFGHIJKLMNSTUVWXYZ[\\\]^_`cfhijklmnrstuvwxyz{|}~]/ ) {
print chr y/1/1/;
}
}
s//1/; # This prepends 1 to the string $_ again.
}
edit:
I wanted to edit my original code back into the question for reference:
use strict;
use warnings;
$_="";
until( y/1/1/ > 32){
print "test1";
s//1/;
print "test";
}
print "$_\n";
until( y/1/1/ > 125+1 ) {
if( chr(y/1/1/) !~ /[!"'()*+,-.\/12357:;<=>?CEFGHIJKLMNSTUVWXYZ[\\\]^_`cfhijklmnrstuvwxyz{|}~]/ ) {
print chr y/1/1/;
}
s/^/1/; # this is the line we remove ^ from
}
When we remove ^ from the line, the output changes from:
test1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1testtest1test111111111111111111111111111111111
#$%&04689#ABDOPQRabdegopq
to
hanging with no output
So in this case, the line change in the second loop changes the behavior of the first one it seems.

s//1/; does not check for any or empty string. It checks against the last successful regex text before. So, the first loop uses default regex and the second one uses the last successful check from the if above.
Quote:
If the PATTERN evaluates to the empty string, the last successfully
matched regular expression is used instead. In this case, only the g
and c flags on the empty pattern are honored
Please, see The empty pattern //

To expand on VladimirM answer
print "regex have dynamic scope\n";
$_ = 1;
{
m/1/;
s//2/;
print "$_ one becomes two, s//2/ is really s/1/2/\n";
}
$_=1;
{
m/1/;
{
s//2/;
}
print "$_ one still becomes two, s//2/ is really s/1/2/\n";
}
$_=1;
{
{
m/1/;
}
s//2/;
print "$_ one becomes twentyone, s//2/; is really s/(?:)//2;\n";
}
__END__
regex have dynamic scope
2 one becomes two, s//2/ is really s/1/2/
2 one still becomes two, s//2/ is really s/1/2/
21 one becomes twentyone, s//2/; is really s/(?:)//2;
since regex have dynamic scope, using The empty pattern // really means using the previous pattern from same dynamic scope so don't do that :)
If you add use re 'debug'; you can see the regex engine use the previous pattern (focus on Matching REx statements, NOTHING(2) is empty without previous, EXACT <1>(3) is the previous pattern)
regex have dynamic scope
Guessing start of match in sv for REx "1" against "1"
Found anchored substr "1" at offset 0...
Guessed: match at offset 0
Guessing start of match in sv for REx "1" against "1"
Found anchored substr "1" at offset 0...
Guessed: match at offset 0
Matching REx "1" against "1"
0 <> <1> | 1:EXACT <1>(3)
1 <1> <> | 3:END(0)
Match successful!
2 one becomes two, s//2/ is really s/1/2/
Guessing start of match in sv for REx "1" against "1"
Found anchored substr "1" at offset 0...
Guessed: match at offset 0
Guessing start of match in sv for REx "1" against "1"
Found anchored substr "1" at offset 0...
Guessed: match at offset 0
Matching REx "1" against "1"
0 <> <1> | 1:EXACT <1>(3)
1 <1> <> | 3:END(0)
Match successful!
2 one still becomes two, s//2/ is really s/1/2/
Guessing start of match in sv for REx "1" against "1"
Found anchored substr "1" at offset 0...
Guessed: match at offset 0
Matching REx "" against "1"
0 <> <1> | 1:NOTHING(2)
0 <> <1> | 2:END(0)
Match successful!
21 one becomes twentyone, s//2/; is really s/(?:)//2;
update: because you have an infinite loop; last pattern always has 1 in it, so the substitution is essentially s/1/1/; which means your string doesn't grow, its always 33 chars ... see update :)
$_="";
until( y/1/1/ > 32){
print "test1";
s//1/;
print "test";
}
print "$_\n";
my $max = 126;
my $count = 0;
my $reps = 0;
until( y/1/1/ > 125+1 ) {
if( chr(y/1/1/) !~ /[!"'()*+,-.\/12357:;<=>?CEFGHIJKLMNSTUVWXYZ[\\\]^_`cfhijklmnrstuvwxyz{|}~]/ ) {
print chr y/1/1/;
}
$reps =
#~ s/^/1/; # win
s//1/; # fail
$count++;
last if $count > $max;
}
print "m $max c $count r $reps l #{[ length $_ ]}\n";
__END__
win #$%&04689#ABDOPQRabdegopqm 126 c 94 r 1 l 127
fail m 126 c 127 r 1 l 33
Unless you're obfuscating append is $_ .= 1; and prepend is $_ = 1 . $_;

To expand a second time on VladimirM's answer that the empty pattern // is the problem, the following is from perldoc:
The empty pattern //
If the PATTERN evaluates to the empty string, the last successfully matched regular expression is used instead. In this case, only the g and c flags on the empty pattern are honored; the other flags are taken from the original pattern. If no match has previously succeeded, this will (silently) act instead as a genuine empty pattern (which will always match).
Basically, if there is another regex within the same scope that matched, then the LHS of the regex with the empty pattern will actually be the LHS of the previous regex.
In the below example inspired by the OP, I expand the string using the ones digit of the incrementer instead. However, once the other regex matches chr(33) which is a exclamation point, the LHS of the empty regex will change. It will then start matching the digits 12357 and replacing them with our ones place of the incrementer. Therefore the string will stay the same length from then on.
use strict;
use warnings;
$_ = "";
for my $i (1..127) {
my $chr = chr(length);
if( $chr =~ m'(?![#$%&])[[:punct:]12357CE-NS-Zcfh-nr-z]' ) {
print "'$chr'";
} else {
print " ";
}
s//$i % 10/e;
printf "% 4d %s\n", $i, $_;
}
The following output clearly demonstrates this:
1 1
2 21
3 321
4 4321
5 54321
6 654321
7 7654321
8 87654321
9 987654321
10 0987654321
11 10987654321
12 210987654321
13 3210987654321
14 43210987654321
15 543210987654321
16 6543210987654321
17 76543210987654321
18 876543210987654321
19 9876543210987654321
20 09876543210987654321
21 109876543210987654321
22 2109876543210987654321
23 32109876543210987654321
24 432109876543210987654321
25 5432109876543210987654321
26 65432109876543210987654321
27 765432109876543210987654321
28 8765432109876543210987654321
29 98765432109876543210987654321
30 098765432109876543210987654321
31 1098765432109876543210987654321
32 21098765432109876543210987654321
33 321098765432109876543210987654321
'!' 34 421098765432109876543210987654321
'!' 35 451098765432109876543210987654321
'!' 36 461098765432109876543210987654321
'!' 37 467098765432109876543210987654321
'!' 38 468098765432109876543210987654321
'!' 39 468098965432109876543210987654321
'!' 40 468098960432109876543210987654321
'!' 41 468098960412109876543210987654321
'!' 42 468098960422109876543210987654321
'!' 43 468098960432109876543210987654321
'!' 44 468098960442109876543210987654321
'!' 45 468098960445109876543210987654321
'!' 46 468098960446109876543210987654321
'!' 47 468098960446709876543210987654321
'!' 48 468098960446809876543210987654321
'!' 49 468098960446809896543210987654321
'!' 50 468098960446809896043210987654321
'!' 51 468098960446809896041210987654321
'!' 52 468098960446809896042210987654321
'!' 53 468098960446809896043210987654321
'!' 54 468098960446809896044210987654321
'!' 55 468098960446809896044510987654321
'!' 56 468098960446809896044610987654321
'!' 57 468098960446809896044670987654321
'!' 58 468098960446809896044680987654321
'!' 59 468098960446809896044680989654321
'!' 60 468098960446809896044680989604321
'!' 61 468098960446809896044680989604121
'!' 62 468098960446809896044680989604221
'!' 63 468098960446809896044680989604321
'!' 64 468098960446809896044680989604421
'!' 65 468098960446809896044680989604451
'!' 66 468098960446809896044680989604461
'!' 67 468098960446809896044680989604467
'!' 68 468098960446809896044680989604468
'!' 69 468098960446809896044680989604468
'!' 70 468098960446809896044680989604468
'!' 71 468098960446809896044680989604468
'!' 72 468098960446809896044680989604468
'!' 73 468098960446809896044680989604468
'!' 74 468098960446809896044680989604468
'!' 75 468098960446809896044680989604468
'!' 76 468098960446809896044680989604468
'!' 77 468098960446809896044680989604468
'!' 78 468098960446809896044680989604468
'!' 79 468098960446809896044680989604468
'!' 80 468098960446809896044680989604468
'!' 81 468098960446809896044680989604468
'!' 82 468098960446809896044680989604468
'!' 83 468098960446809896044680989604468
'!' 84 468098960446809896044680989604468
'!' 85 468098960446809896044680989604468
'!' 86 468098960446809896044680989604468
'!' 87 468098960446809896044680989604468
'!' 88 468098960446809896044680989604468
'!' 89 468098960446809896044680989604468
'!' 90 468098960446809896044680989604468
'!' 91 468098960446809896044680989604468
'!' 92 468098960446809896044680989604468
'!' 93 468098960446809896044680989604468
'!' 94 468098960446809896044680989604468
'!' 95 468098960446809896044680989604468
'!' 96 468098960446809896044680989604468
'!' 97 468098960446809896044680989604468
'!' 98 468098960446809896044680989604468
'!' 99 468098960446809896044680989604468
'!' 100 468098960446809896044680989604468
'!' 101 468098960446809896044680989604468
'!' 102 468098960446809896044680989604468
'!' 103 468098960446809896044680989604468
'!' 104 468098960446809896044680989604468
'!' 105 468098960446809896044680989604468
'!' 106 468098960446809896044680989604468
'!' 107 468098960446809896044680989604468
'!' 108 468098960446809896044680989604468
'!' 109 468098960446809896044680989604468
'!' 110 468098960446809896044680989604468
'!' 111 468098960446809896044680989604468
'!' 112 468098960446809896044680989604468
'!' 113 468098960446809896044680989604468
'!' 114 468098960446809896044680989604468
'!' 115 468098960446809896044680989604468
'!' 116 468098960446809896044680989604468
'!' 117 468098960446809896044680989604468
'!' 118 468098960446809896044680989604468
'!' 119 468098960446809896044680989604468
'!' 120 468098960446809896044680989604468
'!' 121 468098960446809896044680989604468
'!' 122 468098960446809896044680989604468
'!' 123 468098960446809896044680989604468
'!' 124 468098960446809896044680989604468
'!' 125 468098960446809896044680989604468
'!' 126 468098960446809896044680989604468
'!' 127 468098960446809896044680989604468

Related

AWK regex split function using multiple delimiters

I'm trying to use Awk's split function to split input into three fields in order to use the values as field[1], field[2], field[3]. I'm trying to extract the first value: everything (including) the colon, then everything until the first tab (\t) (the hex), then the last field will include everything else.
I've tried multiple regexes and the closest I've come to solving this is:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{split($0,field,/([:])([ ])|([\t])/); \
print "length of field:" length(field);for (x in field) print field[x]}'
But the result doesn't include the colon --and I'm not sure if it's good regex I've written:
length of field:3
ffffffff81000000
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Thanks in advance.
Using gnu-awk's RS (for record separator) variable:
s=$'ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf'
awk -v RS='^\\S+|[^\t:]+' '{gsub(/^\s*|\s*$/, "", RT); print RT}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Explanation:
RS='^\\S+|[^\t:]+': Sets RS as 1+ non-whitespace characters at the start OR 1+ of non-tab, non-colon characters
gsub(/^\s*|\s*$/, "", RT) removed whitespace at the start or end from RT variable that gets populated because of RS
print RTprintsRT` variable
If you want to print length of fields also then use:
awk -v RS='^\\S+|[^\t:]+' '{gsub(/^\s*|\s*$/, "", RT); print RT} END {print "length of field:", NR}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
length of field: 3
If you don't have gnu-awk then here is a POSIX awk solution for the same:
awk '{
while (match($0, /^[^[:blank:]]+|[^\t:]+/)) {
print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART+RLENGTH)
}
}' <<< "$s"
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
Using your awk code with some changes:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" | awk -v OFS='\n' '
{
sub(/: */,":\t")
split($0,field,/[\t]/)
print "length of field:" length(field), field[1], field[2],field[3]
}'
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
As you can see:
added a tab with sub(),
so the separator for split() is only [\t],
and the OFS is \n.
And finally only a print.
Your regex can be simplified as:
split($0,field,/: |\t/)
but the result will be the same without containing the colon character
because the delimiter pattern is not included in the splitted result.
If you want to use a complex pattern such as a whitespace preceded by a colon
as a delimiter in the split function, you will need to use PCRE which is not
supported by awk.
Here is an example with python:
#!/usr/bin/python
import re
s = "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf"
print(re.split(r'(?<=:) |\t', s))
Output:
['ffffffff81000000:', '48 8d 25 51 3f 60 01', 'leaq asdf asdf asdf']
You'll see the colon is included in the result.
You can use sub to replace : with :\t and the \t with \n. You will not find \n in a line of awk text unless your programming actions put it there; it is therefor a useful delimiter. You now can split on \n and your code will work as you imagine:
echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" \
| awk '{sub(/: /,":\t"); gsub(/\t/,"\n"); split($0,field,/\n/)
print "length of field:" length(field)
for (x=1; x<=length(field); x++) print field[x]}'
Prints:
length of field:3
ffffffff81000000:
48 8d 25 51 3f 60 01
leaq asdf asdf asdf
IMHO for a job like this you should use GNU awk for the 3rd arg to match() instead of split():
$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
match($0,/([^:]+:)\s*([^\t]+)\t(.*)/,field) {
print "length of field:" length(field);for (x in field) print x, field[x]
}
'
length of field:12
0start 1
0length 58
3start 40
1start 1
2start 19
3length 19
2length 20
1length 17
0 ffffffff81000000: 48 8d 25 51 3f 60 01 leaq asdf asdf asdf
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf
Note that the resultant array has a lot more information than just the 3 fields that get populated with the strings that match the regexp segments. Just ignore the extra fields if you don't need them:
$ echo -e "ffffffff81000000: 48 8d 25 51 3f 60 01\tleaq asdf asdf asdf" |
awk '
match($0,/([^:]+:)\s*([^\t]+)\t(.*)/,field) {
for (x=1; x<=3; x++) print x, field[x]
}
'
1 ffffffff81000000:
2 48 8d 25 51 3f 60 01
3 leaq asdf asdf asdf

What does this function ft_isalnum do?

I am reading a program which contains the following function, which is
int ft_isalnum(int c)
{
return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
|| (c >= '0' && c <= '9'));
}
I don't quite understand what does this function intend to do?
As suggested by its name, the function checks if the given character is alphanumeric.
Assuming ASCII character encoding where A-Z and a-z are stored consecutively, it checks if the character is in either the 'A' to 'Z' range, the 'a' to 'z' range, or the '0' to '9' range and returns true if any of those conditions are satisfied.
Write a program to figure it out:
#include <stdio.h>
#include <ctype.h>
int ft_isalnum(int c)
{
return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9'));
}
int main(void)
{
for (int i = 0; i < 128; putchar(++i % 8 ? ' ' : '\n'))
printf("%3d '%c' %c ", i, isprint((char unsigned)i) ? i : '?', ft_isalnum(i) ? 'X' : ' ');
putchar('\n');
}
Output
0 '?' 1 '?' 2 '?' 3 '?' 4 '?' 5 '?' 6 '?' 7 '?'
8 '?' 9 '?' 10 '?' 11 '?' 12 '?' 13 '?' 14 '?' 15 '?'
16 '?' 17 '?' 18 '?' 19 '?' 20 '?' 21 '?' 22 '?' 23 '?'
24 '?' 25 '?' 26 '?' 27 '?' 28 '?' 29 '?' 30 '?' 31 '?'
32 ' ' 33 '!' 34 '"' 35 '#' 36 '$' 37 '%' 38 '&' 39 '''
40 '(' 41 ')' 42 '*' 43 '+' 44 ',' 45 '-' 46 '.' 47 '/'
48 '0' X 49 '1' X 50 '2' X 51 '3' X 52 '4' X 53 '5' X 54 '6' X 55 '7' X
56 '8' X 57 '9' X 58 ':' 59 ';' 60 '' 63 '?'
64 '#' 65 'A' X 66 'B' X 67 'C' X 68 'D' X 69 'E' X 70 'F' X 71 'G' X
72 'H' X 73 'I' X 74 'J' X 75 'K' X 76 'L' X 77 'M' X 78 'N' X 79 'O' X
80 'P' X 81 'Q' X 82 'R' X 83 'S' X 84 'T' X 85 'U' X 86 'V' X 87 'W' X
88 'X' X 89 'Y' X 90 'Z' X 91 '[' 92 '\' 93 ']' 94 '^' 95 '_'
96 '`' 97 'a' X 98 'b' X 99 'c' X 100 'd' X 101 'e' X 102 'f' X 103 'g' X
104 'h' X 105 'i' X 106 'j' X 107 'k' X 108 'l' X 109 'm' X 110 'n' X 111 'o' X
112 'p' X 113 'q' X 114 'r' X 115 's' X 116 't' X 117 'u' X 118 'v' X 119 'w' X
120 'x' X 121 'y' X 122 'z' X 123 '{' 124 '|' 125 '}' 126 '~' 127 '?'
The output indicates, on my machine, that characters 0 to 9 and letters A to Z and a to z return a 1 while everything else returns a 0.
Note
Not all characters are printable.
Thanks
To #Swordfish for making the output more attractive and readable.

gdb print char* as string characters

char* buf;
...
(gdb) x/s buf
0x7fffef8f5f80: "35=DC\001\064\071=ABCD\001"
(gdb) x/12cb buf
0x7fffef8f5f80: 51 '3' 53 '5' 61 '=' 65 'D' 66 'C' 1 '\001' 52 '4' 57 '9'
0x7fffef8f5f88: 61 '=' 83 'A' 80 'B' 88 'C' 84 'D' 1 '\001'
Question> How can I enable gdb to print the buf as the following:
"35=DC\00149=ABCD\001"?
Thank you
Question> How can I enable gdb to print the buf as the following:
There is no way to do this right now. You could file a gdb bug report if you like.
What is going on here is that gdb's string-printing function has a special case to escape a digit when it follows a character that was emitted as an escape sequence. That is why you see \064 and not 4.

awk command with if and substring

I have an input file (input.txt) that looks like this:
id01 90 5
id01 80 4
id01 79 3
id13 95 5
id01 77 3
id01 85 4
id15 92 5
id17 99 5
id18 65 2
id19 72 3
And I want to output the file as in output.txt:
1 90 5
1 80 4
1 79 3
13 95 5
1 77 3
1 85 4
15 92 5
17 99 5
18 65 2
19 72 3
I did search and was able to find some code example that worked individually (like just the substring part, or just if part) but when I put the entire thing together I am getting syntax errors. I am doing this in ssh environment and I saw there is a slight difference in syntax between sh and bash. Below is what I was able to come up with but gives me syntax errors:
awk -F $'\t' 'BEGIN {OFS = FS} { num = substr($1, 3, 1) if (num == "0") num2 = substr($1,4,1) else num2= substr($1,3,2) {print num2, $2, $3 } }' input.txt > output.txt
I will appreciate any help on this one.
Thanks!
Some like this awk
awk '{sub(/id/,"",$1);$1=$1+0}8' OFS="\t"
1 90 5
1 80 4
1 79 3
13 95 5
1 77 3
1 85 4
15 92 5
17 99 5
18 65 2
19 72 3
Updated to get rid of leading 0
Try this sed,
sed 's/id//g' file.txt
To get rid of the leading zeros,
sed 's/id0*//g' file.txt

Consecutively regex-replace separated values

Reading a raster grid file into #grid containing arbitrary numbers, like
82 8 98 98 42 12 3342 321 34 34 09434 9232
(and many more of those rows).
Herein, I do like to replace some numbers, like 34 with 42.
But only single, separated numbers! Eg. I do not want to replace the 34 in 3342.
So for numbers $a (search,eg 34) and $b (replace, eg 42), my approach is
s/(^|\s)$a(\s|$)/$1$b$2/g for #grid;
But this only replaces every second of consecutive occurrences (like 34 34 34 34=>42 34 42 34), because the suffix \s is not taken into account as prefix of the next pattern.
Is there any solution for this problem, other than putting two of those commands back-to-back (which is slow for large arrays)?
You're looking for \b : the boundary between a word char (\w) and something that is not a word char
s/\b$a\b/$b/g
Live DEMO
You can set up a hash that contains your replacement pairs, and then capture each number on a line and do the replacement if that number's a hash key:
use strict;
use warnings;
my %replacements = ( 34 => 42, 8 => 100 );
while (<DATA>) {
s/(\d+)/exists $replacements{$1} ? $replacements{$1} : $1/ge;
print;
}
__DATA__
82 8 98 98 42 12 3342 321 34 34 09434 9232
97 8 8 8 27 37 34 55 19 100 8 34 07932 8
Output:
82 100 98 98 42 12 3342 321 42 42 09434 9232
97 100 100 100 27 37 42 55 19 100 100 42 07932 100
Hope this helps!