I have some large ASCII images that I want to check are symmetrical. Say I have the following file:
***^^^MMM
*^**^^MMM
**^^^^^MMMMM
The first line is what I want, they are all separated and have the same amount in each section (it doesn't have to be 3 of each ever time though), and the next two are not what I want. I want to count the number of *'s in a row, and then make sure there are the same amount of ^'s and M's following them. I'm trying to get some symmetry on each line, so this would be good:
**^^MM
**********^^^^^^^^^^MMMMMMMMMM
****^^^^MMMM
*^M
etc.
How can I scan through a file and maybe grep the problem lines?
I tried a few loops with cat ASCIIfile | sed 's/\^//g' | sed 's/M//g' | wc -c and assigning output to a variable and then comparing the count to the other char counts, but obviously this doesn't take into account order and lines like *^*^*M^MM were working.
Using perl:
perl -ne ' { $l=$_; chomp; ($v)=/^((.)\2*)/; $t=length($v); \
s/M{$t}//;s/\^{$t}//;s/\*{$t}//; \
print $l if length } ' input_file
Using bash/sed:
while read line; do
m=$(echo $line | sed 's/[^M]*\([M][M]*\)[^M]*/\1/' | wc -c)
s=$(echo $line | sed 's/[^*]*\([*][*]*\)[^*]*/\1/' | wc -c)
n=$(echo $line | sed 's/[^\^]*\([\^][\^]*\)[^\^]*/\1/' | wc -c)
if [[ $m -ne $s || $m -ne $n ]]; then
echo "- $line $m::$s::$n"
else
echo "+ $line $m::$s::$n"
fi
done < input_file
Pure Bash:
#!/bin/bash
for string in '***^^^MMM' '**^^MM' '****^^MMMM' '*^*M^'
do
flag=true
sym=true
patt=''
prevlen=${#string}
for c in '*' '^' 'M'
do
patt+="*\\$c"
sub="${string##$patt}"
sublen="${#sub}"
if $flag
then
flag=false
((compare = prevlen - sublen ))
else
if (( prevlen - sublen != compare ))
then
printf '%s\n' "$string is NOT symmetrical"
sym=false
break
fi
fi
prevlen=$sublen
done
if $sym
then
printf '%s\n' "$string IS symmetrical"
fi
done
To read from a file, change the first for loop to while read -r string and add < filename after the last done on the same line.
Related
I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.
I tried to scratch my head around this issue and couldn't understand what it wrong about my one liner below.
Given that
echo "5" | wc -m
2
and that
echo "55" | wc -m
3
I tried to add a zero in front of all numbers below 9 with an awk if-statement as follow:
echo "5" | awk '{ if ( wc -m $0 -eq 2 ) print 0$1 ; else print $1 }'
05
which is "correct", however with 2 digits numbers I get the same zero in front.
echo "55" | awk '{ if ( wc -m $0 -eq 2 ) print 0$1 ; else print $1 }'
055
How come? I assumed this was going to return only 55 instead of 055. I now understand I'm constructing the if-statement wrong.
What is then the right way (if it ever exists one) to ask awk to evaluate if whatever comes from the | has 2 characters as one would do with wc -m?
I'm not interested in the optimal way to add leading zeros in the command line (there are enough duplicates of that).
Thanks!
I suggest to use printf:
printf "%02d\n" "$(echo 55 | wc -m)"
03
printf "%02d\n" "$(echo 123456789 | wc -m)"
10
Note: printf is available as a bash builtin. It mainly follows the conventions from the C function printf().. Check
help printf # For the bash builtin in particular
man 3 printf # For the C function
Facts:
In AWK strings or variables are concatenated just by placing them side by side.
For example: awk '{b="v" ; print "a" b}'
In AWK undefined variables are equal to an empty string or 0.
For example: awk '{print a "b", -a}'
In AWK non-zero strings are true inside if.
For example: awk '{ if ("a") print 1 }'
wc -m $0 -eq 2 is parsed as (i.e. - has more precedence then string concatenation):
wc -m $0 -eq 2
( wc - m ) ( $0 - eq ) 2
^ - integer value 2, converted to string "2"
^^ - undefined variable `eq`, converted to integer 0
^^ - input line, so string "5" converted to integer 5
^ - subtracts 5 - 0 = 5
^^^^^^^^^^^ - integer 5, converted to string "5"
^ - undefined variable "m", converted to integer 0
^^ - undefined variable "wc" converted to integer 0
^^^^^^^^^ - subtracts 0 - 0 = 0, converted to a string "0"
^^^^^^^^^^^^^^^^^^^^^^^^^ - string concatenation, results in string "052"
The result of wc -m $0 -eq 2 is string 052 (see awk '{ print wc -m $0 -eq 2 }' <<<'5'). Because the string is not empty, if is always true.
It should return only 55 instead of 055
No, it should not.
Am I constructing the if statement wrong?
No, the if statement has valid AWK syntax. Your expectations to how it works do not match how it really works.
To actually make it work (not that you would want to):
echo 5 | awk '
{
cmd = "echo " $1 " | wc -m"
cmd | getline len
if (len == 2)
print "0"$1
else
print $1
}'
But why when you can use this instead:
echo 5 | awk 'length($1) == 1 { $1 = "0"$1 } 1'
Or even simpler with the various printf solutions seen in the other answers.
Suppose we have a string like
"dir1|file1|dir2|file2"
and would like to turn it into
"-f dir1/file1 -f dir2/file2"
Is there an elegant way to do this with sed or awk for a general case of n > 2?
My attempt was to try
echo "dir1|file1|dir2|file2" | sed 's/\(\([^|]\)|\)*/-f \2\/\4 -f \6\/\8/'
An awk solution:
awk -F'|' '{ for (i=1;i<=NF;i+=2) printf "-f %s/%s%s", $i, $(i+1), ((i==NF-1) ? "\n" : " ") }' \
<<<"dir1|file1|dir2|file2"
-F'|' splits the input into fields by |
for (i=1;i<=NF;i+=2) loops over the field indices in increments of 2
printf "-f %s/%s%s", $i, $(i+1), ((i==NF-1) ? "\n" : " ") prints pairs of consecutive fields joined with / and prefixed with -f<space>
((i==NF-1) ? "\n" : " ") terminates each field-pair either with a space, if more fields follow, or a \n to terminate the overall output.
In a comment, the OP suggests a shorter variation, which may be of interest if you don't need/want the output to be \n-terminated:
awk -F'|' '{ for (i=1;i<=NF;++i) printf "%s", (i%2 ? " -f " $i : "/" $i ) }' \
<<<"dir1|file1|dir2|file2"
This might work for you (GNU sed):
sed 's/\([^|]*\)|\([^|]*\)|\?/-f \1\/\2 /g;s/ $//' file
This will work for dir1|file1|dir2|file2|dirn|filen type strings
The regexp forms two back references (\1,\2 used in the replacement part of the substitution command s/pattern/replacement/), the first is all non-|'s, then a |, the second is all non-|'s then an optional | i.e. for the first application of the substitution (N.B. the g flag is implemented and so the substitutions may be multiple) dir1 becomes \1 and file1 becomes \2. All that remains is to prepend -f and replace the first | by / and the second | by a space. The last space is not needed at the end of the line and is removed in the second substitution command.
$ awk -v RS='|' 'NR%2{p=$0;next} {printf " -f %s/%s", p, $0}' <<< 'dir1|file1|dir2|file2'
-f dir1/file1 -f dir2/file2
A gnu-awk solution:
s="dir1|file1|dir2|file2"
awk 'BEGIN{ FPAT="[^|]+\\|[^|]+" } {
for (i=1; i<=NF; i++) {
sub(/\|/, "/", $i);
if (i>1)
printf " ";
printf "-f " $i
};
print ""
}' <<< "$s"
-f dir1/file1 -f dir2/file2
FPAT is used for grabbing dir1|file2 into single field.
I have a string with some words in them, example a=1 b=2 c=3 a=50. Now I want to parse this and create another string a=50 b=2 c=3 which is essentially the same as above except that if the same phrase before the = is encountered for the second time the first one is over written with the latest one, so in the end there are only unique phrases on the left of =. Here is what I got till now:
a="a=1 b=2 c=3 a=50"
o=()
for i in $a
do
reg=${i%=*}
if [[ ${o[*]} == *"$reg"* ]]
then
o=$(echo ${o[*]} | sed -e "s/\$reg=\S/\$i")
else
o+=( $i )
fi
done
What am I doing wrong here?
I'd take an entirely different approach, not based on regular expressions or string rewriting.
declare -A values=( ) # Initialize an associative array ("hash", "map")
while IFS= read -r -d' ' word; do # iterate over input words, separated by spaces
if [[ $word = *=* ]]; then # ignore any word that doesn't have an "=" in it
values[${word%%=*}]=${word#*=} # add everything before the "=" as a key...
fi # ...with everything after the "=" as a value
done
for key in "${!values[#]}"; do # Then iterate over keys we found
value="${values[$key]}" # ...extract the values for each...
printf '%s=%s ' "$key" "$value" # ...and print the pairs.
done
echo # When done iterating, print a newline.
Because the words are being processed first-to-last through the string, updates take effect before the print loop is reached.
Using awk
$ awk -F= -v RS=" |\n" '{a[$1]=$2} END{for (k in a) printf "%s=%s ",k,a[k]}' <<<"a=1 b=2 c=3 a=50"
a=50 b=2 c=3
How it works:
-F=
Set the field separator to be an equal sign.
-v RS=" |\n"
Set the record separator to be either a space or a newline.
a[$1]=$2
Update associative array a with the latest value.
END{for (k in a) printf "%s=%s ",k,a[k]}
In no particular order, print out the final values.
Using bash
Like Charles Duffy's approach, this uses read -d" " to parse the string. This approach, however, uses IFS="=" to separate names and values.
Two loops are required. The first gathers the values. The second reassembles the updated values in the original order:
a="a=1 b=2 c=3 a=50"
declare -A b
while IFS== read -d" " name value
do
b["$name"]="$value"
done <<<"$a "
declare -A seen
while IFS== read -d" " name value
do
[ "${seen[$name]}" ] || o="$o $name=${b["$name"]}"
seen[$name]=1
done <<<"$a "
echo "$o"
Easily done with perl:
echo "a=1 b=2 c=3 a=50" \
| sed "s/ /\n/g" \
| perl -e '
my %hash = ();
while(<>){
$line = $_;
if($line =~ m/(\S+)=(\S+)/) {
$hash{$1} = $2;
}
}
for $key (sort keys %hash) {
print "$key=$hash{$key}\n";
}'
...or, all on one line:
echo "a=1 b=2 c=3 a=50" | sed "s/ /\n/g" | perl -e 'my %hash = (); while(<>){ $line = $_; if($line =~ m/(\S+)=(\S+)/) { $hash{$1} = $2; } } for $key (sort keys %hash) { print "$key=$hash{$key}\n"; }'
Hi I am trying to match the following string to no avail
echo '[xxAA][xxBxx][C]' | awk -F '/\[.*\]/' '{ for (i = 1; i <= NF; i++) printf "-->%s<--\n", $i }'
I basically want to have each field be an enclosing bracket such that
field 1 = xxAA
field 2 = xxBxx
field 3 = C
but i keep getting the following result
-->[xxAA][xxBxx][C]<--
any pointers where I am going wrong?
You can use a regex in Field Separator. We enclose the [ and ] in character class to have it considered as literal. Both are separated by | which is logical OR. Since we target them as field separator we just iterate over even field numbers to get the output.
$ echo '[xxAA][xxBxx][C]' | awk -v FS="[]]|[[]" '{ for (i=2;i<=NF;i+=2) print $i }'
xxAA
xxBxx
C
The regex /\[.*\]/ matches the entire input, because the .* matches the ][ inside the input as well as matching the letters.
You could split fields on the ']' character instead, then put it back again in the output:
echo '[xxAA][xxBxx][C]' | awk -F ']' '{ for (i = 1; i <= NF; i++) if ($i != "") printf "-->%s]<--\n", $i }'
This is a job for GNU awk's FPAT variable which lets you specify the pattern of the fields rather than the pattern of the field separators:
$ echo '[xxAA][xxBxx][C]' | awk -v FPAT='[^][]+' '{ for (i = 1; i <= NF; i++) printf "-->%s<--\n", $i }'
-->xxAA<--
-->xxBxx<--
-->C<--
With other awks I'd use:
$ echo '[xxAA][xxBxx][C]' | awk -F'\\]\\[' '{ gsub(/^\[|\]$/,""); for (i = 1; i <= NF; i++) printf "-->%s<--\n", $i }'
-->xxAA<--
-->xxBxx<--
-->C<--