Escaping special characters with sed

Escaping special characters with sed - regex

I have a script to generate char arrays from strings:
#!/bin/bash
while [ -n "$1" ]
do
echo -n "{" && echo -n "$1" | sed -r "s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
shift
done
It works great as is:
$ wchar 'test\n' 'test\\n' 'test\123' 'test\1234' 'test\x12345'
{'t','e','s','t','\n',0}
{'t','e','s','t','\\','n',0}
{'t','e','s','t','\123',0}
{'t','e','s','t','\123','4',0}
{'t','e','s','t','\x12345',0}
But because sed considers each new line to be a brand new thing it doesn't handle actual newlines:
$ wchar 'test
> test'
{'t','e','s','t',
't','e','s','t',0}
How can I replace special characters (Tabs, newlines etc) with their escaped versions so that the output would be like so:
$ wchar 'test
> test'
{'t','e','s','t','\n','t','e','s','t',0}
Edit: Some ideas that almost work:
echo -n "{" && echo -n "$1" | sed -r ":a;N;;s/\\n/\\\\n/;$!ba;s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
Produces:
$ wchar 'test\n\\n\1234\x1234abg
test
test'
{test\n\\n\1234\x1234abg\ntest\ntest0}
While removing the !:
echo -n "{" && echo -n "$1" | sed -r ":a;N;;s/\\n/\\\\n/;$ba;s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
Produces:
$ wchar 'test\n\\n\1234\x1234abg
test
test'
{'t','e','s','t','\n','\\','n','\123','4','\x1234ab','g','\n','t','e','s','t',
test0}
This is close...
The first isn't performing the final replacement, and the second isn't correctly adding the last line

You can pre-filter before passing to sed. Perl will do:
$ set -- 'test1
> test2'
$ echo -n "$1" | perl -0777 -pe 's/\n/\\n/g'
test1\ntest2

This is a very convoluted solution, but might work for your needs. GNU awk 4.1
#!/usr/bin/awk -f
#include "join"
#include "ord"
BEGIN {
RS = "\\\\(n|x..)"
FS = ""
}
{
for (z=1; z<=NF; z++)
y[++x] = ord($z)<0x20 ? sprintf("\\x%02x",ord($z)) : $z
y[++x] = RT
}
END {
y[++x] = "\\0"
for (w in y)
y[w] = "'" y[w] "'"
printf "{%s}", join(y, 1, x, ",")
}
Result
$ cat file
a
b\nc\x0a
$ ./foo.awk file
{'a','\x0a','b','\n','c','\x0a','\0'}

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?

Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.

If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20

Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t

Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

detect string case and apply to another one

How can I detect the case (lowercase, UPPERCASE, CamelCase [, maybe WhATevERcAse]) of a string to apply to another one?
I would like to do it as a oneline with sed or whatever.
This is used for a spell checker which proposes corrections.
Let's say I get something like string_to_fix:correction:
BEHAVIOUR:behavior => get BEHAVIOUR:BEHAVIOR
Behaviour:behavior => get Behaviour:Behavior
behaviour:behavior => remains behaviour:behavior
Extra case to be handled:
MySpecalCase:myspecialcase => MySpecalCase:MySpecialCase (so character would be the point of reference and not the position in the word)

With awk you can use the posix character classes to detect case:
$ cat case.awk
/^[[:lower:]]+$/ { print "lower"; next }
/^[[:upper:]]+$/ { print "upper"; next }
/^[[:upper:]][[:lower:]]+$/ { print "capitalized"; next }
/^[[:alpha:]]+$/ { print "mixed case"; next }
{ print "non alphabetic" }
Jims-MacBook-Air so $ echo chihuahua | awk -f case.awk
lower
Jims-MacBook-Air so $ echo WOLFHOUND | awk -f case.awk
upper
Jims-MacBook-Air so $ echo London | awk -f case.awk
capitalized
Jims-MacBook-Air so $ echo LaTeX | awk -f case.awk
mixed case
Jims-MacBook-Air so $ echo "Jaws 2" | awk -f case.awk
non alphabetic
Here's an example taking two strings and applying the case of the first to the second:
BEGIN { OFS = FS = ":" }
$1 ~ /^[[:lower:]]+$/ { print $1, tolower($2); next }
$1 ~ /^[[:upper:]]+$/ { print $1, toupper($2); next }
$1 ~ /^[[:upper:]][[:lower:]]+$/ { print $1, toupper(substr($2,1,1)) tolower(substr($2,2)); next }
$1 ~ /^[[:alpha:]]+$/ { print $1, $2; next }
{ print $1, $2 }
$ echo BEHAVIOUR:behavior | awk -f case.awk
BEHAVIOUR:BEHAVIOR
$ echo Behaviour:behavior | awk -f case.awk
Behaviour:Behavior
$ echo behaviour:behavior | awk -f case.awk
behaviour:behavior

With GNU sed:
sed -r 's/([A-Z]+):(.*)/\1:\U\2/;s/([A-Z][a-z]+):([a-z])/\1:\U\2\L/' file
Explanations:
s/([A-Z]+):(.*)/\1:\U\2/: search for uppercase letters up to : and using backreference and uppercase modifier \U, change letters after : to uppercase
s/([A-Z][a-z]+):([a-z])/\1:\U\2\L/ : search for words starting with uppercase letter and if found, replace first letter after : to uppercase

awk -F ':' '
{
# read Pattern to reproduce
Pat = $1
printf("%s:", Pat)
# generic
if ( $1 ~ /^[:upper:]*$/) { print toupper( $2); next}
if ( $1 ~ /^[:lower:]*$/) { print tolower( $2); next}
# Specific
gsub( /[^[:upper:][:lower:]]/, "~:", Pat)
gsub( /[[:upper:]]/, "U:", Pat)
gsub( /[[:lower:]]/, "l:", Pat)
LengPat = split( Pat, aDir, /:/)
# print with the correponsing pattern
LenSec = length( $2)
for( i = 1; i <= LenSec; i++ ) {
ThisChar = substr( $2, i, 1)
Dir = aDir[ (( i - 1) % LengPat + 1)]
if ( Dir == "U" ) printf( "%s", toupper( ThisChar))
else if ( Dir == "l" ) printf( "%s", tolower( ThisChar))
else printf( "%s", ThisChar)
}
printf( "\n")
}' YourFile
take all case (and taking same concept as #Jas for quick upper or lower pattern)
works for this strucure only (spearator by :)
second part (text) could be longer than part1, pattern is used cyclingly

This might work for you (GNU sed):
sed -r '/^([^:]*):\1$/Is//\1:\1/' file
This uses the I flag to do a caseless match and then replaces both instances of the match with the first.

Bash command to match n line

I have an index HTML file with file/dir listing. It is just a usual filebrowser like :
...content here...
<td>20120011/</td>
<td>20120111/</td>
<td>20120211/</td>
<td>20120411/</td>
...content here...
I don't understand how to extract the 2nd line from the bottom.
1) I downloaded HTML with curl
content=$(curl -sL "http://path-to-html")
2) then used
dir=$(echo $content | sed '/.*href="\([0-9]*\/\)".*/!d;s//\1/;q')
which gives me the last match : 20120411.
But how to get the previous one ?
I don't know the total count of items.

This awk program will print the penultimate line:
echo ${content} | awk '{ pen = ult; ult = $0 } END { print pen }'
This will print the penultimate matching line:
echo ${content} | awk '/href="([0-9]{8}\/)"/ { pen = ult; ult = $0 } END { print pen }'
If you just want to extract the first capture group:
echo ${content} | awk 'match($0, /href="([0-9]{8}\/)"/, a) { pen = ult; ult = a[1] } END { print pen }'
Putting it all together:
bash-4.2$ dir=$(curl -sL http://www.arteetmarte.no/tmp/index.html |
awk 'match($0, /href="([0-9]{8}\/)"/, a) {
pen = ult
ult = a[1]
}
END {
print pen
}
')
bash-4.2$ echo ${dir}
20130918/
Tested with: GNU Awk 4.1.0, API: 1.0

May be a bit easier with awk
dir=$(echo "$content"|awk '/href=/{x=p;p=$0}END{sub(/.*">/,"",x);sub(/<.*/, "",x); print x}')

dir=$(echo $content | sed sed -n '/href="\([0-9]\{1,\}\/\)"/ {s|.*href="\([0-9]\{1,\}/\)".*|-\1-|;H;}
$ {x;l;s|.*-\([0-9]\{1,\}/\)-\(\n-[0-9]\{1,\}/-\)\{1\}$|\1|p;}')
The 1 in \{1\}$ specify how much line must be removed from the end

How can I check the balance of ASCII images using bash?

I have some large ASCII images that I want to check are symmetrical. Say I have the following file:
***^^^MMM
*^**^^MMM
**^^^^^MMMMM
The first line is what I want, they are all separated and have the same amount in each section (it doesn't have to be 3 of each ever time though), and the next two are not what I want. I want to count the number of *'s in a row, and then make sure there are the same amount of ^'s and M's following them. I'm trying to get some symmetry on each line, so this would be good:
**^^MM
**********^^^^^^^^^^MMMMMMMMMM
****^^^^MMMM
*^M
etc.
How can I scan through a file and maybe grep the problem lines?
I tried a few loops with cat ASCIIfile | sed 's/\^//g' | sed 's/M//g' | wc -c and assigning output to a variable and then comparing the count to the other char counts, but obviously this doesn't take into account order and lines like *^*^*M^MM were working.

Using perl:
perl -ne ' { $l=$_; chomp; ($v)=/^((.)\2*)/; $t=length($v); \
s/M{$t}//;s/\^{$t}//;s/\*{$t}//; \
print $l if length } ' input_file
Using bash/sed:
while read line; do
m=$(echo $line | sed 's/[^M]*\([M][M]*\)[^M]*/\1/' | wc -c)
s=$(echo $line | sed 's/[^*]*\([*][*]*\)[^*]*/\1/' | wc -c)
n=$(echo $line | sed 's/[^\^]*\([\^][\^]*\)[^\^]*/\1/' | wc -c)
if [[ $m -ne $s || $m -ne $n ]]; then
echo "- $line $m::$s::$n"
else
echo "+ $line $m::$s::$n"
fi
done < input_file

Pure Bash:
#!/bin/bash
for string in '***^^^MMM' '**^^MM' '****^^MMMM' '*^*M^'
do
flag=true
sym=true
patt=''
prevlen=${#string}
for c in '*' '^' 'M'
do
patt+="*\\$c"
sub="${string##$patt}"
sublen="${#sub}"
if $flag
then
flag=false
((compare = prevlen - sublen ))
else
if (( prevlen - sublen != compare ))
then
printf '%s\n' "$string is NOT symmetrical"
sym=false
break
fi
fi
prevlen=$sublen
done
if $sym
then
printf '%s\n' "$string IS symmetrical"
fi
done
To read from a file, change the first for loop to while read -r string and add < filename after the last done on the same line.

Remove all the text using sed

Format:
[Headword]{}"UC(icl>restriction)"(Attributes);(gloss)
The testme.txt file has 2 lines
[testme] {} "acetify" (V,lnk,CJNCT,AJ-V,VINT,VOO,VOO-CHNG,TMP,Vo) <H,0,0>;
[newtest] {} "acid-fast" (ADJ,DES,QUAL,TTSM) <H,0,0>;
The expected output is this:
testme = acetify
newtest = acid-fast
What I have achieved so far is:
cat testme.txt | sed 's/[//g' | sed 's/]//g' | sed 's/{}/=/g' | sed 's/\"//'
testme = acetify" (V,lnk,CJNCT,AJ-V,VINT,VOO,VOO-CHNG,TMP,Vo) <H,0,0>;
newtest = acid-fast" (ADJ,DES,QUAL,TTSM) <H,0,0>;
How do I remove all the text from the second " to the end of the line?

Remove everything after the doublequote-space-openparenthesis " (:
sed 's/" (.*//g'

The whole process might be a little quicker with awk:
awk 'NF > 0 { print $1 " = " $3 }' testme.txt | tr -d '[]"'

this is how you do it with awk instead of all those sed commands, which is unnecessary. what you want is field 1 and field 3. use gsub() to remove the quotes and brackets
$ awk '{gsub(/\"/,"",$3);gsub(/\]|\[/,"",$1);print $1" = "$3}' file
testme = acetify
newtest = acid-fast

Your whole sequence of multiple calls to sed can be replaced by:
sed 's/\[\([^]]*\)][^"]*"\([^"]*\).*/\1 = \2/' inputfile

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Escaping special characters with sed - regex

You can pre-filter before passing to sed. Perl will do: $ set -- 'test1 > test2' $ echo -n "$1" | perl -0777 -pe 's/\n/\\n/g' test1\ntest2

Related

stop condition for emulating "grep -oE" with awk

detect string case and apply to another one

Bash command to match n line

How can I check the balance of ASCII images using bash?

Remove all the text using sed

Categories

Resources