I'm trying to use sed with perl to replace ^[(s3B with an empty string in several files.
s/^[(s3B// isn't working though, so I'm wondering what else I could try.
You need to quote the special characters:
$ echo "^[(s3B AAA ^[(s3B"|sed 's/\^\[[(]s3B//g'
AAA
$ echo "^[(s3B AAA ^[(s3B" >file.txt
$ perl -p -i -e 's/\^\[[(]s3B//g' file.txt
$ cat file.txt
AAA
The problem is that there are several characters that have a special meaning in regular expressions. ^ is a start-of-line anchor, [ opens a character class, and ( opens a capture.
You can escape all non-alphanumerics in a Perl string by preceding it with \Q, so you can safely use
s/\Q^[(s3B//
which is equivalent to, and more readable than
s/\^\[\(s3B//
If you're dealing with ANSI sequences (xterm color sequences, escape sequences), then ^[ is not '^' followed by '[' but rather an unprintable character ESC, ASCII code 0x1B.
To put that character into a sed expression you need to use \x1B in GNU sed, or see http://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/ . You can also insert special characters directly into your command line using ctrl+v in Bash line editing.
In regex "^", "[" and "(" (and many others) are special characters used for special regex features, if you are referencing the characters themselves you should preceed them with "\".
The correct substitution reges would be:
$string =~ s/\^\[\(3B//g
if you want to replace all occurences.
Related
The SQL applications that I'm using isn't properly escaping all of the strings that I have, so I'm trying to use sed to replace these instances. The issue is I'll have this:
`some string of characters that may include hyphens'
and the quote at the end won't get escaped (yes that's supposed to be a ` not a quote).
My plan was to use this:
sed 's/[^\\]\'[^,]/&\\\'&/g' testfile.txt
Logic: anything that isn't a backslash followed by a quote, then anything that isn't a comma will be replaced by the same text with with a backslash and a quote.
I would like for testfile.txt to have all instances of ' replaced with \', but I just keep getting > as if it isn't done the line
I try this using gnu sed,
$ cat d
already escaped quote \' won't be escaped
$ sed -E "s/([^\\]|^)'([^,]|$)/\1\\\'\2/" d
already escaped quote \' won\'t be escaped
What you're looking for are called lookaround assertions, where you match any ' not preceded by a \ or followed by an end of line. Unfortunately, sed doesn't support these. But you can use Perl:
perl -pe 's/(?<!\\)'\''(?!$)/\\'\''/g' testfile.txt
In unescaped form, this would look like s/(?<!\\)'(?!$)/\\'/g but we have to make allowances for the shell. No escapes are recognized in single quoted strings, so your original problem was \' not being recognized, and the string terminating early.
See here for example and detailed regexp breakdown: https://regex101.com/r/k8sonu/1
I need to replace each character of a regular expression, once evaluated, with each character plus the # symbol.
For example:
If the regular expression is: POS[AB]
and the input text is: POSA_____POSB
I want to get this result: P#O#S#A#_____P#O#S#B#
Please, using sed or awk.
I have tried this:
$ echo "POSA_____POSB" | sed "s/POS[AB]/&#/g"
POSA#_____POSB#
$ echo "POSA_____POSB" | sed "s/./&#/g"
P#O#S#A#_#_#_#_#_#P#O#S#B#
But what I need is:
P#O#S#A#_____P#O#S#B#
Thank you in advance.
Best regards,
Octavio
Perl to the resuce!
perl -pe 's/(POS[AB])/$1 =~ s:(.):$1#:gr/ge'
The /e interprets the replacement as code, and it contains another substitution which replaces each character with itself plus #.
In ancient Perls before 5.14 (i.e. without the /r modifier), you need to use a bit more complex
perl -pe 's/(POS[AB])/$x = $1; $x =~ s:(.):$1#:g; $x/ge'
echo "POSA_____POSB" | sed "s/[^_]/&#/g"
or
echo "POSA_____POSB" | sed "s/[POSAB]/&#/g"
Try this regex:
echo "POSA_____POSB" | sed "s/[A-Z]/&#/g"
Output:
P#O#S#A#_____P#O#S#B#
You may replace regex pattern using awk with sub (first matching substring, sed "s///") or gsub (substitute matching substrings globally, sed "s///g") commands. The regex themselves will not differ between sed and awk. In your case you want:
Solution 1
EDIT: edited to match the comments
The following awk will limit substitution to a given substring (e.g.'POSA_____POSB'):
echo "OOPS POSA_____POSB" | awk '{str="POSA_____POSB"}; {gsub(/[POSAB]/,"&#",str)}; {gsub(/'POSA_____POSB'/, str); print $0} '
If your input consist only of matched string, try this:
echo "POSA_____POSB" | awk '{gsub(/[POSAB]/,"&#");}1'
Explanation:
Separate '{}' for each action and explicit print are for clarity sake.
The gsub accepts 3 arguments gsub(pattern, substitution [, target]) where target must be variable (gsub will change it inplace and store result there).
We use var named 'str' and initialize it with value (your string) before doing any substitutions.
The second gsub is there to put modified str into $0 (matches the whole record/line).
The expressions are greedy by default --- they will match the longest string possible.
[] introduces set of characters to be matched: every occurence of any char will be matched. The expression above says awk to match each occurence of any of "POSAB".
Your first regexp does not work as expected for you told sed to match POS ending in any of [AB] (the whole string at once).
In the other expression you told it to match any single character (including "_") when you used: '.' (dot).
If you want to generalize this solution you may use: [\w] expression which will match any of [a-zA-Z0-9_] or [a-z], [A-Z], [0-9] to match lowercase, uppercase letters and numbers respectively.
Solution 2
Note that you might negate character sets with [^] so: [^_] would also work in this particular case.
Explanation:
Negation means: match anything but the character between '[]'. The '^' character must come as first char, right after opening '['.
Sidenotes:
Also it may be good idea to directly indicate you want to match one character at a time with [POSAB]? or [POSAB]{1}.
Also note that some implementations of sed might need -r switch to use extended (more complicated) regexps.
With the given example you can use
echo "POSA_____POSB" | sed -r 's/POS([AB])/P#O#S#\1#/g'
This will fail for more complicated expressions.
When your input is without \v and \r, you can use
echo "POSA_____POSB" |
sed -r 's/POS([AB])/\v&\r/g; :loop;s/\v([^\r])/\1#\v/;t loop; s/[\v\r]//g'
$ cat file
anna
amma
kklks
ksklaii
$ grep '\`' file
anna
amma
kklks
ksklaii
Why? How is that match working ?
This appears to be a GNU extension for regular expressions. The backtick ('\`') anchor matches the very start of a subject string, which explains why it is matching all lines. OS X apparently doesn't implement the GNU extensions, which would explain why your example doesn't match any lines there. See http://www.regular-expressions.info/gnu.html
If you want to match an actual backtick when the GNU extensions are in effect, this works for me:
grep '[`]' file
twm's answer provides the crucial pointer, but note that it is the sequence \`, not ` by itself that acts as the start-of-input anchor in GNU regexes.
Thus, to match a literal backtick in a regex specified as a single-quoted shell string, you don't need any escaping at all, neither with GNU grep nor with BSD/macOS grep:
$ { echo 'ab'; echo 'c`d'; } | grep '`'
c`d
When using double-quoted shell strings - which you should avoid for regexes, for reasons that will become obvious - things get more complicated, because you then must escape the ` for the shell's sake in order to pass it through as a literal to grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\`"
c`d
Note that, after the shell has parsed the "..." string, grep still only sees `.
To recreate the original command with a double-quoted string with GNU grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\\\`" # !! BOTH \ and ` need \-escaping
ab
c`d
Again, after the shell's string parsing, grep sees just \`, which to GNU grep is the start-of-the-input anchor, so all input lines match.
Also note that since grep processes input line by line, \` has the same effect as ^ the start-of-a-line anchor; with multi-line input, however - such as if you used grep -z to read all lines at once - \` only matches the very start of the whole string.
To BSD/macOS grep, \` simply escapes a literal `, so it only matches input lines that contain that character.
I have a text file with characters from different languages like (chinese, latin etc)
I want to remove all lines that contain these non-English characters. I want to include all English characters (a-b), numbers (0-9) and all punctuations.
How can I do it using unix tools like awk or sed.
Perl supports an [:ascii:] character class.
perl -nle 'print if m{^[[:ascii:]]+$}' inputfile
You can use Awk, provided you force the use of the C locale:
LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file
The environment variable LC_TYPE=C (or LC_ALL=C) force the use of the C locale for character classification. It changes the meaning of the character classes ([:alnum:], [:space:], etc.) to match only ASCII characters.
The /[^[:alnum:][:space:][:punct:]]/ regex match lines with any non ASCII character. The ! before the regex invert the condition. So only lines without any non ASCII characters will match. Then as no action is given, the default action is used for matching lines (print).
EDIT: This can also be done with grep:
LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file
With GNU grep, which supports perl compatible regular expressions, you can use:
grep -P '^[[:ascii:]]+$' file
You can use egrep -v to return only lines not matching the pattern and use something like [^ a-zA-Z0-9.,;:-'"?!] as pattern (include more punctuation as needed).
Hm, thinking about it, a double negation (-v and the inverted character class) is probably not that good. Another way might be ^[ a-zA-Z0-9.,;:-'"?!]*$.
You can also just filter for ASCII:
egrep -v "[^ -~]" foo.txt
I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.
For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.
The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.
All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt
option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile
This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.
You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence
You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new
Using gawk:
gawk '{$1=$1}1' OFS="\n" file