tr utility - add exceptions to bracket expression [[:punct:]]

tr utility - add exceptions to bracket expression [[:punct:]] - regex

I was wondering if there was a simple way to add exceptions to the [[:punct:]] bracket expression when using tr utility:
cat *.txt | tr '[[:punct:]]' '\012'
for instance: do not do anything if the punctuation characters are - or ).

You can use a negative look-ahead: '(?![-)])[[:punct:]]'
This will first check whether the next character is neither a -nor a ), and then check it for whether it's a punctuation character. Using a negative look-behind is also possible and might or might not be faster: '[[:punct:]](?<![-)])'
edit: since tr apparently doesn't support Regex (only basic POSIX), you should use another utility, e.g. sed: cat *.txt | sed -r 's/(?![\-#\/\\¤%+[&|=^\]$_*#])[[:punct:]]/\012/g'

Related

find recurring pattern with `sed`

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!

With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.

This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file

The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.

Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.

If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7

With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.

This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.

$ cut -d'"' -f2 file
TEXT I WANT TO KEEP

You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"

The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.

A simpler one for sed:
sed 's/^[^"]*//' myfile.txt

If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.

Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

sed does not match the regex

I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?

sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.

I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD

There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.

As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)

Grep Regex Exclusion Special Character

I am having a difficult time trying to search for a phrase but exclude the phrase if it is directly followed by a colon-space.
I am looking for Delet! (i.e. "Delet.*" in regex syntax) but I do not want anything returned that is "Deleted: " (includes a space after the colon). However, I would like anything returned that is "Deleted" followed by anything other than a colon-space.
I have tried the following expressions
grep -ri 'delet.*[^:]'
grep -ri 'delet[a-zA-Z0-9\;\".....]{0,10}'
(including all special characters in the range preceded by escapes)

Using a lookahead expression:
grep -Pi 'Delet(?!ed: )'
Note the modification of the parameters of grep: -P enables the use of lookahead expressions.

Try this. The ? after the * instructs it to select as few non-space characters as possible, followed by any one character that is not a colon, followed by a space.
grep -ri 'delet[^ ]*?[^:] '

If I got you correctly you want anything starting with delet, and not starting with deleted::
grep -Ei '^delet((([^e]|e$)|e([^d]|d$)|ed([^:]|:$)|ed:[^ ]).*)?$'
This basically says:
Match [start]deletX[anything][end] or [start]delete[end] where X is not e
Match [start]deleteX[anything][end] or [start]deleted[end] where X is not d
Match [start]deletedX[anything][end] or [start]deleted:[end] where X is not :
Match [start]deleted:X[anything][end] where X is not space.
It would have been far easier to use pipe and second negative grep if that is applicable:
grep -i ^delet | grep -vi '^deleted: '

It sounds like all you need is:
awk -v IGNORECASE=1 '/delet/ && !/deleted: /' file
The above uses GNU awk for IGNORECASE, other awks would use tolower().
The benefit of awk over grep is that awk tests for conditions, not just regexps, so you can create compound conditions using && and || out of tests for regexps which makes it MUCH simpler and clearer to just code the condition you want to test - that the line contains delet and (&&) not (!) deleted:.

Bash- How to convert non-alphanumerical character to "_"

I am trying to store user input in a variable and clean that variable in order to keep only alphanumerical caract + some others (I mean [a-zA-Z0-9-_]).
I tried using this but it isn't exhaustive :
SERVICE_NAME=$(echo $SERVICE_NAME | tr A-Z a-z | tr ' ' _ | tr \' _ | tr \" _)
Do you have some help for this?

Bash's string substitution is a fine thing: ${var//pat/rep}
val='Foo$%!*#BAR###baZ'
echo ${val//[^a-zA-Z_-]/_}
Foo_____BAR___baZ
A small explanation: The slash introduces a search/replace, a little like in sed (where it just delimits patterns). But you use a single slash for one replacement:
val='Foo$%!*#BAR###baZ'
echo ${val/[^a-zA-Z_-]/_}
Foo_%!*#BAR###baZ
Two slashes // mean replace all. Uncommon, but it has some logic, multiple slashes to mean multiple replace (please excuse my poor English).
And note how the $ is separated from the variable, but it is hard to modify a literal constant this way (which would be nice for testing). Modifying $1 isn't a no-brainer as well, afaik.

$ echo 'asd!#QCW##D' | tr A-Z a-z | sed -e 's/[^a-zA-Z0-9\-]/_/g'
asd__qcw__d
I would use sed for this and use the ^ (not) operator in your set of valid characters and replace everything else with an underscore. The above shows the syntax with the output.
And, as a bonus, if you want to replace a run of invalid characters with one underscore, just add + to your regular expression (and use the -r switch to sed to make it use extended regular expressions:
$ echo 'asd!#QCW##D' | tr A-Z a-z | sed -r 's/[^a-zA-Z0-9\-]+/_/g'
asd_qcw_d

I believe it can all be done in 1 single sed command like this:
echo 'Foo$%!*#BAR###baZ' | sed -e 's/[A-Z]/\L&/g' -e 's/[^a-z0-9\-]/_/g'
OUTPUT
foo_____bar___baz

perl way:
perl -ple 's/[^\w\-]/_/g'
pure bash way
a='foo-BAR_123,.:goo'
echo ${a//[^[:alnum:]-]/_}
produces:
foo-BAR_123___goo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

tr utility - add exceptions to bracket expression [[:punct:]] - regex

I was wondering if there was a simple way to add exceptions to the [[:punct:]] bracket expression when using tr utility: cat *.txt | tr '[[:punct:]]' '\012' for instance: do not do anything if the punctuation characters are - or ).

Related

find recurring pattern with `sed`

How to use grep/sed/awk, to remove a pattern from beginning of a text file

sed does not match the regex

Grep Regex Exclusion Special Character

Bash- How to convert non-alphanumerical character to "_"

Categories

Resources