Inserting underscores in strings using sed - regex

I'm trying to use sed to insert _ before every uppercase letter of a string of non-whitespace characters, unless it's at its beginning. (I want to convert strings that are in camelcase and occasionally contain several adjacent uppercase letters or even punctuation signs.)
Desired behavior:
Input:
AaAaAa AAA AAA
Output:
Aa_Aa_Aa A_A_A A_A_A
I tried to use the following command:
sed -e "s/\(\S\)\([[:upper:]]\)/\1_\2/g"
But it fails on the last two strings in the above input, yielding this:
Aa_Aa_Aa A_AA A_AA
And I don't really understand why.
I'm using GNU sed 4.2.2.

I am assuming your example is mistyped because Aa Aa Aa given to the substitution you gave does nothing. And it's also not a camel case identifier. It should be AaAaAa, correct?
If so, then you can get sed to do what you need by causing it to loop until no more substitutions occur:
echo "AaAaAa AAA AAA" | sed -e ':x;s/\([^[:space:]_]\)\([[:upper:]]\)/\1_\2/g;tx'
produces
Aa_Aa_Aa A_A_A A_A_A

This might work for you (GNU sed):
sed -r 'y/_/\n/;s/[[:upper:]]/_&/g;s/\b_//g;y/\n/_/' file
Convert all _'s to unique alternative. Insert _'s infront of uppercase characters. Remove any leading _'s. Reconvert original _'s.
If you don't have any leading _'s in the first place, then this is suffice:
sed -r 's/[[:upper:]]/_&/g;s/\b_//g' file

The problem is that with a single s///g, regex matches can't overlap (and results of an earlier substitution aren't considered for further matches).
With AAA, the first match is
AAA
^^
| \
\1 \2
After replacement, we have A_AA, with the "current position" between the two rightmost A's:
A _ A A
^
next match attempt starts here
Then we try to match again, but we've run out of characters. \S matches the last A, but that's it: There's no uppercase character after that.
To make this work, we'd have to somehow match the middle A as both \2 of the first substitution and \1 of the second substitution, and I don't know how to do that with sed.
(It would be easy with perl because then you could use look-behind/look-ahead, which don't include the surrounding text in the match: perl -pe 's/(?<=\S)(?=[[:upper:]])/_/g')

Related

replace last n parts after spliting on delimiter using sed or regex

I need to replace last 2 parts of the string separated by delimiter with empty space to clean up the name.
Example:
something-useful-a12356-78929
=>
something-useful
something-more-useful-v35f62-2728902
=>
something-more-useful
I tried the following:
echo "something-useful-12345-67890" | sed -re 's/(-([0-9])+)//g'
This works if my last 2 elements of delimiter are numbers only, but wouldn't work for the example above. I need to remove the last 2 parts after splitting it on "-"
I can only use sed or regex to solve this.
Does sed 's/\(-[^-]*\)\{2\}$//' file does what you want?
Use [^-] to match anything other than -. Use $ to match the end of the string. Match hyphen followed by non-hyphens twice at the end.
sed -r 's/(-[^-]+){2}$//'
This might work for you (GNU sed):
sed -re 's/-[^-]*//2g' file
Removes globally from the second occurrence of - followed by non - characters.

sed replace exact string that include brackets

i'm trying to replace an exact string that includes bracket on it. let's say:
a[aa] to bbb, just for giving an example.
I had used the following regex:
sed 's|\<a\[aa]\>|bbb|g' testfile
but it doesn't seem to work. this could be something really basic but I have not been able to make it work so I would appreciate any help on this.
You need to remove the trailing word boundary that requires a letter, digit or _ to immediately follow the ] char.
sed 's|\<a\[aa]|bbb|g' file
See the online sed demo:
s="say: a[aa] to bbb, not ba[aa]"
sed 's|\<a\[aa]|bbb|g' <<< "$s"
# => say: bbb to bbb, not ba[aa]
You may also require a non-word char with a capturing group and replace with a backreference:
sed -E 's~([^_[:alnum:]]|^)a\[aa]([^_[:alnum:]]|$)~\1bbb\2~g' file
Here, ([^_[:alnum:]]|^) captures any non-word char or start of string into Group 1 and ([^_[:alnum:]]|$) matches and caprures into Group 2 any char other than _, digit or letter, and the \1 and \2 placeholders restore these values in the result. This, however, does not allow consecutive matches, so you may still use \< before a to play it safe: sed -E 's~\<a\[aa]([^_[:alnum:]]|$)~bbb\1~g'. file`.
See this online demo.
To enforce whitespace boundaries you may use
sed -E 's~([[:space:]]|^)a\[aa]([[:space:]]|$)~\1bbb\2~g' file
Or, in your case, just a trailing whitespace boundary seems to be enough:
sed -E 's~\<a\[aa]([[:space:]]|$)~bbb\1~g' file

Why doesn't grep work in pattern with colon

I know a colon: should be literal, so I'm not clear why a grep matches all lines. Here's a file called "test":
cat test
123|4444
4546|4444
666666|5678
7777777|7890675::1
I need to match the line with::1. Of course, the real case is more complicated, so I can't simply search for "::1". I tried many iterations, like
grep -E '^[0-9]|[0-9]:' test
grep -E '^[0-9]|[0-9]::1' test
But they return all lines:
123|4444
4546|4444
666666|5678
7777777|7890675::1
I am expecting to match just the last line. Any idea why that is?
This is GNU/Linux bash.
The pipe needs to be escaped and you need to allow repeated digits:
grep -E '^[0-9]+\|[0-9]+:' test
Otherwise ^[0-9] is all that needs to match for a line to be retained by the grep.
Given:
$ echo "$txt"
123|4444
4546|4444
666666|5678
7777777|7890675::1
Use repetition (+ means 'one or more') and character classes:
$ echo "$txt" | grep -E '^[[:digit:]]+[|][[:digit:]]+[:]+'
7777777|7890675::1
Since | is a regex meta character, it has to be either escaped (\|) or in a character class.
There are two issues:
The regex [0-9] matches any single digit. Since you have multiple digits, you need to replace those parts with [0-9]+, which matches one or more digits. If you want to allow an empty sequence with no digits, replace the + with a *, which means “zero or more”.
The pipe character | means “alternative”s in regex. What you provided will match either a digit at the start of the line, or a digit followed by a colon. Since every line has at least one of those, you match every line. To get a literal | character, you can use either [|] or \|; the second option is usually preferred in most styles.
Applying both of these, you get ^[0-9]+\|[0-9]+::1.
Another approach is to use a tool like awk that can process the fields of each line, and match lines where the 2nd field ends with "::1"
awk -F'|' '$2 ~ /::1$/' test

How to replace with one sed command first n letter to uppercase

I would like to replace with one sed command first n letter to uppercase.
Example 'madrid' to 'MADrid'. (n=3)
I know how to change first letter to uppercase with this command:
sed -e "s/\b\(.\)/\U\1/g"
but I dont know how to change this command for my problem.
I tried to change
sed -e "s/\b\(.\)/\U\1/g"
to
sed -e "s/\b\(.\)/\U\3/g"
but this didnt work. Also, I googled and searched on this site but exact answer with my problem I couldnt find.
Thank you.
I infer from your use of \U that you're using GNU sed:
n=3
echo 'madrid' | sed -r 's/\<(.{'"$n"'})/\U\1/g' # -> 'MADrid'
I've omitted the unnecessary -e option
I have added -r to enable support for extended regular expressions, which have more familiar syntax and also offer more features.
I'm using a single-quoted sed script with a shell-variable value spliced in so as to avoid confusion between what the shell expands up front and what is interpreted by sed itself.
\< is used instead of \b, because unlike the latter it only matches at the start of a word.Thanks, Casimir et Hippolyte
The above replaces any 3 characters at the start of a word, however.
To limit it to at most $n letters:
sed -r 's/\<([[:alpha:]]{1,'"$n"'})/\U\1/g'
As for what you've tried:
The \3 in your attempt sed -e "s/\b\(.\)/\U\3/g" refers to the 3rd capture group (parenthesized subexpression, (...)) in the regex (which doesn't exist), it does not refer to 3 repetitions.
Instead, you have to make sure that your one and only capture group (which you can reference as \1 in the substitution) itself captures as many characters as desired - which is what the {<n>} quantifier is for; the related {<m>,<n>} construct matches a range of repetitions.
This might work for you (GNU sed):
sed -r 's/[a-z]/&\n/'"$n"';s/^([^\n]*)\n/\U\1/' file
Where $n is the first n letters. Putting the question of word boundaries aside this converts n letters of a-z consecutive or non-consecutive to upper case i.e. A-Z
N.B. this is two sed commands not one!

sed does not match the regex

I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?
sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.
I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD
There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.
As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)