Calculate the string length in sed - regex

I was forced to calculate the string length in sed. The string is always a nonempty sequence of a's.
sed -n ':c /a/! be; s/^a/1/; s/0a/1/; s/1a/2/; s/2a/3/; s/3a/4/; s/4a/5/; s/5a/6/; s/6a/7/; s/7a/8/; s/8a/9/; s/9a/a0/; /a/ bc; :e p'
It's quite long :) So now I wonder if it is possible to rewrite this script more concisely using the y or other sed command?
I know that it is better to use awk or another tool. However, this is not a question here.
Note that the sed script basically simulates decadic full adder.

I guess it's cheating but:
sed 's/.//;s/./\n/g'|sed -n '$='
You can certainly shorten your existing version to:
sed -n ':c s/^a/1/; s/0a/1/; s/1a/2/; s/2a/3/; s/3a/4/; s/4a/5/; s/5a/6/; s/6a/7/; s/7a/8/; s/8a/9/; s/9a/a0/; tc; p'
Turns out using y/// is possible but I think it only shaves off a few characters, and \u is not portable:
sed -n '
:c;
s/^a/c/;
s/\([b-j]\)a/\u\1/;
y/BCDEFGHIJ/cdefghijk/;
s/ka/ab/;
tc;
y/bcdefghijk/0123456789/;
p
'

Related

How to swap text based on patterns at once with sed?

Suppose I have 'abbc' string and I want to replace:
ab -> bc
bc -> ab
If I try two replaces the result is not what I want:
echo 'abbc' | sed 's/ab/bc/g;s/bc/ab/g'
abab
So what sed command can I use to replace like below?
echo abbc | sed SED_COMMAND
bcab
EDIT:
Actually the text could have more than 2 patterns and I don't know how many replaces I will need. Since there was a answer saying that sed is a stream editor and its replaces are greedily I think that I will need to use some script language for that.
Maybe something like this:
sed 's/ab/~~/g; s/bc/ab/g; s/~~/bc/g'
Replace ~ with a character that you know won't be in the string.
I always use multiple statements with "-e"
$ sed -e 's:AND:\n&:g' -e 's:GROUP BY:\n&:g' -e 's:UNION:\n&:g' -e 's:FROM:\n&:g' file > readable.sql
This will append a '\n' before all AND's, GROUP BY's, UNION's and FROM's, whereas '&' means the matched string and '\n&' means you want to replace the matched string with an '\n' before the 'matched'
sed is a stream editor. It searches and replaces greedily. The only way to do what you asked for is using an intermediate substitution pattern and changing it back in the end.
echo 'abcd' | sed -e 's/ab/xy/;s/cd/ab/;s/xy/cd/'
Here is a variation on ooga's answer that works for multiple search and replace pairs without having to check how values might be reused:
sed -i '
s/\bAB\b/________BC________/g
s/\bBC\b/________CD________/g
s/________//g
' path_to_your_files/*.txt
Here is an example:
before:
some text AB some more text "BC" and more text.
after:
some text BC some more text "CD" and more text.
Note that \b denotes word boundaries, which is what prevents the ________ from interfering with the search (I'm using GNU sed 4.2.2 on Ubuntu). If you are not using a word boundary search, then this technique may not work.
Also note that this gives the same results as removing the s/________//g and appending && sed -i 's/________//g' path_to_your_files/*.txt to the end of the command, but doesn't require specifying the path twice.
A general variation on this would be to use \x0 or _\x0_ in place of ________ if you know that no nulls appear in your files, as jthill suggested.
Here is an excerpt from the SED manual:
-e script
--expression=script
Add the commands in script to the set of commands to be run while processing the input.
Prepend each substitution with -e option and collect them together. The example that works for me follows:
sed < ../.env-turret.dist \
-e "s/{{ name }}/turret$TURRETS_COUNT_INIT/g" \
-e "s/{{ account }}/$CFW_ACCOUNT_ID/g" > ./.env.dist
This example also shows how to use environment variables in your substitutions.
This might work for you (GNU sed):
sed -r '1{x;s/^/:abbc:bcab/;x};G;s/^/\n/;:a;/\n\n/{P;d};s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/;ta;s/\n(.)/\1\n/;ta' file
This uses a lookup table which is prepared and held in the hold space (HS) and then appended to each line. An unique marker (in this case \n) is prepended to the start of the line and used as a method to bump-along the search throughout the length of the line. Once the marker reaches the end of the line the process is finished and is printed out the lookup table and markers being discarded.
N.B. The lookup table is prepped at the very start and a second unique marker (in this case :) chosen so as not to clash with the substitution strings.
With some comments:
sed -r '
# initialize hold with :abbc:bcab
1 {
x
s/^/:abbc:bcab/
x
}
G # append hold to patt (after a \n)
s/^/\n/ # prepend a \n
:a
/\n\n/ {
P # print patt up to first \n
d # delete patt & start next cycle
}
s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/
ta # goto a if sub occurred
s/\n(.)/\1\n/ # move one char past the first \n
ta # goto a if sub occurred
'
The table works like this:
** ** replacement
:abbc:bcab
** ** pattern
Tcl has a builtin for this
$ tclsh
% string map {ab bc bc ab} abbc
bcab
This works by walking the string a character at a time doing string comparisons starting at the current position.
In perl:
perl -E '
sub string_map {
my ($str, %map) = #_;
my $i = 0;
while ($i < length $str) {
KEYS:
for my $key (keys %map) {
if (substr($str, $i, length $key) eq $key) {
substr($str, $i, length $key) = $map{$key};
$i += length($map{$key}) - 1;
last KEYS;
}
}
$i++;
}
return $str;
}
say string_map("abbc", "ab"=>"bc", "bc"=>"ab");
'
bcab
May be a simpler approach for single pattern occurrence you can try as below:
echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
My output:
~# echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
bcab
For multiple occurrences of pattern:
sed 's/\(ab\)\(bc\)/\2\1/g'
Example
~# cat try.txt
abbc abbc abbc
bcab abbc bcab
abbc abbc bcab
~# sed 's/\(ab\)\(bc\)/\2\1/g' try.txt
bcab bcab bcab
bcab bcab bcab
bcab bcab bcab
Hope this helps !!
echo "C:\Users\San.Tan\My Folder\project1" | sed -e 's/C:\\/mnt\/c\//;s/\\/\//g'
replaces
C:\Users\San.Tan\My Folder\project1
to
mnt/c/Users/San.Tan/My Folder/project1
in case someone needs to replace windows paths to Windows Subsystem for Linux(WSL) paths
If replacing the string by Variable, the solution doesn't work.
The sed command need to be in double quotes instead on single quote.
#sed -e "s/#replacevarServiceName#/$varServiceName/g" -e "s/#replacevarImageTag#/$varImageTag/g" deployment.yaml
Here is an awk based on oogas sed
echo 'abbc' | awk '{gsub(/ab/,"xy");gsub(/bc/,"ab");gsub(/xy/,"bc")}1'
bcab
I believe this should solve your problem. I may be missing a few edge cases, please comment if you notice one.
You need a way to exclude previous substitutions from future patterns, which really means making outputs distinguishable, as well as excluding these outputs from your searches, and finally making outputs indistinguishable again. This is very similar to the quoting/escaping process, so I'll draw from it.
s/\\/\\\\/g escapes all existing backslashes
s/ab/\\b\\c/g substitutes raw ab for escaped bc
s/bc/\\a\\b/g substitutes raw bc for escaped ab
s/\\\(.\)/\1/g substitutes all escaped X for raw X
I have not accounted for backslashes in ab or bc, but intuitively, I would escape the search and replace terms the same way - \ now matches \\, and substituted \\ will appear as \.
Until now I have been using backslashes as the escape character, but it's not necessarily the best choice. Almost any character should work, but be careful with the characters that need escaping in your environment, sed, etc. depending on how you intend to use the results.
Every answer posted thus far seems to agree with the statement by kuriouscoder made in his above post:
The only way to do what you asked for is using an intermediate
substitution pattern and changing it back in the end
If you are going to do this, however, and your usage might involve more than some trivial string (maybe you are filtering data, etc.), the best character to use with sed is a newline. This is because since sed is 100% line-based, a newline is the one-and-only character you are guaranteed to never receive when a new line is fetched (forget about GNU multi-line extensions for this discussion).
To start with, here is a very simple approach to solving your problem using newlines as an intermediate delimiter:
echo "abbc" | sed -E $'s/ab|bc/\\\n&/g; s/\\nab/bc/g; s/\\nbc/ab/g'
With simplicity comes some trade-offs... if you had more than a couple variables, like in your original post, you have to type them all twice. Performance might be able to be improved a little bit, too.
It gets pretty nasty to do much beyond this using sed. Even with some of the more advanced features like branching control and the hold buffer (which is really weak IMO), your options are pretty limited.
Just for fun, I came up with this one alternative, but I don't think I would have any particular reason to recommend it over the one from earlier in this post... You have to essentially make your own "convention" for delimiters if you really want to do anything fancy in sed. This is way-overkill for your original post, but it might spark some ideas for people who come across this post and have more complicated situations.
My convention below was: use multiple newlines to "protect" or "unprotect" the part of the line you're working on. One newline denotes a word boundary. Two newlines denote alternatives for a candidate replacement. I don't replace right away, but rather list the candidate replacement on the next line. Three newlines means that a value is "locked-in", like your original post way trying to do with ab and bc. After that point, further replacements will be undone, because they are protected by the newlines. A little complicated if I don't say so myself... ! sed isn't really meant for much more than the basics.
# Newlines
NL=$'\\\n'
NOT_NL=$'[\x01-\x09\x0B-\x7F]'
# Delimiters
PRE="${NL}${NL}&${NL}"
POST="${NL}${NL}"
# Un-doer (if a request was made to modify a locked-in value)
tidy="s/(\\n\\n\\n${NOT_NL}*)\\n\\n(${NOT_NL}*)\\n(${NOT_NL}*)\\n\\n/\\1\\2/g; "
# Locker-inner (three newlines means "do not touch")
tidy+="s/(\\n\\n)${NOT_NL}*\\n(${NOT_NL}*\\n\\n)/\\1${NL}\\2/g;"
# Finalizer (remove newlines)
final="s/\\n//g"
# Input/Commands
input="abbc"
cmd1="s/(ab)/${PRE}bc${POST}/g"
cmd2="s/(bc)/${PRE}ab${POST}/g"
# Execute
echo ${input} | sed -E "${cmd1}; ${tidy}; ${cmd2}; ${tidy}; ${final}"

1-indexing to 0-indexing in grep/sed/awk

I'm parsing a template using linux command line using pipes which has some 1-indexed pseudo-variables, I need 0-indexing. Bascially:
...{1}...{7}...
to become
...{0}...{6}...
I'd like to use one of grep, sed or awk in this order of preference. I guess grep can't do that but I'm not sure. Is such arithmetic operation even possible using any of these?
numbers used in the file are in range 0-9, so ignore problems like 23 becoming 12
no other numbers in the file, so you can even ignore the {}
I can do that using python, ruby, whatever but I prefer not to so stick to standard command line utils
other commend line utils usable with pipes with regex that I don't know are fine too
EDIT: Reworded first bullet point to clarify
If the input allows it, you may be able to get away with simply:
tr 123456789 012345678
Note that this will replace all instances of any digit, so may not be suitable. (For example, 12 becomes 01. It's not clear from the question if you have to deal with 2 digit values.)
If you do need to handle multi-digit numbers, you could do:
perl -pe 's/\{(\d+)\}/sprintf( "{%d}", $1-1)/ge'
You can use Perl.
perl -pe 's/(?<={)\d+(?=})/$&-1/ge' file.txt
If you are sure you can ignore {...}, then use
perl -pe 's/\d+/$&-1/ge' file.txt
And if index is always just one-digit number, then go with shorter one
perl -pe 's/\d/$&-1/ge' file.txt
With gawk version 4, you can write:
gawk '
{
n = split($0, a, /[0-9]/, seps)
for (i=1; i<n; i++)
printf("%s%d", a[i], seps[i]-1)
print a[n]
}
'
Older awks can use
awk '
{
while (match(substr($0,idx+1), /[0-9]/)) {
idx += RSTART
$0 = substr($0,1, idx-1) (substr($0,idx,1) - 1) substr($0,idx+1)
}
print
}
'
Both less elegant than the Perl one-liners.

Can we really do without lazy quantifiers?

Many people say we can do without lazy quantifiers in regular expressions, but I've just run into a problem that I can't solve without them (I'm using sed here).
The string I want to process is composed of substrings separated by the word rate, for example:
anfhwe9.<<76xnf9247 rate 7dh3_29snpq+074j rate 48jdhsn3gus8 rate
I want to replace those substrings (apart from the word 'rate') with 3 dashes (---) each; the result should be:
---rate---rate---rate
From what I understand (I don't know Perl), it can be easily done using lazy quantifiers. In vim there are lazy quantifiers too; I did it using this command
:s/.\{-}rate/---rate/g
where \{-} tells vim to match as few as possible.
However, vim is a text editor and I need to run the script on many machines, some of which have no Perl installed. It could also be solved if you can tell the regex to not match an atomic grouping like .*[^(rate)]rate but that did not work.
Any ideas how to achieve this using POSIX regex, or is it impossible?
In a case like this, I would use split():
perl -n -e 'print join ("rate", ("---") x split /rate/)' [input-file]
Are there any characters that are guaranteed not to be in the input? For instance, if '!'
can't occur, you could transform the input to substitute that unique character, and then do a global replace on the transformed input:
sed 's/ rate /!/g' < input | sed -e 's/[^!]*/---/g' -e 's/!/rate/g'
Another alternative is to use awk's split command in an analogous way to
the perl suggestion above, assuming awk is any more reliably available than perl.
awk '
{ ans="---"
n=split($0, x, / rate /);
while ( n-- ) { ans = ans "rate---";}
print ans
}'
It's not easy without using lazy quantifiers or negative lookaheads (neither of which POSIX supports), but this seems to work.
([^r]*((r($|[^a]|a([^t]|$)|at([^e]|$))))?)+rate
I vaguely recall POSIX character classes being a bit persnickety. You may need to alter the character classes in that regex if they're not already POSIX-compliant.
The fact that you don't care about the contents of the substrings opens up a lot of options. For example, to add to Bob Lied's suggestion — even if '!' can occur in the input, you can start by changing it to something else:
sed -e 's/!/./g' -e 's/rate/!/g' -e 's/[^!]\+/---/g' -e 's/!/rate/g' <input >output
With awk:
awk -Frate '{
for (i = 0; ++i <= NF;)
$i = (i == 1 || i == NF) && $i == x ? x : "---"
}1' OFS=rate infile
Or, awk 'BEGIN {OFS=FS="rate"} {for (i=1; i<=NF-1; i++) {$i = "---"}; print}'

How do I remove duplicate characters and keep the unique one only in Perl?

How do I remove duplicate characters and keep the unique one only.
For example, my input is:
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
Expected output is:
EFUAH
UEH
UJHACDEF
I came across perl -pe's/$1//g while/(.).*\/' which is wonderful but it is removing even the single occurrence of the character in output.
This can be done using positive lookahead :
perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME
The regex used is: (.)(?=.*?\1)
. : to match any char.
first () : remember the matched
single char.
(?=...) : +ve lookahead
.*? : to match anything in between
\1 : the remembered match.
(.)(?=.*?\1) : match and remember
any char only if it appears again
later in the string.
s/// : Perl way of doing the
substitution.
g: to do the substitution
globally...that is don't stop after
first substitution.
s/(.)(?=.*?\1)//g : this will
delete a char from the input string
only if that char appears again later
in the string.
This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.
To keep the relative order intact we can do what KennyTM tells in one of the comments:
reverse the input line
do the substitution as before
reverse the result before printing
The Perl one line for this is:
perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME
Since we are doing print manually after reversal, we don't use the -p flag but use the -n flag.
I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.
if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines
$ wc -l <file2
3210220
$ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null
real 1m1.761s
user 0m58.565s
sys 0m1.568s
$ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}' file2 > /dev/null
real 1m32.123s
user 1m23.623s
sys 0m3.450s
$ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null
real 1m17.818s
user 1m10.611s
sys 0m2.557s
$ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null
real 1m20.347s
user 1m13.069s
sys 0m2.896s
perl -ne'my%s;print grep!$s{$_}++,split//'
Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.
perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'
It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable
Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:
use Test::More 0.88;
use Tie::IxHash;
sub dedupe {
my $str=shift;
my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
return join('',$hash->Keys);
}
{
my $str='EFUAHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='EFUAHHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='UJUJHHACDEFUCU';
is(dedupe($str),'UJHACDEF');
}
done_testing();
Use uniq from List::MoreUtils:
perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates
Me bad - it removes only adjoining appearances. Disregard
This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.
However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):
perl -pe 's/(.)(?=.*?\1)//g'
And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.
MASSIVE EDIT
I've been spending the last half an hour on this, and this looks like this works, without the reversing.
perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME
I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).
With test input like this:
aabbbcbbccbabb
EFAUUUUH
ABCBBBBD
DEEEFEGGH
AABBCC
The output is like this:
abc
EFAUH
ABCD
DEFGH
ABC
I think it's working...
Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.
From the shell, this works:
sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'
In words: mark every linebreak with a <EOL> string, then put every character on a line of its own, then use uniq to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL> markers.
I found the -e :a -e '$!N; s/\n//; ta part in a forum post and I don't understand the seperate -e :a part, or the $!N part, so if anyone can explain those, I'd be grateful.
Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:
cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done
That puts the characters in each line in alphabetical order though.
use strict;
use warnings;
my ($uniq, $seq, #result);
$uniq ='';
sub uniq {
$seq = shift;
for (split'',$seq) {
$uniq .=$_ unless $uniq =~ /$_/;
}
push #result,$uniq;
$uniq='';
}
while(<DATA>){
uniq($_);
}
print #result;
__DATA__
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
The output:
EFUAH
UEH
UJHACDEF
for a file containing the data you list named foo.txt
python -c "print set(open('foo.txt').read())"

select part of filename using regex

I got a file that looks like
dcdd62defb908e37ad037820f7 /sp/dir/su1/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/su2/89/sfdh.gz
ee1d443b8a0cc27749f4b31e56 /sp/dir/su3/89/24.gz
33c02e311fd0a894f7f0f8aae4 /sp/dir/su4/89/dfad.gz
43f6cdce067f6794ec378c4e2a /sp/dir/su5/89/adf.gz
2f6c584116c567b0f26dfc8703 /sp/dir/su6/895/895.gz
a864b7e327dac1bb6de59dedce /sp/dir/su7/895/895.gz
How do i use sed to substitue all the su* such that I can replace with a single value like
sed "s/REXEXP/newfolder/g" myfile
thanks in advance
I think you want
sed 's/su./newfolder/g'
If you actually want to keep the number in su1...su7 as a part of newfolder (for example newfolder1...newfolder7), you can do:
sed 's/su\(.\)/newfolder\1/g'
It also depends upon how "strict" do you want your patterns to be. The above will match su followed by any character and do the replacement. On the other hand, a command like s#/su\([0-9]\)/#/newfolder\1/#g will only match /su followed by a digit, followed by /. So you may need to adjust your pattern accordingly.
$ sed -e 's|/su[^/]*|/newfolder|' /tmp/files\
dcdd62defb908e37ad037820f7 /sp/dir/newfolder/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/newfolder/89/sfdh.gz
...
If you want to get rid of the checksums as well:
$ sed -r -e 's|/su[^/]*|/newfolder|' -e 's/^[^ ]+ +//' /tmp/files\
/sp/dir/newfolder/89/asga.gz
/sp/dir/newfolder/89/sfdh.gz
...
su[0-9] will match a single digit.
sed requires a dirty amount of metacharacter escaping, some of it may be slightly off.
sed -i -e 's/\/su[^\/]+\//\/newFolder\//g' myfile
I vote for Wayne Conrad's answer as the most likely to be what the OP wants, but I'd suggest using an alternate character for the sed expression separator, thus:
sed 's|/su[^/]*|/newfolder|' /tmp/files
That makes it a bit cleaner.
Note also that the trailing 'g' is probably not wanted.
use awk. since there is a delimiter you can use , '/'. after that, column 4 is what you want to change. So if you have paths like /sp/su3dir/su2/89/sfdh.gz , su3dir will not be affected.
awk -F"/" '{$4="newfolder";}1' OFS="/" file