Regexp with sed, alphabetical order without duplicates - regex

I need an expression which allows letters only in alphabetical order without duplications, white spaces allowed.
For example:
abc d efg
abcd efg
bcdefg h
I have to use "sed". Due to that i cant use lookahead expression.
Sed reads file and in each string must find substring that matches example.
Best i have now is this:
sed -nr 's/^[a-g]*(a?b?c?d?e?f?g?)[a-g]*$/\1/gp' test.txt
It doesn't work with white spaces, and doesn't work at all

Suggest you try for letters in [a-h] range:
sed -nr '/^a? *b? *c? *d? *e? *f? *g? *h? *$/p' test.txt

With GNU sed:
sed -nE '/^a{0,1} *b{0,1} *c{0,1} *d{0,1} *e{0,1} *f{0,1} *g{0,1} *h{0,1} *i{0,1} *j{0,1} *k{0,1} *l{0,1} *m{0,1} *n{0,1} *o{0,1} *p{0,1} *q{0,1} *r{0,1} *s{0,1} *t{0,1} *u{0,1} *v{0,1} *w{0,1} *x{0,1} *y{0,1} *z{0,1} *$/p' file

cat test.txt | sed -e "/\([a-z]\).*\1/d" | grep -E "^ *a* *b* *c* *d* *e* *f* *g* *h* *i* *j* *k* *l* *m* *n* *o* *p* *q* *r* *s* *t* *u* *v* *w* *x* *y* *z* *$"
or
grep -E "^ *a? *b? *c? *d? *e? *f? *g? *h? *i? *j? *k? *l? *m? *n? *o? *p? *q? *r? *s? *t? *u? *v? *w? *x? *y? *z? *$" test.txt
or
sed -nE '/^ *a? *b? *c? *d? *e? *f? *g? *h? *i? *j? *k? *l? *m? *n? *o? *p? *q? *r? *s? *t? *u? *v? *w? *x? *y? *z? *$/p' test.txt

This might work for you (GNU sed):
sed -r 'h;s/ //g;/(.).*\1/d;s/.*/&\nzyxwvutsrqponmlkjihgfedcba/;:a;ta;/^\n/!s/^(.)(.*\n.*)\1.*/\2/;ta;/^.+\n/d;x'
Copy the line then remove spaces. If the line contains duplicates delete it. Otherwise starting from the front, remove each character in alphabetical order and if successful reinstate the original line. Otherwise delete the line.

sed -nE 's/[a-z]*(^a{0,1} *b{0,1} *c{0,1} *d{0,1} *e{0,1} *f{0,1} *g{0,1} *h{0,1} *i{0,1} *j{0,1} *k{0,1} *l{0,1} *m{0,1} *n{0,1} *o{0,1} *p{0,1} *q{0,1} *r{0,1} *s{0,1} *t{0,1} *u{0,1} *v{0,1} *w{0,1} *x{0,1} *y{0,1} *z{0,1} *)[a-z]*$/\1/gp' file.txt
Was enough for me. Thanks for all. Great answers.

Related

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

Calculate the string length in sed

I was forced to calculate the string length in sed. The string is always a nonempty sequence of a's.
sed -n ':c /a/! be; s/^a/1/; s/0a/1/; s/1a/2/; s/2a/3/; s/3a/4/; s/4a/5/; s/5a/6/; s/6a/7/; s/7a/8/; s/8a/9/; s/9a/a0/; /a/ bc; :e p'
It's quite long :) So now I wonder if it is possible to rewrite this script more concisely using the y or other sed command?
I know that it is better to use awk or another tool. However, this is not a question here.
Note that the sed script basically simulates decadic full adder.
I guess it's cheating but:
sed 's/.//;s/./\n/g'|sed -n '$='
You can certainly shorten your existing version to:
sed -n ':c s/^a/1/; s/0a/1/; s/1a/2/; s/2a/3/; s/3a/4/; s/4a/5/; s/5a/6/; s/6a/7/; s/7a/8/; s/8a/9/; s/9a/a0/; tc; p'
Turns out using y/// is possible but I think it only shaves off a few characters, and \u is not portable:
sed -n '
:c;
s/^a/c/;
s/\([b-j]\)a/\u\1/;
y/BCDEFGHIJ/cdefghijk/;
s/ka/ab/;
tc;
y/bcdefghijk/0123456789/;
p
'

Trimming a file with regular expressions / sed

I've got a file with several lines like this:
*wordX*-Sentence1.;Sentence2.;Sentence3.;Sentence4.
One of these Sentences may or may not contain wordX.
What I want is to trim the file to make it look like this:
*wordX*-Sentence1.;Sentence2.
Where Sentence3 was the first to contain wordX.
How can i do this with sed/awk?
Edit:
Here's a sample file:
*WordA*-This sentence does not contain what i want.%Neither does this one.;Not here either.;Not here.;Here is WordA.;But not here.
*WordB*-WordA here.;WordB here, time to delete everything.;Including this sentece.
*WordC*-WordA, WordB. %Sample sentence one.;Sample Sentence 2.;Sample sentence 3.;Sample sentence 4.;WordC.;Discard this.
And here is the desired output:
*WordA*-This sentence does not contain what i want.%Neither does this one.;Not here either.;Not here.
*WordB*-WordA here.
*WordC*-WordA, WordB. %Sample sentence one.;Sample Sentence 2.;Sample sentence 3.;Sample sentence 4.
This task is more suited to awk. Use following awk command:
awk -F ";" '/^ *\*.*?\*/ {printf("%s;%s\n", $1, $2)}' inFile
This assumes that the words your are trying to match are always wrapped in asterisks *.
This might work for you (GNU sed):
sed -r 's/-/;/;:a;s/^(\*([^*]+)\*.*);[^;]+\2.*/\1;/;ta;s/;/-/;s/;$//' file
Convert the - following the wordX to a ;. Delete sentences containing wordX ( working from the back to the front of the line). Replace the original -.Delete the last ;.
sed -r -e 's/\.;/\n/g' \
-e 's/-/\n/' \
-e 's/^(\*([^*]*).*\n)[^\n]*\2.*/\1/' \
-e 's/\n/-/' \
-e 's/\n/.;/g' \
-e 's/;$//'
(edit: added the -:\n swaps to handle a match in the first sentence.)

Using regular expression to extract substring

I want to extract from < to the next from my log-files.
$>cat messages.log
2013-03-24 19:32:37.231 <F280 [192.168.178.22]:5000 -- Unknown>, Msg:[Test1]
2013-03-24 19:32:37.547 <F281 [192.168.178.22]:5000 -- Unknown>, Msg:[Test2
Test3
Test4]
2013-03-24 19:32:38.833 <F280 [192.168.178.22]:5000 -- Unknown>, Msg:[Test5]
2013-03-24 19:32:42.222 <F281 [192.168.178.22]:5000 -- Unknown>, Msg:[Test6]
$>sed 's/.*\<\(.*\) \[.*/\1|/g' messages.log
F280|
F281|
Test3
Test4]
F280|
F281|
I almost got what I wanted except for the output with the newlines. So I'd like to have the following result:
F280|F281|F280|F281
How has the regular expression look like?
I wouldn't create a unreadable regexp to do this I'd use awk here:
$ awk -F'[< ]' '/^[0-9]+/{s?s=s"|"$4:s=s$4}END{print s}' file
F280|F281|F280|F281
Try this:
sed -n '/</{s/^.*<\([^ ]\+\) .*$/\1|/g;H;${x;s/\n//g;s/|$//;p}}' messages.log
Try something like that (you'll have nested groups), or turn on multiline option in regex:
(^.+<(\w+) .+$)+
Is it compulsory to only use grep or are also other commands available?
I'd say that
grep "<.* " messages.log | sed 's/.*\<\(.*\) \[.*/\1|/g' | tr -d '\n' | sed 's/.$//'
The first grep is to remove data not following your desired pattern, followed by your sed command.
On the output, who should look like
F280|
F281|
F280|
F281|
The last tr command just removes the newline character at the end of each line (i.e it concatenates the result) while the last sed is just to remove the final pipe delimiter

trim whitespace inside angle brackets in sed

I actually solved this while composing the question but I think it could be neater than the way I did it.
I wanted to trim whitespace and most punctation except url legal stuff (from rdf/n3 entities) that appears inside <>s.
An example of the source text would be:
<this is a problem> <this_is_fine> "this is ok too" .
<http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .
The output needs to convert spaces to underscores and trim quotes and anything that isn't legal in a url/iri.
<http://This is a "problem"> => <http://This_is_a_problem>
These didn't work.
sed -e 's/\(<[^ ]*\) \(.*>\)/\1_\2/g' badDoc.n3 | head
sed '/</,/>/{s/ /_/g}' badDoc.n3 | head
My eventual solution, that seems to work, is:
sed -e ':a;s/\(<[^> ]*\) \(.*>\)/\1_\2/g;ta' badDoc.n3 | sed -e ':b;s/\(<[:/%_a-zA-Z0-9.\-]*\)[^><:/%_a-zA-Z0-9.\-]\(.*>\)/\1\2/g;tb' > goodDoc.n3
Is there a better way?
First of all, I would say that this is an interesting problem. It looks a simple substitution problem, however if go into it, it is not so easy as I thought. When I was looking for the solution, I do miss vim!!!... :)
I don't know if sed is a must for this question. I would do it with awk:
awk '{t=$0;
while (match(t,/<[^>]*>/,a)>0){
m[++i]=a[0];n[i]=a[0];t=substr(t,RSTART+RLENGTH)
}
for(x in n){
gsub(/[\x22\x27]/,"",n[x])
gsub(/ /,"_",n[x])
sub(m[x],n[x])
}}1' file
test it a bit with your example:
kent$ cat file
<this is a problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .
kent$ awk '{t=$0;
while (match(t,/<[^>]*>/,a)>0){
m[++i]=a[0];n[i]=a[0];t=substr(t,RSTART+RLENGTH)
}
for(x in n){
gsub(/[\x22\x27]/,"",n[x])
gsub(/ /,"_",n[x])
sub(m[x],n[x])
}}1' file
<this_is_a_problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContainsQuotesThatWillBreakThings> "This should be 'left alone'." .
well it is not really an one-liner, see if there are other shorter solutions from others.