Command line to remove in-line dupes - regex

What is fast and succinct way to remove dupes from within a line?
I have a file in the following format:
alpha • a | b | c | a | b | c | d
beta • h | i | i | h | i | j | k
gamma •  m | n | o
delta • p | p | q | r | s | q
So there's a headword in column 1, and then various words delimited by pipes, with an unpredictable amount of duplication. The desired output has the dupes removed, as:
alpha • a | b | c | d
beta • h | i | j | k
gamma •  m | n | o
delta • p | q | r | s
My input file is a few thousand lines. The Greek names above correspond to category names (e.g., "baseball"); and the alphabet corresponds English dictionary words (which might contain spaces or accents), e.g. "ball game | batter | catcher | catcher | designated hitter".
This could be programmed many ways, but I suspect there's a smart way to do it. I encounter variations of this scenario a lot, and wonder if there's a concise and elegant way to do this. I am using MacOS, so a few fancy unix options are not available.
Bonus complexity, I often have a comment at the end which should be retained, e.g.,
zeta • x | y | x | z | z ; comment here
P.S. this input is actually the output of a prior StackOverflow question:
Command line to match lines with matching first field (sed, awk, etc.)

BSD awk does not have sort functions builtin where GNU awk does, but I'm not sure they're necessary. The bullet, • (U+2022), causes some grief with awk.
I suggest pre-processing the bullet to a single-byte character. I chose #, but you could use Control-A or something else if you prefer. Your data was in a file data. I note that there was a double space before m in the gamma line; I'm assuming that isn't significant.
sed 's/•/#/' data |
awk -F ' *[#|] *' '
{
delete names
delete comments
delete fields;
if ($NF ~ / *;/) { split($NF, comments, / *; */); $NF=comments[1]; }
j = 1;
for (i = 2; i <= NF; i++)
{
if (names[$i]++ == 0)
fields[j++] = $i;
}
printf("%s", $1);
delim = "•"
for (k = 1; k < j; k++)
{
printf(" %s %s", delim, fields[k]);
delim = "|";
}
if (comments[2])
printf(" ; %s", comments[2]);
printf("\n");
}'
Running this yields:
alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here

With bash, sort, xargs, sed:
while IFS='•;' read -r a b c; do
IFS="|" read -ra array <<< "$b"
array=( "${array[#]# }" )
array=( "${array[#]% }" )
readarray -t array < <(printf '%s\0' "${array[#]}" | sort -zu | xargs -0n1)
SAVE_IFS="$IFS"; IFS="|"
s="$a• ${array[*]}"
[[ $c != "" ]] && s="$s ;$c"
sed 's/|/ | /g' <<< "$s"
IFS="$SAVE_IFS"
done < file
Output:
alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here
I suppose the two spaces before "m" are a typo.

This might work for you (GNU sed):
sed 'h;s/.*• \([^;]*\).*/cat <<\\! | sort -u |\1|!/;s/\s*|\s*/\n/2ge;s/\n/ | /g;G;s/^\(.*\)\n\(.*• \)[^;]*/\2\1/;s/;/ &/' file
The sketch of this idea is: to remove the head and tail of each line, morph the data into a mini file, use standard utilities to sort and remove duplicates, then put the line back together again.
Here a copy of the line is held in the hold space. The id and comments removed. The data is munged into a file using cat and the bash here-document syntax and piped through a sort (and uniq if your sort does not come equipped with the -u option). The pattern space is evaluated and the line reassembled by appending the original line to the pattern space and using regexp pattern matching.

Related

Remove certain letters in foma

I am trying to write a rule to remove the non-start [a | e | h | i | o | u | w | y] letters in a string. The rule should keep the first letter, but remove given letters in other locations.
For example,
vave -> vv
aeiou -> a
My code is as below:
?* [ a | e | h | i | o | u | w | y ]+:0 ?* [ a | e | h | i | o | u | w | y ]+:0;
However, when applying the rule on vaavaa, it returns
vaav
vava
vava
vav
vava
vava
vav
vvaa
vva
vva
vv
while vv is what I want.
Please share some advice. Thanks!
You may use this regex for search:
(?!^)[aehiouwy]+
and replace it by emptry string ""
RegEx Demo
RegEx Details:
(?!^): Lookahead to make sure it is not at start
[aehiouwy]+: Match one or more of these letters inside [...]
You can use a captured group and alternation
^(.)|[aehiouwy]+
replace by \1
Regex demo

Removing symbols and making a tab delimited file while keeping all the words after a certain string in one column

I have a file full of such lines:
>Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8]
>Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8]
>Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4]
What I want to get is something like this:
Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8]|midbrain (mesencephalon)[3/8]|other[7/8]
Such that all the words after "positive" are in one column of their own separated by a pipe, and all the columns are separated by tab.
This is what I did:
sed -E 's/ *[>\|:-] */\t/g' mouse_genome_vista1.txt > mouse_genome_vista2.txt
sed "s/^[ \t]*//" -i mouse_genome_vista2.txt
My output was like this:
Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8] midbrain (mesencephalon)[3/8] other[7/8]
Mouse chr16 90449561 90451327 element 1672 positive forebrain[4/8] heart[6/8]
Mouse chr3 137446183 137449401 element 4 positive heart[3/4]
It works if I have just one word after "positive" it'll be alone in its column . However if I have more than one I'll have multiple columns. For instance hindbrain, midbrain , and other are each in their own tab delimited columns I want them to be pipe separated in one column.
You may try this with perl or awk:
[|:-](?=.*positive)|positive\s+\K\|
Regex 101 Demo
Sample Perl Solution(note it illustrates over a set of string not file):
use strict;
my $str = 'Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8]
Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8]
Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4]
';
my $regex = qr/[|:-](?=.*positive)|positive\s+\K\|/xmp;
my $subst = '\\t';
my $result = $str =~ s/$regex/$subst/rg;
print $result;

Solve a puzzle using bash tools such as grep

I need to solve a puzzle using shell script. I tried to combine grep with rev and saved the output into a temporary text file but still don't know how to solve it entirely.
That's the puzzle to solve :
j s e t f l
a l s f e l
g a a n p l
e p f d p k
r e g e l a
f n e t e n
The file that contains the wordlist to use is in http://pastebin.com/DP4mFZAr
I know how to tell grep where to find the patterns to match as fixed strings extracted from a text file using $ grep -Ff wordlist puzzle and
how to search for mirrored words using $ rev puzzle | grep -Ff wordlist puzzle , thus dealing with the horizontal lines, but how do I deal with vertical words too ?
I am covering horizontal and vertical matching. The main idea is to remove the spaces and then use grep -f with the given list of words, stored in words file.
With grep -f, the results are shown within the line. If you just want to see the matched test, use grep -of.
Horizontal matching
$ cat puzzle | tr -d ' ' | grep -f words
alsfel
gaanpl
regela
fneten
$ cat puzzle | tr -d ' ' | grep -of words
als
gaan
regel
eten
Vertical matching
For this, we firstly have to transpose the content of the file. For this, I use what I used for another answer of mine:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
And let's see:
$ cat puzzle | transpose | tr -d ' ' | grep -f words
jagerf
slapen
esafge
tfndet
lllkan
$ cat puzzle | transpose | tr -d ' ' | grep -of words
jager
slapen
af
ge
de
kan
You can then use rev (as you suggest in your question) for mirrored words. Also tac can be interesting for vertically mirrored words.
Diagonal matching
For the diagonal matching, I think that an interesting approach would be to move every single line a little bit to the left/right. This way,
e x x x x
x g x x x
x x g x x
can become
e x x x x
g x x x
g x x
and you can use the vertical/horizontal approaches.
For this, you can use printf as described in Using variables in printf format:
$ cat a
e x x x x
x g x x x
x x g x x
$ awk -v c=20 '{printf "%*s\n", c, $0; c-=2}' a
e x x x x
x g x x x
x x g x x

How can I use sed/awk/perl to replace a matched pattern with an equivalent number of dashes?

I would like to search for all occurrences of a pattern in a file and replace the matches with an equivalent number of padding such as spaces or dashes. It is important to note that I DO NOT WANT TO ALTER THE FILE! I would like like to print the result as standard output. This is why I prefer using sed. The output should be the same length as the file since I would like to replace each pattern found by the regex with the length of that pattern in dashes.
Example: Say the file contains the following:
data | more data | "to be dashed"
Desired Output:
data | more data | --------------
I currently have some thing like this:
sed -e 's/["][^"]*["]/-/g' file
which results in:
data | more data | -
Any Thoughts?
With Perl:
perl -pe 's/(".*?")/ "-" x length($1) /ge' <<END
data | more data | "to be dashed"
data | "more data" | "multi words " "to be dashed"
END
data | more data | --------------
data | ----------- | -------------- --------------
Since you need to find the string length of the matched text, you need to run the substitution part of s/// through a round of evaluation, hence the e flag.
Using GNU awk:
gawk 'BEGIN{ FS = "" }{ while (match($0, /^(.*)(["][^"]*["])(.*)$/, a)){ gsub(/./, "-", a[2]); $0 = a[1] a[2] a[3]; } } 1' file
Examples:
$ echo 'data | more data | "to be dashed"' | gawk 'BEGIN{ FS = "" }{ while (match($0, /^(.*)(["][^"]*["])(.*)$/, a)){ gsub(/./, "-", a[2]); $0 = a[1] a[2] a[3]; } } 1'
data | more data | --------------
$ echo 'data | more data | "to be dashed" x "1234"' | gawk 'BEGIN{ FS = "" }{ while (match($0, /^(.*)(["][^"]*["])(.*)$/, a)){ gsub(/./, "-", a[2]); $0 = a[1] a[2] a[3]; } } 1'
data | more data | -------------- x ------
A sed solution:
sed -r '
:loop
h # copy pattspace to holdspace
s/(.*)("[^"]+")(.*)/\1\n\3/ # replace quoted field with newline
T # if no replacement occurred, start next cycle
x # exchange pattspace and holdspace
s/.*("[^"]+").*/\1/ # isolate quoted field
s/./-/g # change all chars to dashes
G # append newline and holdspace to pattspace
s/(-*)\n(.*)\n(.*)/\2\1\3/ # reorder fields using newlines
t loop # repeat (must be conditional for T to work)
' file
OSX/BSD may not have the T command (jump to label (or next cycle) if substitution has not been made since last line read or last conditional jump). In that case, replace T with:
t keeplooping # branch over b if substitution occurred
b # unconditional branch to next cycle
:keeplooping

Replace similar strings in a file in place

I have a file with the following types of pairs of strings:
Call Stack: [UniqueObject1] | [UnOb2] | [SuspectedObject1] | [SuspectedObject2] | [SuspectedObject3] | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
Call Stack: [UniqueObject1] | [UnOb2] | 0x28798765 | 0x18793765 | 0x48792767 | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
There are many such pairs that occur in the file.
The attributes of this pair are that the first part of the pair has "SuspectedObject1","SuspectedObject2" and so on, which in the second part of the pair are replaced by HEX-VALUES of the address of those objects.
What I want to do is, remove all the second part of the pairs.
Please note the pairs do not occur in any specific order and might be separated by many lines in between.
I plan to iterate through each line of this file, if I see a hex-string given as an address instead of a suspected object, I would want to start comparing the following regex
Call Stack: [UniqueObject1] | [UnOb2] | * | * | * | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
in the whole file and if a string does match, I want to remove this specific line from the file.
Can someone suggest a shell way to do this?
If I have understood your question correctly, you may need to use awk. Run like:
awk -f script.awk file file
Contents of script.awk:
BEGIN {
FS=" \\| "
}
FNR==NR {
$3=$4=$5=""
a[$0]++
next
}
$3 ~ /^0x[0-9]{8}$/ {
r=$0
$3=$4=$5=""
if (a[$0]<2) {
print r
}
next
}1
Alternatively, here's the one-liner:
awk -F ' \\| ' 'FNR==NR { $3=$4=$5=""; a[$0]++; next } $3 ~ /^0x[0-9]{8}$/ { r=$0; $3=$4=$5=""; if (a[$0]<2) print r; next }1' file{,}