Solve a puzzle using bash tools such as grep - regex

I need to solve a puzzle using shell script. I tried to combine grep with rev and saved the output into a temporary text file but still don't know how to solve it entirely.
That's the puzzle to solve :
j s e t f l
a l s f e l
g a a n p l
e p f d p k
r e g e l a
f n e t e n
The file that contains the wordlist to use is in http://pastebin.com/DP4mFZAr
I know how to tell grep where to find the patterns to match as fixed strings extracted from a text file using $ grep -Ff wordlist puzzle and
how to search for mirrored words using $ rev puzzle | grep -Ff wordlist puzzle , thus dealing with the horizontal lines, but how do I deal with vertical words too ?

I am covering horizontal and vertical matching. The main idea is to remove the spaces and then use grep -f with the given list of words, stored in words file.
With grep -f, the results are shown within the line. If you just want to see the matched test, use grep -of.
Horizontal matching
$ cat puzzle | tr -d ' ' | grep -f words
alsfel
gaanpl
regela
fneten
$ cat puzzle | tr -d ' ' | grep -of words
als
gaan
regel
eten
Vertical matching
For this, we firstly have to transpose the content of the file. For this, I use what I used for another answer of mine:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
And let's see:
$ cat puzzle | transpose | tr -d ' ' | grep -f words
jagerf
slapen
esafge
tfndet
lllkan
$ cat puzzle | transpose | tr -d ' ' | grep -of words
jager
slapen
af
ge
de
kan
You can then use rev (as you suggest in your question) for mirrored words. Also tac can be interesting for vertically mirrored words.
Diagonal matching
For the diagonal matching, I think that an interesting approach would be to move every single line a little bit to the left/right. This way,
e x x x x
x g x x x
x x g x x
can become
e x x x x
g x x x
g x x
and you can use the vertical/horizontal approaches.
For this, you can use printf as described in Using variables in printf format:
$ cat a
e x x x x
x g x x x
x x g x x
$ awk -v c=20 '{printf "%*s\n", c, $0; c-=2}' a
e x x x x
x g x x x
x x g x x

Related

egrep the line ended with

$ cat file
c f t e, u y r s p I y
p A w p d. R i
G e w o a l n o v s.
P G e a o c f s p
k e i c w a p p e.
$ od -c file
0000000 c f t e , u y r s
0000020 p I y \r \n p A w p
0000040 d . R i \r \n G e w o
0000060 a l n o v s . \r \n P
0000100 G e a o c f s p
0000120 \r \n k e i c w a p
0000140 p e . \r \n
0000146
I tried to use the egrep command to grep all lines ended with .
However, I was not able to do it!
for example:
$ egrep '.*\.' file
p A w p d. R i
G e w o a l n o v s.
k e i c w a p p e.
It did not give me the correct output!
Also tried to use $ to anchor the dot, \r, and \n, none of them work.
Any suggestions will help.
You should begin converting your file in Unix format:
dos2unix file
Then you can simply use this instruction:
egrep "[.]$" file
Your file is in DOS format (with carriage return / line feed endings). Either convert it to unix format first and use
egrep '\.$'
or leave the file unchanged and search for a literal carriage return
egrep $'\\.\r$'
(using bash trickery because grep doesn't understand \r).
egrep '.*\.' just finds all lines that contain a . anywhere.

Command line to remove in-line dupes

What is fast and succinct way to remove dupes from within a line?
I have a file in the following format:
alpha • a | b | c | a | b | c | d
beta • h | i | i | h | i | j | k
gamma •  m | n | o
delta • p | p | q | r | s | q
So there's a headword in column 1, and then various words delimited by pipes, with an unpredictable amount of duplication. The desired output has the dupes removed, as:
alpha • a | b | c | d
beta • h | i | j | k
gamma •  m | n | o
delta • p | q | r | s
My input file is a few thousand lines. The Greek names above correspond to category names (e.g., "baseball"); and the alphabet corresponds English dictionary words (which might contain spaces or accents), e.g. "ball game | batter | catcher | catcher | designated hitter".
This could be programmed many ways, but I suspect there's a smart way to do it. I encounter variations of this scenario a lot, and wonder if there's a concise and elegant way to do this. I am using MacOS, so a few fancy unix options are not available.
Bonus complexity, I often have a comment at the end which should be retained, e.g.,
zeta • x | y | x | z | z ; comment here
P.S. this input is actually the output of a prior StackOverflow question:
Command line to match lines with matching first field (sed, awk, etc.)
BSD awk does not have sort functions builtin where GNU awk does, but I'm not sure they're necessary. The bullet, • (U+2022), causes some grief with awk.
I suggest pre-processing the bullet to a single-byte character. I chose #, but you could use Control-A or something else if you prefer. Your data was in a file data. I note that there was a double space before m in the gamma line; I'm assuming that isn't significant.
sed 's/•/#/' data |
awk -F ' *[#|] *' '
{
delete names
delete comments
delete fields;
if ($NF ~ / *;/) { split($NF, comments, / *; */); $NF=comments[1]; }
j = 1;
for (i = 2; i <= NF; i++)
{
if (names[$i]++ == 0)
fields[j++] = $i;
}
printf("%s", $1);
delim = "•"
for (k = 1; k < j; k++)
{
printf(" %s %s", delim, fields[k]);
delim = "|";
}
if (comments[2])
printf(" ; %s", comments[2]);
printf("\n");
}'
Running this yields:
alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here
With bash, sort, xargs, sed:
while IFS='•;' read -r a b c; do
IFS="|" read -ra array <<< "$b"
array=( "${array[#]# }" )
array=( "${array[#]% }" )
readarray -t array < <(printf '%s\0' "${array[#]}" | sort -zu | xargs -0n1)
SAVE_IFS="$IFS"; IFS="|"
s="$a• ${array[*]}"
[[ $c != "" ]] && s="$s ;$c"
sed 's/|/ | /g' <<< "$s"
IFS="$SAVE_IFS"
done < file
Output:
alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here
I suppose the two spaces before "m" are a typo.
This might work for you (GNU sed):
sed 'h;s/.*• \([^;]*\).*/cat <<\\! | sort -u |\1|!/;s/\s*|\s*/\n/2ge;s/\n/ | /g;G;s/^\(.*\)\n\(.*• \)[^;]*/\2\1/;s/;/ &/' file
The sketch of this idea is: to remove the head and tail of each line, morph the data into a mini file, use standard utilities to sort and remove duplicates, then put the line back together again.
Here a copy of the line is held in the hold space. The id and comments removed. The data is munged into a file using cat and the bash here-document syntax and piped through a sort (and uniq if your sort does not come equipped with the -u option). The pattern space is evaluated and the line reassembled by appending the original line to the pattern space and using regexp pattern matching.

How do I properly match unicode characters with awk's regex?

I have the following statement in a script, to retrieve the domain portion of an email address from a variety of email logs with a reliably formatted To: line:
awk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
This matches lines such as To: doc#bequerelint.net (Omer). However, it does not match the lines To: andy.vitrella#uol.com.br (André) or To: boggers#operamail.com (Pål), nor any other line with a non-ascii character within the trailing parentheses after the email address.
Incidentally, od -c for the first non-matching example gives:
0000000 T o : a n d y . v i t r e l l
0000020 a # u o l . c o m . b r ( A n
0000040 d r 351 ) \n
0000045
I surmise there is something going on with awk's regex's . not matching the non-ascii character in (André). What is the correct regex statement to match such a line?
I give my comment as an answer to have the code formatted correctly,
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' | gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
uol.com.br
operamail.com
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' > fileee12
$ gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}' fileee12
uol.com.br
operamail.com
$ env | grep -e '\(LOC\)\|\(LAN\)'
LANG=C
XTERM_LOCALE=C
$
as you see, your command works both reading from stdin and reading from a file, using a C locale, so I can exclude that on my computer it is the locale or the differences between reading from stdin rather than from a file to make a difference.
My computer has linux, my gawk is 4.1.1, what are your circumstances?
further simplifying it, where locale setting simply doesn't matter
{mawk/mawk2/gawk [-b]? -e} 'BEGIN { FS = "\100"; # ampersand
} /^To: / && ( NF > 1 ) { # play it safe in case
# of no ampersand
print ($2 !~ / /) ? $2 : \ # in case no "(Omer)" towards the end
\
substr($2, 1, index($2, " ") - 1);
}'
since spaces aren't valid in email address (unless URI-encoded (?)), and you're force delimiting by # , this substr alone does it without all the gsub and unicode and what not

How do I get GNU grep to match exactly "H" and not things that just start with "H"?

I have a file and I would like to do find the number of total number of occurrences of a passed in word, while supporting regex
grep -e "Hello*" filename | wc -w
But there a few bugs, I let's say I do something like
grep -e "H" filename | wc -w
It should only match EXACTLY H and not count things that start with H, the way grep does it right now.
Anyone know how?
try this:
grep '\bH\b'
e.g.:
kent$ echo "Hello
IamH
we need this H
and this H too"|grep '\bH\b'
we need this H
and this H too
Note that if you want to count only the matched words, you need to use -o option on grep. (thx fotanus)
EDIT
You can get all matched words by grep -o, in this case -c doesn't help, because it counts matched lines. you could pass grep -o to wc -l
for example:
kent$ echo "No Hfoo will be counted this line
this line has many: H H H H H H H (7)
H (8 starting)
foo bar (9 ending) H
H"|grep -o '\bH\b'|wc -l
10
or simpler, single process solution with awk:
awk '{s+=gsub(/\<H\>/,"")}END{print s}' file
same example:
kent$ echo "No Hfoo will be counted this line
this line has many: H H H H H H H (7)
H (8 starting)
foo bar (9 ending) H
H"|awk '{s+=gsub(/\<H\>/,"")}END{print s}'
10

How do I match across newlines in a perl regex?

I'm trying to work out how to match across newlines with perl (from the shell). following:
(echo a b c d e; echo f g h i j; echo l m n o p) | perl -pe 's/(c.*)/[$1]/'
I get this:
a b [c d e]
f g h i j
l m n o p
Which is what I expect. But when I place an /s at the end of my regex, I get this:
a b [c d e
]f g h i j
l m n o p
What I expect and want it to print is this:
a b [c d e
f g h i j
l m n o p
]
Is the problem with my regex somehow, or my perl invocation flags?
-p loops over input line-by-line, where "lines" are separated by $/, the input record separator, which is a newline by default. If you want to slurp all of STDIN into $_ for matching, use -0777.
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -pe 's/(c.*)/[$1]/s'
a b [c d e
]f g h i j
l m n o p
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -0777pe 's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]
See Command Switches in perlrun for information on both those flags. -l (dash-ell) will also be useful.
The problem is that your one-liner works one line at a time, your regex is fine:
use strict;
use warnings;
use 5.014;
my $s = qq|a b c d e
f g h i j
l m n o p|;
$s =~ s/(c.*)/[$1]/s;
say $s;
There's More Than One Way To Do It: since you're reading "the entire file at a time" anyway, I'd personally drop the -p modifier, slurp the entire input explicitly, and go from there:
echo -e "a b c d e\nf g h i j\nl m n o p" | perl -e '$/ = undef; $_ = <>; s/(c.*)/[$1]/s; print;'
This solution does have more characters, but may be a bit more understandable for other readers (which will be you in three months time ;-D )
Actually your one-liner looks like this:
while (<>) {
$ =~ s/(c.*)/[$1]/s;
}
It's mean that regexp works only with first line of your input.
You're reading a line at a time, so how do you think it can possibly match something that spans more than one line?
Add -0777 to redefine "line" to "file" (and don't forget to add /s to make . match newlines).
$ (echo a b c d e; echo f g h i j; echo l m n o p) | perl -0777pe's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]