Grep ascending order of cards. Why does it work? - regex

The collection of cards I need to grep is defined as:
{h ∈ H | h contains only cards in ascending order regardless of their suit}
Example:
h = Ah2c2d3s5h6d8s8d9h9cTdTcKh
h != 3d4dQc3sKcAh2sAc7hKdKsKh4h62 (Q is followed by lower rank 3)
The ascending ranks of cards are:
A(ace) 2 3 4 5 6 7 8 9 T(ten) J Q K
The suits are defined as such:
c(clover) s(spade) h(heart) d(diamond)
I have tried the following grep and it is correct but I still don't
understand why it works.
Edit*** added -P flag (forgot about it) as pointed out by tripleee that just grep -v is indeed invalid.
grep -Pv "[KQJT].*[2-9A].* |[KQ].*[JT].* |[6-9].*[2-5A].* "
What baffles me is how K followed by Q got matched with this pattern or even 5 followed by [A2-4]
The solution has a total of 31027 lines
The text file provided for the exercise can be found here:
http://computergebruik.ugent.be/oefeningenreeks1/kaarten1.txt

Your regex is not at all valid, so I don't understand why you say it works.
Plain grep does not understand | to mean alteration. You can add an -E option to specify ERE (traditionally, egrep) regex semantics, or with POSIX grep backslash the |; or you can specify multiple -e options. (See e.g. https://en.wikipedia.org/wiki/Regular_expression#Standards for some background about the various regex dialects in common use.)
grep -Ev "[KQJT].*[2-9A].* |[KQ].*[JT].* |[6-9].*[2-5A].* "
grep -v "[KQJT].*[2-9A].* \|[KQ].*[JT].* \|[6-9].*[2-5A].* "
grep -ve "[KQJT].*[2-9A].* " -e "[KQ].*[JT].* " -e "[6-9].*[2-5A].* "
Even with this fix, the regex is obviously insufficient for removing matches where e.g. 3 is followed by 2. The only way to make it cover all cases is to enumerate every possibility. (Disallow 1 followed by any higher number, 2 followed by any higher number, 3 followed by any higher number, etc.) An altogether better approach would be to use a scripting language of some sort, and basically just map the symbols to ones with the desired sort order, then check if the input is sorted.
If that is not an option, maybe try
grep -E '^(A.)*(2.)*(3.)*(4.)*(5.)*(6.)*(7.)*(8.)*(9.)*(T.)*(J.)*(Q.)*(K.)* '
which looks for zero or more aces, followed by zero or more twos, followed by zero or more threes, etc.

Related

How to determine all sequences of genes, within an RNA , where a combination of G and T is repeated at least one single time

supose we have this file RNA.txt
GGGT
CCAAA
AAAACCGGTT
CCCCT
AAAAAG
And I would like to search for all sequences composed of lettres G and T repeated at least 1 time. such as AGTTG or GGGGGT or maybe TAACGG but not AAAAT neither CCCCT ..etc.
I tried the command :
grep -e "G\T+" RNA.txt
and I got the follwing output:
GGGT
AAAACCGGTT
AAAAAG
The first 2 sequences retrieved were correct but the AAAAAG is wrong as it shoudl have at least one sequence of G and T in iany order to be displayed.
Let's say you have:
cat file
GGGT
CCAAA
AAAACCGGTT
CCCCT
AAAAAG
AGTTG
GGGGGT
TAACGG
AAAAT
Then you can use this grep with an alternation regex:
grep -E 'G.*T|T.*G' file
GGGT
AAAACCGGTT
AGTTG
GGGGGT
TAACGG
-E: enabled extended regex mode in grep. We may also use grep 'G.*T\|T.*G' file
G.*T|T.*G will match a line with G and T in any order.
If I understand correctly you want to grep all lines that have at least two G or two T:
grep -e "GG\|TT" RNA.txt
Fairly simple using the alternation operator |. The only gotcha: The operator needs to be escaped when grepping like that.

Using grep to extract very specific strings from binary file

I have a large binary file. I want to extract certain strings from it and copy them to a new text file.
For example, in:
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZM-^G
I want to take the number '7' (after the #^#^#E) and every character after it stopping at the Z ('ignoring the M-^G).
I want to copy this 7cacscKLrrok9bwC3Z64NTnZ to a new file.
There will be multiple such strings in one file. The end will always be denoted by the M- (which I don't want copied). The start will always be denoted by a 7 (which I do want copied).
Unfortunately, my knowledge of grep, sed, etc, does not extend to this level. Can someone please suggest a viable way to achieve this?
cat -v filename | grep [7][A-Z,a-z] will show all strings with a '7' followed by a letter but that's not much.
Thank you.
I've noticed that my requirements are rather more complicated.
(I've performed the correct - I hope - formatting this time). Thanks to 'tshiono' for his (?) answer to the earlier submission.
I want to check the ending of a string and, if it ends in M-, grep another string that follows it (with junk in between). If the string does not end in M-, then I don't want it copied (let alone any other strings).
So what I would like is:
grep -a -Po "7[[:alnum:]]+(?=M-)" file_name and if the ending is M- then grep -a -Po "5x[[:alnum:]]+(?=\^)" file_name to copy the string that starts with 5x and ends with a ^.
In this example:
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZM-^GwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
The outcome would be:
7cacscKLrrok9bwC3Z64NTnZ
5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
However, if the ending is not M- (more precisely, if the ending is ^S), then do not try the second grep and do not record anything at all.
In this example:
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZ^SGwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
The outcome would be null (nothing copied) as the 7cacs... string ends in ^S.
Is grep the correct tool? Grep a file and if the condition in the grep command is 'yes' then issue a different grep command but if the condition is 'no' then do nothing.
Thanks again.
I have noticed one addition modification.
Can one add an OR command to the second part? Grep if the second string starts with 5x OR 6x?
In the example below, grep -aPo "7[[:alnum:]]+M-.*?5x[[:alnum:]]+\^" filename | grep -aPo "7[[:alnum:]]+(?=M-)|5x[[:alnum:]]+(?=\^)" will extract the strings starting with 7 and the strings starting with 5x.
How can one change the 5x to 5x or 6x?
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZM-^GwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7AAAAAscKLrrok9bwC3Z64NTnZM-^GwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM6x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
In this example, the desired outcome would be:
7cacscKLrrok9bwC3Z64NTnZ
5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
7AAAAAscKLrrok9bwC3Z64NTnZ
6x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
UPDATE MARCH 09:
I need to create a series of complex grep (or perl) commands to extract strings from a series of binary files.
I need two strings from the binary file.
The first string will always start with a 1.
The first string will end with a letter or number. The next letter will always be a lower case k. I do not want this k character.
The difficulty is that the ending k will not always be the first k in the string. It might be the first k but it might not.
After the k, there is a second string. The second string will always start with an A or a B.
The ending of the second string will be in one of two forms:
a) it will end with a space then display the first three characters from the first string in lower case followed by a )
b) it will end with a ^K then display the first three characters from the first string in lower case.
For example:
1pppsx9YPar8Rvs75tJYWZq3eo8PgwbckB4m4zT7Yg042KIDYUE82e893hY ppp)
Should be:
1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc and B4m4zT7Yg042KIDYUE82e893hY - delete the k and the space then ppp.
For example:
1zzzsx9YPkr8Rvs75tJYWZq3eo8PgwbckA2m4zT7Yg042KIDYUE82e893hY^Kzzz
Should be:
1zzzsx9YPkar8Rvs75tJYWZq3eo8Pgwbc and A4m4zT7Yg042KIDYUE82e893hY - delete the second k and the ^Kzzz.
In the second example, we see that the first k is part of the first string. It is the k before the A that breaks up the first and second strings.
I hope there is a super grep expert who can help! Many thanks!
If your grep supports -P option, would you please try:
grep -a -Po "7[[:alnum:]]+(?=M-)" file
The -a option forces grep to read the input as a text file.
The -P option enables the perl-compatible regex.
The -o option tells grep to print only the matched substring(s).
The pattern (?=M-) is a zero-width lookahead assertion (introduced in
Perl) without including it in the result.
Alternatively you can also say with sed:
sed 's/M-/\n/g' file | sed -n 's/.*\(7[[:alnum:]]\+\).*/\1/p'
The first sed command splits the input file into miltiple lines by
replacing the substring M- with a newline.
It has two benefits: it breaks the lines to allow multiple matches with
sed and excludes the unnecessary portion M- from the input.
The next sed command extracts the desired pattern from the input.
It assumes your sed accepts \n in the replacement, which is
a GNU extension (not POSIX compliant). Otherwise please try (in case you are working on bash):
sed 's/M-/\'$'\n''/g' file | sed -n 's/.*\(7[[:alnum:]]\+\).*/\1/p'
[UPDATE]
(The requirement has been updated by the OP and the followings are solutions according to it.)
Let me assume the string which starts with 7 and ends with M- is always followed
by another (no more and no less than one) string which starts with 5x and ends
with ^ (ascii caret character) with junks in between.
Then would you please try the following:
grep -aPo "7[[:alnum:]]+M-.*?5x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|5x[[:alnum:]]+(?=\^)"
It executes the task in two steps (two cascaded greps).
The 1st grep narrows down the input data into the candidate substring
which will include the desired two sequences and junks in between.
The regex .*? in between matches any (ascii or binary) characters
except for a newline character.
The trailing ? enables the shortest match
which avoids the overrun due to the greedy nature of regex. The regex is intended to match junks in between.
The 2nd grep includes two regex's merged with a pipe | meaning logical OR.
Then it extracts two desired sequences.
A potential problem of grep solution is that grep is a line oriented command
and cannot include the newline character in the matched string.
If a newline character is included in the junks in between (I'm not sure about the possibility), the above solution will fail.
As a workaround, perl will provide flexible manipulations with binary data.
perl -0777 -ne '
while (/(7[[:alnum:]]+)M-.*?(5x[[:alnum:]]+)\^/sg) {
printf("%s\n%s\n", $1, $2);
}
' file
The regex is mostly same as that of grep because the -P option of grep means
perl-compatible.
It can capture multiple patterns at once in variables $1 and $2 hence just one regex is enough.
The -0777 option to the perl command tells perl to slurp all data
at once.
The s option at the end the regex makes a dot match a newline character.
The g option enables the global (multiple) match.
[UPDATE2]
In order to make the regex match either 5x or 6x, replace 5x with (5|6)x.
Namely:
grep -aPo "7[[:alnum:]]+M-.*?(5|6)x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|(5|6)x[[:alnum:]]+(?=\^)"
As mentioned before, the pipe | means OR. The OR operator has the lowest priority in the evaluation, hence you need to enclose them with parens in this case.
If there is a possibility any other number than 5 or 6 may appear, it will be safer to put [[:digit:]] instead, which matches any one digit betweeen 0 and 9:
grep -aPo "7[[:alnum:]]+M-.*?[[:digit:]]x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|[[:digit:]]x[[:alnum:]]+(?=\^)"
[UPDATE3]
(Answering the OP's requirement on March 9th)
Let me start with a perl code which regex will be relatively easier
to explain.
perl -0777 -ne 'while (/(1(.{3}).+)k([AB].*)[\013 ]\2/g){print "$1 $3\n"}' file
Output:
1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc B4m4zT7Yg042KIDYUE82e893hY
1zzzsx9YPkr8Rvs75tJYWZq3eo8Pgwbc A2m4zT7Yg042KIDYUE82e893hY
[Explanation of regex]
(1(.{3}).+)k([AB].*)[\013 ]\2
( start of the 1st capture group referred by $1 later
1 literal "1"
( start of the 2nd capture group referred by \2 later
.{3} a sequence of the identical three characters such as ppp or zzz
) end of the 2nd capture group
.+ followed by any characters with "greedy" match which may include the 1st "k"
) end of the 1st capture group
k literal "k"
( start of the 3rd capture group referred by $3 later
[AB].* the character "A" or "B" followed by any characters
) end of the 3rd capture group
[\013 ] followed by ^K or a whitespace
\2 followed by the capture group 2 previously assigned
When implementing it with grep, we will encounter a limitation of grep.
Although we want to extract multiple patterns from the input file,
the -e option (which can specify multiple search patterns) does not
work with -P option. Then we need to split the regex into two patterns
such as:
grep -Po "(1(.{3}).+)(?=k([AB].*)[\013 ]\2)" file
grep -Po "(1(.{3}).+)k\K([AB].*)(?=[\013 ]\2)" file
And the result will be:
1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc
1zzzsx9YPkr8Rvs75tJYWZq3eo8Pgwbc
B4m4zT7Yg042KIDYUE82e893hY
A2m4zT7Yg042KIDYUE82e893hY
Please be noted the order of output is not same as the order of appearance in the original file.
Another option will be to introduce ripgrep or rg which is a fast
and versatile version of grep. You may need to install ripgrep with
sudo apt install ripgrep or using other package handling tool.
An advantage of ripgrep is it supports -r (replace) option in which
you can make use of the backreferences:
rg -N -Po "(1(.{3}).+)k([AB].*)[\013 ]\2" -r '$1 $3' file
The -r '$1 $3' option prints the 1st and the 3rd capture groups and the result will be the same as perl.
In the general case, you can use the strings utility to pluck out ASCII from binary files; then of course you can try to grep that output for patterns that you find interesting.
Many traditional Unix utilities like grep have internal special markers which might get messed up by binary input. For example, the character \xFF was used for internal purposes by some versions of GNU grep so you can't grep for that character even if you can figure out a way to represent it in the shell (Bash supports $'\xff' for example).
A traditional approach would be to run hexdump or a similar utility, and then grep that for patterns. However, more modern scripting languages like Perl and Python make it easy to manipulate arbitrary binary data.
perl -ne 'print if m/\xff\xff/' </dev/urandom
This might work for you (GNU sed):
sed -En '/\n/!{s/M-\^G/\n/;s/7[^\n]*\n/\n&/};/^7[^\n]*/P;D' file
Split each line into zero or more lines that begin with 7 and end just before M-^G and only print such lines.

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!
You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2
cat fileName | grep pattern | uniq -c
I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

How do I make Grep less Greedy with a shell variable?

I have been polishing up my grep skills with a particular problem I have found. Basically it goes like this. I have a local file with words from a dictionary. The user will pass in a word and the script will find all words that can be made with letters from that word. The catch is, the words must be at least 4 characters long and you can only use as many letters as the user passes in. So if I passed in a word like "College" clee and cell would be acceptable words but not words like cocco because yes it contains letters from the word but college only has 1 o and 1 c. Here is my regular expression thus far.
egrep -i "^[("$text")]{4,}$" /usr/dict/words
This will find strings that contain these letters that are at least four characters long however grep is being greedy and grabbing more characters than those in the variable. How would I specify to only use the amount of characters in the variable? I've been stuck on this for a few days now to no avail. Thank you for your help and time community!
To expand on what #chepner said in the comments, regular expressions won't test for the exact number of characters that is in a range. In other words, [ee] will not match 2 e's it will only match if there is an e at all, so [ee] is a redundant of [e]. Regular expressions usually match 1 or more of a match expression [e]+ would match at least 1 e up to the buffer size of the string. To match a specific number of e's you'd have to know that before hand to do something like [e]{2,5} which would match at least 2 but no more than 5 e's.
Even if you set a pre-processor to calculate the number of letters that are repeated in the input, you'd have a hard time matching the regular expression how you think it matches. To go with your example of "college", preprocessed would look like c=1,o=1,l=2, e=2,g=1. If you were to put it in a regular expression like you had ^c?o?l{0,2}e{0,2}g?$` [note a "?" in this context is short hand for {0,1}] would not even match "college" as the match would be positional it would match "colleg", "colleeg", "colleg", etc.
To verify the length of the string what you have only verifies that there are at least for letters in the range []. You may want to change it to grep "^.{4,}$" to check whether the entire length is at least 4 characters.
If you aren't limited to only using grep, but are limited to bash, you may be able to use the below script to solve you're problem:
read input
cat /usr/dictwords | while read line
do
if $(echo $line | grep "^.\{4,\}\$" >> /dev/null)
then
testVal=$line
for i in $(echo $input | sed -e 's/\(.\)/\1 /g')
testVal=$(echo "$testVal" | sed -e "s/$i/_/i")
done
fi
if $(echo $testVal | grep "^_\+$" >> /dev/null)
then
echo $line
fi
done

Line-insensitive pattern-matching – How can some context be displayed?

I'm looking for a technique to search a file for a pattern (typically a phrase) that may span multiple lines, and print the match with some surrounding context on one line. The file's lines may be too long or too short for a sensible amount of context; I'm not concerned to print a single line of the file, as you might do with grep, but rather to print onto a single line of my terminal.
Basic requirements
Show a specified number of characters before and after the match, even if it straddles lines.
Show newlines as ‘\n’ to prevent flooding the terminal with whitespace if there are many short lines.
Prefix output line with line and column number of the start of the match.
Preferably a sed oneliner.
So far, I'm assuming that the pattern has a constant length shorter than the width of the terminal, which is okay and very useful for most phrases I might want to search for.
Further considerations
I would be interested to see how the following could also be achieved using sed or the likes:
Prefix output line with line and column number range of the match.
Generalise for variable length patterns, truncating the middle of the match to ‘[…]’ if too long.
Can I avoid using something like ‘[ \n]’ between words in a phrase regex on a file that has been ‘hard-wrapped’ using newlines, without altering what's printed?
Using the output of stty size to dynamically determine the terminal width may be useful, though I'd probably prefer to leave it static in case I want to resize the terminal or use it from screen attached from terminals of different sizes.
Examples
The basic idea for 10 characters of context would be something like:
‘excessively long line with match in the middle\n’ → ‘line with match in the mi’
‘short\nlines\n\nmatch\nlots\nof\nshort\nlines\n’ → ‘rt\nlines\n\nmatch\nlots\nof\ns’
Here's a command to return the 20 characters surrounding a pattern, spanning newlines and including them as a character:
$ input="test.txt"
$ pattern="match"
$ tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g'
line with match in the mi
rt\nlines\n\nmatch\nlots\nof\ns
With row number of the match as well:
$ paste <(grep -n ${pattern} "$input" | cut -d: -f1) \
<(tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g')
1 line with match in the mi
5 rt\nlines\n\nmatch\nlots\nof\ns
I realise this doesn't quite fulfill all of your basic requirements, but am not good enough with awk to do better (guess this is technically possible in sed, but I don't want to think about what it would look like).