Compare or match 2 strings and display matched word - python-2.7

I would like to compare 2 strings and display any matched words.
For example -
string1 = "cat feet"
string2 = "cat shoes"
The result should = "cat"
How can I do this with regular expressions? Or is there a better way to do this?

Split each string on whitespace, and convert both to sets. Their intersection will contain all of the words they have in common.
>>> set("cat feet".split()).intersection(set("cat shoes".split()))
set(['cat'])
This method does not care about ordering of words. "feet cat" and "cat shoes" will have output "cat", even though "cat" does not appear in the same position in both strings. If you want to find words that exist in the same position in both strings, you can zip the split strings together, and display only the words that exist in the same place in both:
>>> [a for a,b in zip("cat feet".split(), "cat shoes".split()) if a == b]
['cat']
>>> [a for a,b in zip("feet cat".split(), "cat shoes".split()) if a == b]
[]

Just regarding the use of regular expressions:
Regular expressions are equivalent to finite automatons and these have the property that they have only a finite set of states, which in turn means they have kind of finite memory. Thus you can't do stuff involving an unknown arbitrary lenght objective string.

Related

How can I tell if there are three or more characters between matches in a regex?

I'm using Ruby 2.1. I have this logic that looks for consecutive pairs of strings in a bigger string
results = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
My question is, how do I iterate over the list of results and print out whether there are three or more characters between the two strings? For instance if my string were
"abc def"
The above would produce
[["abc def", "abc", "def"]]
and I'd like to know whether there are three or more characters between "abc" and "def."
Use a quantifier for the spaces inbetween: \b((\S+?)\b\s{3,}\b(\S+?))\b
Also, the inner boundries are not really needed:
\b((\S+?)\s{3,}(\S+?))\b
A straightforward way to check this is by running a separate regex:
results.select!{|x|p x[/\S+?\b(.*?)\b\S+?/,1].size}
will print the size for every of the bunch.
Another way is to take the size of the captured groups and subtract them:
results = []
line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/) do |s, group1, group2|
results << $~ if s.size - group1.size - group2.size >= 3
end

Using regexpr in R with multiple strings affirmation and negation

I am grepping at a column of notes. Looking for the presence of some strings and the absence of others. The expression looks like
toMatch <- c("words", "i", "want", "to")
notToMatch <- c("not", "in", "my", "res")
insert <- paste(paste(toMatch, collapse="|"), "!", paste(notToMatch, collapse="!"), sep="")
regexpr(insert, df$notes, ignore.case=T)
It seems to me that regexpr will count
printNotes = +1 presence and -1 absence
and if that expression evaluates to printNotes > 0, it returns a value other than -1 (which in regexpr indicates not found).
Any suggested syntax for regexpr to return -1 if any of the notToMatch "!" arguments return TRUE?
Thanks much!
You can use grepl() to get a logical vector of where the strings have matched and then sum() that vector to see the number which are matches. You can do the same thing (roughly) with grep() and counting the length of the resultant vector but grepl() behaves a bit more consistently.
If you want to get the inverse of any match you can do !grepl("match", x) and it will show the logical inverse.
If you *specifically want it to return TRUE or ! you can do something like ifelse(grepl("m", letters), TRUE, "!") which searches the letters constant (all 26 lower-case english letters) for "m" and returns TRUE on a match and "!" on a failure to match.

How to extract line numbers from a multi-line string in Vim?

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)
You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")
One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')

Union in regular expression in R

I'm trying to use regular expressions in R to find one or more phrases within a vector of long sentences (which I'll call x).
So, for example, this works fine for one phrase:
grep("(phrase 1)",x)
But this doesn't work for two (or more) phrases:
grep("(phrase 1)+(phrase 2)+",x)
As I would expect. As I read it, this last one should give me all matches in x for 1 or more phrase 1s, and 1 or more phrase 2's. But it returns nothing.
Another way
which(grepl("(phrase 1)+",x) & grepl("(phrase 2)+",x))
You have to tell it to skip over any intervening characters:
grep("(phrase 1)+.*(phrase 2)+",x)
Also note that it will not reverse the order, so you might have to add that explicitly. Overall, it might be simpler to search each phrase separately (especially if there are more than two phrases), and then combine with intersect and union as you want to get overall results.
Full examples (e.g. with, you know, data ...) are always good.
The main key for regexps in R is to remember that there are three (!!) different engines. I tend to like the Perl regexps.
Next, it is important to remember that there are meta-character -- so if you want parens, you need to escape them.
With that, here is an example:
> txt <- c("The grey fox jumped", "The blue cat slept", "The sky was falling")
> grep("blue", txt) # finds sentence two
[1] 2
> grep("(grey|blue)", txt, perl=TRUE) # finds one and two
[1] 1 2
> grep("(red|blue)", txt, perl=TRUE) # finds only two (as it should)
[1] 2
>
So with Perl regexps, you list alternatives inside parentheses, separated by a pipe symbol.
There's a way to do it with a single regex using lookaheads, though most regex engines will execute it pretty slowly:
> txt <- c("The grey fox jumped", "The blue cat slept", "The fox is grey", "The cat is grey")
> grep("(?=.*fox)(?=.*grey)", txt, perl=TRUE)
[1] 1 3

How do I assign many values to a particular Perl variable?

I am writing a script in Perl which searches for a motif(substring) in protein sequence(string). The motif sequence to be searched (or substring) is hhhDDDssEExD, where:
h is any hydrophobic amino acid
s is any small amino acid
x is any amino acid
h,s,x can have more than one value separately
Can more than one value be assigned to one variable? If yes, how should I do that? I want to assign a list of multiple values to a variable.
It seems like you want some kind of pattern matching. This can be done with strings using regular expressions.
You can use character classes in your regular expression. The classes you mentioned would be:
h -> [VLIM]
s -> [AG]
x -> [A-IK-NP-TV-Z]
The last one means "A to I, K to N, P to T, V to Z".
The regular expression for your example would be:
/[VLIM]{3}D{3}[AG]{2}E{2}[A-IK-NP-TV-Z]D/
I am no great expert in perl, so there is quite possibly a quicker way to this, but it seems like the match operator "//" in list context is what you need. When you assign the result of a match operation to a list, the match operator takes on list context and returns a list with each of the parenthesis delimited sub-expressions. If you specify global matches with the "g" flag, it will return a list of all the matches of each sub-expression. Example:
# print a list of each match for "x" in "xxx"
#aList = ("xxx" =~ /(x)/g);
print(join(".", #aList));
Will print out
x.x.x
I'm assuming you have a regular expression for each of those 5 types h, D, s, E, and x. You didn't say whether each of these parts is a single character or multiple, so I'm going to assume they can be multiple characters. If so, your solution might be something like this:
$h = ""; # Insert regex to match "h"
$D = ""; # Insert regex to match "D"
$s = ""; # Insert regex to match "s"
$E = ""; # Insert regex to match "E"
$x = ""; # Insert regex to match "x"
$sequenceRE = "($h){3}($D){3}($s){2}($E){2}($x)($D)"
if ($line =~ /$sequenceRE/) {
$hPart = $1;
$sPart = $3;
$xPart = $5;
#hValues = ($hPart =~ /($h)/g);
#sValues = ($sPart =~ /($s)/g);
#xValues = ($xPart =~ /($x)/g);
}
I'm sure there is something I've missed, and there are some subtleties of perl that I have overlooked, but this should get you most of the way there. For more information, read up on perl's match operator, and regular expressions.
I could be way off, but it sounds like you want an object with a built in method to output as a string.
If you start with a string, like the one you mentioned, you could pass the string to the class as a new object, use regular expressions like everyone has already suggested to parse out the chunks that you would then assign as variables to that object. Finally, you could have it output a string based on the variables of that object, for instance:
$string = "COHOCOHOCOHOCOHOCOHOC";
$sugar = new Organic($string);
Class Organic {
$chem;
function __construct($chem) {
$hydro_find = "OHO";
$carb_find = "C";
$this-> hydro = preg_find ($hydro_find, $chem);
$this -> carb = preg_find ($carb_find, $chem);
function __TO_STRING() {
return $this->carb."="$this->hydro;
}
}
echo $sugar;
Okay, that kind of fell apart in the end, and it was pseudo-php, not perl. But if I understand your question correctly, you are looking for a way to get all of the info from the string but keep it tied to that string. That would be objects and classes.
You probably want an array (or arrayref) or a pattern (qr//).
Or maybe Quantum::Superpositions.