compare list items against another list - python-2.7

So lets say I have 3 item list:
myString = "prop zebra cool"
items = myString.split(" ")
#items = ["prop", "zebra", "cool"]
And another list content containing hudreds of string items. Its actally a list of files.
Now I want to get only the items of content that contain all of the items
So I started this way:
assets = []
for c in content:
for item in items:
if item in c:
assets.append(c)
And then somehow isolate only the items that are duplicated in assets list
And this would work fine. But I dont like that, its not elegant. And Im sure that there is some other way to deal with that in python

If I interpret your question correctly, you can use all.
In your case, assuming:
content = [
"z:/prop/zebra/rig/cool_v001.ma",
"sjasdjaskkk",
"thisIsNoGood",
"shakalaka",
"z:/prop/zebra/rig/cool_v999.ma"
]
string = "prop zebra cool"
You can do the following:
assets = []
matchlist = string.split(' ')
for c in content:
if all(s in c for s in matchlist):
assets.append(c)
print assets
Alternative Method
If you want to have more control (ie. you want to make sure that you only match strings where your words appear in the specified order), then you could go with regular expressions:
import re
# convert content to a single, tab-separated, string
contentstring = '\t'.join(content)
# generate a regex string to match
matchlist = [r'(?:{0})[^\t]+'.format(s) for s in string.split(' ')]
matchstring = r'([^\t]+{0})'.format(''.join(matchlist))
assets = re.findall(matchstring, contentstring)
print assets
Assuming \t does not appear in the strings of content, you can use it as a separator and join the list into a single string (obviously, you can pick any other separator that better suits you).
Then you can build your regex so that it matches any substring containing your words and any other character, except \t.
In this case, matchstring results in:
([^\t]+(?:prop)[^\t]+(?:zebra)[^\t]+(?:cool)[^\t]+)
where:
(?:word) means that word is matched but not returned
[^\t]+ means that all characters but \t will match
the outer () will return whole strings matching your rule (in this case z:/prop/zebra/rig/cool_v001.ma and z:/prop/zebra/rig/cool_v999.ma)

Related

How can I do a regex replace using a List as the possible match entires?

I have a list of terms which I want to match as follows:
final List _emotions = [
'~~wink~~',
'~~bigsmile~~',
'~~sigh~~',
];
And a second list of replacements:
final List _replacements = [
'0.gif',
'1.gif',
'2.gif',
];
SO that if I have text:
var text = "I went to the store and got a ~~bigsmile~~";
I could have it replace the text as
I went to the store and got a <img src="1.gif" />
So essentially, I was thinking of running a regex replace on my text variable, but the search pattern would be based on my _emotions List.
Forming the replacement text should be easy, but I'm not sure how I could use the list as the basis for the search terms
How is this possible in dart?
You need to merge the two string lists into a single Map<String, String> that will serve as a dictionary (make sure the _emotions strings are in lower case since you want a case insensitive matching), and then join the _emotions strings into a single alternation based pattern.
After getting a match, use String#replaceAllMapped to find the right replacement for the found emotion.
Note you can shorten the pattern if you factor in the ~~ delimiters (see code snippet below). You might also apply more advanced techniques for the vocabulary, like regex tries (see my YT video on this topic).
final List<String> _emotions = [
'wink',
'bigsmile',
'sigh',
];
final List<String> _replacements = [
'0.gif',
'1.gif',
'2.gif',
];
Map<String, String> map = Map.fromIterables(_emotions, _replacements);
String text = "I went to the store and got a ~~bigsmile~~";
RegExp regex = RegExp("~~(${_emotions.join('|')})~~", caseSensitive: false);
print(text.replaceAllMapped(regex, (m) => '<img src="${map[m[1]?.toLowerCase()]}" />'));
Output:
I went to the store and got a <img src="1.gif" />

How to detect text inside specific character on Dart

I am trying to create a function to detect if some text is encapsuled by a special symbol/character , so if i have String :
String text = 'This is normal String, <b> This is Bold String <b>';
i would get a List of Map like this :
[ {'This is Normal String' : 'normal'} ,
{'This is Bold String' : 'bold'} ]
Then i can rewrite it on RichText,
What i have tried is splitting the text like this: List<String> list = text.split('<b>');
and make the even index of the list bold, but it will not behave the way i wanted if the bold tag is on the front of the text , and if i need to detect another character like <i> , is there any way to do this ?
Thankyou
var string = 'I am here';
string.contains('h');// true
//You can use Regex to find patterns inside a string
string.contains(new RegExp(r'[A-Z]')); // true

Tokenize a sentence where each word contains only letters using RegexTokenizer Scala

I am using spark with scala and trying to tokenize a sentence where each word should only contain letters. Here is my code
def tokenization(extractedText: String): DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
val textDataFrame = existingSparkSession.createDataFrame(Seq(
(0, extractedText))).toDF("id", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\\W")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.select("sentence", "words").show(false)
return regexTokenized;
}
If I provide senetence as "I am going to school5" after tokenization it should have only [i, am, going, to] and should drop school5. But with my current pattern it doesn't ignore the digits within words. How am I suppose to drop words with digits ?
You can use the settings below to get your desired tokenization. Essentially you extract words which only contain letters using an appropriate regex pattern.
val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setGaps(false).setPattern("\\b[a-zA-Z]+\\b")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.show(false)
+---+---------------------+------------------+
|id |sentence |words |
+---+---------------------+------------------+
|0 |I am going to school5|[i, am, going, to]|
+---+---------------------+------------------+
For the reason why I set gaps to false, see the docs:
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
You want to repeatedly match the regex, rather than splitting the text by a given regex.

Regex pattern matching with sentence which has words starting with entered words

suppose i have two words
words = ['create', 'mult']
and list
list = ['can we create malfunction channels in teams', 'i want to create multiple teams in microsoft teams']
i want to filter sentence in list with word in words if full word matches or it contains word which start with word in words
desired output = ['i want to create multiple teams in microsoft teams']
here 1st sentence gets filtered as it has no words starting with mult although it has create
Here is what you can do:
import re
words = ['create', 'nn']
sentenses = ['can we create malfunction channels in teams', 'i want to create multiple teams in microsoft teams']
pattern = re.compile(''.join([r'\b{word}\w*\b.*'.format(word=word) for word in words]))
result = [s for s in sentenses if pattern.findall(s)]
print(result) # []
words = ['create', 'mult']
pattern = re.compile(''.join([r'\b{word}\w*\b.*'.format(word=word) for word in words]))
result = [s for s in sentenses if pattern.findall(s)]
print(result) # ['i want to create multiple teams in microsoft teams']
You do not need a regex pattern for this. The only comparison needed is to test whether one string starts with another string: startswith.
You want to test:
for each sentence in list ...
where ANY word starts with one of the phrases in words ...
for ALL of the phrases in words.
Then
words = ['create', 'mult']
list = ['can we create malfunction channels in teams',
'i want to create multiple teams in microsoft teams']
result = [sentence for sentence in list
if all(
any(
word.startswith(phrase)
for word in sentence.split()
)
for phrase in words
)]
leads to
['i want to create multiple teams in microsoft teams']
You can run it with different words to verify it's really working.

Dart List count showing one when splitting an empty string

I'm trying to do a basic grab from a Text Column in sqlite and processing it to a field in a model that is a List<String>. If data is empty then I want to set it as an empty list []. However, when I do this for some reason I get a list that looks empty but really with a length of 1. To recreate this issue I simplified the issue with the following.
String stringList = '';
List<String> aList = [];
aList = stringList.split(',');
print(aList.length);
Why does this print 1? Shouldn't it return 0 since there are no values with a comma in it?
This should print 1.
When you split a string on commas, you are finding all the positions of commas in the string, then returning a list of the strings around those. That includes strings before the first comman and after the last comma.
In the case where the input contains no commas, you still find the initial string.
If your example had been:
String input = "451";
List<String> parts = input.split(",");
prtin(parts.length);
you would probably expect parts to be the list ["451"]. That is also what happens here because the split function doesn't distinguish empty parts from non-empty.
If a string does contain a comma, say the string ",", you get two parts when splitting, in this case two empty parts. In general, you get n+1 parts for a string containing n matches of the split pattern.