To make things easier but also more complicated, I tried to implement a concept of "combined / concise tags" that expand further on into multiple basic tag forms.
In this case the tags consist of (one or more) "sub-tag(s)", delimited by semicolons:
food:fruit:apple:sour/sweet
drink:coffee/tea:hot/cold
wall/bike:painted:red/blue
Slashes indicate "sub-tag" interchangeability.
Therefore the interpreter translates them to this:
food:fruit:apple:sour
food:fruit:apple:sweet
drink:coffee:hot
drink:coffee:cold
drink:tea:hot
drink:tea:cold
wall:painted:red
wall:painted:blue
bike:painted:red
bike:painted:blue
The code used (not perfect, but works):
import itertools
def slash_split_tag(tag):
if not '/' in tag:
return tag
subtags = tag.split(':')
pattern, v_pattern = (), ()
for subtag in subtags:
if '/' in subtag:
pattern += (None,)
v_pattern += (tuple(subtag.split('/')),)
else:
pattern += (subtag,)
def merge_pattern_and_product(pattern, product):
ret = list(pattern)
for e in product:
ret[ret.index(None)] = e
return ret
CartesianProduct = tuple(itertools.product(*v_pattern)) # http://stackoverflow.com/a/170248
return [ ':'.join(merge_pattern_and_product(pattern, product)) for product in CartesianProduct ]
#===============================================================================
# T E S T
#===============================================================================
for tag in slash_split_tag('drink:coffee/tea:hot/cold'):
print tag
print
for tag in slash_split_tag('A1/A2:B1/B2/B3:C1/C2:D1/D2/D3/D4/EE'):
print tag
Question: How can I possibly revert this process? I need this for readability reasons.
Here's a simple, first-pass attempt at such a function:
def compress_list(alist):
"""Compress a list of colon-separated strings into a more compact
representation.
"""
components = [ss.split(':') for ss in alist]
# Check that every string in the supplied list has the same number of tags
tag_counts = [len(cc) for cc in components]
if len(set(tag_counts)) != 1:
raise ValueError("Not all of the strings have the same number of tags")
# For each component, gather a list of all the applicable tags. The set
# at index k of tag_possibilities is all the possibilities for the
# kth tag
tag_possibilities = list()
for tag_idx in range(tag_counts[0]):
tag_possibilities.append(set(cc[tag_idx] for cc in components))
# Now take the list of tags, and turn them into slash-separated strings
tag_possibilities_strs = ['/'.join(tt) for tt in tag_possibilities]
# Finally, stitch this together with colons
return ':'.join(tag_possibilities_strs)
Hopefully the comments are sufficient in explaining how it works. A few caveats, however:
It doesn't do anything sensible such as escaping backslashes if it finds them in the list of tags.
This doesn't recognise if there's a more subtle division going on, or if it gets an incomplete list of tags. Consider this example:
fish:cheese:red
chips:cheese:red
fish:chalk:red
It won't realise that only cheese has both fish and chips, and will instead collapse this to fish/chips:cheese/chalk:red.
The order of the tags in the finished string is random (or at least, I don't think it has anything to do with the order of the strings in the given list). You could sort tt before you join it with slashes if that's important.
Testing with the three lists given in the question seems to work, although as I said, the order may be different to the initial strings:
food:fruit:apple:sweet/sour
drink:tea/coffee:hot/cold
wall/bike:painted:blue/red
Related
I'm looking for help on a tricky QRegExp that I'd like to pass to my QSortFilterProxyModel.setFilterRegex. I've been struggling to find a solution that handles my use-case.
From the sample code below, I need to capture items with two underscores (_) but ONLY if they have george or brian. I do not want items that have more or less than two underscores.
string_list = [
'john','paul','george','ringo','brian','carl','al','mike',
'john_paul','paul_george','john_ringo','george_ringo',
'john_paul_george','john_paul_brian','john_paul_ringo',
'john_paul_carl','paul_mike_brian','john_george_brian',
'george_ringo_brian','paul_george_ringo','john_george_ringo',
'john_paul_george_ringo','john_paul_george_ringo_brian','john_paul_george_ringo_brian_carl',
]
view = QListView()
model = QStringListModel(string_list)
proxy_model = QSortFilterProxyModel()
proxy_model.setSourceModel(model)
view.setModel(proxy_model)
view.show()
The first part (matching two underscores) can be accomplished with the line (simplified here, but really each token can be composed of any alphanumeric character other than _, so [a-zA-Z0-9]*):
proxy_model.setFilterRegExp('^[a-z]*_[a-z]*_[a-z]*$')
The second part can be accomplished (independently with)
proxy_model.setFilterRegExp('george|brian')
To complicate matters, these additional criterial apply:
This list may grow to the realm of several thousand items,
The tokenization may reach up to 10 or so tokens
The tokenization can be in any order (so george could occur at the beginning, middle, end)
We may also want to also capture georgeH and brainW35 when they occur, so long as they begin with george or brian.
We may have N-Number of names we're searching for (i.e. george|brian|jim|al, but only when they're in strings with two underscores.
To simplify them:
Lines will never begin or end with "_", and should only ever begin/end with [a-zA-Z0-9]
Do the QRegExp and QSortFilterProxyModel even have the capabilities I'm looking for, or will I need to resort to some other approach?
For very complex conditions using regex is not very useful, in that case it is better to override the filterAcceptsRow method where you can implement the filter function as shown in the following trivial example:
class FilterProxyModel(QSortFilterProxyModel):
_words = None
_number_of_underscore = -1
def filterAcceptsRow(self, source_row, source_parent):
text = self.sourceModel().index(source_row, 0, source_parent).data()
if not self._words or self._number_of_underscore < 0:
return True
return (
any([word in text for word in self._words])
and text.count("_") == self._number_of_underscore
)
#property
def words(self):
return self._words
#words.setter
def words(self, words):
self._words = words
self.invalidateFilter()
#property
def number_of_underscore(self):
return self._number_of_underscore
#number_of_underscore.setter
def number_of_underscore(self, number):
self._number_of_underscore = number
self.invalidateFilter()
view = QListView()
model = QStringListModel(string_list)
proxy_model = FilterProxyModel()
proxy_model.setSourceModel(model)
view.setModel(proxy_model)
view.show()
proxy_model.number_of_underscore = 2
proxy_model.words = (
"george",
"brian",
)
I'm having trouble converting my working code from lists to dictionaries. The basics of the code checks a file name for any keywords within the list.
But I'm having a tough time understanding dictionaries to convert it. I am trying to pull the name of each key and compare it to the file name like I did with lists and tuples. Here is a mock version of what i was doing.
fname = "../crazyfdsfd/fds/ss/rabbit.txt"
hollow = "SFV"
blank = "2008"
empty = "bender"
# things is list
things = ["sheep", "goat", "rabbit"]
# other is tuple
other = ("sheep", "goat", "rabbit")
#stuff is dictionary
stuff = {"sheep": 2, "goat": 5, "rabbit": 6}
try:
print(type(things), "things")
for i in things:
if i in fname:
hollow = str(i)
print(hollow)
if hollow == things[2]:
print("PERFECT")
except:
print("c-c-c-combo breaker")
print("\n \n")
try:
print(type(other), "other")
for i in other:
if i in fname:
blank = str(i)
print(blank)
if blank == other[2]:
print("Yes. You. Can.")
except:
print("THANKS OBAMA")
print("\n \n")
try:
print(type(stuff), "stuff")
for i in stuff: # problem loop
if i in fname:
empty = str(i)
print(empty)
if empty == stuff[2]: # problem line
print("Shut up and take my money!")
except:
print("CURSE YOU ZOIDBERG!")
I am able to get a full run though the first two examples, but I cannot get the dictionary to run without its exception. The loop is not converting empty into stuff[2]'s value. Leaving money regrettably in fry's pocket. Let me know if my example isn't clear enough for what I am asking. The dictionary is just short cutting counting lists and adding files to other variables.
A dictionary is an unordered collection that maps keys to values. If you define stuff to be:
stuff = {"sheep": 2, "goat": 5, "rabbit": 6}
You can refer to its elements with:
stuff['sheep'], stuff['goat'], stuff['rabbit']
stuff[2] will result in a KeyError, because the key 2 is not found in your dictionary. You can't compare a string with the last or 3rd value of a dictionary, because the dictionary is not stored in an ordered sequence (the internal ordering is based on hashing). Use a list or tuple for an ordered sequence - if you need to compare to the last item.
If you want to traverse a dictionary, you can use this as a template:
for k, v in stuff.items():
if k == 'rabbit':
# do something - k will be 'rabbit' and v will be 6
If you want to check to check the keys in a dictionary to see if they match part of a string:
for k in stuff.keys():
if k in fname:
print('found', k)
Some other notes:
The KeyError would be much easier to notice... if you took out your try/except blocks. Hiding python errors from end-users can be useful. Hiding that information from YOU is a bad idea - especially when you're debugging an initial pass at code.
You can compare to the last item in a list or tuple with:
if hollow == things[-1]:
if that is what you're trying to do.
In your last loop: empty == str(i) needs to be empty = str(i).
I am trying to populate a list in Python3 with 3 random items being read from a file using REGEX, however i keep getting duplicate items in the list.
Here is an example.
import re
import random as rn
data = '/root/Desktop/Selenium[FILTERED].log'
with open(data, 'r') as inFile:
index = inFile.read()
URLS = re.findall(r'https://www\.\w{1,10}\.com/view\?i=\w{1,20}', index)
list_0 = []
for i in range(3):
list_0.append(URLS[rn.randint(1, 30)])
inFile.close()
for i in range(len(list_0)):
print(list_0[i])
What would be the cleanest way to prevent duplicate items being appended to the list?
(EDIT)
This is the code that i think has done the job quite well.
def random_sample(data):
r_e = ['https://www\.\w{1,10}\.com/view\?i=\w{1,20}', '..']
with open(data, 'r') as inFile:
urls = re.findall(r'%s' % r_e[0], inFile.read())
x = list(set(urls))
inFile.close()
return x
data = '/root/Desktop/[TEMP].log'
sample = random_sample(data)
for i in range(3):
print(sample[i])
Unordered collection with no duplicate entries.
Use the builtin random.sample.
random.sample(population, k)
Return a k length list of unique elements chosen from the population sequence or set.
Used for random sampling without replacement.
Addendum
After seeing your edit, it looks like you've made things much harder than they have to be. I've wired a list of URLS in the following, but the source doesn't matter. Selecting the (guaranteed unique) subset is essentially a one-liner with random.sample:
import random
# the following two lines are easily replaced
URLS = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6', 'url7', 'url8']
SUBSET_SIZE = 3
# the following one-liner yields the randomized subset as a list
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
print(urlList) # produces, e.g., => ['url7', 'url3', 'url4']
Note that by using len(URLS) and SUBSET_SIZE, the one-liner that does the work is not hardwired to the size of the set nor the desired subset size.
Addendum 2
If the original list of inputs contains duplicate values, the following slight modification will fix things for you:
URLS = list(set(URLS)) # this converts to a set for uniqueness, then back for indexing
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
Or even better, because it doesn't need two conversions:
URLS = set(URLS)
urlList = [u for u in random.sample(URLS, SUBSET_SIZE)]
seen = set(list_0)
randValue = URLS[rn.randint(1, 30)]
# [...]
if randValue not in seen:
seen.add(randValue)
list_0.append(randValue)
Now you just need to check list_0 size is equal to 3 to stop the loop.
Having the following HTML code:
<span class="warning" id ="warning">WARNING:</span>
For an object accessible by XPAth:
.//*[#id='unlink']/table/tbody/tr[1]/td/span
How can one count its attributes (class, id) by means of Selenium WebDriver + Python 2.7, without actually knowing their names?
I'm expecting something like count = 2.
Got it! This should work for div, span, img, p and many other basic elements.
element = driver.find_element_by_xpath(xpath) #Locate the element.
outerHTML = element.get_attribute("outerHTML") #Get its HTML
innerHTML = element.get_attribute("innerHTML") #See where its inner content starts
if len(innerHTML) > 0: # Let's make this work for input as well
innerHTML = innerHTML.strip() # Strip whitespace around inner content
toTrim = outerHTML.index(innerHTML) # Get the index of the first part, before the inner content
# In case of moste elements, this is what we care about
rightString = outerHTML[:toTrim]
else:
# We seem to have something like <input class="bla" name="blabla"> which is good
rightString = outerHTML
# Ie: <span class="something" id="somethingelse">
strippedString = rightString.strip() # Remove whitespace, if any
rightTrimmedString = strippedString.rstrip('<>') #
leftTrimmedString = rightTrimmedString.lstrip('</>') # Remove the <, >, /, chars.
rawAttributeArray = leftTrimmedString.split(' ') # Create an array of:
# [span, id = "something", class="somethingelse"]
curatedAttributeArray = [] # This is where we put the good values
iterations = len(rawAttributeArray)
for x in range(iterations):
if "=" in rawAttributeArray[x]: #We want the attribute="..." pairs
curatedAttributeArray.append(rawAttributeArray[x]) # and add them to a list
numberOfAttributes = len(curatedAttributeArray) #Let's see what we got
print numberOfAttributes # There we go
I hope this helps.
Thanks,
R.
P.S. This could be further enhanced, like stripping whitespace together with <, > or /.
It's not going to be easy.
Every element has a series of implicit attributes as well as the ones explicitly defined (for example selected, disabled, etc). As a result the only way I can think to do it would be to get a reference to the parent and then use a JavaScript executor to get the innerHTML:
document.getElementById('{ID of element}').innerHTML
You would then have to parse what is returned by innerHTML to extract out individual elements and then once you have isolated the element that you are interested in you would again have to parse that element to extract out a list of attributes.
The truncatewords filter inserts a space before the elipsis. As in,
'A fine holiday recipe book of ...'
vs. the desired
'A fine holiday recipe book of...'
Is there an easy way to get this filter to not put a space there? I could take care of this in the view pretty easily, but would prefer to do it in the template - ideally without creating a custom filter. Any suggestions are welcome.
There are a bunch of template filters at Djangosnippets, and this one looks pretty neat:
# From http://djangosnippets.org/snippets/1259/
from django import template
register = template.Library()
#register.filter
def truncatesmart(value, limit=80):
"""
Truncates a string after a given number of chars keeping whole words.
Usage:
{{ string|truncatesmart }}
{{ string|truncatesmart:50 }}
"""
try:
limit = int(limit)
# invalid literal for int()
except ValueError:
# Fail silently.
return value
# Make sure it's unicode
value = unicode(value)
# Return the string itself if length is smaller or equal to the limit
if len(value) <= limit:
return value
# Cut the string
value = value[:limit]
# Break into words and remove the last
words = value.split(' ')[:-1]
# Join the words and return
return ' '.join(words) + '...'
This will also work:
{{ value|truncatewords:3|slice:"-4" }}...
Basically, just slice off the last 4 characters (ellipse plus space), and then add it back without the space!
The neat thing is, with this method you can also end your, uh, truncation with anything you'd like.