I am trying to get no of reviews count of a particular product
The code is:
total_reviews = soup.find("div", {"class": "feature"}).findNext(
"span", {"id": "acrCustomerReviewText"}).string
x = ''
for number in total_reviews:
if number == ' ':
break
else:
x = x + number
num_reviews =int(x)
Replace the comma with an empty string using str.replace():
num_reviews = int(x.replace(",", ""))
Related
I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.
For eg:
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows
###Output should look like
df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])
It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)
For eg - 'PARIS' can match against 'PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.
Similarly, 'XANDER' matches against 'NOVA XANDER' and 'SALA MANDER' (MANDER being 1 char diff from XANDER) , but does not generate match against 'ALEXANDERS'.
As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.
Current code -
tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])
def search_it(partyname):
m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].str.apply(search_it)
Also, will multiprocessing help with regex ? Many thanks in advance!
Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S), and you will only need one tag pattern.
Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.
To reduce this backtracking, you will need a regex trie.
Here is a working snippet:
import regex
import pandas as pd
## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return regex.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row
trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
trie.add(word)
tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)
def search_it(partyname):
m = tags_ptn_regex.search(partyname)
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].apply(search_it)
I have below code gives me ValueError when input is left blank
U = find_lastuid() # Return variable from other function
uidnum = int(raw_input("What is uid number? (default is: %s)" % U))
if not uidnum:
print("defualt uid is used: %s" % uidnum)
else:
print("UID is %s" % uidnum)
uidnum = int(raw_input("What is uid number? (default is: %s)" % U))
ValueError: invalid literal for int() with base 10: ''
Can someone tell what is wrong with this code?
I can see similar code works in REPL
>>> id = 234
>>> a = raw_input("Enter id")
Enter id
>>> if not id:
... print(id is blank)
... else:
... print(id)
...
234
This is because you cannot convert an empty string to an int, what you could do, however is use a try and except block to handle this:
try:
uidnum = int(raw_input("Enter uid:"))
except ValueError:
print("Please enter a number!")
here my function . I have to do lower what ever coming in newfilename..i.e newfilename.lower()
def my_function(start, end):
sheetname = 'my-sheet'
filepath = "/myxl.xlsx"
try:
work_book=xlrd.open_workbook(filepath)
except:
print 'error'
try:
worksheet = work_book.sheet_by_name(sheetname)
except:
print 'error'
rows=worksheet.nrows
cols=worksheet.ncols
success = []
fail = []
for row in xrange(start,end):
print "row no. : ",row
state = '/home/myfolder/'
if os.path.exists(state):
print "state folder exits"
else:
os.makedirs(state)
district = state + worksheet.cell_value(row,0) + '/'
if os.path.exists(district):
print "district folder exits"
else:
os.makedirs(district)
city = district + worksheet.cell_value(row,2) + '/'
if os.path.exists(city):
print "city folder exits"
else:
os.makedirs(city)
newfilename = city + worksheet.cell_value(row,4).replace (" ", "-") + '.png'
if worksheet.cell_value(row,5) !="":
oldfilename = worksheet.cell_value(row,5)
else:
oldfilename="no-image"
newfullpath = newfilename
oldfullpath = '/home/old/folder/' + oldfilename
try:
os.rename(oldfullpath,newfullpath)
success.append(row)
except Exception as e:
fail.append(row)
print "Error",e
print 'renaming done for row #' ,row , ' file ', oldfilename , ' to ', newfilename
print 'SUCCESS ', success
print 'FAIL ', fail
newfilename.lower() not working
here when I am going to use unicode error coming...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 72: ordinal not in range(128)
I am working in the textwrap module that requires the outputted text to look like see picture of expected output formatting
However, my output looks like:
see my output
This is the code I am using for the text wrap:
wrap = textwrap.TextWrapper(initial_indent = ' '*4, subsequent_indent = ' '*4)
if name not in bus_name:
print 'This business is not found'
else:
count = 0
for id in bus_name[name]:
if id not in bus_review:
print 'No reviews for this business are found'
else:
for rev in bus_review[id]:
print 'Review %d' %(count+1)
string = textwrap.fill(rev)
string = string.replace(' ', '$')
string = wrap.fill(string)
string = string.replace('$', '\n'+' '*4)
print '%s' %string
print
count += 1
I'm new to python (2.7.3) and I am experimenting with lists. Say I have a list that is defined as:
my_list = ['name1', 'name2', 'name3']
I can print it with:
print 'the names in your list are: ' + ', '.join(my_list) + '.'
Which would print:
the names in your list are: name1, name2, name3.
How do i print:
the names in your list are: name1, name2 and name3.
Thank you.
Update:
I am trying logic suggested below but the following is throwing errors:
my_list = ['name1', 'name2', 'name3']
if len(my_list) > 1:
# keep the last value as is
my_list[-1] = my_list[-1]
# change the second last value to be appended with 'and '
my_list[-2] = my_list[-2] + 'and '
# make all values until the second last value (exclusive) be appended with a comma
my_list[0:-3] = my_list[0:-3] + ', '
print 'The names in your list are:' .join(my_list) + '.'
Try this:
my_list = ['name1', 'name2', 'name3']
print 'The names in your list are: %s, %s and %s.' % (my_list[0], my_list[1], my_list[2])
The result is:
The names in your list are: name1, name2, and name3.
The %s is string formatting.
If the length of my_list was unknown:
my_list = ['name1', 'name2', 'name3']
if len(my_list) > 1: # If it was one, then the print statement would come out odd
my_list[-1] = 'and ' + my_list[-1]
print 'The names in your list are:', ', '.join(my_list[:-1]), my_list[-1] + '.'
My two cents:
def comma_and(a_list):
return ' and '.join([', '.join(a_list[:-1]), a_list[-1]] if len(a_list) > 1 else a_list)
Seems to work in all cases:
>>> comma_and(["11", "22", "33", "44"])
'11, 22, 33 and 44'
>>> comma_and(["11", "22", "33"])
'11, 22 and 33'
>>> comma_and(["11", "22"])
'11 and 22'
>>> comma_and(["11"])
'11'
>>> comma_and([])
''