I'm taking user input into a Textarea widget, then looping by line, and trying to split the three "words" (first name, last name, email) from each line into a list, which I'll then deal with later. When I use split() on the line, though, it always splits into characters, which I assume is part of the CharField def'n of the field, meaning that it's not a string and the split() method won't behave as I want it to. Edit: even the for construct is failing - it's analyzing each character, instead of each line.
What's the workaround for that?
class UserImportForm(forms.Form):
importtext = forms.CharField(required=True,widget=forms.Textarea(attrs={'cols': 40, 'rows': 15}))
elif "UserImport" in request.POST:
g = UserImportForm(request.POST, prefix='usrimp')
rawtext = g['importtext'].value()
if g.is_valid():
newusers = []
for lines in rawtext:
row = lines.split(" ")
if len(row) == 3 and validate_email(row[2]):
newusers.append(row)
While this is likely not the best way to do it, here's what I ended up doing. Still welcoming better answers!
elif "UserImport" in request.POST:
g = UserImportForm(request.POST, prefix='usrimp')
if g.is_valid():
rawtext = g.cleaned_data['importtext'].encode('utf8')
rawtext = "".join(rawtext)
rawtext = rawtext.split("\n")
newusers = []
for lines in rawtext:
row = lines.split()
if len(row) == 3:
try:
validate_email(row[2])
newusers.append([row[0],row[1],row[2],"processmore"])
except:
newusers.append([row[0],row[1],row[2],"Invalid email address"])
Related
I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.
For eg:
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows
###Output should look like
df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])
It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)
For eg - 'PARIS' can match against 'PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.
Similarly, 'XANDER' matches against 'NOVA XANDER' and 'SALA MANDER' (MANDER being 1 char diff from XANDER) , but does not generate match against 'ALEXANDERS'.
As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.
Current code -
tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])
def search_it(partyname):
m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].str.apply(search_it)
Also, will multiprocessing help with regex ? Many thanks in advance!
Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S), and you will only need one tag pattern.
Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.
To reduce this backtracking, you will need a regex trie.
Here is a working snippet:
import regex
import pandas as pd
## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return regex.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row
trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
trie.add(word)
tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)
def search_it(partyname):
m = tags_ptn_regex.search(partyname)
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].apply(search_it)
First of all, I am sorry about the weird question heading. Couldn't express it in one line.
So, the problem statement is,
If I am given the following string --
"('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
I have to parse it as
list1 = ["'James Gosling'", 'jamesgosling', 'jame gosling']
list2 = ["'SUN Microsystem'", 'sunmicrosystem']
list3 = [ list1, list2, keyword]
So that, if I enter James Gosling Sun Microsystem keyword it should tell me that what I have entered is 100% correct
And if I enter J Gosling Sun Microsystem keyword it should say i am only 66.66% correct.
This is what I have tried so far.
import re
def main():
print("starting")
sentence = "('James Gosling'/jamesgosling/jame gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
splited = sentence.split(",")
number_of_primary_keywords = len(splited)
#print(number_of_primary_keywords, "primary keywords length")
number_of_brackets = 0
inside_quotes = ''
inside_quotes_1 = ''
inside_brackets = ''
for n in range(len(splited)):
#print(len(re.findall('\w+', splited[n])), "length of splitted")
inside_brackets = splited[n][splited[n].find("(") + 1: splited[n].find(")")]
synonyms = inside_brackets.split("/")
for x in range(len(synonyms)):
try:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
print(inside_quotes_1)
except:
pass
try:
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
print(inside_quotes)
except:
pass
#print(synonyms[x])
number_of_brackets += 1
print(number_of_brackets)
if __name__ == '__main__':
main()
Output is as follows
'James Gosling
jamesgoslin
jame goslin
'SUN Microsystem
SUN Microsystem
sunmicrosyste
sunmicrosyste
3
As you can see, the last letters of some words are missing.
So, if you read this far, I hope you can help me in getting the expected output
Unfortunately, your code has a logic issue that I could not figure it out, however there might be in these lines:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
which by the way you can simply use:
inside_quotes_1 = synonyms[x][synonyms[x].find("\x22") + 1: synonyms[n].find("\x22")]
inside_quotes = synonyms[x][synonyms[x].find("\x27") + 1: synonyms[n].find("\x27")]
Other than that, you seem to want to extract the words with their indices, which you can extract them using a basic expression:
(\w+)
Then, you might want to find a simple way to locate the indices, where the words are. Then, associate each word to the desired indices.
Example Test
# -*- coding: UTF-8 -*-
import re
string = "('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
expression = r'(\w+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
I am writing a program that produce a frequency plot of the letters in a body of text. however, there is an error in my code that I can not spot it. any ideas?
def letter_count(word,freqs,pmarks):
for char in word:
freqs[char]+=1
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
pmarks = list(string.punctuation)
words = [word.upper() for word in words]
for word in words:
letter_count(word,freqs,pmarks)
fhand.close()
return freqs.values
You are calling
freqs[char]+=1
with char = '.' without having initialized a value freqs['.']=0
You should check before line 3, whether the key exists already, as you can do the +=1 operation only on existing keys of the dictionary.
So something like:
for char in word:
if freqs.has_key(char):
freqs[char]+=1
Python: how can I check if the key of an dictionary exists?
I am taking a text file as an input and creating a function that counts which word occurs most frequently. If 2 or more words occur most frequent and are equal I will print all of those words.
def wordOccurance(userFile):
userFile.seek(0)
line = userFile.readline()
lines = []
while line != "":
if line != "\n":
line = line.lower() # making lower case
line = line.rstrip("\n") # cleaning
line = line.rstrip("?") #cleans the whole docoument by removing "?"
line = line.rstrip("!") #cleans the whole docoument by removing "!"
line = line.rstrip(".") #cleans the whole docoument by removing "."
line = line.split(" ") #splits the texts into space
lines.append(line)
line = userFile.readline() # keep reading lines from document.
words = lines
wordDict = {} #creates the clean word Dic, from above
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
largest_value = max(wordDict.values())
for k in wordDict.keys():
if wordDict[k] == largest_value:
print(k)
return wordDict
Please help me with this function.
In this line you are creating a list of strings:
line = line.split(" ") #splits the texts into space
Then you append it to a list, so you have a list of lists:
lines.append(line)
Later you loop through that list of lists, and try to use a sublist as a key:
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1 # Here you will try to assign a list (`word`) as a key, which is not allowed
One easy fix would be to flatten the list of lists first:
words = [item for sublist in lines for item in sublist]
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
The list comprehension [item for sublist in lines for item in sublist] will loop through lines, then loop through the sublists created by line.split(" ") and return a new list consisting of the items in each sublist. For you, lines probably looks something like this:
[['words', 'on', 'line', 'one'], ['words', 'on', 'line', 'two']]
The list comprehension will turn it into this:
['words', 'on', 'line', 'one', 'words', 'on', 'line', 'two']
If you would like to use something a little less complicated, you could just use nested loops:
# words = lines
# just use `lines` in your for loop instead of creating an identical list
wordDict = {} #creates the clean word Dic, from above
for line in lines:
for word in line:
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
largest_value = max(wordDict.values())
This will probably be a little less efficient and/or "Pythonic", but it will probably be easier to wrap your head around.
Also, you may want to consider splitting each line into words before cleaning the data, because if you clean the lines first, you will only remove punctuation at the end of lines rather than at the end of words. However, this might not be necessary depending on the nature of your data.
If there is a character that is not in my key list, such as "X", how do I avoid it and continue without doing nothing to it? I am getting KeyError 'X' , cause there is a X in my sequence that I am looking at.
keys = ["A", "C", "D", "E"]
for char in keys:
counts[char] = 0
for line in gpcr:
if line.startswith(">"):
line = line.replace(' ','')
header = line.split()
number = header[0].split('|')
print "Id:",number[2]
continue
fo.write(number[2])
fo.write('\n')
for char in line.strip():
if char
counts[char] += 1
total = float(sum(counts.values()))
toReturn = ''
for key in keys:
aa_per = (counts[key]/total)*100
toReturn = toReturn + '%.2f'%aa_per + '%'+ '\t'
fo.write(number[1])
fo.write('\n')
fo.write(''.join(str(x) for x in toReturn))
fo.write('\n')
print toReturn
fo.close()
I am slightly confused by your question. I guess the problematic line is
aa_per = (counts[key]/total)*100
You can check for a KeyError by using a try block:
try:
aa_per = (counts[key]/total)*100
except KeyError:
aa_per = 0
I guess if the key doesn't occur the percentage should be 0 here.
In general try blocks are a powerful tool to check for exceptions or warnings. See also herehttps://docs.python.org/3/tutorial/errors.html
If you still want to count the number of occurrences of this X character, you can use defaultdict
from collections import defaultdict
counts = defaultdict(int)
counts will be an instance of a dictionary that instead of raising a KeyError, will return 0 if the key does not exist. This way you will be able to avoid the dictionary initialization altogether.
Update:
If you want to use if-else, I think it should be enough to do:
for char in line.strip():
if char in keys:
counts[char] += 1