Related
When I use the RegexTokenizer from pyspark.ml.feature to tokenize sentences column in my dataframe to find all the word characters, I get the opposite of what I would get when the python re package is used for the same sentence. Here is the sample code:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer
spark = SparkSession.builder \
.master("local") \
.appName("Word list") \
.getOrCreate()
df = spark.createDataFrame(data = [["Hi there, I have a question about RegexTokenizer, Could you
please help me..."]], schema = ["Sentence"])
regexTokenizer = RegexTokenizer(inputCol="Sentence", outputCol="letters", pattern="\\w")
df = regexTokenizer.transform(df)
df.first()['letters']
This gives the following output:
[' ', ', ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' ', '...']
On the other hand if I use the re module on the same sentence and use the same pattern to match the letters, using this code here:
import re
sentence = "Hi there, I have a question about RegexTokenizer, could you
please help me..."
letters_list = re.findall("\\w", sentence)
print(letters_list)
I get the desired output as per the regular expression pattern as:
['H', 'i', 't', 'h', 'e', 'r', 'e', 'I', 'h', 'a', 'v', 'e', 'a',
'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'a', 'b', 'o', 'u', 't',
'R', 'e', 'g', 'e', 'x', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'e',
'r', 'c', 'o', 'u', 'l', 'd', 'y', 'o', 'u', 'p', 'l', 'e', 'a',
's', 'e', 'h', 'e', 'l', 'p', 'm', 'e']
I also found that I need to use \W instead of \w in pySpark to solve this problem. Why is this difference? Or have I misunderstood the usage of pattern argument in RegexTokenizer?
From what the documentation on RegexTokenizer says, on creation it has a parameter called gaps. In one mode, the regexp matches gaps (true and is the default), in other it matches tokens (not the gaps, false).
Try setting it manually to the value you need: in your case, gaps = false.
I'm new to python (2.7) and stackoverflow. I'm trying to learn how to use the 'sorted' function. When I use the 'sorted' function, the sentence splits into individual letters and sorts those letters in ascending order. But that is not what I want. I want to sort my words in ascending order. I'm trying to run this code
peace = "This is one of the most useful sentences in the whole wide world."
def pinkan (one):
return sorted (one)
print pinkan (peace)
But the output I get is something of this sort:
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'T', 'c', 'd',
'd', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'f', 'f'
, 'h', 'h', 'h', 'h', 'i', 'i', 'i', 'i', 'l', 'l', 'l', 'm', 'n', 'n', 'n',
'n', 'o', 'o', 'o', 'o', 'o', 'r', 's', 's', 's', 's', 's
', 's', 't', 't', 't', 't', 'u', 'u', 'w', 'w', 'w']
I would appreciate any help/suggestion. Thanks :-)
You should be using first split() to generate a list of words, then sort(), to sort that list alphabetically in ascending order:
peace = "This is one of the most useful sentences in the whole wide world."
terms = peace.split()
terms.sort(key=str.lower)
output = " ".join(terms)
print(output)
['in', 'is', 'most', 'of', 'one', 'sentences', 'the', 'the', 'This', 'useful',
'whole', 'wide', 'world.']
This is the list.
list1 =['F', 'L', 'Y', 'W', 'B', 'E', 'G', 'A', 'L', 'K', 'R', 'U', 'B', 'E', 'T', 'L', 'H', 'G', 'E', 'C', 'K', 'Y', 'U', 'B', 'H', 'L', 'U', 'G', 'A', 'F', 'K', 'Y', 'F', 'M', 'P', 'U', 'B', 'K', 'F', 'G', 'I', 'O', 'N', 'S', 'Y']
I want to delete the letters that repeat n numbers of time. In the context of this problem n is 4.
This is what i have tried so far.
n = 4
alphabet = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
i = 0
for x in range(len(alphabet)-1):
print(alphabet[i])
h = list1.count(alphabet[x])
print("h: ",h)
if h == n:
while alphabet[x] in alphabet:
alphabet.remove(alphabet[x])
print(alphabet)
I An getting an error saying that list.remove(x): x not in list
to delete the letters that repeat n numbers of time
The solution using collections.Counter subclass:
import collections
n = 4
list1 =['F', 'L', 'Y', 'W', 'B', 'E', 'G', 'A', 'L', 'K', 'R', 'U', 'B', 'E', 'T', 'L', 'H', 'G', 'E', 'C', 'K', 'Y', 'U', 'B', 'H', 'L', 'U', 'G', 'A', 'F', 'K', 'Y', 'F', 'M', 'P', 'U', 'B', 'K', 'F', 'G', 'I', 'O', 'N', 'S', 'Y']
counts = collections.Counter(list1)
list1 = [l for l in list1 if l in counts and counts[l] != n]
print(list1)
The output:
['W', 'E', 'A', 'R', 'E', 'T', 'H', 'E', 'C', 'H', 'A', 'M', 'P', 'I', 'O', 'N', 'S']
Not to propose whole new solution, you can modify your code to something like this:
for x in alphabet:
print(x)
h = list1.count(x)
print("h: ",h)
if h == n:
while x in alphabet:
alphabet.remove(x)
print(alphabet)
The problem with your code is that you have a while loop that removes many elements from the alphabet if there is even only one letter that have h = 4 (the element and all forward letters). This is caused by:
while alphabet[x] in alphabet:
alphabet.remove(alphabet[x])
when you remove alphabet[x], next element becomes alphabet[x] (indexing must be continuous), so while loop removes one letter and all forward letters.
But as you want to remove letters from list not alphabet you should modify:
list1 =['F', 'L', 'Y', 'W', 'B', 'E', 'G', 'A', 'L', 'K', 'R', 'U', 'B', 'E', 'T', 'L', 'H', 'G', 'E', 'C', 'K', 'Y', 'U', 'B', 'H', 'L', 'U', 'G', 'A', 'F', 'K', 'Y', 'F', 'M', 'P', 'U', 'B', 'K', 'F', 'G', 'I', 'O', 'N', 'S', 'Y']
n = 4
alphabet = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
i = 0
for x in alphabet:
print(x)
h = list1.count(x)
print("h: ",h)
if h == n:
while x in list1:
list1.remove(x)
print(''.join(list1))
My name is Edwin i am new to programming but i love to learn. I have a assignment for school and i must build a program that rates passwords. but i have a litle problem now. As you can see i made 3 lists with every character. If i run this program it will not show "uw wachtwoord is Sterk" if the conditions klein and groot and nummers and symbols are true. how do i fix this?
btw i can't make use of isdigit,isnumeric and such.
thank you in advance!
print ("Check of uw wachtwoord veilig genoeg is in dit programma.")
print ("Uw wachtwoord moet tussen minimaal 6 en maximaal 12 karakters
bestaan")
print ("U kunt gebruik maken van hoofdletters,getallen en symbolen
(#,#,$,%)")
ww = input("Voer uw wachtwoord in: ")
klein = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
'm','n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
groot = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nummers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
symbolen= [' ', '!', '#', '$', '%', '&', '"', '(', ')', '*', '+', ',', '-',
'.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`',
'{', '|', '}', '~',"'"]
if len(ww) < 6:
print ("uw wachtwoord is te kort, uw wachtwoord moet uit minimaal 6 en
maximaal 12 karakters bestaan!")
elif len(ww) > 12:
print ("uw wachtwoord is te lang, uw wachtwoord moet uit minimaal 6 en
maximaal 12 karakters bestaan!")
elif len(ww) >= 6 and len(ww)<= 12:
if ww == klein and ww == groot and ww == nummers and ww == symbolen:
print ("uw wachtwoord is Sterk")
Your test cannot work because you're comparing your password (of type str: string) against a list. Since objects are non-comparable, the result is just False (even if they were comparable there is no equality here, but a non-empty intersection to check)
You need to check for each list that there's at least 1 member of the list in the password
Define an aux function which checks if a letter of the list is in the password (a lambda would be too much maybe) using any:
def ok(passwd,l):
return any(k in passwd for k in l)
Then test all your four lists against this condition using all:
elif len(ww) >= 6 and len(ww)<= 12:
sww = set(ww)
if all(ok(sww,l) for l in [klein,groot,nummers,symbolen]):
print ("uw wachtwoord is Sterk")
Note the slight optimization by converting the password (which is kind of a list so O(n) for in operator) by a set of characters (where the in operator exists but is way faster). Besides, a set will remove duplicate characters, which is perfect for this example.
More compact version without the aux function and using a lambda which is not so difficult to understand after all:
elif len(ww) >= 6 and len(ww)<= 12:
sww = set(ww)
if all(lambda l: any(k in sww for k in l) for l in [klein,groot,nummers,symbolen]):
print ("uw wachtwoord is Sterk")
got a question about this error:
TypeError: rot13() takes exactly 1 argument (2 given)
Which occurs on this code:
def get(self): <-- called on every get request
ch = self.rot13("abc")
def rot13(input): <-- fairly untested rot 13 ;)
alpha = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
escaped = escape(input)
ciphered = ""
for char in escaped:
originalIndex = alpha.index(char)
newIndex = (originalIndex + 13) % 26
ciphered = chipered + alpha[newIndex]
Do not know why there is this error. I'm just handing one parameter there.
It seems that you're missing this:
def rot13(self, input):
… That's because rot13() appears to be a method inside a class, not a stand-alone function, so it needs to receive self.