I need a regex that works as follows, I've been trying for a day and can't figure it out.
(IIILjava/lang/String;Ljava/lang/String;II)V = ['I', 'I', 'I', 'I',
'Ljava/lang/String;', 'Ljava/lang/String;', 'I', 'I'] Ignoring whats after )
(IIJ)J = ['I', 'I', 'J']
(IBZS)Z = ['I', 'B', 'Z', 'S']
I've gotten (I|D|F|Z|B|S|L.+?;) so far but I can't get it to ignore that character that's after ')'.
(?<=\([^()]{0,10000})[A-Z][^A-Z()]*(?=[^()]*\))
(?<=\([^()]{0,10000}) Positive lookbehind ensuring what precedes is (, followed by any character except ( or ) between 0 and 10000 times. The upper limit may be adjusted as needed, but must not be infinite.
[A-Z] Match any uppercase ASCII letter
[^A-Z()]* Match any character except an uppercase ASCII letter, ( or ) any number of times
(?=[^()]*\)) Positive lookahead ensuring what follows is any character except ( or ) any number of times, followed by )
Results:
['I', 'I', 'I', 'I', 'Ljava/lang/String;', 'Ljava/lang/String;', 'I', 'I']
['I', 'I', 'J']
['I', 'B', 'Z', 'S']
Sample code: See in use here
Related
If my string is-
''Felix Underhalm'' and I want to turn into...
list = ['F', 'E', 'L', 'I', 'X']
How should I do it?
When I use the RegexTokenizer from pyspark.ml.feature to tokenize sentences column in my dataframe to find all the word characters, I get the opposite of what I would get when the python re package is used for the same sentence. Here is the sample code:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer
spark = SparkSession.builder \
.master("local") \
.appName("Word list") \
.getOrCreate()
df = spark.createDataFrame(data = [["Hi there, I have a question about RegexTokenizer, Could you
please help me..."]], schema = ["Sentence"])
regexTokenizer = RegexTokenizer(inputCol="Sentence", outputCol="letters", pattern="\\w")
df = regexTokenizer.transform(df)
df.first()['letters']
This gives the following output:
[' ', ', ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' ', '...']
On the other hand if I use the re module on the same sentence and use the same pattern to match the letters, using this code here:
import re
sentence = "Hi there, I have a question about RegexTokenizer, could you
please help me..."
letters_list = re.findall("\\w", sentence)
print(letters_list)
I get the desired output as per the regular expression pattern as:
['H', 'i', 't', 'h', 'e', 'r', 'e', 'I', 'h', 'a', 'v', 'e', 'a',
'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'a', 'b', 'o', 'u', 't',
'R', 'e', 'g', 'e', 'x', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'e',
'r', 'c', 'o', 'u', 'l', 'd', 'y', 'o', 'u', 'p', 'l', 'e', 'a',
's', 'e', 'h', 'e', 'l', 'p', 'm', 'e']
I also found that I need to use \W instead of \w in pySpark to solve this problem. Why is this difference? Or have I misunderstood the usage of pattern argument in RegexTokenizer?
From what the documentation on RegexTokenizer says, on creation it has a parameter called gaps. In one mode, the regexp matches gaps (true and is the default), in other it matches tokens (not the gaps, false).
Try setting it manually to the value you need: in your case, gaps = false.
I am trying to split a string as follows:
Zero or more consonants followed by zero or more vowels are a taken as a token.
All other characters are taken as a token.
For example, 'yes, oat is good' is split as ['ye', 's', ',', ' ', 'oa', 't', ' ', 'i', 's', ' ', 'goo', 'd'].
Trying regex re.compile(r'[bcdefghjklmnpqrstuvwxyz]*[aeiou]*').findall('yes, oat is good') gives me ['yes', '', '', 'oa', 't', '', 'i', 's', '', 'goo', 'd', '']. Why is 'yes' not split into 'ye' and 's'?
Then, trying re.compile(r'[bcdefghjklmnpqrstuvwxyz]*[aeiou]*|.').findall('yes, oat is good') gives me the same result. Why does it not capture ',' and ' '?
Finally, is there a way to avoid getting empty strings?
You should not include the letter e as one of the consonants. Aside from that, you should use an alternation pattern to match all the other characters as a token. Also use a positive lookahead pattern to ensure the pattern that matches zero or more consonants followed by zero or more vowels matches at least one alphabet:
re.findall(r'[^a-z]|(?=[a-z])[bcdfghjklmnpqrstvwxyz]*[aeiou]*', 'yes, oat is good', re.I)
This returns:
['ye', 's', ',', ' ', 'oa', 't', ' ', 'i', 's', ' ', 'goo', 'd']
My name is Edwin i am new to programming but i love to learn. I have a assignment for school and i must build a program that rates passwords. but i have a litle problem now. As you can see i made 3 lists with every character. If i run this program it will not show "uw wachtwoord is Sterk" if the conditions klein and groot and nummers and symbols are true. how do i fix this?
btw i can't make use of isdigit,isnumeric and such.
thank you in advance!
print ("Check of uw wachtwoord veilig genoeg is in dit programma.")
print ("Uw wachtwoord moet tussen minimaal 6 en maximaal 12 karakters
bestaan")
print ("U kunt gebruik maken van hoofdletters,getallen en symbolen
(#,#,$,%)")
ww = input("Voer uw wachtwoord in: ")
klein = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
'm','n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
groot = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
nummers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
symbolen= [' ', '!', '#', '$', '%', '&', '"', '(', ')', '*', '+', ',', '-',
'.', '/', ':', ';', '<', '=', '>', '?', '#', '[', '\\', ']', '^', '_', '`',
'{', '|', '}', '~',"'"]
if len(ww) < 6:
print ("uw wachtwoord is te kort, uw wachtwoord moet uit minimaal 6 en
maximaal 12 karakters bestaan!")
elif len(ww) > 12:
print ("uw wachtwoord is te lang, uw wachtwoord moet uit minimaal 6 en
maximaal 12 karakters bestaan!")
elif len(ww) >= 6 and len(ww)<= 12:
if ww == klein and ww == groot and ww == nummers and ww == symbolen:
print ("uw wachtwoord is Sterk")
Your test cannot work because you're comparing your password (of type str: string) against a list. Since objects are non-comparable, the result is just False (even if they were comparable there is no equality here, but a non-empty intersection to check)
You need to check for each list that there's at least 1 member of the list in the password
Define an aux function which checks if a letter of the list is in the password (a lambda would be too much maybe) using any:
def ok(passwd,l):
return any(k in passwd for k in l)
Then test all your four lists against this condition using all:
elif len(ww) >= 6 and len(ww)<= 12:
sww = set(ww)
if all(ok(sww,l) for l in [klein,groot,nummers,symbolen]):
print ("uw wachtwoord is Sterk")
Note the slight optimization by converting the password (which is kind of a list so O(n) for in operator) by a set of characters (where the in operator exists but is way faster). Besides, a set will remove duplicate characters, which is perfect for this example.
More compact version without the aux function and using a lambda which is not so difficult to understand after all:
elif len(ww) >= 6 and len(ww)<= 12:
sww = set(ww)
if all(lambda l: any(k in sww for k in l) for l in [klein,groot,nummers,symbolen]):
print ("uw wachtwoord is Sterk")
I have been given the following pattern for a UK Postcode:
([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)
Can anyone break this down for me?
Here it is:
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-PR-UWYZ0-9] any character of: 'A' to 'P', 'R' to
'U', 'W', 'Y', 'Z', '0' to '9'
----------------------------------------------------------------------
[A-HK-Y0-9] any character of: 'A' to 'H', 'K' to
'Y', '0' to '9'
----------------------------------------------------------------------
[AEHMNPRTVXY0-9]? any character of: 'A', 'E', 'H', 'M',
'N', 'P', 'R', 'T', 'V', 'X', 'Y', '0'
to '9' (optional (matching the most
amount possible))
----------------------------------------------------------------------
[ABEHMNPRVWXY0-9]? any character of: 'A', 'B', 'E', 'H',
'M', 'N', 'P', 'R', 'V', 'W', 'X', 'Y',
'0' to '9' (optional (matching the most
amount possible))
----------------------------------------------------------------------
{1,2} ' ' (between 1 and 2 times (matching the
most amount possible))
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
[ABD-HJLN-UW-Z]{2} any character of: 'A', 'B', 'D' to 'H',
'J', 'L', 'N' to 'U', 'W' to 'Z' (2
times)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
GIR 0AA 'GIR 0AA'
----------------------------------------------------------------------
) end of \1
debuggex.com is a really useful resource for debugging regular expressions:
Debuggex Demo
I'm actually a bigger fan of either of the other two answers than the ones I'm about to list, but the more the merrier:
http://regex101.com/ - will give a good breakdown/explanation
http://www.regexper.com/ - will produce a lovely railroad diagram:
The following answers are also worth a read for slight alternatives/explanations:
UK Postcode Regex (Comprehensive)
UK Postcode Regex