Splitting a string into consonant-vowel sequences

Splitting a string into consonant-vowel sequences - regex

I am trying to split a string as follows:
Zero or more consonants followed by zero or more vowels are a taken as a token.
All other characters are taken as a token.
For example, 'yes, oat is good' is split as ['ye', 's', ',', ' ', 'oa', 't', ' ', 'i', 's', ' ', 'goo', 'd'].
Trying regex re.compile(r'[bcdefghjklmnpqrstuvwxyz]*[aeiou]*').findall('yes, oat is good') gives me ['yes', '', '', 'oa', 't', '', 'i', 's', '', 'goo', 'd', '']. Why is 'yes' not split into 'ye' and 's'?
Then, trying re.compile(r'[bcdefghjklmnpqrstuvwxyz]*[aeiou]*|.').findall('yes, oat is good') gives me the same result. Why does it not capture ',' and ' '?
Finally, is there a way to avoid getting empty strings?

You should not include the letter e as one of the consonants. Aside from that, you should use an alternation pattern to match all the other characters as a token. Also use a positive lookahead pattern to ensure the pattern that matches zero or more consonants followed by zero or more vowels matches at least one alphabet:
re.findall(r'[^a-z]|(?=[a-z])[bcdfghjklmnpqrstvwxyz]*[aeiou]*', 'yes, oat is good', re.I)
This returns:
['ye', 's', ',', ' ', 'oa', 't', ' ', 'i', 's', ' ', 'goo', 'd']

Related

Why my RegexTokenizer transformation in PySpark gives me the opposite of the required pattern?

When I use the RegexTokenizer from pyspark.ml.feature to tokenize sentences column in my dataframe to find all the word characters, I get the opposite of what I would get when the python re package is used for the same sentence. Here is the sample code:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer
spark = SparkSession.builder \
.master("local") \
.appName("Word list") \
.getOrCreate()
df = spark.createDataFrame(data = [["Hi there, I have a question about RegexTokenizer, Could you
please help me..."]], schema = ["Sentence"])
regexTokenizer = RegexTokenizer(inputCol="Sentence", outputCol="letters", pattern="\\w")
df = regexTokenizer.transform(df)
df.first()['letters']
This gives the following output:
[' ', ', ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' ', '...']
On the other hand if I use the re module on the same sentence and use the same pattern to match the letters, using this code here:
import re
sentence = "Hi there, I have a question about RegexTokenizer, could you
please help me..."
letters_list = re.findall("\\w", sentence)
print(letters_list)
I get the desired output as per the regular expression pattern as:
['H', 'i', 't', 'h', 'e', 'r', 'e', 'I', 'h', 'a', 'v', 'e', 'a',
'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'a', 'b', 'o', 'u', 't',
'R', 'e', 'g', 'e', 'x', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'e',
'r', 'c', 'o', 'u', 'l', 'd', 'y', 'o', 'u', 'p', 'l', 'e', 'a',
's', 'e', 'h', 'e', 'l', 'p', 'm', 'e']
I also found that I need to use \W instead of \w in pySpark to solve this problem. Why is this difference? Or have I misunderstood the usage of pattern argument in RegexTokenizer?

From what the documentation on RegexTokenizer says, on creation it has a parameter called gaps. In one mode, the regexp matches gaps (true and is the default), in other it matches tokens (not the gaps, false).
Try setting it manually to the value you need: in your case, gaps = false.

Using 'sorted' function for words gives an output with letters being split and sorted

I'm new to python (2.7) and stackoverflow. I'm trying to learn how to use the 'sorted' function. When I use the 'sorted' function, the sentence splits into individual letters and sorts those letters in ascending order. But that is not what I want. I want to sort my words in ascending order. I'm trying to run this code
peace = "This is one of the most useful sentences in the whole wide world."
def pinkan (one):
return sorted (one)
print pinkan (peace)
But the output I get is something of this sort:
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'T', 'c', 'd',
'd', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'f', 'f'
, 'h', 'h', 'h', 'h', 'i', 'i', 'i', 'i', 'l', 'l', 'l', 'm', 'n', 'n', 'n',
'n', 'o', 'o', 'o', 'o', 'o', 'r', 's', 's', 's', 's', 's
', 's', 't', 't', 't', 't', 'u', 'u', 'w', 'w', 'w']
I would appreciate any help/suggestion. Thanks :-)

You should be using first split() to generate a list of words, then sort(), to sort that list alphabetically in ascending order:
peace = "This is one of the most useful sentences in the whole wide world."
terms = peace.split()
terms.sort(key=str.lower)
output = " ".join(terms)
print(output)
['in', 'is', 'most', 'of', 'one', 'sentences', 'the', 'the', 'This', 'useful',
'whole', 'wide', 'world.']

Regex for individual characters between () but excluding what's ouside

I need a regex that works as follows, I've been trying for a day and can't figure it out.
(IIILjava/lang/String;Ljava/lang/String;II)V = ['I', 'I', 'I', 'I',
'Ljava/lang/String;', 'Ljava/lang/String;', 'I', 'I'] Ignoring whats after )
(IIJ)J = ['I', 'I', 'J']
(IBZS)Z = ['I', 'B', 'Z', 'S']
I've gotten (I|D|F|Z|B|S|L.+?;) so far but I can't get it to ignore that character that's after ')'.

(?<=\([^()]{0,10000})[A-Z][^A-Z()]*(?=[^()]*\))
(?<=\([^()]{0,10000}) Positive lookbehind ensuring what precedes is (, followed by any character except ( or ) between 0 and 10000 times. The upper limit may be adjusted as needed, but must not be infinite.
[A-Z] Match any uppercase ASCII letter
[^A-Z()]* Match any character except an uppercase ASCII letter, ( or ) any number of times
(?=[^()]*\)) Positive lookahead ensuring what follows is any character except ( or ) any number of times, followed by )
Results:
['I', 'I', 'I', 'I', 'Ljava/lang/String;', 'Ljava/lang/String;', 'I', 'I']
['I', 'I', 'J']
['I', 'B', 'Z', 'S']
Sample code: See in use here

How to fetch a particular pattern using regular expression in Robot Framework?

I have a scenario where I need to fetch a particular pattern from the string using regular expression.
The string looks like below:
${text} = Slot 0 l 5 3 24+6
Slot 1 l 3 16 10
Slot 3 l 4 3 32
Slot 8 l 2 3
Slot 9 l 1 3
Here, I need to fetch only
Slot 0
Slot 1
Slot 3
Slot 8
Slot 9
How do I do this?
I have tried using the keywords 'Replace String Using Regexp' and 'Get Regexp Matches' for the same.
${text}= String.Replace String Using Regexp ${response} [^Slot\\s+\\d], ${EMPTY}
The result was:
${text} = Slot 0 l 5 3 24+6 Slot 1 l 3 16 10 Slot 3 l 4 3 32 Slot 8 l 2 3 Slot 9 l 1 3 –
And, Get Regexp Matches gives the below result:
${matches}= String.Get Regexp Matches ${response} [Slot\\s+\\d]
The result:
${matches}= ['S', 'l', 'o', 't', ' ', '0', ' ', ' ', ' ', 'l', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '5', ' ', '3', ' ', ' ', '2', '4', '+', '6', '\r', '\n', 'S', 'l', 'o', 't', ' ', '1', ' ', ' ', ' ', 'l', ' ', ' ... –

The solution is just to remove the square brackets used for the regular expression in 'Get Regexp Matches' keyword.i.e., Use Slot\s+\d+ instead of [Slot\s+\d+] This is because [] Matches a single character from the list and my requirement was to fetch the whole substring. Thanks #Todor

Regex to split by square brackets and dots with python and re module

I want to build a regex expression to split by '.' and '[]', but here, I would want to keep the result between square brackets.
I mean:
import re
pattern = re.compile("\.|[\[-\]]")
my_string = "a.b.c[0].d.e[12]"
pattern.split(my_string)
# >>> ['a', 'b', 'c', '0', '', 'd', 'e', '12', '']
But I would wish to get the following output (without any empty string):
# >>> ['a', 'b', 'c', '0', 'd', 'e', '12']
Would be it possible? I've tested with a lot of regex patterns and that is the best which I've found but it's not perfect.

You can use a quantifier in your regex and filter:
>>> pattern = re.compile(r'[.\[\]]+')
>>> my_string = "a.b.c[0].d.e[12]"
>>> filter(None, pattern.split(my_string))
['a', 'b', 'c', '0', 'd', 'e', '12']

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting a string into consonant-vowel sequences - regex

Related

Why my RegexTokenizer transformation in PySpark gives me the opposite of the required pattern?

Using 'sorted' function for words gives an output with letters being split and sorted

Regex for individual characters between () but excluding what's ouside

How to fetch a particular pattern using regular expression in Robot Framework?

Regex to split by square brackets and dots with python and re module

Categories

Resources