Python splitting a string with accented letters

Python splitting a string with accented letters - python-2.7

I would like to split a string which contains accented characters into characters without breaking the accent and the letter apart.
A simple example is
>>> o = u"šnjiwgetit"
>>> print u" ".join(o)
s ̌ n j i w g e t i t
or
>>> print list(o)
[u's', u'\u030c', u'n', u'j', u'i', u'w', u'g', u'e', u't', u'i', u't']
Whereas I would like the result to be š n j i w g e t i t so that the accent stays on top of the consonant.
The solution should work even with more difficult characters such as h̭ɛ̮ŋkkɐᴅ

You can use regex to group the characters. Here is a sample code for doing so:
import re
pattern = re.compile(r'(\w[\u02F3\u1D53\u0300\u2013\u032E\u208D\u203F\u0311\u0323\u035E\u031C\u02FC\u030C\u02F9\u0328\u032D:\u02F4\u032F\u0330\u035C\u0302\u0327\u03572\u0308\u0351\u0304\u02F2\u0352\u0355\u00B7\u032C\u030B\u2019\u0339\u00B4\u0301\u02F1\u0303\u0306\u030A7\u0325\u0307\u0354`\u02F0]+|\w|\W)', re.UNICODE | re.IGNORECASE)
In case you had some accents missing, add them the pattern.
Then, you can split words into characters as follows.
print(list(pattern.findall('šnjiwgetit')))
['š', 'n', 'j', 'i', 'w', 'g', 'e', 't', 'i', 't'
print(list(pattern.findall('h̭ɛ̮ŋkkɐᴅ')))
['h̭', 'ɛ̮', 'ŋ', 'k', 'k', 'ɐ', 'ᴅ']
If you are using Python2, add from __future__ import unicode_literals at the beginning of the file.

Related

How to remove all consonants and print vowels in a list

Here is my code:
#Alphabet class
class Alphabet(object):
def __init__(self, s):
self.s = s
def __str__(self):
return "Before: " + str(self.s)
#Define your Vowels class here
class Vowels:
def __init__(self,vowelList):
self.vowelList = vowelList
def __str__(self):
return "Invoking the method in Vowels by passing the Alphabet object\nAfter: " + str(vowelList)
def addVowels(self,a_obj):
for letter in a_obj:
if letter in 'aeiou':
vowelList.append(letter)
l = ','.join(vowelList)
a1 = Alphabet('A,B,C,E,I')
print a1
b = Vowels(a1)
b.addVowels(a1)
print (a2)
Right now, all it is printing is "Before: A,B,C,E,I", but I am trying to take a string of letters separated by commas (i.e. a_obj), extract the vowels from the string, then append the result to a list. I have looked at other answers regarding finding and printing only the vowels, which is why I have the for loop and if statement in addVowels, but no luck. Just to note,Vowels is supposed to be a container class for Alphabet.
When trying to get the output...the below code gives me
a1 = Alphabet('A,B,C,E,I')
print a1
a2 = Vowels(a1)
print a2
ouput:
Before: A,B,C,E,I
Invoking the method in Vowels by passing the Alphabet object
After: []
it seems like it isn't passing the letters from Alphabet...

You can create the list and get rid of the commas in one line by using split.
>>> "a,b,c,d,e,f,g,h,i,j".split(",")
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
From there you can remove the consonants by only keeping the vowels.
You can use a for loop:
letterList = ['a', 'b', 'c', 'd']
vowelList = []
for letter in letterList:
if letter in 'aeiou':
vowelList.append(letter)
Or you can use list comprehension:
letterList = ['a', 'b', 'c', 'd']
vowelList = [letter for letter in letterList if letter in 'aeiou']
Example of how this would work for your code:
class Vowels(object):
def __init__(self, vowelList):
self.vowelList = vowelList
lettersList = self.vowelList.s.split(",")
self.vowelList = [letter for letter in self.lettersList if letter in 'aeiou']

I'm using this code, and it works for me.
def getVowels(text):
vowel_letters = []
vowel_list = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U',]
for vowels in text:
if vowels in vowel_list:
vowel_letters.append(vowels)
return vowel_letters
print(getVowels('Hi, How are you today!'))
## Output: ['i', 'o', 'a', 'e', 'o', 'u', 'o', 'a']

Regex for individual characters between () but excluding what's ouside

I need a regex that works as follows, I've been trying for a day and can't figure it out.
(IIILjava/lang/String;Ljava/lang/String;II)V = ['I', 'I', 'I', 'I',
'Ljava/lang/String;', 'Ljava/lang/String;', 'I', 'I'] Ignoring whats after )
(IIJ)J = ['I', 'I', 'J']
(IBZS)Z = ['I', 'B', 'Z', 'S']
I've gotten (I|D|F|Z|B|S|L.+?;) so far but I can't get it to ignore that character that's after ')'.

(?<=\([^()]{0,10000})[A-Z][^A-Z()]*(?=[^()]*\))
(?<=\([^()]{0,10000}) Positive lookbehind ensuring what precedes is (, followed by any character except ( or ) between 0 and 10000 times. The upper limit may be adjusted as needed, but must not be infinite.
[A-Z] Match any uppercase ASCII letter
[^A-Z()]* Match any character except an uppercase ASCII letter, ( or ) any number of times
(?=[^()]*\)) Positive lookahead ensuring what follows is any character except ( or ) any number of times, followed by )
Results:
['I', 'I', 'I', 'I', 'Ljava/lang/String;', 'Ljava/lang/String;', 'I', 'I']
['I', 'I', 'J']
['I', 'B', 'Z', 'S']
Sample code: See in use here

String value decode utf-8

I want to decode string values to utf-8. But it doesn't change.
So, here is my code:
self.textEdit_3.append(str(self.new_header).decode("utf-8") + "\n")
The result image is here:
The original output value is:
['matchkey', 'a', 'b', 'd', '안녕'] # 안녕 is Korean Language
I changed the default encoding for encoding / decoding with unicode to utf-8 instead of ascii. On the first line I added this code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Why doesn't the string value change?

You can fix your code like this:
header = str(self.new_header).decode('string-escape').decode("utf-8")
self.textEdit_3.append(header + "\n")
You do not need the setdefaultencoding lines.
Expanantion:
The original value is a list containing byte-strings:
>>> value = ['matchkey', 'a', 'b', 'd', '안녕']
>>> value
['matchkey', 'a', 'b', 'd', '\xec\x95\x88\xeb\x85\x95']
If you convert this list with str, it will use repr on all the list elements:
>>> strvalue = str(value)
>>> strvalue
"['matchkey', 'a', 'b', 'd', '\\xec\\x95\\x88\\xeb\\x85\\x95']"
The repr parts can be decoded like this:
>>> strvalue = strvalue.decode('string-escape')
>>> strvalue
"['matchkey', 'a', 'b', 'd', '\xec\x95\x88\xeb\x85\x95']"
and this can now be decoded to unicode like this:
>>> univalue = strvalue.decode('utf-8')
>>> univalue
u"['matchkey', 'a', 'b', 'd', '\uc548\ub155']"
>>> print univalue
['matchkey', 'a', 'b', 'd', '안녕']
PS:
Regarding the problems reading files with a utf-8 bom, please test this script:
# -*- coding: utf-8 -*-
import os, codecs, tempfile
text = u'a,b,d,안녕'
data = text.encode('utf-8-sig')
print 'text:', repr(text), len(text)
print 'data:', repr(data), len(data)
f, path = tempfile.mkstemp()
print 'write:', os.write(f, data)
os.close(f)
with codecs.open(path, 'r', encoding='utf-8-sig') as f:
string = f.read()
print 'read:', repr(string), len(string), string == text

Regex to split by square brackets and dots with python and re module

I want to build a regex expression to split by '.' and '[]', but here, I would want to keep the result between square brackets.
I mean:
import re
pattern = re.compile("\.|[\[-\]]")
my_string = "a.b.c[0].d.e[12]"
pattern.split(my_string)
# >>> ['a', 'b', 'c', '0', '', 'd', 'e', '12', '']
But I would wish to get the following output (without any empty string):
# >>> ['a', 'b', 'c', '0', 'd', 'e', '12']
Would be it possible? I've tested with a lot of regex patterns and that is the best which I've found but it's not perfect.

You can use a quantifier in your regex and filter:
>>> pattern = re.compile(r'[.\[\]]+')
>>> my_string = "a.b.c[0].d.e[12]"
>>> filter(None, pattern.split(my_string))
['a', 'b', 'c', '0', 'd', 'e', '12']

How to use the regular expression to make the Pig Latin game?

I am trying to get a single match for the first consonant or consonant cluster in an input. Then the program should move the consonant to the beginning of the word and add "ay" at the end.
Here is my code
import re
consonants = [ 'bl', 'cl', 'fl', 'gl', 'pl', 'sl', 'br', 'cr', 'dr', 'fr', 'gr','pr', 'tr', 'sc', 'sk', 'sm', 'sn', 'sp', 'st', 'sw', 'tw','b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z']
word1 = str(input("Type something"))
word2 = re.split(r'[b-df-hj-np-tv-z]' or '[bl]''[cl]''[fl]', word1)
if any(consonants in word2 for consonants in consonants):
print(word2[1] + word2[0] + word2[2] + "ay")
The output does not appear in the interactive console.

Right, Python does not do "magic"; or is a well-defined operator which takes two boolean expressions and produces a boolean expression, not something which magically combines two regular expression strings into a new regular expression string. (You have to remember that you're talking to a computer, and computers are very stupid!)
To do the pig latin game you'll probably want to just gather a substring of non-vowels and then check whether it's 0-length (starts with a vowel) or not.

Just solved the program.
import re
words1 = input("Input Sentence:")
b1 = re.search(r"([^aeoiu]*)([aeoiu]*)([^aeoiu]*)([aeoiu]*)([^aeoiu]*)", words1)
b2 = b1.group(1)
b3 = b1.group(2)
b4 = b1.group(3)
b5 = b1.group(4)
b6 = b1.group(5)
if b5 != 5:
print(b3 + b4 + b5 + b6 + b2 + "ay")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python splitting a string with accented letters - python-2.7

Related

How to remove all consonants and print vowels in a list

Regex for individual characters between () but excluding what's ouside

String value decode utf-8

Regex to split by square brackets and dots with python and re module

How to use the regular expression to make the Pig Latin game?

Categories

Resources