Getting proper length of emojis - python-2.7

I noticed that while you are inputting emojis in your phone message some of them take 1 character and some of them are taking 2. For example, "♊" takes 1 char but "😁" takes 2. In python, I'm trying to get length of emojis and I'm getting:
len("♊") # 3
len("😁") # 4
len(unicode("♊", "utf-8")) # 1 OH IT WORKS!
len(unicode("😁", "utf-8")) # 1 Oh wait, no it doesn't.
Any ideas?
This site has emojis length in Character.charCount() row: http://www.fileformat.info/info/unicode/char/1F601/index.htm

Read sys.maxunicode:
An integer giving the value of the largest Unicode code point, i.e.
1114111 (0x10FFFF in hexadecimal).
Changed in version 3.3: Before PEP 393, sys.maxunicode used to
be either 0xFFFF or 0x10FFFF, depending on the configuration
option that specified whether Unicode characters were stored as
UCS-2 or UCS-4.
The following script should work in both Python versions 2 an 3:
# coding=utf-8
from __future__ import print_function
import sys, platform, unicodedata
print( platform.python_version(), 'maxunicode', hex(sys.maxunicode))
tab = '\t'
unistr = u'\u264a \U0001f601' ### unistr = u'♊ 😁'
print ( len(unistr), tab, unistr, tab, repr( unistr))
for char in unistr:
print (len(char), tab, char, tab, repr(char), tab,
unicodedata.category(char), tab, unicodedata.name(char,'private use'))
Output shows consequence of different sys.maxunicode property value. For instance, the 😁 character (unicode codepoint 0x1f601 above the Basic Multilingual Plane) is converted to corresponding surrogate pair (codepoints u'\ud83d' and u'\ude01') if sys.maxunicode results to 0xFFFF:
PS D:\PShell> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
PS D:\PShell> . py -3 D:\test\Python\Py\42783173.py
3.5.1 maxunicode 0x10ffff
3 ♊ 😁 '♊ 😁'
1 ♊ '♊' So GEMINI
1 ' ' Zs SPACE
1 😁 '😁' So GRINNING FACE WITH SMILING EYES
PS D:\PShell> . py -2 D:\test\Python\Py\42783173.py
2.7.12 maxunicode 0xffff
4 ♊ 😁 u'\u264a \U0001f601'
1 ♊ u'\u264a' So GEMINI
1 u' ' Zs SPACE
1 �� u'\ud83d' Cs private use
1 �� u'\ude01' Cs private use
Note: above output examples were taken from Unicode-aware Powershell-ISE console pane.

Related

how to remove everything but letters, numbers and ! ? . ; , # ' using regex in python pandas df?

I am trying to remove everythin but letters, numbers and ! ? . ; , # ' from my python pandas column text.
I have already read some other questions on the topic, but still can not make mine work.
Here is an example of what I am doing:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'text':['hey+ guys! wuzup',
'hello p3ople!What\'s up?',
'hey, how- thing == do##n',
'my name is bond, james b0nd']}
)
Then we have the following table:
id text
1 hey+ guys! wuzup
2 hello p3ople!What\'s up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Now, tryng to remove everything but letters, numbers and ! ? . ; , # '
First try:
df.loc[:,'text'] = df['text'].str.replace(r"^(?!(([a-zA-z]|[\!\?\.\;\,\#\'\"]|\d))+)$",' ',regex=True)
output
id text
1 hey+ guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Second try
df.loc[:,'text'] = df['text'].str.replace(r"(?i)\b(?:(([a-zA-Z\!\?\.\;\,\#\'\"\:\d])))",' ',regex=True)
output
id text
1 ey+ uys uzup
2 ello 3ople hat p
3 ey ow- hing == o##
4 y ame s ond ames 0nd
Third try
df.loc[:,'text'] = df['text'].str.replace(r'(?i)(?<!\w)(?:[a-zA-Z\!\?\.\;\,\#\'\"\:\d])',' ',regex=True)
output
id text
1 ey+ uys! uzup
2 ello 3ople! hat' p?
3 ey, ow- hing == o##
4 y ame s ond, ames 0nd
Afterwars, I also tried using re.sub() function using the same regex patterns, but still did not manage to have the expected the result. Being this expected result as follows:
id text
1 hey guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing don
4 my name is bond, james b0nd
Can anyone help me with that?
Links that I have seen over the topic:
Is there a way to remove everything except characters, numbers and '-' from a string
How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?
removing newlines from messy strings in pandas dataframe cells?
https://stackabuse.com/using-regex-for-text-manipulation-in-python/
Is this what you are looking for?
df.text.str.replace("(?i)[^0-9a-z!?.;,#' -]",'')
Out:
0 hey guys! wuzup
1 hello p3ople!What's up?
2 hey, how- thing don
3 my name is bond, james b0nd
Name: text, dtype: object

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

Lemmatizing Italian sentences for frequency counting

I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.
I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.
I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.
Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?
Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
Gives this output:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
How to effectively lemmatize some sentences with pattern using their tokenizer? (assuming lemmas are recognized as nouns/verbs/adjectives etc.)
Is there a python alternative to pattern to use for Italian lemmatization with nltk?
How to split articles that are bound to the next word using apostrophes?
I'll try to answer your question, knowing that I don't know a lot about italian!
1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.
3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
It yields:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).
A simple example with your example above:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
The pprint yields:
[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.
The doc for installing the TreeTaggerWrapper library for python
I know this issue has been solved few years ago, but I am facing the same problem with nltk tokenization and Python 3 in regards to parsing words like all'ippodromo or dall'Italia. So I want to share my experience and give a partial, although late, answer.
The first action/rule that an NLP must take into account is to prepare the corpus. So I discovered that by replacing the ' character with a proper accent ’ by using accurate regex replacing during text parsing (or just a propedeutic replace all at once in basic text editor), then the tokenization works correctly and I am having the proper splitting with just nltk.tokenize.word_tokenize(text)

Python 2 str.decode('hex') in Python 3?

I want to send hex encoded data to another client via sockets in python. I managed to do everything some time ago in python 2. Now I want to port it to python 3.
Data looks like this:
""" 16 03 02 """
Then I used this function to get it into a string:
x.replace(' ', '').replace('\n', '').decode('hex')
It then looks like this (which is a type str by the way):
'\x16\x03\x02'
Now I managed to find this in python 3:
codecs.decode('160302', 'hex')
but it returns another type:
b'\x16\x03\x02'
And since everything I encode is not a proper language, i cannot use utf-8 or some decoders, as there are invalid bytes in it (e.g. \x00, \xFF). Any ideas on how I can get the string solution escaped again just like in python 2?
Thanks
'str' objects in python 3 are not sequences of bytes but sequences of unicode code points.
If by "send data" you mean calling send then bytes is the right type to use.
If you really want the string (not 3 bytes but 12 unicode code points):
>>> import codecs
>>> s = str(codecs.decode('16ff00', 'hex'))[2:-1]
>>> s
'\\x16\\xff\\x00'
>>> print(s)
\x16\xff\x00
Note that you need to double backslashes in order to represent them in code.
There is an standard solution for Python2 and Python3. No imports needed:
hex_string = """ 16 03 02 """
some_bytes = bytearray.fromhex(hex_string)
In python3 you can treat it like an str (slicing it, iterate, etc) also you can add byte-strings: b'\x00', b'text' or bytes('text','utf8')
You also mentioned something about to encode "utf-8". So you can do it easily with:
some_bytes.encode()
As you can see you don't need to clean it. This function is very effective. If you want to return to hexadecimal string: some_bytes.hex() will do it for you.
a = """ 16 03 02 """.encode("utf-8")
#Send things over socket
print(a.decode("utf-8"))
Why not encoding with UTF-8, sending with socket and decoding with UTF-8 again ?

'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128) SQLAlchemy (Django)

I use SQLAlchemy query with utf-8 encode when i use run query on mysqldb i get output, but run code on python i get error :
'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128)
query :
query = """SELECT * FROM (SELECT p.ID AS 'persons_ID', p.FirstName AS 'persons_FirstName', p.LastName AS 'persons_LastName',p.NationalCode AS 'persons_NationalCode', p.CityID AS 'persons_CityID', p.Mobile AS 'persons_Mobile',p.Address AS 'persons_Address', cities_1.ID AS 'cities_1_ID', cities_1.Name AS 'cities_1_Name',cities_1.ParentID AS 'cities_1_ParentID', cities_2.ID AS 'cities_2_ID', cities_2.Name AS 'cities_2_Name',cities_2.ParentID AS 'cities_2_ParentID' , cast(#row := #row + 1 as unsigned) as 'persons_row_number' FROM Persons p LEFT OUTER JOIN cities AS cities_2 ON cities_2.ID = p.CityID LEFT OUTER JOIN cities AS cities_1 ON cities_1.ID = cities_2.ParentID , (select #row := 0) as init WHERE 1=1 AND p.FirstName LIKE N'{}%'""".format('رامین')
Conntector charset Mysql :
e = create_engine("mysql+pymysql://#localhost/test?charset=utf8")
do you have idea for resolve ?
Thanks,
Python 2 uses bytestrings (ASCII) strings by default, which support only Latin characters. Python 3 uses Unicode strings by default.
As I see you use some Arabic script in your query and therefore you probably get some in response. The error says, that, obviously, Python can't decode Arabic characters to ASCII. To handle Arabic (or any other non-Latin) characters you have to use unicode in Python. Note: it has nothing to do with unicode setting you provide, which affects only the database.
So your options are:
Switch to Python 3.
Stay as you are, but add from __future__ import unicode_literals at the start of your every module to enable using unicode for strings by default.
Use encode/decode everytime to manipulate with unicode and bytestrings, but it's the worst solution.