I am reading the lines from a CSV file; I am applying LDA algorithm to find the most common topic, after data processing in doc_processed, I am getting 'u' in every word but why? Please suggest me to remove 'u' from the doc+processed, my code in Python 2.7 is
data = [line.strip() for line in open("/home/dnandini/test/h.csv", 'r')]
stop = set(stopwords.words('english'))# stop words
exclude = set(string.punctuation) #to reomve the punctuation
lemma = WordNetLemmatizer() # to map with parts of speech
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
shortword = re.compile(r'\W*\b\w{1,2}\b')
output=shortword.sub('', normalized)
return output
doc_processed = [clean(doc) for doc in data]
Output as doc_processed -
[u'amount', u'ze69heneicosadien11one', u'trap', u'containing', u'little', u'microg', u'zz69ket', u'attracted', u'male', u'low', u'population', u'level', u'indicating', u'potent', u'sex', u'attractant', u'trap', u'baited', u'z6ket', u'attracted', u'male', u'windtunnel', u'bioassay', u'least', u'100fold', u'attractive', u'male', u'zz69ket', u'improvement', u'trap', u'catch', u'occurred', u'addition', u'z6ket', u'various', u'binary', u'mixture', u'zz69ket', u'including', u'female', u'ratio', u'ternary', u'mixture', u'zz69ket']
the u'some string' format means it is a unicode string. See this question for more details on unicode strings themselves, but the easiest way to fix this is likely to str.encode the result before returning it from clean.
def clean(doc):
# all as before until
output = shortword.sub('', normalized).encode()
return output
Note that attempting to encode unicode code points that don't translate directly to the default encoding (which appears to be ASCII. See sys.getdefaultencoding() on your system to check) will throw an error here. You can handle the error in various ways be defining the errors kwarg to encode.
s.encode(errors="ignore") # omit the code point that fails to encode
s.encode(errors="replace") # replace the failing code point with '?'
s.encode(errors="xmlcharrefreplace") # replace the failing code point with ' '
# Note that the " " above is U+FFFD, the Unicode Replacement Character.
Related
I have a function that is designed to make some text safe for filenames or URLs. I'm trying to change it so that it works in Python 2 and Python 3. In my attempt, I've confused myself with bytecode and would welcome some guidance. I'm encountering errors like sequence item 1: expected a bytes-like object, str found.
def slugify(
text = None,
filename = True,
URL = False,
return_str = True
):
if sys.version_info >= (3, 0):
# insert magic here
else:
if type(text) is not unicode:
text = unicode(text, "utf-8")
if filename and not URL:
text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore")
text = unicode(re.sub("[^\w\s-]", "", text).strip())
text = unicode(re.sub("[\s]+", "_", text))
elif URL:
text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore")
text = unicode(re.sub("[^\w\s-]", "", text).strip().lower())
text = unicode(re.sub("[-\s]+", "-", text))
if return_str:
text = str(text)
return text
It seems like your main problem is figuring out how to convert the text to unicode and back to bytes when you aren't sure what the original type was. In fact, you can do this without any conditional checks if you're careful.
if isinstance(s, bytes):
s = s.decode('utf8')
Should be sufficient to convert something to unicode in either Python 2 or 3 (assuming 2.6+ and 3.2+ as is usual). This is because bytes exists as an alias for string in Python 2. The explicit utf8 argument is only required in Python 2, but there's no harm in providing it in Python 3 as well. Then to convert back to a bytestring, you just do the reverse.
if not isinstance(s, bytes):
s = s.encode('utf8')
Of course, I would recommend that you think hard about why you are unsure what types your strings have in the first place. It is better to keep the distinction seperate, rather than write "weak" APIs that accept either. Python 3 just encourages you to maintain the separation.
I'm writing a bot in python using tweepy for python 2.7. I'm stumped on how to approach what I am looking to do. Currently the bot finds the tweet id and appends it to a text file. On later runs I want to use regex to search that file for a match and only write if there is no match within the text file. The intent is not to add duplicate tweet ids to my text file which could span a large amount of numbers followed by newline.
Any help is appreciate!
/edit when I try the below code the IDE says match can't be seen and syntax error as a result.
import re,codecs,tweepy
qName = Queue.txt
tweets = api.search(q=searchQuery,count=tweet_count,result_type="recent")
with codecs.open(qName,'a',encoding='utf-8') as f:
for tweet in tweets:
tweetId = tweet.id_str
match = re.findall(tweedId), qName)
#if match = false then do write, else discard and move on
f.write(tweetId + '\n')
If i get you correct,You need not to bother with regex etc. let the special containers do the work for you.I would proceed with non-duplicate-container like dictionary or set e.g read all the data from file into dictionary or set and then go for extending id into this dictionary or set after all write this dictionary or set back into file.
e.g.
>>>data = set()
>>>for i in list('asddddddddddddfgggggg'):
data.add(i)
>>>data
>>>set(['a', 's', 'd', 'g', 'f']) ## see one d and g
Hi guys I am having a problem with inserting utf-8 unicode character to my database.
The unicode that I get from my form is u'AJDUK MARKO\u010d'. Next step is to decode it to utf-8. value.encode('utf-8') then I get a string 'AJDUK MARKO\xc4\x8d'.
When I try to update the database, works the same for insert btw.
cur.execute( "UPDATE res_partner set %s = '%s' where id = %s;"%(columns, value, remote_partner_id))
The value gets inserted or updated to the database but the problem is it is exactly in the same format as AJDUK MARKO\xc4\x8d and of course I want AJDUK MARKOČ. Database has utf-8 encoding so it is not that.
What am I doing wrong? Surprisingly couldn't really find anything useful on the forums.
\xc4\x8d is the UTF-8 encoding representation of Č. It looks like the insert has worked but you're not printing the result correctly, probably by printing the whole row as a list. I.e.
>>> print "Č"
"Č"
>>> print ["Č"] # a list with one string
['\xc4\x8c']
We need to see more code to validate (It's always a good idea to give as much reproducible code as possible).
You could decode the result (result.decode("utf-8")) but you should avoid manually encoding or decoding. Psycopg2 already allows you send Unicodes, so you can do the following without encoding first:
cur.execute( u"UPDATE res_partner set %s = '%s' where id = %s;" % (columns, value, remote_partner_id))
- note the leading u
Psycopg2 can return Unicodes too by having strings automatically decoded:
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
Edit:
SQL values should be passed as an argument to .execute(). See the big red box at: http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters
Instead
E.g.
# Replace the columns field first.
# Strictly we should use http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
sql = u"UPDATE res_partner set {} = %s where id = %s;".format(columns)
cur.execute(sql, (value, remote_partner_id))
I'm running a parser in python 2.7 that is taking a text field of xml code from a database and using Beautiful Soup to find and pull different tags in the xml. When I am pulling the tags from an tag in the xml and getting to the given text it is returning
<author>
<name>Josef Šimánek</name>
</author>
Josef \xc5\xa0im\xc3\xa1nek
when what it should look like is
Josef Šimánek
my relevant code is as follows:
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
if author != None:
for name in author.findAll("name"):
if(checkNull(name).find(",") != -1):
name = checkNull(name).split(",",1)
for s in name:
print s
else:
print name
As you can see the code pulls out and cycles through the different tags and if the name tag contains a comma separated list of names, then it splits and prints each individually.
def checkNull(item):
if item != None:
return item.text.rstrip()
return " "
Also the check null function is just a helper method to see if the returned tag even contains any text at all as seen above.
I have tried encode, decode, and unicode functions in order to try and resolve the issue however none have succeded. Are there any other methods recommended that i could try to fix this?
name is a BeautifulSoup.Tag type not a string so you're probably getting a __repr__ of the object that's suitable for a terminal that doesn't support UTF-8 (\xc5\xa0 is the Python byte sequence for the UTF-8 encoding of š). name.text is probably the value you actually want, which should be a Unicode string.
If you're using Windows, it's best to avoid printing to the console as its console doesn't easily support UTF-8. You could use https://pypi.python.org/pypi/win_unicode_console, but it's easier to just write your output to a file instead.
I've cleaned up your code a little to make it simpler (quick null checks) and to write your output to a UTF-8 encoded file:
# io provides better access to files with working universal newline support
import io
# open a file in text mode, encoding all output to utf-8
output_file = io.open("output.txt", "w", encoding="utf-8")
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
# If not null or not empty
if author:
for name in author.findAll("name"):
# .text contains the actual Unicode string value
if name.text:
names = name.text.split(",", 1)
# If string contained a comma, you'll have two elements in a list
# else you'll just have the 1 length list
for flname in names:
# remove any whitespace on either side
output_file.write(flname.strip() + "\n")
output_file.close()
I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}