Spark SQL Character Encoding Issues - python-2.7

I'm reading in an RDD/DF like so:
testRDD=sc.textFile('s3n://sample.txt')\.map(lambda x: x.split('|')).map(lambda x: Row(event_date=x[0])).cache()
testDF = sqlContext.createDataFrame(testRDD)
testDF.registerTempTable("testDF")
The RDD returns data that looks fine:
for i in testRDD.take(1):
print(i)
Row(event_date=u'2016-04-01 00:00:17')
But the DF comes up with some encoding issues where the first several characters are missing and the string ends in a bunch of encoded bytes:
for i in testDF.take(1):
print(i)
Row(event_date=u'01 00:00:17\x00\x00\x00\x00\x00\x05\x00\x00')
Any ideas where I'm going wrong? I've tried using decode('utf-8') on the incoming string with no luck.

Related

Extract the second instance of the website pattern in a string using pandas str.contains

I am trying to extract 2nd instance of www website from the below string. This is in a pandas dataframe.
https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-
banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
So I want to extract the following string and store it in a separate column.
https://www.accenture.com/in-en/insights/software-
platforms/core- banking-on-
cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD
Final Dataframe:
sr.no link_orig link_extracted
1 <the above string> <the extracted string that starts from
https://www.accenture.com>
Below is the code snippet:
df['link_extracted'] = `df['link_orig'].str.contains('www.accenture.com',regex=False,na-np.NaN)
I am getting the following error:
ValueError: Cannot mask with non-boolean array containing NA / NaN values
What I am missing here? If I have to use regex then what should be the approach?
The error message means you probably have NaNs in the link_orig column. That can be fixed by adding a fillna('') to your code.
Something like
df['link_extracted'] = df['link_orig'].fillna('').str.contains ...
That said, I'm not sure the rest of your code will do what you want. That will just return True is www.accenture.com is anywhere in the link_orig string.
If the link you are trying to extract always contains www.accenture.com then you can do this
df['link_extracted'] = df['link_orig'].fillna('').str.extract('(www\.accenture\.com.*)')
Personally, I'd use Series.str.extract() for this. E.g:
df['link_extracted'] = df['link_orig'].str.extract('http.*(http.*)')
This matches http, followed by anything, then captures http followed by anything.
An alternate approach would be to use urlparse.
You can use urllib.parse module:
import pandas as pd
from urllib.parse import urlparse, parse_qs
url = 'https://google.com/url?q=https://www.accenture.com/in-en/insights/software-platforms/core-banking-on-cloud&sa=U&ved=2ahUKEwiQ75fwvYD1AhXOYMAKHXofCeoQFnoECAgQAg&usg=AOvVaw02sP402HcesId4vbgOaspD'
df = pd.DataFrame({'sr.no':[1], 'link_orig':[url]})
extract_q = lambda url: parse_qs(urlparse(url).query)['q'][0]
df['link_extracted'] = df['link_orig'].apply(extract_q)
Output:
>>> df
sr.no link_orig link_extracted
0 1 https://google.com/url?q=https://www.accenture... https://www.accenture.com/in-en/insights/softw...

python file reading and splitting the words

I am reading a file in python and splitting the file with '\n' . when i am printing the splitted list it is giving 'Magni\xef\xac\x81cent Mary' instead of 'Magnificient Mary'
Here is my code...
with open('/home/naveen/Desktop/answer.txt') as ans:
content = ans.read()
content = content.split('\n')
print content
note: answer.txt contains following lines
Magnificent Mary
Flying Sikh
Payyoli Express
Here is my output of the program
the problem is in your text file. There are some unicodes characters in "Magnificent Mary" If you fix that your program should work. If you want to read with unicodes characters, you have to properly decode texts to UTF-8.
Have a look at this one (assuming you want to use python 2) Backporting Python 3 open(encoding="utf-8") to Python 2
python2
with codecs.open(filename='/Users/emily/Desktop/answers.txt', mode='rb', encoding='UTF-8') as ans:
content = ans.read().splitlines()
for i in content: print i
If you can use python3, you can actually do this:
with open('/home/naveen/Desktop/answer.txt', encoding='UTF-8') as ans:
content = ans.read().splitlines()
print(content)
There is a problem with your 'f' in Magnificent Mary . It is not the normal f , but it is the
LATIN SMALL LIGATURE FI . You can simply delete your 'f' and retype it in gedit.
To verify the difference , simply include
print [(ord(a),a) for a in (file.split("\n"))[0]]
at the end of your code for both the fs.
If there is no way to edit the file , you could first convert the string to unicode , and then use the unicodedata of python.
import unicodedata
file = open("answer.txt")
file = (file.read()).decode('utf-8')
print unicodedata.normalize('NFKD',
file).encode('ascii','ignore').split("\n")

How to parse space separated data in pyspark?

I have below kind of data which is space separated, I want to parse it by space but I am getting issue when particular element has “Space” in it.
2018-02-13 17:21:52.809 “EWQRR.OOM” “ERW WERT11”
have used below code:
import shlex
rdd= line.map(lambda x: shlex.split(x))
but its returning deserialized result like \x00\x00\x00.
Use re.findall() and regex “.+?”|\S+ or you can use “[^”]*”|\S+ by #ctwheels it performs better.
rdd = line.map(lambda x: re.findall(r'“.+?”|\S+', x))
Input:
"1234" "ewer" "IIR RT" "OOO"
Getting Output:
1234, ewer, IIR, RT, OOO
Desired Output .
1234, ewer, IIR RT, OOO
By default all the text lines are encoded as unicode if you are using sparkContext's textFile api as the api document of textFile says
Read a text file from HDFS, a local file system (available on all
nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.
If use_unicode is False, the strings will be kept as str (encoding
as utf-8), which is faster and smaller than unicode. (Added in
Spark 1.2)
And by default this option is true
#ignore_unicode_prefix
def textFile(self, name, minPartitions=None, use_unicode=True):
And thats the reason you are getting unicode characters like \x00\x00\x00 in the results.
You should include use_unicode option while reading the data file to rdd as
import shlex
rdd = sc.textFile("path to data file", use_unicode=False).map(lambda x: shlex.split(x))
Your results should be
['2018-02-13', '17:21:52.809', 'EWQRR.OOM', 'ERW WERT11']
you can even include utf-8 encoding in map function as
import shlex
rdd = sc.textFile("path to the file").map(lambda x: shlex.split(x.encode('utf-8')))
I hope the answer is helpful

Python 2 str.decode('hex') in Python 3?

I want to send hex encoded data to another client via sockets in python. I managed to do everything some time ago in python 2. Now I want to port it to python 3.
Data looks like this:
""" 16 03 02 """
Then I used this function to get it into a string:
x.replace(' ', '').replace('\n', '').decode('hex')
It then looks like this (which is a type str by the way):
'\x16\x03\x02'
Now I managed to find this in python 3:
codecs.decode('160302', 'hex')
but it returns another type:
b'\x16\x03\x02'
And since everything I encode is not a proper language, i cannot use utf-8 or some decoders, as there are invalid bytes in it (e.g. \x00, \xFF). Any ideas on how I can get the string solution escaped again just like in python 2?
Thanks
'str' objects in python 3 are not sequences of bytes but sequences of unicode code points.
If by "send data" you mean calling send then bytes is the right type to use.
If you really want the string (not 3 bytes but 12 unicode code points):
>>> import codecs
>>> s = str(codecs.decode('16ff00', 'hex'))[2:-1]
>>> s
'\\x16\\xff\\x00'
>>> print(s)
\x16\xff\x00
Note that you need to double backslashes in order to represent them in code.
There is an standard solution for Python2 and Python3. No imports needed:
hex_string = """ 16 03 02 """
some_bytes = bytearray.fromhex(hex_string)
In python3 you can treat it like an str (slicing it, iterate, etc) also you can add byte-strings: b'\x00', b'text' or bytes('text','utf8')
You also mentioned something about to encode "utf-8". So you can do it easily with:
some_bytes.encode()
As you can see you don't need to clean it. This function is very effective. If you want to return to hexadecimal string: some_bytes.hex() will do it for you.
a = """ 16 03 02 """.encode("utf-8")
#Send things over socket
print(a.decode("utf-8"))
Why not encoding with UTF-8, sending with socket and decoding with UTF-8 again ?

Dataframe encoding

Is there a way to encode the index of my dataframe? I have a dataframe where the index is the name of international conferences.
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
I keep getting:
KeyError: 'Leitf\xc3\xa4den der angewandten Informatik'
whenever my code references a foreign conference name with unknown ascii letters.
I tried:
df.at[x.encode("utf-8"), 'col1']
df.at[x.encode('ascii', 'ignore'), 'col']
Is there a way around it? I tried to see if I could encode the dataframe itself when creating it, but it doesn't seem I can do that either.
If you're not using csv, and you want to encode your string index, this is what worked for me:
df.index = df.index.str.encode('utf-8')
Setting up the encoding should be treated when reading the input file, using the option encoding
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8")
or if the file uses BOM,
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8-sig")
Just put "u" in front of utf8 strings such that
df2= pd.DataFrame(index=df_conf[u'Conference'], columns=[u'Citation1991',u'Citation1992'])
It will work.