CSV import produces encoding problem in Stata - stata

I have a problem regarding the import of a CSV file.
The following code produces an encoding problem with the "*" sign (asterisk), even though in the provided data sample of the variable looks fine.
import delimited using `file', case(preserve) stringcol(_all) encoding(utf8) clear
I first tried it without the encoding(utf8) part, thinking that with Stata 16 that is not necessary any more. However, in both cases, with or without , I get the little question marks instead of the asterisk.
Has anybody an idea what could cause the problem and how I can fix it?
Tools which I use in the work flow:
Stata 16
Ultraedit (standard encoding on ANSI Latin I)

Related

After using libdb2o.so ( instead of libdb2.so ), Program throws an DB2 / LUW SQL Error out of range ( -302 ) by the same size/length of std::string

I have a strange question about the using of libdb2.so and libdb2o.so. My C++ Program have an string with length/size 33. I'm trying to write that string inside a database column of character which have a size of 35 character. Important: That string have a german umlauts (ü) and several special characters. It looks like that:
Xüxxxxx / Xxxxxxxxxxxx-Xxxxxxxxxxx
If my program ran with the hard-linked libdb2.so it ran perfectly and I have no SQL Error for many month. After a view month I thought would be a great idea to use the libdb2o.so which I did not hard-linked inside my C++ Program. All other SQL Statements went well, but my INSERT-SQLs got an error like that:
[IBM][CLI Driver][DB2/AIX64] SQL0302N The value of a host variable in the EXECUTE or OPEN statement is out of range for its corresponding use. SQLSTATE=22001
After some analysis I regonized that I could have an encoding problem, but not inside my std::string. The size of std::string did not changed. It is still 33 long. If I replaced my umlauts to normal character (ü => u) it worked fine, but it is not what I want.
I thought if I using libdb2o.so I have the standard encoding UTF8, but it looks like it could not be, or? If I tried to set the UTF8 inside my connection string like that above, it did not worked and I got an error of "unkown paramter inside connection string"
CONNECTION_STRING=DRIVER=libdb2o.so;Database=XXXX;Protocol=tcpip;Hostname=XXXX;Servicename=XXXX;UID=XXXX;PWD=XXXX;
Well, I did not found a simpel solution (okay I found NO solutions or explanations), therefore I would be greatful if someone know how I could fix that problem. Maybe any ideas to simpel use UTF8 inside my INSERT without changing the content of my std::string?
Am I on the wrong track that the problem could be UTF8 encoding? Any other ideas?

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

Python 2.7 "latin-1" encoding used instead of "UTF-8"

I am aware that there are plenty of discussions on the "UTF-8" encoding issue on Python 2 but I was unable to find a solution to my problem so far. I am currently creating a script to get the name of a file and hyperlink it in xlwt, so that the file can be accessed by clicks in the spreadsheet. Problem is, some of the names of these files include non-ASCII characters.
Question 1
I used the following line to retrieve the name of the file. There is only one file in the folder by the way.
>>f = filter(os.path.isfile, os.listdir(tmp_path))[0]
And then
>>print f
'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
>>print sys.stdout.encoding
'UTF-8'
>>f.decode("UTF-8")
*** UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 76: invalid continuation byte
From browsing the discussions here, I realized that "\xe7\xe3o" is not a "UTF-8" encoding. Running the following line seems to back this point.
>>f.decode("latin-1")
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
My question is then, why is the variable f being encoded in "latin-1" when the system encoding is set to "UTF-8"?
Question 2
While f.decode("latin-1") gives me the output that I want, I am still unable to supply the variable to the hyperlink function in the spreadsheet.
>>data.append(["File", xlwt.Formula('HYPERLINK("%s";"%s")' % (os.path.join(dl_path,f.decode("latin-1")),f.decode("latin-1")))])
*** FormulaParseException: can't parse formula HYPERLINK("u'H:\\Mad Lab\\SE Doc Crawler\\bovespa\\download\\521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's;"u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's)
Apparently, the closing double quote got eaten up and was replaced by a " 's" suffix. Can somebody help to explain what's going on here? 0.0
Oh and if someone can suggest a solution to Question 2 above then I will be very grateful - for you would have saved my weekend from misery!
Thanks in advance all!
Welcome to the confusing world of encoding! There's at least file encoding, terminal encoding and filename encoding to deal with, and all three could be different.
In Python 2.x, the goal is to get a Unicode string (different from str) from an encoded str. The problem is that you don't always know the encoding used for the str so it's difficult to decode it.
When using listdir() to get filenames, there's a documented but often overlooked quirk - if you pass a str to listdir() you get encoded strs back. These will be encoded according to your locale. On Windows these will be an 8bit character set, like windows-1252.
Alternatively, pass listdir() a Unicode string instead.
E.g.
os.listdir(u'C:\\mydir')
Note the u prefix
This will return properly decoded Unicode filenames. On Windows and OS X, this is pretty reliable as long your environment locale hasn't been messed with.
In your case, listdir() would return:
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
Again, note the u prefix. You can now print this to your PyCharm console with no modification.
E.g.
f = filter(os.path.isfile, os.listdir(tmp_path))[0]
print f
As for Question 2, I did not investigate further but just printed the output as unicode strings, rather than xlwt objects, due to time constraint. I'm able to continue with the project, though without the understanding of what went wrong here. In that sense, the 2 questions above have been answered.

Importing Chinese Characters from Excel to Stata

I am trying to import excel sheets that contain chinese characters into Stata 13.1. Following the guidelines at the following link: "Chinese Characters in Stata" I am able to get Chinese characters read in Stata. For example, I have .dta files which contain chinese characters and these are displayed correctly.The issue is that when I try and import excel sheets that contain chinese characters, these are imported as "????" - a string of question marks of varying lengths. Is their a way to solve this issue?
Note: I am using Windows 8.1 but I think the method in the link above still applies.
It sounds like an issue with your file and not so much with Stata. Chinese characters are often (if not always) encoded as UTF-8. It's possible that your Excel sheet didn't do this correctly. If you're not required to import from Excel directly, maybe try opening the file in Excel, saving the sheet as a "*.csv" (Comma Separated Values) file, and make sure to select the option which asks for UTF-8 encoding. Then use the insheet using "file.csv" , names to get the file in Stata with the first row made into variable names.