Importing Chinese Characters from Excel to Stata - stata

I am trying to import excel sheets that contain chinese characters into Stata 13.1. Following the guidelines at the following link: "Chinese Characters in Stata" I am able to get Chinese characters read in Stata. For example, I have .dta files which contain chinese characters and these are displayed correctly.The issue is that when I try and import excel sheets that contain chinese characters, these are imported as "????" - a string of question marks of varying lengths. Is their a way to solve this issue?
Note: I am using Windows 8.1 but I think the method in the link above still applies.

It sounds like an issue with your file and not so much with Stata. Chinese characters are often (if not always) encoded as UTF-8. It's possible that your Excel sheet didn't do this correctly. If you're not required to import from Excel directly, maybe try opening the file in Excel, saving the sheet as a "*.csv" (Comma Separated Values) file, and make sure to select the option which asks for UTF-8 encoding. Then use the insheet using "file.csv" , names to get the file in Stata with the first row made into variable names.

Related

CSV import produces encoding problem in Stata

I have a problem regarding the import of a CSV file.
The following code produces an encoding problem with the "*" sign (asterisk), even though in the provided data sample of the variable looks fine.
import delimited using `file', case(preserve) stringcol(_all) encoding(utf8) clear
I first tried it without the encoding(utf8) part, thinking that with Stata 16 that is not necessary any more. However, in both cases, with or without , I get the little question marks instead of the asterisk.
Has anybody an idea what could cause the problem and how I can fix it?
Tools which I use in the work flow:
Stata 16
Ultraedit (standard encoding on ANSI Latin I)

How to find string matches between a text file and CSV?

I am currently working with Python 2.7 on a stand alone system. My preferred method for solving this problem would be to use pandas data-frames to compare. However, I do not have access to install the library on the system I'm working with. So my question is how else could I use a text file and look for matches of the strings in a csv.
If I have a main csv file with many fields (for relevance the first one is timestamps) and several other text files that contain a list of timestamps how can I compare each of the txt files with the main csv and if a match is found grab the entire row from the csv based on the specific field matching and outputting that result to another csv
Example:
example.csv
timestamp,otherfield,otherfield2
1.2345,,
2.3456,,
3.4567,,
5.7867,,
8.3654,,
12.3434,,
32.4355,,
example1.txt
2.3456
3.4565
example2.txt
12.3434
32.4355
If there are any questions I'm happy to answer them.
You can load all the files into lists, then search the lists using
with open('example.txt','r') as file_handle:
example_file_content = file_handle.read().split("\n")
with open("example1.txt", "r") as file_handle:
example1_file_content = file_handle.read().split("\n")
for index, line in enumerate(example_file_content):
if line.split(",")[0] in example1_file_content:
print('match found; line is',line)

Trim unprintable characters using python

I have text content which is coming from different languages like chines, Hebrew and so on. By using google translator API converting the text into 'en'. Here problem is google translator is failing when its identifies some special characters like \x11,\x01(unable to display those characters over here) and dropping that set of records. Please suggest some safest way to do this conversion with out dropping records.
data = ''.join(c for c in data if c.printable)

Importing Polish character file in SAS

I have a csv with polish characters in it but when I am importing in SAS , certain polish characters are being replaced by "?" or any other random variable , How do I handle this.
I have a list of all the possible polish characters and I dont mind it being replaced by its english counterpart
You need to set the appropriate file encoding on your infile statement, e.g. encoding="UTF-8".
SAS Documentation > http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146932.htm
http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002610945.htm

Read Chinese Characters in Dicom Files

I have just started to get a feel of Dicom standard. I am trying to write a small program, that would read a dicom file and dump the information to a text file. I have a dataset that has the patient names in Chinese. How can I read and store these names?
Currently, I am reading the names as Char* from the dicom file, converting this char* to wchar* using code page "950" for Chinese and writing to a text file. Instead of seeing Chinese characters I see * ? and % in my text file. What am I missing?
I am working in C++ on Windows.
If the text file contains UTF-16, have you included a BOM?
There may be multiple issues at hand.
First, do you know the character encoding of the Chinese name, e.g. Big5 or GB*? See http://en.wikipedia.org/wiki/Chinese_character_encoding
Second, do you know the encoding of your output text file? If it is ascii, then you probably won't ever be able to view the Chinese characters. In which case, I would suggest changing it to unicode (i.e. UTF-8).
Then, when you read the Chinese name, convert the raw bytes and write out the result. For example, if the DICOM stores it in Big5, and your text file is UTF-8, you will need a Big5->UTF-8 converter.