I have a web service which serves data to various applications in an XML format. Within that XML is a text file which is Base64 encoded.
What this web service does is receives a string, saves it as a text file using character set windows-1252 (western europian). It then feeds into a third party DLL the filename. The DLL then opens it and converts it to ISO-5426 and saves it to a new results file. I should state I have no control over this DLL.
The web service must then open the results file, convert it to base64 and return it as part of the XML results.
The problem I'm having is that it seems that you can't open a file in .NET with character set ISO-5426. and if it's opened any other way characters are changed or removed.
Is it possible to import the ISO-5426 standard into .NET or is there some other way to fix this?
I'm not an expert on character sets or ISO standards so please be gentle.
Related
We have users who need to be able to export data to a csv, which they open in Excel on mac machines, that support utf-8 characters.
NOTE: We don't want our users to have to go to the data tab, click import from text, then... We want them to be able to open the file immediately after downloading and have it display the correct info.
At first, I thought this was just an encoding/decoding problem since we are using python 2.7 (actively working on upgrading to python 3.6), but after that was fixed, I discovered Excel was the cause of the problem (as the csv works fine when opened in a text editor or even Numbers). The solution i am trying involves adding the utf-8 BOM to the beginning of the file as I read somewhere that this would let Excel know that it requires utf-8.
#Here response is just a variable that is valid when used like this and
#we can export CSV's fine that don't need utf-8
writer = csv.writer(response)
writer.writerow("0xEF0xBB0xBF")
I was hoping that but just adding the utf-8 BOM to the beginning of the csv file like this would allow Excel to realize it needed to use utf-8 encoding when opening this file, but alas it does not work. I am not sure if this is because Excel for MAC doesn't support this or if I simply added the BOM incorrectly.
Edit: I'm not sure why I didn't mention it, as it was critical in the solution, but we are using Django. I found this stack overflow post that gave the solution (which I've included below).
Because we are using Django, we were able to just include:
response.write('\xEF\xBB\xBF')
before creating a csv writer and adding the content to the csv.
Another idea that probably would have lead to a solution is opening the file normally, adding the BOM, and then creating a csv writer (Note: I did not test this idea, but if the above solution doesn't work for someone/they aren't using Django, it is an idea to try).
While writing records in a flat file using Informatica ETL job, greek characters are coming as boxes.We can see original characters in the database.In session level, we are using UTF-8 encoding.We have a multi language application and need to process Chinese, Russian, Greek,Polish,Japanese etc. characters.Please suggest.
try to change your page encoding. I also faced this kind of issue. We are using ANSII encoding, hence we created separate integration service with different encoding and file ran successfully.
There is an easy option. In session properties, select target flat file then click set file propeties. In that you can change the code-page. There you can choose UTF-8. By default it is in ANSII, that's why you are facing this issue.
I have a .NET based Excel addin that uses a C++/CLI library to read/write proprietary files. The C++/CLI library links to some core C++ libraries that provide classes to read and write these files. The core classes use std::string and std::i/ofstream to read/write data in proprietary files.
So when saving data, it goes from:
Excel >> .NET AddIn (string) >> C++/CLI Lib (System::String) >> C++ Core Lib (std::string)
All works fine with simple text (ASCII) files. Now I have a text file (ANSI encoding) with some Japanese characters in it saved on a Japanese machine. I think it uses the SHIFT-JIS encoding by default. This file LOADS fine (I see the characters in Excel same as I see in Notepad) but if I save it back unmodified then the character changes to ??. I think its because the std::string and std::ofstream classes are writing it incorrectly as simple ASCII stream.
I use the following syntax while reading the file to convert them to .NET strings:
%String(mystring.c_str());
and the following while converting them from .NET strings to std::strings while writing:
msclr::interop::marshal_as<std::string>(mydotnetstring)
The problem seems to me with encoding but I am not crystal clear on what exactly is happening. I want to understand WHY the file is READ CORRECTLY but not written correctly?
I have modified my application to read/write UTF-8 and that solves the problem but I still want to know the underlying problem.
Okay, I think I have found the underlying problem. The problem is that the msclr::interop::marshal_as< std::string > method calls WideCharToMultiByte API internally with CP_THREAD_ACP option which means that the CodePage of active THREAD is used. This .NET addin runs inside the Excel process and the current thread has a different CodePage (952 on Japanese system) than the Default CodePage (1252). I verified this by checking the return value of marshal_as call in a sample application vs the .NET addin on a Japanese machine. The sample application was converting a two Japanese character string to 4 bytes whereas the addin was just converting it to 2 unknown '?' bytes.
SOLUTION
marshal_as does not provide an option to change this option so the solution is to marshal .NET strings by directly using the WideCharToMultiByte API with CP_ACP option. It worked for me.
I am writing a Word add-in which is supposed to store some own XML data per document using Word object model and its CustomXMLPart. The problem I am now facing is the lack of IStream-like functionality for reading/writing XML to/from a CustomXMLPart. It only provides BSTR interface and I am puzzled how to handle UTF-8 XMLs with BSTRs. To my understanding an UTF-8 XML file should really never have to undergo this sort of Unicode conversion. I am not sure what to expect as a result here.
Is there another way of using Word automation interfaces to store arbitrary custom information inside a DOCX file?
The "package" is an OPC document (Open Packaging Convention), which is basically a structured zip folder with a different extension (e.g. .pptx, .docx, .xps, etc.). You can get that file in stream and manipulate it any which way you like - but not artibitrarily. It will not be recognized as valid docx if you put things in the wrong places (not just xml elements, but also files in the folders inside the zip file). But if you're just talking "artibitrary" meaning CustomXMLPart, then that's okay.
This is a good kicker page to learn more about the Open XML SDK and if you're up to it, which allows for somewhat easier access to the file formats than using (.NET) System.IO.Packaging or a third-party zip library. To go deeper, grab the eBook (free) Open XML Explained.
With the Open XML SDK (again, this can all be done without the SDK) in .NET, this is what you'll want to do: How to: Insert Custom XML to an Office Open XML Package by Using the Open XML API.
I'm trying to consume a SharePoint webservice from ColdFusion via cfinvoke ('cause I don't want to deal with (read: parse) the SOAP response itself).
The SOAP response includes a byte-order-mark character (BOM), which produces the following exception in CF:
"Cannot perform web service invocation GetList.
The fault returned when invoking the web service operation is:
'AxisFault
faultCode: {http://www.w3.org/2003/05/soap-envelope}Server.userException
faultSubcode:
faultString: org.xml.sax.SAXParseException: Content is not allowed in prolog."
The standard for UTF-8 encoding optionally includes the BOM character (http://unicode.org/faq/utf_bom.html#29). Microsoft almost universally includes the BOM character with UTF-8 encoded streams . From what I can tell there’s no way to change that in IIS. The XML parser that JRun (ColdFusion) uses by default doesn’t handle the BOM character for UTF-8 encoded XML streams. So, it appears that the way to fix this is to change the XML parser used by JRun (http://www.bpurcell.org/blog/index.cfm?mode=entry&entry=942).
Adobe says that it doesn't handle the BOM character (see comments from anoynomous and halL on May 2nd and 5th).
http://livedocs.adobe.com/coldfusion/8/htmldocs/Tags_g-h_09.html#comments
I'm going to say that the answer to your question (is it possible?) is no. I don't know that definitively, but the poster who commented just above halL (in the comments on this page) gave a work-around for the problem -- so I assume it is possible to deal with when parsing manually.
You say that you're using CFInvoke because you don't want to deal with the soap response yourself. It looks like you don't have any choice.
As Adam Tuttle said already, the workaround is on the page that you linked to
<!--- Remove BOM from the start of the string, if it exists --->
<cfif Left(responseText, 1) EQ chr(65279)>
<cfset responseText = mid(xmlText, 2, len(responseText))>
</cfif>
It sounds like ColdFusion is using Apache Axis under the covers.
This doesn't apply exactly to your solution, but I've had to deal with this issue once before when consuming a .NET web service with Apache Axis/Java. The only solution I was able to find (since the owner of the web service was unwilling to change anything on his end) was to write a Handler class that Axis would plug into the pipeline which would delete the BOM from the message if it existed.
So perhaps it's possible to configure Axis through ColdFusion? If so you can add additional Handlers to the message handling flow.