Microsoft Speech to Text with non-ASCII file names - c++

I'm playing with the Microsoft Speech API and the sample code at https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/cpp/windows/console. The SpeechContinuousRecognitionWithFile() function in speech_recognition_samples.cpp is almost completely what I need.
Modifying this function, I replaced the name of the input file by that of another file which happened to be on my disk, "Blutgefäße.wav". This resulted in a std::range_error exception in the FromWavFileInput call, which could only be fixed by renaming the file to "Blutgefaesse.wav".
Is it expected behavior of this function to crash with international file names? Do I have to use a Unicode version of the API? If so, where do I find it?

The solution is to supply non-ASCII file names in UTF-8 coding:
u8"Blutgefäße.wav"
The transcript, by the way, is also UTF-8 encoded.

Related

c++ visual studio code appears weird on windows command line?

opening on windows
opening on powershell
I had the problem of exporting my c++ files from visual studio to my school server/folder, where I would use powershell to open and run the files on the command line. The code is all spaced out and weird font when I open them on file, and it appears as strange characters when I open them on the command line. This causes the code to not run at all.
How do I fix this issue?
edit: I have added some pictures for better reference
This may be because the file is encoded UTF-8 but being read as ANSI or vice-versa (or some other mismatch of encodings). Try navigating to the files directly in powershell, i.e.
cd C:\Users\username\source\repos\projectname\projectname
if you are using the default path, and open a file with notepad then click 'Save as' and check the encoding (left of save button). The default indicates what encoding is being used, try changing to UTF-8 or ANSI - whichever the default is not. If that doesn't work you can also try UTF-16 and UTF-32 (which I believe are listed as Unicode and Unicode Big Endian in notepad, but I haven't verified that).
In visual studio, per this article, you can do this from the save dialog by going to File > Save As and in the Save As dialog you click the down arrow next to Save and select Save with encoding... The default appears to be code 1252, I would recommend trying UTF-8 first and see if that works.
What you have is an encoding problem. The first file starts with Unicode byte order mark ÿþ. That is, UTF-16 little endian. Because UTF-16 uses two bytes for each character and your characters are from ASCII subset, each other byte is 00 - which is rendered as extra spaces.
The second file is harder to dechipher, as Nano doesn't render the characters properly. I'd guess it has exactly the same problem - UTF-16.
It seems that some version of Visual Studio ninja-changed default file encoding as UTF-16.
As how to fix the situation, save the files in ASCII or UTF8 encoding on your Windows system, then upload those just like #Ghost adviced.

How do you print unicode text to an output file?

I'm writing a C++ program in Visual Studio for class. I am using certain Unicode characters within my program like:
╚, █, ╗, ╝, & ║
I have figured out how to print these characters onto the console properly but I have yet to find a way to output it to a file properly.
In Visual Studio, choosing [OEM United States - Codepage 437] encoding when saving the .cpp file allows it to display properly onto the console.
Now I just need a way to output these characters to a file without errors.
Hopefully someone knows how. Thank You!
Create the file using a wofstream, which uses wide (wchar_t) characters instead of an ofstream (which uses char).

UTF-8 while compiling C++ for Geant 4

I am trying to run a program called Geant4, and I have this make file containing numerous .cc files involved in the program, but when I run it I get this error:
/Volumes/Silviu/Geant4/geant4.10.02.p02/examples/basic/B1/src/._B1PrimaryGeneratorAction.cc:1:4096: error:
source file is not valid UTF-8
I am not sure how to provide more details, but the point is that I have a file called B1PrimaryGeneratorAction.cc, but I am not sure what the error means, or what ._B1PrimaryGeneratorAction.cc actually represent. And what is a not valit UTF-8? Can anyone help please?
Inspect the "._B1PrimaryGeneratorAction.cc" file reported there. The name seems suspicious, maybe you extracted an archive incorrectly thus having a broken source file.
Also the position info ":1:4096" is kind of suspicious - it says that the invalid character is the 4096th character on the first line. Sounds like a corrupted file to me, just check that file contents (C++ source files usually do not have 4096 characters on a single line, although there can be exceptions indeed).
As far as I can tell, this might be the correct contents of the file:
http://geant4.web.cern.ch/geant4/UserDocumentation/Doxygen/examples_doc/html/B1PrimaryGeneratorAction_8cc_source.html
(and that file does not seem to contain any "strange" non-ASCII characters)

Python 2.7: Handeling Unicode Objects

I have an application that needs to be able to handle non-ASCII characters of unknown encoding. The program may delete or replace these characters (if they are discovered in a user dictionary file), otherwise they need to pass cleanly through unaltered. What's mind-boggling is, it works one minute, then I make some seemingly trivial change, and now it fails with UnicodeDecode, UnicodeEncode, or kindred errors. Addressing this has led me down the road of cargo cult programing--making random tweaks that get it working again, but I have no idea why. Is there a general-purpose solution for dealing this, perhaps even the creation of class that modifies the normal way Python deals with strings?
I'm not sure what code to include as about five separate modules are involved. Here is what I am doing in abstract terms:
Taking a text from one of two sources: text that the user has pasted directly into a Tkinter toplevel window; text captured from the Win32 clipboard via a hotkey command.
The text is processed, including the removal of whitespace charters, then certain characters/words are replaced or simply deleted based on a customizable user dictionary.
The result is then returned to the Tkinter GUI or the Win32 clipboard, depending on whether or not the keyboard shortcut was used.
Some details that may be relevant:
All modules use
# -*- coding: utf-8 -*-
The user dictionary is saved in UTF-16 LE with BOM (a function removes BOM characters when parsing the file). The file object is instantiated with
self.pf = codecs.open(self.pattern_fn, 'r', 'utf-16')
The text entry points for text are via a Tkinter GUI Text widget:
text = self.paste_to_field.get(1.0, Tkinter.END)
Or from the clipboard:
text = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
And example error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201d' in position
2: character maps to <undefined>
Furthermore, the same text might work when tested on OS X (where I do development work) but cause an error on Windows.
Regular expressions are used, however in this case no non-ASCIIs are included in the pattern. For non-ASCIIs I simply
text = text.replace(old, new)
Another thing to consider: for c in text type iterations are no good because a non-ASCII may look like several characters to Python. The normal word/character distinction no longer holds. Also, using bad_letter = repr(non_ASCII) doesn't help since str(bad_letter) merely returns a string of the escape sequence--it can't restore the original character.
Sorry if this is extremely vague. Please let me know what info I can provide to help clarify. Thanks in advance for reading this.

Read Chinese Characters in Dicom Files

I have just started to get a feel of Dicom standard. I am trying to write a small program, that would read a dicom file and dump the information to a text file. I have a dataset that has the patient names in Chinese. How can I read and store these names?
Currently, I am reading the names as Char* from the dicom file, converting this char* to wchar* using code page "950" for Chinese and writing to a text file. Instead of seeing Chinese characters I see * ? and % in my text file. What am I missing?
I am working in C++ on Windows.
If the text file contains UTF-16, have you included a BOM?
There may be multiple issues at hand.
First, do you know the character encoding of the Chinese name, e.g. Big5 or GB*? See http://en.wikipedia.org/wiki/Chinese_character_encoding
Second, do you know the encoding of your output text file? If it is ascii, then you probably won't ever be able to view the Chinese characters. In which case, I would suggest changing it to unicode (i.e. UTF-8).
Then, when you read the Chinese name, convert the raw bytes and write out the result. For example, if the DICOM stores it in Big5, and your text file is UTF-8, you will need a Big5->UTF-8 converter.