We are loading a Fixed width text file into a SAS dataset.
The character we are using to delimit multi valued field values is being interpreted as 2 characters by SAS. This breaks things, because the fields are of a fixed width.
We can use characters that appear on the keyboard, but obviously this isn't as safe, because our data could actually contain those characters.
The character we would like to use is '§'.
I'm guessing this may be an encoding issue, but don't know what to do about it.
Could you use the keycode for the character like DLM='09'x and change 09 to the right keycode?
Related
I'm a new user who using mainframe, I have a file and I need to change all dots '.' in file with space, I was trying to write this statement on command
change X'05' X'40' all
after I converted the file to hexdecimal, but It doesn't work.
How can I change all the dots with space in file, in simple way please?
The dots are non-displayable characters. You can match them using picture strings in the ISPF editor (which is what I assume you're trying to use to edit the file?)
Try the command
change p'.' ' ' all
The "p'.'" part will match any non-displayable character and change it to a blank.
Hans answer above will certainly change any non-displayable character to a space. However you need to make sure you really want to change all non displayable characters to a space. Turn HEX ON to look at the actual data. You can then do a F p'.' to find the non-displayable character(s) prior to changing it. Browse shows non-displayable characters as a dot. However Edit would replace the value with an attribute for display purposes and this keeps you from typing over the data. You have to turn on HEX mode to manually modify the non-displayable value or use the Change command as you were trying. Typically any hex value from x'00' - x'3F' would be non-displayable. So a
C P'.' X'40' ALL
would modify every one of those values to a space. This may or may not be desirable depending on the file.
I'm working with a huge DB2 table (hundreds of millions of rows), trying to select only the rows that are matched by this regular expression:
\b\d([- \/\\]?\d){12,15}(\D|$)
(That is, a word boundary, followed by 13 to 16 digits separated by nothing or a single dash, space, slash, or backslash, followed be either a non-digit or the end of the line.)
After much Googling, I've managed to create the following SQL:
SELECT idx, comment FROM tblComment
WHERE xmlcast(xmlquery('fn:matches($c,"\b\d([- \/\\]?\d){12,15}(\D|$)")' PASSING comment AS "c") AS INTEGER)=1
Which works perfectly, as far as I can tell... unless it finds a row with an illegal character:
An illegal XML character "#x3" was found in an SQL/XML expression or function argument that begins with string [...]
The data contains many illegal XML characters, and changing the data is not an option (I have limited read-only access, and there are far too many rows that would need to be fixed). Is there a way to strip out or ignore illegal characters, without first modifying the database? Or, is there a different way for me to write my query that has the same effect?
You will have to identify what are all the illegal XML characters that occur in your data. Once you know them, you can use the TRANSLATE() function to eliminate them during the pattern matching.
Say, you determine that all ASCII control characters (0x00 through 0x0F and 0x7F) may be present in the COMMENT column. Your query might then look like:
SELECT idx, comment FROM tblComment
WHERE xmlcast(xmlquery(
'fn:matches($c,"\b\d([- \/\\]?\d){12,15}(\D|$)")'
PASSING TRANSLATE(comment, ' ', x'01020304050607080B0C0F7F') AS "c")
AS INTEGER)=1
All legal XML characters are listed in the manual. 0x09, 0x0A and 0x0D are legal, so you don't need to TRANSLATE() them, for example.
I need to escape all special characters and replace national characters and get "plain text" for a tablename.
string getTableName(string name)
My string could be "šárka65_%&." and I want to get string I can use in my database as a tablename.
Which DBMS?
In standard SQL, a name enclosed in double quotes is a delimited identifier and may contain any characters.
In MS SQL Server, a name enclosed in square brackets is a delimited identifier.
In MySQL, a name enclosed in back-ticks is a delimieted identifier.
You could simply choose to enclose the name in the appropriate markers.
I had a feeling that wasn't what you wanted...
What codeset is your string in? It seems to be UTF-8 by the time it gets to my browser. Do you need to be able to invert the mapping unambiguously? That is harder.
You can use many schemes to map the information:
One simple minded one is simply to hex-encode everything, using a marker (X) to protect against leading digits:
XC5A1C3A1726B6136355F25262E
One slightly less simple minded one is hex-encode anything that is not already an ASCII alphanumeric or underscore.
XC5A1C3A1rka65_25262E
Or, as a comment suggests, you can devise a mapping table for accented Latin letters - indeed, a mapping table appropriately initialized will be the fastest approach. The input is the character in the source string; the output is the desired mapped character or characters. If you use an 8-bit character set, this is entirely manageable. If you use full Unicode, it is a lot less manageable (not least, how do you map all the Han syllabary to ASCII?).
Or ...
Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.
Is there an ascii value I can put into a char in C++, that represents nothing? I tried 0 but it ends up screwing up my file so I can't read it.
ASCII 0 is null. Other than that, there are no "nothing" characters in traditional ASCII. If appropriate, you could use a control character like SOH (start of heading), STX (start of text), or ETX (end of text). Their ASCII values are 1, 2, and 3 respectively.
For the full list of ASCII codes that I used for this explaination, see this site
Sure. Use any character value that won't appear in your regular data. This is commonly referred to as a delimited text file. Popular choices for delimiters include spaces, tabs, commas, semi-colons, vertical-bar characters, and tilde.
In a C++ source file, '\0' represents a 0 byte. However, C++ strings are usually null-terminated, which means that '\0' represents the end of the string - which may be what is messing up your file.
If you really want to store a 0 byte in a data file, you need to use some other encoding. A simplistic one would use some other character - 0xFF, for example - that doesn't appear in your data, or some length/data format or something similar.
Whatever encoding you choose to use, the application writing the file and the one reading it need to agree on what the encoding is. And that is a whole new nightmare.
The null character '\0' still takes up a byte.
Does your software recognize the null character as an end-of-file character?
If your software is reading in this file, you can define a place holder character (one that isn't the same as data) but you'll also need to handle that character. As in, say '*' is your place-holder. You will read in the character but not add it to the structure that stores your data. It will still take up space in your file, but it won't take up space in your data structure.
Am I answering your question or missing it?
Do you mean a value you can write which won't actually change the file? The answer is no.
Maybe post a little more about what you're trying to accomplish.
it would depend on what kind of file it is and who is parsing it.