How to prevent illegal characters error in DB2 SQL query? - regex

I'm working with a huge DB2 table (hundreds of millions of rows), trying to select only the rows that are matched by this regular expression:
\b\d([- \/\\]?\d){12,15}(\D|$)
(That is, a word boundary, followed by 13 to 16 digits separated by nothing or a single dash, space, slash, or backslash, followed be either a non-digit or the end of the line.)
After much Googling, I've managed to create the following SQL:
SELECT idx, comment FROM tblComment
WHERE xmlcast(xmlquery('fn:matches($c,"\b\d([- \/\\]?\d){12,15}(\D|$)")' PASSING comment AS "c") AS INTEGER)=1
Which works perfectly, as far as I can tell... unless it finds a row with an illegal character:
An illegal XML character "#x3" was found in an SQL/XML expression or function argument that begins with string [...]
The data contains many illegal XML characters, and changing the data is not an option (I have limited read-only access, and there are far too many rows that would need to be fixed). Is there a way to strip out or ignore illegal characters, without first modifying the database? Or, is there a different way for me to write my query that has the same effect?

You will have to identify what are all the illegal XML characters that occur in your data. Once you know them, you can use the TRANSLATE() function to eliminate them during the pattern matching.
Say, you determine that all ASCII control characters (0x00 through 0x0F and 0x7F) may be present in the COMMENT column. Your query might then look like:
SELECT idx, comment FROM tblComment
WHERE xmlcast(xmlquery(
'fn:matches($c,"\b\d([- \/\\]?\d){12,15}(\D|$)")'
PASSING TRANSLATE(comment, ' ', x'01020304050607080B0C0F7F') AS "c")
AS INTEGER)=1
All legal XML characters are listed in the manual. 0x09, 0x0A and 0x0D are legal, so you don't need to TRANSLATE() them, for example.

Related

SQLite regex to find rows with strings not in a predefined set

I have a SQLite 3 column which can contain some different strings concatenated by +. Only some strings are valid, and I want to find rows where this column contains anything that is not part of the valid set of strings.
For example, these are valid strings
1 2.4 5X 0A
So if the column contains 1+2.4+0A it is valid, but if it contains 1+2.4+6X it is invalid (because 6X is not part of the valid set). Likewise, 1 2.4 would be invalid since 1 and 2.4 must be separated by a +.
I tried experimenting with regex's but I can't seem to construct one that only allows the strings in the valid-list. Any help would be appreciated, and as previously stated I need this to work with SQLite 3.
You can use (?:1|2\.4|5X|0A) to have a group of all valid strings. Then you can check if a string begins with a valid string followed by any amount of valid strings preceded by a + till the end with ^(?:1|2\.4|5X|0A)(?:\+(?:1|2\.4|5X|0A))*$.
That'll lead to something like
SELECT *
FROM elbat
WHERE nmuloc REGEXP '^(?:1|2\.4|5X|0A)(?:\+(?:1|2\.4|5X|0A))*$';
to get the records with valid strings. (Negate to get the invalid ones.)
If you also want to allow empty strings to be valid, not sure from your description if these are OK, extend it to ^(?:1|2\.4|5X|0A)(?:\+(?:1|2\.4|5X|0A))*$|^$.

Remove or replace '�' character in Informatica

We have a requirement wherein we need to replace or remove '�' character (which is an unrecognizable, undefined character) present in our source. While running my workflow it runs successfully but when i check the records in target they are not committed. I get the following error in Informatica
Error executing query for record 37: 6706: The string contains an untranslatable character.
I tried functions like replace_chr, reg_replace, replace_str etc., but none seems to be working. Kindly advise on how to get rid of this. Any reply is greatly appreciated.
You need to use in your schema definitions charset=> utf8-unidode-ci
but now you can do:
UPDATE tablename
SET columnToCheck = REPLACE(CONVERT(columnToCheck USING ascii), '?', '')
WHERE ...
or
update tablename
set columnToCheck = replace(columnToCheck , char(146), '');
Replace NonASCII Characters in MYSQL
You can replace the special characters in an expression transformation.
REPLACESTR(1,Column_Name,'?',NULL)
REPLACESTR - Function
1 - Position
Column_Name - Column name which has a special character
? - Special character
NULL - Replacing character
You need to fetch rows with the appropriate character set defined on your connection. What is the connection you're using, ODBC or native? What's the DB?
Special characters are a challenge and having checked the informatica network I can see there is a kludge involving replace_str setting first a variable to the string with all non special characters first and then using the resulting variable in a replace_str so that the final value has only the allowed characters https://network.informatica.com/thread/20642 (awesome workaround by nico so long as you can positively identify every character that should be allowed) ...
As an alternate kludge I would also attempt something using an xml transformation somewhere within the mapping as informatica conveniently converts special characters to encoded (decimal or hex I cant remember) values... so long as you can live with these encoded values appearing in your target text you should be fine ( and build some extra space into your strings to accommodate any bloatage from the extra characters

How to split string into chunks using regular expressions while keeping URI coded special characters together

Let's assume you have a string that you want to split into chunks having a maximum size of x characters. If you ignore new lines, a suitable regular expression would be .{1,x}
The problem I have is that I want to keep URI coded special characters like %20 together.
Example:
Hello%20world%20how%20are%20you%20today
Doing a "dumb" chunking with 5 character chunks, you end up with:
Hello
%20wo
rld%2
0how%
20are
%20yo
u%20t
oday
What I want to achieve is this:
Hello
%20wo
rld
%20ho
w%20a
re%20
you
%20to
day
Is this even possible with only regular expressions? I currently have a working solution with a loop that goes through each character and fills a bucket. If the bucket is full, it adds its content to an array of chunks and empties it. However, it also checks if the current character is a % and if the bucket would be able to hold 3 more characters (% plus the two hex digits). If it can, OK, otherwise it would push the content of the bucket in the chunks array and start with a fresh bucket.
Keep it simple, stay with your working solution with a loop, its probably faster and ten times more readible.... http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
Try this regular expression to match all parts:
/(%[0-9A-F]{2}[^%]?[^%]?|[^%]%[0-9A-F]{2}[^%]?|[^%][^%]%[0-9A-F]{2}|[^%]{1,5})/
This basically lists all possible options to get at most five characters:
%[0-9A-F]{2}[^%]?[^%]? – a percent-encoded octet followed by at most two non-% characters
[^%]%[0-9A-F]{2}[^%]? – one non-% character, followed by a percent-encoded octet followed at most one non-% character
[^%][^%]%[0-9A-F]{2} – two non-% characters followed by a percent-encoded octet
[^%]{1,5} – one to five non-% characters

c++ - escape special characters

I need to escape all special characters and replace national characters and get "plain text" for a tablename.
string getTableName(string name)
My string could be "šárka65_%&." and I want to get string I can use in my database as a tablename.
Which DBMS?
In standard SQL, a name enclosed in double quotes is a delimited identifier and may contain any characters.
In MS SQL Server, a name enclosed in square brackets is a delimited identifier.
In MySQL, a name enclosed in back-ticks is a delimieted identifier.
You could simply choose to enclose the name in the appropriate markers.
I had a feeling that wasn't what you wanted...
What codeset is your string in? It seems to be UTF-8 by the time it gets to my browser. Do you need to be able to invert the mapping unambiguously? That is harder.
You can use many schemes to map the information:
One simple minded one is simply to hex-encode everything, using a marker (X) to protect against leading digits:
XC5A1C3A1726B6136355F25262E
One slightly less simple minded one is hex-encode anything that is not already an ASCII alphanumeric or underscore.
XC5A1C3A1rka65_25262E
Or, as a comment suggests, you can devise a mapping table for accented Latin letters - indeed, a mapping table appropriately initialized will be the fastest approach. The input is the character in the source string; the output is the desired mapped character or characters. If you use an 8-bit character set, this is entirely manageable. If you use full Unicode, it is a lot less manageable (not least, how do you map all the Han syllabary to ASCII?).
Or ...

Delimiting Character

We are loading a Fixed width text file into a SAS dataset.
The character we are using to delimit multi valued field values is being interpreted as 2 characters by SAS. This breaks things, because the fields are of a fixed width.
We can use characters that appear on the keyboard, but obviously this isn't as safe, because our data could actually contain those characters.
The character we would like to use is '§'.
I'm guessing this may be an encoding issue, but don't know what to do about it.
Could you use the keycode for the character like DLM='09'x and change 09 to the right keycode?