Special Chars to ASCII - regex

Using VBA in MS Access, is there a way to have all special chars in a string replaced with the ASCII equivalent? In other words, I want the ampersands gone and replaced with &, along with every other special character.
A PHP equivalent is HTMLSpecialChars. I have semi-colons in my inserts that are probably blowing up my query. I need semi-colons converted to clean my text for an insert.

Starting with Access 2000 the Replace Function is available in Access VBA.
? Replace("a&v", "&", "&")
a&b
You would need to repeat that function pattern for any other characters you want to replace.
However, if this is intended to prevent blowing up an INSERT statement, it may be a red herring. You should be able to insert text which contains semi-colons or ampersands into a text field as long as the text you insert is properly quoted or is supplied as a parameter to a parameter query. Both these statements execute successfully for me.
CurrentDb.Execute "INSERT INTO MyTable (MyText) " & _
"VALUES ('a&b')"
CurrentDb.Execute "INSERT INTO MyTable (MyText) " & _
    "VALUES ('a;b')"
It may help to show us the SQL for your failing INSERT statement with a simple example of the text which causes it to blow up. Also tell us the error message, if any. Please paste the SQL into your question rather than into a comment.

http://www.renownedmedia.com/blog/convert-ascii-to-utf-8-using-vba/

Related

BigQuery remove <0x00> hidden characters from a column

I have a table with unwanted hidden characters such as my_table:
id
fruits
1
STuff1 stuff_2 ����������������������
2
Blahblah-blahblah �������������
3
nothing
How do I remove ���������������������� when selecting this column?
Current query:
SELECT fruits, TRIM(REGEXP_REPLACE(fruits, r'[^a-zA-Z,0-9,-]', ' ')) AS new_fruits
FROM `project-id.MYDATASET.my_table`
This query is too flaw because I'm worried if I accidentally exclude/replace important data. I only want to be specific on this weird characters.
Upon opening the data as csv, the weird characters shows as <0x00>. How do I solve this?
First you have to identify which is this character, because as it is a non printable this sign is just a random representation. For replace it without remove any other important information, do the following:
identify the hexadecimal of the character. Copy from csv and past on this site:
Use the replace function in bigquery replacing the char of this hex, as following:
SELECT trim(replace(string_field_1,chr(0xfffd)," ")) FROM `<project>.<dataset>.<table>`;
if your character result is different than fffd, put you value on the chr() function

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

How can I replace multiple words "globally" using regexp_replace in Oracle?

I need to replace multiple words such as (dog|cat|bird) with nothing in a string where there may be multiple consecutive occurrences of a word. The actual code is to remove salutations and suffixes from a name. Unfortunately the garbage data I get sometimes contains "SNERD JR JR."
I was able to create a regular expression pattern that accomplishes my goal but only for the first occurrence. I implemented a stupid hack to get rid of the second occurrence, but I believe there has to be a better way. I just can't figure it out.
Here is my "hacked" code;
FUNCTION REMOVE_SALUTATIONS(IN_STRING VARCHAR2) RETURN VARCHAR2 DETERMINISTIC
AS
REGEX_SALUTATIONS VARCHAR2(4000) := '(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)';
BEGIN
RETURN TRIM(REGEXP_REPLACE(REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' '),REGEX_SALUTATIONS,''));
END REMOVE_SALUTATIONS;
I was actually proud that I was able to get this far, as regular expression are not very regular to me. All help is appreciated.
EDIT:
The default for regexp_replace based on my understanding is to do a global replace. But on the outside chance my DB is configured different I did try;
select REGEXP_REPLACE('SNERD JR JR','(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)',' ',1,0) from dual;
and the results are;
SNERD JR
Use occurrence parameter of REGEXP_REPLACE function. The docs says:
occurrence is a nonnegative integer indicating the occurrence of the replace operation:
If you specify 0, then Oracle replaces all occurrences of the match.
If you specify a positive integer n, then Oracle replaces the nth occurrenc
https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions137.htm#SQLRF06302
It should look like:
...
REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' ', 1,0 )
...

Is it possible to only get non alphabetic, English alphabet, and non numeric from a column DB2

Sometimes I get Ÿ (hex C5B8: 2 bytes, 1 character) in my database and I have a script that processes multiple data which can't read that data since it doesn't know what to do with it so it stops the whole process and I have to go into my logs and see where the error is so that I can restart the whole process.
I want to execute a query that only gives me characters that are not in the english alphabet so that I can see if they should be changed.
I tried to only look for UTF8 characters but Ÿ is a UTF8 char so I need to go for another aproach.
words containing other than:
A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z
and numbers
0-1-2-3-4-5-6-7-8-9
excluding alpanumeric (in case someone writes a address like this)
h3ll0
I was thinkg something like this:
SELECT * FROM myTable WHERE myCol != (/^[A-Za-z]+$/)
something like that where I only get columns that have characters which do not belong to the english alphabet or numbers 0-9
I'm not sure if I understood you correctly. Basically you want to find all columns that have words with characters that are not in the English alphabet? If so this might work:
SELECT * FROM `myTable` WHERE `myCol` NOT REGEXP '[A-Za-z0-9]'
EDIT: This answer was written for the old tag to this question which was "mySQL", you've change it to db2. I've tried modifying it for db2 11 but It's at best an educated guess:
SELECT * FROM `myTable` WHERE `myCol` NOT REGEXP_LIKE '[A-Za-z0-9]'
Check out the
TRANSLATE
function - see documentation
Translate all regular characters and number to an empty string - like:
select translate(mycol, '', 'ABCDEFGabcdefghi1234567890')
from mytable
This is no the complete solution but you should get the idea. This works in DB2 LUW and is available für i series.

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)