I have a column which contains free-form text, i.e alphabets, digits and certain special characters and non-printable non-ascii control characters. How can I clean this text string by suppressing the non-printable characters using REGEX in Spark SQL 2.4 ?
Just to clarify further, besides ascii alphabets and digits, I also need to retain characters like %-()|,<;:">?/[]#+=#!&.. etc. Only the non-printable non-ascii characters need to be removed using regex.
Example - something similar to:
select regexp_replace(col, "[^:print:][^:ctrl:]", '')
OR
select regexp_replace(col, "[^:alphanum:]", "")
But I can't get it to work in Spark SQL (with the SQL API). Can anyone please advise with a working example.
Any help is appreciated.
Thanks
Related
I want to select the records containing non-alphanumeric and remove those symbols from strings. The result I expecting is strings with only numbers and letters.
I'm not really familiar with regular expression and sometime it's really confusing. The code below is from answers to similar questions. But it also returns records having only letters and space. I also tried to use /s in case some spaces are not spaces but tabs instead. But I got the same result.
Also, I want to remove all symbols, characters excepting letters, numbers and spaces. I found a function named removesymbols from google could reference. But it seems this function does not exist at all. The website introduces removesymbols is https://cloud.google.com/dataprep/docs/html/REMOVESYMBOLS-Function_57344727. How can I remove all symbols? I don't want to use replace because there are a lot of symbols and I don't know all kinds of non-alphanumeric they have.
-- the code here only shows I want to select all records with non-alphanumeric
SELECT EMPLOYER
FROM fec.work
WHERE EMPLOYER NOT LIKE '[^a-zA-Z0-9/s]+'
GROUP BY 1;
Below is for BigQuery Standard SQL
SELECT
REGEXP_REPLACE(EMPLOYER, '[^a-zA-Z\\d\\s\\t]', ''), -- option 1
REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s\t]', ''), -- option 2
REGEXP_REPLACE(EMPLOYER, r'[^\w]', ''), -- option 3
REGEXP_REPLACE(EMPLOYER, r'\W', '') -- option 4
FROM fec.work
As you can see - option 1 is most verbose and you can avoid double escaping by using r in front of string regular expression as it is in option 2
To further simplify - you can use \w or directly \W as in options 3 and 4
Note: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
I suggest using REGEXP_REPLACE for select, to remove the characters, and using REGEXP_CONTAINS to get only the one you want.
SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '')
FROM fec.work
WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')
You say you don't want to use replace because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^ to get all but alphanumerical ?
EDIT :
To complete with what Mikhail answered, you have multiple choices for your regex :
'[^a-zA-Z\\d\\s]' // Basic regex
r'[^a-zA-Z\d\s]' // Uses r to avoid escaping
r'[^\w\s]' // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)
If you don't consider underscores to be alphanumerical, you should not use \w
In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)
I searched a lot but couldn't find any exact soluion.
I have a CSV which contains some values that contains a comma in between the values.
Following is a sample row
"BEIAAGJIPAMBPJIF",2757,08042010,"13:53.59",09042010,"01:55.39","SIHAM","BEIAIGHEIPLGPJIF",20,"A",20,"S",0.00,0.00,0.00,"OLY
SPECIAL ORDER","IN STOCK , DESIGNER",0.00000,0,"","N","N",
Now it you look at the value "IN STOCK , DESIGNER", it containts a comma in between. due to which while reading the csv in my .net application and in MS Dynamics CRM import file wizard, it breaks it into two seprate values instead of one single value.
I need a regex that can match such strings and replace the comma with a hyphen "-" that I can use in Notepad ++.
Kindly help.
Thanks.
This solution worked for me, although it is a bit indirect:
by searching, detect character which is unused in the file, e.g. #
use the following regex replace to replace all delimiters: find: (".*?"|.*?), replace: \1# (note the character from step 1)
now, all leftover commas are only those which are inside the quotes. Mass replace them for -
replace back all #'s for commas
I searched a lot, but nowhere is it written how to remove non-ASCII characters from Notepad++.
I need to know what command to write in find and replace (with picture it would be great).
If I want to make a white-list and bookmark all the ASCII words/lines so non-ASCII lines would be unmarked
If the file is quite large and can't select all the ASCII lines and just want to select the lines containing non-ASCII characters...
This expression will search for non-ASCII values:
[^\x00-\x7F]+
Tick off 'Search Mode = Regular expression', and click Find Next.
Source: Regex any ASCII character
In Notepad++, if you go to menu Search → Find characters in range → Non-ASCII Characters (128-255) you can then step through the document to each non-ASCII character.
Be sure to tick off "Wrap around" if you want to loop in the document for all non-ASCII characters.
In addition to the answer by ProGM, in case you see characters in boxes like NUL or ACK and want to get rid of them, those are ASCII control characters (0 to 31), you can find them with the following expression and remove them:
[\x00-\x1F]+
In order to remove all non-ASCII AND ASCII control characters, you should remove all characters matching this regex:
[^\x1F-\x7F]+
To remove all non-ASCII characters, you can use following replacement: [^\x00-\x7F]+
To highlight characters, I recommend using the Mark function in the search window: this highlights non-ASCII characters and put a bookmark in the lines containing one of them
If you want to highlight and put a bookmark on the ASCII characters instead, you can use the regex [\x00-\x7F] to do so.
Cheers
To keep new lines:
First select a character for new line... I used #.
Select replace option, extended.
input \n replace with #
Hit Replace All
Next:
Select Replace option Regular Expression.
Input this : [^\x20-\x7E]+
Keep Replace With Empty
Hit Replace All
Now, Select Replace option Extended and Replace # with \n
:) now, you have a clean ASCII file ;)
Another good trick is to go into UTF8 mode in your editor so that you can actually see these funny characters and delete them yourself.
Another way...
Install the Text FX plugin if you don't have it already
Go to the TextFX menu option -> zap all non printable characters to #. It will replace all invalid chars with 3 # symbols
Go to Find/Replace and look for ###. Replace it with a space.
This is nice if you can't remember the regex or don't care to look it up. But the regex mentioned by others is a nice solution as well.
Click on View/Show Symbol/Show All Character - to show the [SOH] characters in the file
Click on the [SOH] symbol in the file
CTRL=H to bring up the replace
Leave the 'Find What:' as is
Change the 'Replace with:' to the character of your choosing (comma,semicolon, other...)
Click 'Replace All'
Done and done!
In addition to Steffen Winkler:
[\x00-\x08\x0B-\x0C\x0E-\x1F]+
Ignores \r \n AND \t (carriage return, linefeed, tab)
I've a free text field on my form where the users can type in anything. Some users are pasting text into this field from Word documents with some weird characters that I don't want to go in my DB. (e.g. webding font characters) I'm trying to get a regular expression that would give me only the alphanum and the punctuation characters.
But when I try the following, the output is still all the characters. How can I leave them out?
<html><body><script type="text/javascript">var str="";document.write(str.replace(/[^a-zA-Z 0-9 [:punct]]+/g, " "));</script></body></html>
If you want only ascii, use /[^ -~]+/ as regex. The problem is your [:punct:] statement. Perhaps javascript does not support [:punct:]?