Vertica new-line CR LF replace - replace

I have a column in vertica which I wish to export to .csv.
The problem is that this column has CRLF in the middle, meaning that the export reads each line as two lines. Example of input(the EOF delimiter was copy pasted from Vertica):
First part
Second part
I tried the REPLACE option but it does not replace the sequence.
select TABLE, REPLACE(column_name, '\r\n', 'FUFU') from DB;
The command does replace random letters.
Hence I start to question if there is a CRLF (Notepad++ found it) or if there is some other character hidden there which I fail to replace...
Any help on what are other possible causes for the new line (I tried \n, \c, \r and any possible combinations...) or how to see it other than in Notepad (directly in Vertica?) will be greatly appreciated...
Alternatively, I found no way to explicitly define in Vertica the EOF characters on export - does something like this exist?
Thanks

You might want to check how to use Extended String Literals in the Vertica's SQL Reference Manual.
Example:
create table a ( id integer , txt varchar(20) ) ;
insert into a values ( 1 , 'abc' ) ;
insert into a values ( 2 , e'def\r\nrghi' ) ;
insert into a values ( 3 , e'ij\r\nklm' ) ;
insert into a values ( 4 , 'poq' ) ;
Then, to replace \r\n sequences - for example - with a space:
SQL=> select id, replace(txt, e'\r\n', ' ' ) from a order by id ;
id | replace
----+----------
1 | abc
2 | def rghi
3 | ij klm
4 | poq
(4 rows)

REGEXP_REPLACE(text, '(?>\r\n|\n|\r)', ' ')

Related

Oracle SQL String Manipulation

My field contains short codes that I want to access, such as C-COR3.
The issue is some records have additional information (F and H with numbers). An example is C-COR3 F1.54H19, I only care about C-COR3. Anything after "F" I want to ignore.
Code below works, but only if I hard-code the full F1.54H19. I want to use wildcards to abstract this for other occurrences that have F and H info in the field. (Ex C-R3 F0.18H18 -> C-R3 or C-COR3 F0.23H8.5 -> C-COR3), note varying short code string lengths.
/* Translates C-COR3 F1.54H19 to C-COR3. */
select distinct SUBSTR(lud_code_short,1,INSTR(lud_code_short, 'F1.54H19')-2)
from rep_dba.mytable
I've read that SUBSTR does not allow wildcards, but have had no luck trying my hand at REGEXP_INSTR and REGEX_SUBSTR instead. Any help appreciated.
Assuming that the "code" is always the first continuous sequence of non-space characters (and that there are no leading spaces - if there are, that's easy to handle), you could do something like this. Note the str || ' ' in the call to instr() - that takes care of the case when the input string has no spaces in it to begin with. Also notice the last input - since there are no spaces anywhere, the output is the same as the input. (Showing that if the "code" is not always separated from the "additional information" by at least one space, the solution would not work.)
with
test_data (str) as (
select 'C-COR3 F14H2.5' from dual union all
select 'C-AB3' from dual union all
select null from dual union all
select 'C-AB2F14H2.5' from dual
)
select str, substr(str, 1, instr(str || ' ', ' ') - 1) as code
from test_data
;
STR CODE
-------------- --------------
C-COR3 F14H2.5 C-COR3
C-AB3 C-AB3
C-AB2F14H2.5 C-AB2F14H2.5
Try using regexp_replace within your query like below
SELECT
regexp_replace('C-COR3 F14H2.5', '(C-[[:alnum:]]+) [FH].*', '\1')
FROM dual;

Locate where is the nth occurrence of a token in a string separated by pipes

I'm I am a newbie with Regex and would like to know if it is possible to do that.
It is possible to locate the token position of a sub-string in a string like the below sample text?
AA|BBBBBBBBBB|XXXX||XXXX||FFFFFFFFFFF
Requesting the position of the 1st occurrence of 'XXXX' I must get '3', requesting the 2nd occurrence of 'XXXX' I must get '5', requesting the 3rd occurrence of 'XXXX' I must get '0' cause there's no a 3rd ocurrence.
This can be done using just regex?
Thanks in advance.
PS: If it is possible I will implement this solution on DB2 v7r2 using REGEX functions to replace an UDF I write long time ago on PLSQL to do this job.
This isn't how'd I'd normally use regex....
But it can get the job done...
create variable mysource varchar(50)
default('AA|BBBBBBBBBB|XXXX||XXXX||FFFFFFFFFFF');
select
regexp_count(
substring(mysource
, 1
,regexp_instr(mysource
,'XXXX'
,1
,2 --occurance
,1)
)
,'\|')
from sysibm.sysdummy1;
REGEXP_COUNT
5
Might need to concat a '|' to the end of the source if it's possible for the pattern to fall in the last position.
EDIT
Ok, here's a completely different way...using a recursive common table expression (RCTE)
Note that the solution is easiest if you ensure that the text ends with a delimiter...
create variable mysource varchar(50)
default('AA|BBBBBBBBBB|XXXX||XXXX||FFFFFFFFFFF|');
And the code..
with splitstring (pos, data, remain) as (
select 1
, substring(mysource,1,locate('|', mysource) -1 )
, substring(mysource,locate('|', mysource) + 1 )
from sysibm.sysdummy1
union all
select pos + 1
, substring(remain,1,locate('|', remain) -1 )
, substring(remain,locate('|', remain) + 1 )
, matches as (
select row_number() over (order by pos) as occur
,pos
from splitString
where data = 'XXXX'
)
select coalesce(pos,0) as pos
from sysibm.sysdummy1
left join matches
on occur = 2 ;
Results
POS
5

Hadoop process file with different field delimiters

What are the options to process a text file with different field delimiters in the same file and non new line row delimiter?
Some fields in the file can be fixed length and some can be separated by a character.
Example:
100 xyz |abc#hello#200 xyz1 |abc1#world
In this example, 100 is the first field value, xyz is the second field value, abc is the 3rd field value, hello is the fourth field value. | and # are the delimiters for the 3rd and the 4th fields. The lines are separated by #.
Any of Map reduce or pig or hive solution is fine.
One option may be an MR to configure a custom row delimiter, read the entire line and process the same. But any InputFormat accepts a custom delimiter?
You can override the record delimiter and set it to #.After that load the records as a line and then replace the '|' and '#' characters with space.Then you will get all the fields separated by ' '.Use STRSPLIT to get the individual fields.
SET textinputformat.record.delimiter '#'
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A REPLACE(REPLACE(line,'|',' '),'#',' ') AS line;-- Note:'\\|' if you need to escape '|'
C = FOREACH B GENERATE STRSPLIT(line,' ',4);
DUMP C;
You could try Hive with RegexSerDe

Escape function for regular expression or LIKE patterns

To forgo reading the entire problem, my basic question is:
Is there a function in PostgreSQL to escape regular expression characters in a string?
I've probed the documentation but was unable to find such a function.
Here is the full problem:
In a PostgreSQL database, I have a column with unique names in it. I also have a process which periodically inserts names into this field, and, to prevent duplicates, if it needs to enter a name that already exists, it appends a space and parentheses with a count to the end.
i.e. Name, Name (1), Name (2), Name (3), etc.
As it stands, I use the following code to find the next number to add in the series (written in plpgsql):
var_name_id := 1;
SELECT CAST(substring(a.name from E'\\((\\d+)\\)$') AS int)
INTO var_last_name_id
FROM my_table.names a
WHERE a.name LIKE var_name || ' (%)'
ORDER BY CAST(substring(a.name from E'\\((\\d+)\\)$') AS int) DESC
LIMIT 1;
IF var_last_name_id IS NOT NULL THEN
var_name_id = var_last_name_id + 1;
END IF;
var_new_name := var_name || ' (' || var_name_id || ')';
(var_name contains the name I'm trying to insert.)
This works for now, but the problem lies in the WHERE statement:
WHERE a.name LIKE var_name || ' (%)'
This check doesn't verify that the % in question is a number, and it doesn't account for multiple parentheses, as in something like "Name ((1))", and if either case existed a cast exception would be thrown.
The WHERE statement really needs to be something more like:
WHERE a.r1_name ~* var_name || E' \\(\\d+\\)'
But var_name could contain regular expression characters, which leads to the question above: Is there a function in PostgreSQL that escapes regular expression characters in a string, so I could do something like:
WHERE a.r1_name ~* regex_escape(var_name) || E' \\(\\d+\\)'
Any suggestions are much appreciated, including a possible reworking of my duplicate name solution.
To address the question at the top:
Assuming standard_conforming_strings = on, like it's default since Postgres 9.1.
Regular expression escape function
Let's start with a complete list of characters with special meaning in regular expression patterns:
!$()*+.:<=>?[\]^{|}-
Wrapped in a bracket expression most of them lose their special meaning - with a few exceptions:
- needs to be first or last or it signifies a range of characters.
] and \ have to be escaped with \ (in the replacement, too).
After adding capturing parentheses for the back reference below we get this regexp pattern:
([!$()*+.:<=>?[\\\]^{|}-])
Using it, this function escapes all special characters with a backslash (\) - thereby removing the special meaning:
CREATE OR REPLACE FUNCTION f_regexp_escape(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT regexp_replace($1, '([!$()*+.:<=>?[\\\]^{|}-])', '\\\1', 'g')
$func$;
Add PARALLEL SAFE (because it is) in Postgres 10 or later to allow parallelism for queries using it.
Demo
SELECT f_regexp_escape('test(1) > Foo*');
Returns:
test\(1\) \> Foo\*
And while:
SELECT 'test(1) > Foo*' ~ 'test(1) > Foo*';
returns FALSE, which may come as a surprise to naive users,
SELECT 'test(1) > Foo*' ~ f_regexp_escape('test(1) > Foo*');
Returns TRUE as it should now.
LIKE escape function
For completeness, the pendant for LIKE patterns, where only three characters are special:
\%_
The manual:
The default escape character is the backslash but a different one can be selected by using the ESCAPE clause.
This function assumes the default:
CREATE OR REPLACE FUNCTION f_like_escape(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
SELECT replace(replace(replace($1
, '\', '\\') -- must come 1st
, '%', '\%')
, '_', '\_');
$func$;
We could use the more elegant regexp_replace() here, too, but for the few characters, a cascade of replace() functions is faster.
Again, PARALLEL SAFE in Postgres 10 or later.
Demo
SELECT f_like_escape('20% \ 50% low_prices');
Returns:
20\% \\ 50\% low\_prices
how about trying something like this, substituting var_name for my hard-coded 'John Bernard':
create table my_table(name text primary key);
insert into my_table(name) values ('John Bernard'),
('John Bernard (1)'),
('John Bernard (2)'),
('John Bernard (3)');
select max(regexp_replace(substring(name, 13), ' |\(|\)', '', 'g')::integer+1)
from my_table
where substring(name, 1, 12)='John Bernard'
and substring(name, 13)~'^ \([1-9][0-9]*\)$';
max
-----
4
(1 row)
one caveat: I am assuming single-user access to the database while this process is running (and so are you in your approach). If that is not the case then the max(n)+1 approach will not be a good one.
Are you at liberty to change the schema? I think the problem would go away if you could use a composite primary key:
name text not null,
number integer not null,
primary key (name, number)
It then becomes the duty of the display layer to display Fred #0 as "Fred", Fred #1 as "Fred (1)", &c.
If you like, you can create a view for this duty. Here's the data:
=> select * from foo;
name | number
--------+--------
Fred | 0
Fred | 1
Barney | 0
Betty | 0
Betty | 1
Betty | 2
(6 rows)
The view:
create or replace view foo_view as
select *,
case
when number = 0 then
name
else
name || ' (' || number || ')'
end as name_and_number
from foo;
And the result:
=> select * from foo_view;
name | number | name_and_number
--------+--------+-----------------
Fred | 0 | Fred
Fred | 1 | Fred (1)
Barney | 0 | Barney
Betty | 0 | Betty
Betty | 1 | Betty (1)
Betty | 2 | Betty (2)
(6 rows)

Finding and removing Non-ASCII characters from an Oracle Varchar2

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit.
When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters.
What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
I would like to do :
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
Is there a simple way doing what I want to do?
I think this will do the trick:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.
I wouldn't recommend it for production code, but it makes sense and seems to work:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
The select may look like the following sample:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;
In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.
There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.
The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03  ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!#/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);
I had a similar issue and blogged about it here.
I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.
I found the answer here:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
Try the following:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
You can try something like following to search for the column containing non-ascii character :
select * from your_table where your_col <> asciistr(your_col);
I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.
I tried this from an excellent comment
'[^ -~|[:space:]]'
but got this ORA-12728: invalid range in regular expression .
but it lead me to my solution:
select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from
(select '- <- strangest thing here, and I want to keep line break after
-' deta from dual ) t
displays (in my TOAD tool) as
replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)
Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.
Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Here is the simple test in Oracle 11.2.03
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
i.e for some reason this version of Oracle does not replace char(226) and above.
Using '['||chr(127)||'-'||chr(225)||']' gives the desired result.
If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).
Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
å => a
ä => a
ö => o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.
As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:
SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;
This will replace anything outside that printable range as a question mark.
This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.
Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))
I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is).
The following is a simple character whitelist approach:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est