update where symbol length less than 3 characters KDB+/Q - sql-update

I have a table like this:
test:([]column1:`A`B`C`D`E;column2:`Consumer`RealEstate`27`85`Technology)
I need to write an update query where the character count of column2 is 2 or less, but I couldn't find any way to reference the character length of a symbol to use in a where clause. How would I write a query like that so my result is the following?
test:([]column1:`A`B`C`D`E;column2:`Consumer`RealEstate`NewCategory`NewCategory`Technology)

One way to accomplish what you want is to convert the symbols into strings (i.e. lists of chars) and apply the where clause to the count of each of those strings:
update column2:`NewCategory from `test where 3>count each string column2
(using a backtick on the table name here, assuming you want to apply the change in place)

Another option is to use a vector conditional
update column2:?[3>count each string column2;`NewCatergory;column2] from test

Related

BigQuery remove <0x00> hidden characters from a column

I have a table with unwanted hidden characters such as my_table:
id
fruits
1
STuff1 stuff_2 ����������������������
2
Blahblah-blahblah �������������
3
nothing
How do I remove ���������������������� when selecting this column?
Current query:
SELECT fruits, TRIM(REGEXP_REPLACE(fruits, r'[^a-zA-Z,0-9,-]', ' ')) AS new_fruits
FROM `project-id.MYDATASET.my_table`
This query is too flaw because I'm worried if I accidentally exclude/replace important data. I only want to be specific on this weird characters.
Upon opening the data as csv, the weird characters shows as <0x00>. How do I solve this?
First you have to identify which is this character, because as it is a non printable this sign is just a random representation. For replace it without remove any other important information, do the following:
identify the hexadecimal of the character. Copy from csv and past on this site:
Use the replace function in bigquery replacing the char of this hex, as following:
SELECT trim(replace(string_field_1,chr(0xfffd)," ")) FROM `<project>.<dataset>.<table>`;
if your character result is different than fffd, put you value on the chr() function

Extract string after last match strings [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm using BigQuery and I want to extract string after the specific match strings, in my case, the strings is sc
I have a string like this :
www.xxss.com?psct=T-EST2%20.coms&.com/u[sc'sc(mascscin', sc'.c(scscossccnfiscg.scjs']-/ci=1(sctitis)
My expected result is:
titis)
Is this possible?
In general, across all RDBMS finding the index of the last instance of a match in a string is easy to compute by first reversing the string. Then we are only looking for the first match.
Update: BigQuery
Follow the documentation for REGEXP_EXTRACT in the String Functions documentation for BigQuery
NOTE: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
However, this problem can be solved without RegEx.
BigQuery supports array processing and has a SPLIT function, so you could split by the lookup variable and capture only the last result:
SELECT ARRAY_REVERSE(SPLIT( !YOUR COLUMN HERE! , "sc"))[OFFSET(1)]
The following adaptation from my original submission may still work:
SELECT REVERSE(SUBSTR(REVERSE(#text), 1, STRPOS(REVERSE(#text), "cs") -1))
For those who have a similar requirement in MS SQL Server the following syntax can be used.
other RDBMS can use a similar query, you will have to use the appropriate platform functions to acheive the result.
DECLARE #text varchar(200) = 'www.xxss.com?psct=T-EST2%20.coms&.com/u[sc''sc(mascscin'', sc''.c(scscossccnfiscg.scjs'']-/ci=1(sctitis)'
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Produces: titis)
You could achieve a similar result by obtaining the last index of 'sc' as above and using that value in a SUBSTRING however for that to work you need to re-compute the Length, this solution instead uses the LEFT function and then REVERSE's the result , reducing the functional complexity of the query by 1 (1 less function call)
Step this through:
Reverse the value:
SELECT REVERSE(#text)
Results in:
)sititcs(1=ic/-]'sjcs.gcsifnccssocscs(c.'cs ,'nicscsam(cs'cs[u/moc.&smoc.02%2TSE-T=tcsp?moc.ssxx.www
Now we find the first Index of 'cs'
Note: we have to reverse the sequece of the lookup string as well!
SELECT CharIndex('cs', REVERSE(#text),1)
Result: 7
Select the characters before this index:
Note: we must use -1 here because SQL uses 1-based index result from CharIndex so we must reduce it by 1
SELECT LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1)
Finally, we reverse the result:
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Guess you could use 'sc' as seperator, define (if constant string length) string length in your query (wildcard),
STRING_SPLIT ( string , separator )

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

Partially match integers in PostgreSQL queries

So in my PostgreSQL 10 I have a column of type integer. This column represents a code of products and it should be searched against another code or part of the code. The values of the column are made of three parts, a five-digit part and two two-digit parts. Users can search for only the first part, the first-second or first-second-third.
So, in my column I have , say 123451233 the user searches for 12345 (the first part). I want to be able to return the 123451233. Same goes if the users also searches for 1234512 or 123451233.
Unfortunately I cannot change the type of column or break the one column into three (one for every part). How can I do this? I cannot use LIKE. Maybe something like a regex for integers?
Thanks
Consider to use simple arithmetic.
log(value)::int + 1 returns the number of digits in integer part of the value and using this:
value/(10^(log(value)::int-log(search_input)::int))::int
returns value truncated to the same digits number as search_input so, finally
search_input = value/(10^(log(value)::int-log(search_input)::int))::int
will make the trick.
It is more complex literally but also could be more efficient then strings manipulations.
PS: But having index like create index idx on your_table(cast(your_column as text)); search like
select * from your_table
where cast(your_column as text) like search_input || '%';
is the best case IMO.
You do not need regex functions. Cast the integer to text and use the function left(), example:
create table my_table(code int); -- or bigint
insert into my_table values (123451233);
with input_data(input_code) as (
values('1234512')
)
select t.*
from my_table t
cross join input_data
where left(code::text, length(input_code)) = input_code;
code
-----------
123451233
(1 row)

R: replacing special character in multiple columns of a data frame

I try to replace the german special character "ö" in a dataframe by "oe". The charcter occurs in multiple columns so I would like to be able to do this all in one by not having to specify individual columns.
Here is a small example of the data frame
data <- data.frame(a=c("aö","ab","ac"),b=c("bö","bb","ab"),c=c("öc","öb","acö"))
I tried :
data[data=="ö"]<-"oe"
but this did not work since I would need to work with regular expressions here. However when I try :
data[grepl("ö",data)]<-"oe"
I do not get what I want.
The dataframe at the end should look like:
> data
a b c
1 aoe boe oec
2 ab bb oeb
3 ac ab acoe
>
The file is a csv import that I import by read.csv. However, there seems to be no option to change to fix this with the import statement.
How do I get the desired outcome?
Here's one way to do it:
data <- apply(data,2,function(x) gsub("ö",'oe',x))
Explanation:
Your grepl doesn't work because grepl just returns a boolean matrix (TRUE/FALSE) corresponding to the elements in your data frame for which the regex matches. What the assignment then does is replace not just the character you want replaced but the entire string. To replace part of a string, you need sub (if you want to replace just once in each string) or gsub (if you want all occurrences replaces). To apply that to every column you loop over the columns using apply.
If you want to return a data frame, you can use:
data.frame(lapply(data, gsub, pattern = "ö", replacement = "oe"))