Trim Results After Certain Character - amazon-web-services

I have a table that list all of the available product ids.
For example, 1020, 1020A, 1020B.
I am looking to group these product ids together.
Is it possible to do this via SQL?

to group rows with 1020, 1020A, 1020B into a group called 1020 you just need to use the substring expression in group by clause:
select substring(your_column from 1 for 4), ...
from ...
group by substring(your_column from 1 for 4)
if you have options with a different length like 102A, 102B turning into 102 you'll need a regular expression for that. The general idea is that you can use any expression, not just the column name, in group by clause

Related

Select group > 10 in REGEXP_REPLACE in teradata

I have a regex, where I have around 19 groups and I need to select every single one. However when i try to select a group > 10, it selects group one and add the number behind the 1.
My regex looks like this:
pattern ='GPU = (\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s\sHDD = (\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)\s(\d*)'
And i want to replace all of it with the one item of the group with this code:
regexp_replace(text,pattern,'\19') as GPU_19 //OUTPUT--> AB9
I am using Teradata format.

How to limit characters when using regexp_extract in hive?

I have a fixed length string in which I need to extract portions as fields. First 5 characters to ACCOUNT1, next 2 characters to ACCOUNT2 and so on.
I would like to use regexp_extract (not substring) but I am missing the point. They return nothing.
select regexp_extract('47t7916A2088M040323','(.*){0,5}',1) as ACCOUNT1,
regexp_extract('47t7916A2088M040323','(.*){6,8}',1) as ACCOUNT2 --and so on
If you want using regexp then use it like in this example. For Account1 expression '^(.{5})' means: ^ is a beginning of the string, then capturing group 1 consisting of any 5 characters (group is in the round brackets). {5} - is a quantifier, means exactly 5 times. For Account2 - capturing group 2 is after group1. (.{2}) - means two any characters.
In this example in the second regexp there are two groups(for first and second column) and we extracting second group.
hive> select regexp_extract('47t7916A2088M040323','^(.{5})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2;
OK
47t79 16
Time taken: 0.064 seconds, Fetched: 1 row(s)
Actually you can use the same regexp containing groups for all columns, extracting different capturing groups.
Example using the same regexp and extracting different groups:
hive> select regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2
> ;
OK
47t79 16
Time taken: 1.043 seconds, Fetched: 1 row(s)
Add more groups for each column. This approach works only for fixed length columns. If you want to parse delimited string, then put the delimiter characters between groups, modify group to match everything except delimiters and remove/modify quantifiers. For such example substring or split for delimited string looks much more simple and cleaner, regexp allows to parse very complex patterns. Hope you have caught the idea.

PostgreSQL - finding string using regular expression

What I am looking to do is to, within Postgres, search a column for a string (an account number). I have a log table, which has a parameters column that takes in parameters from the application. It is a paragraph of text and one of the parameters stored in the column is the account number.
The position of the account number is not consistent in the text and some rows in this table have nothing in the column (since no parameters are passed on certain screens). The account number has the following format: L1234567899. So for the account number, the first character is a letter and then it is followed by ten digits.
I am looking for a way to extract the account number alone from this column so I can use it in a view for a report.
So far what I have tried is getting it into an array, but since the position changes, I cannot count on it being in the same place.
select foo from regexp_split_to_array(
(select param from log_table where id = 9088), E'\\s+') as foo
You can use regexp_match() to achieve that result.
(regexp_match(foo,'[A-Z][0-9]{10}'))[1]
DBFiddle
Use substring to pull out the match group.
select substring ('column text' from '[A-Z]\d{10}')
Reference: PostgreSQL regular expression capture group in select

POSIX ERE Regular expression to find repeated substring

I have a set of strings containing a minimum of 1 and a maximum of 3 values in a format like this:
123;456;789
123;123;456
123;123;123
123;456;456
123;456;123
I'm trying to write a regular expression so I can find values repeated on the same string, so if you have 123;456;789 it would return null but if you had 123;456;456 it would return 456 and for 123;456;123 return 123
I managed to write this expression:
(.*?);?([0-9]+);?(.*?)\2
It works in the sense that it returns null when there are no duplicate values but it doesn't return exactly the value I need, eg: for the string 123;456;456 it returns 123;456;456and for the string 123;123;123 it returns 123;123
What I need is to return only the value for the ([0-9]+) portion of the expression, from what I've read this would normally be done using non-capturing groups. But either I'm doing it wrong or Oracle SQL doesn't support this as if I try using the ?: syntax the result is not what I expect.
Any suggestions on how you would go about this on oracle sql? The purpose of this expression is to use it on a query.
SELECT REGEXP_SUBSTR(column, "expression") FROM DUAL;
EDIT:
Actually according to https://docs.oracle.com/cd/B12037_01/appdev.101/b10795/adfns_re.htm
Oracle Database implements regular expression support compliant with the POSIX Extended Regular Expression (ERE) specification.
Which according to https://www.regular-expressions.info/refcapture.html
Non-capturing group is not supported by POSIX ERE
This answer describes how to select a matching group from a regex. So using that,
SELECT regexp_substr(column, '(\d{3}).*\1', 1, 1, NULL, 1) from dual;
# ^ Select group 1
Working demo of the regex (courtesy: OP).
If you only have three substrings, then you can use a brute force method. It is not particularly pretty, but it should do the job:
select (case when val1 in (val2, val3) then val1
when val2 = val3 then val2
end) as repeated
from (select t.*,
regexp_substr(col, '[^;]+', 1, 1) as val1,
regexp_substr(col, '[^;]+', 1, 2) as val2,
regexp_substr(col, '[^;]+', 1, 3) as val3
from t
) t
where val1 in (val2, val3) or val2 = val3;
Please bear with me and think of this different approach. Look at the problem a little differently and break it down in a way that gives you more flexibility in how you you are able look at the data. It may or may not apply to your situation, but hopefully should be interesting to keep in mind that there are always different ways to approach a problem.
What if you turned the strings into rows so you could do standard SQL against them? That way you could not only count elements that repeat but perhaps apply aggregate functions to look for patterns across sets or something.
Consider this then. The first Common Table Expression (CTE) builds the original data set. The second one, tbl_split, turns that data into a row for each element in the list. Uncomment the select that immediately follows to see. The last query selects from the split data, showing the count of how often the element occurs in the id's data. Uncomment the HAVING line to restrict the output to those elements that appear more than one time for the data you are after.
With the data in rows you can see how other aggregate functions could be applied to slice and dice to reveal patterns, etc.
SQL> with tbl_orig(id, str) as (
select 1, '123;456;789' from dual union all
select 2, '123;123;456' from dual union all
select 3, '123;123;123' from dual union all
select 4, '123;456;456' from dual union all
select 5, '123;456;123' from dual
),
tbl_split(id, element) as (
select id,
regexp_substr(str, '(.*?)(;|$)', 1, level, NULL, 1) element
from tbl_orig
connect by level <= regexp_count(str, ';')+1
and prior id = id
and prior sys_guid() is not null
)
--select * from tbl_split;
select distinct id, element, count(element)
from tbl_split
group by id, element
--having count(element) > 1
order by id;
ID ELEMENT COUNT(ELEMENT)
---------- ----------- --------------
1 123 1
1 456 1
1 789 1
2 123 2
2 456 1
3 123 3
4 123 1
4 456 2
5 123 2
5 456 1
10 rows selected.
SQL>

Regex in Hive QL (RLIKE) - performance?

I'm wondering how/if can I improve the regex I'm using in a query. I have a set of identifiers for certain user groups. They can be in two main format:
X123 or XY12, (type 1)
any two letter combo, excluding XY (type 2)
Type 1 groups always are of length 4. It's either letter X followed by a number between 100 and 999 (inclusive) OR XY followed by numbers between 0 and 99 (padded to length 2 with zeros).
Type 2 groups are 2 letter strings, with any letter allowed, excluding XY (although my query doesn't specify this).
User can belong to multiple groups, in which case different groups are separated by pound symbol (#). Here's an example:
groups user age
X124 john 23
XY22#AB mike 33
AB peter 21
X122#XY01 francis 43
I want to count rows in which at least one group in second format appears, i.e. where user is not exclusively member of groups in first format.
I need to catch all rows (i.e. users) which don't belong exclusively to first type of groups. In the example above, I want to exclude users john and francis because they are members only of type 1 groups.
On the other hand, mike is OK because he's member of AB group (i.e. group of type 2).
I'm currently doing it like this:
select
count(*)
from
users
where
groups not rlike '^(X[Y1-9][0-9]{2,2})(#X[Y1-9][0-9]{2,2})*$'
Is this bad performance wise? And how should I approach fixing it?
I want to count rows in which at least one group in second format appears.
It seems a bit simpler then to select where groups like:
\b(?:(?!XY)[A-Z]{2})\b
\b is a word boundary. It doesn't consume a character, instead it states there cannot be a non-alphanumeric character there.
Live demo.