Hive query to extract a column which has alphanumeric characters

Hive query to extract a column which has alphanumeric characters - regex

I have a requirement in which I need to extract the data based on a filter on a column and the filter would be to extract only alphanumeric values which means that it should contain at least one alphabet and a number for consideration.
For example if I have five numbers such as 333,abc,ab333,+33,+ab33 the output should have only ab333 and +ab33.
I was trying to implement this using the rlike function and the query is as below but this is giving all the records in the table.
select column_name from table_name where column_name rlike '^[a-zA-Z0-9]+$';
I also tried a different approach by using the below query but in case of special characters such as + the below query gives the wrong result.
select column_name from table_name where column_name not rlike '^[0-9]+$';
Could anybody guide me regarding the mistake of if there is a different approach for this.

You can use
RLIKE '^\\+?(?:[0-9]+[a-zA-Z]|[a-zA-Z]+[0-9])[0-9a-zA-Z]*$'
Details:
^ - start of string
\+? - an optional + symbol
(?:[0-9]+[a-zA-Z]|[a-zA-Z]+[0-9]) - one or more digits followed with a letter or one or more letters followed with a digit and then
[0-9a-zA-Z]* - zero or more alphanumeric chars
$ - end of string.

Related

Tableau Regex Extract date from file name

I have a field that has the text file name being used as the data source. The file name is formatted like "file_name_example_2022-11-17_14.45.56.txt" with the "2022-11-17_14.45.56" being the date and time. I know I can do a series of RIGHT and LEFTs to extract the date time as a separate field, but I wanted to see if REGEX_EXTRACT would provide a cleaner way to do it. I've been looking at regular expression documentation and can't seem to figure it out. I am trying to end up with a full date time field.
So far I have tried
REGEXP_EXTRACT([File Paths], '\d(.+)')
and that results in "022-11-17_14.45.56.txt"

You can use
REGEXP_EXTRACT([File Paths], '\d{4}-\d{1,2}-\d{1,2}_\d{1,2}\.\d{1,2}\.\d{1,2}')
See the regex demo.
Details:
\d{4}-\d{1,2}-\d{1,2} - four digits, -, one or two digits, -, one or two digits
_ - a _ char
\d{1,2}\.\d{1,2}\.\d{1,2} - one or two digits, ., one or two digits, ., one or two digits.

REGEXP_EXTRACT with String Value in Bigquery

I want to extract words in a column, the column value looks like this:'p-fr-youtube-car'. And they should all be extracted to their own column.
INPUT:
p-fr-youtube-car
DESIRED OUTPUT:
Country = fr
Channel = youtube
Item = car
I've tried below to extract the first word, but can't figure out the rest.What RegEx will achieve my desired output from this input? And how can I make it not case sensative fr and FR will be the same.
REGEXP_EXTRACT_ALL(CampaignName, r"^p-([a-z]*)") AS Country

You can use [^-]+ to match parts between hyphens and only capture what you need to fetch.
To get strings like youtube, you can use
REGEXP_EXTRACT_ALL(CampaignName, r'^p-[^-]+-([^-]+)')
To get strings like car, you can use
REGEXP_EXTRACT_ALL(CampaignName, r'^p-[^-]+-[^-]+-([^-]+)')
So, [^-]+ matches one or more chars other than - and ([^-]+) is the same pattern wrapped with a capturing group whose contents REGEXP_EXTRACT actually returns as a result.

You can use named groups.
Example Regex:
p-(?P<Country>[a-z]*)\-(?P<Channel>[a-z]*)\-(?P<Item>[a-z]*)$
https://regex101.com/r/fKoBIn/3

Below is for BigQuery Standard SQL
I would recommend use of SPLIT in cases like yours
#standardSQL
SELECT CampaignName,
parts[SAFE_OFFSET(1)] AS Country,
parts[SAFE_OFFSET(2)] AS Channel,
parts[SAFE_OFFSET(3)] AS Item
FROM `project.dataset.table`,
UNNEST([STRUCT(SPLIT(CampaignName, '-') AS parts)])
if to apply to sample data from your question - the output is
Row CampaignName Country Channel Item
1 p-fr-youtube-car fr youtube car
Meantime, if for some reason you are required to use Regexp - you can use below
#standardSQL
SELECT CampaignName,
parts[SAFE_OFFSET(1)] AS Country,
parts[SAFE_OFFSET(2)] AS Channel,
parts[SAFE_OFFSET(3)] AS Item
FROM `project.dataset.table`,
UNNEST([STRUCT(REGEXP_EXTRACT_ALL(CampaignName, r'(?:^|-)([^-]*)') AS parts)])

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

I have this inputs:
John/Bean/4000-M100
John/4000-M100
John/4000
How can I get just the 4000 but note that the 4000 there will be change from time to time it can be 3000 or 2000 how can I treat that using regex pattern?
Here's my output so far, it statisfies John/400-M100 and John/4000 but the double slash doesnt suffice the match requirements in the regex I have:
REGEXP_REPLACE(REGEXP_SUBSTR(a.demand,'/(.*)-|/(.*)',1,1),'-|/','')

You can use this query to get the results you want:
select regexp_replace(data, '^.*/(\d{4})[^/]*$', '\1')
from test
The regex looks for a set of 4 digits following a / and then not followed by another / before the end of the line and replaces the entire content of the string with those 4 digits.
Demo on dbfiddle

This would also work, unless you need any digit followed by three zeros. See it in action here, for as long as it lives, http://sqlfiddle.com/#!4/23656/5
create table test_table
( data varchar2(200))
insert into test_table values('John/Bean/4000-M100')
insert into test_table values('John/4000-M100')
insert into test_table values('John/4000')
select a.*,
replace(REGEXP_SUBSTR(a.data,'/\d{4}'), '/', '')
from test_table a

The following will match any multiple of 1000 less than 10000 when its preceded by a slash:
\/[1-9]0{3}
To match any four-digit number preceded by a slash, not followed by another digit, such as 4031 in—
Sal_AS_180763852/4200009751_S5_154552/4031
—try:
\/\d{3}(?:(?:\d[^\d])|(?:\d$))
https://regex101.com/r/Am34WO/1

POSTGRESQL at least 8 characters in name with LIKE or REGEX

SELECT name
FROM players
WHERE name ~ '(.*){8,}'
It is really simple but I cannot seem to get it.
I have a list with names and I have to filter out the ones with at least 8 characters... But I still get the full list.
What am I doing wrong?
Thanks! :)

A (.*){8,} regex means match any zero or more chars 8 or more times.
If you want to match any 8 or more chars, you would use .{8,}.
However, using character_lenth is more appropriate for this task:
char_length(string) or character_length(string) int Number of characters in string
CREATE TABLE table1
(s character varying)
;
INSERT INTO table1
(s)
VALUES
('abc'),
('abc45678'),
('abc45678910')
;
SELECT * from table1 WHERE character_length(s) >= 8;
See the online demo

Validate column using regular expression in postgre SQL

I need to check whether a column in a table having a numeric value followed by decimal point and 3 precisions after the decimal point. Kindly suggest how to do using regular expression in postgre SQL (or) any other alternate method.
Thanks

The basic regex for digits, a period and digits is \d+\.\d{3}
You can use it for several things, for instance:
1. Add a Constraing to your Column Definition
ALTER TABLE mytable ADD (CONSTRAINT mycolumn_regexp CHECK (mycolumn ~ $$^\d+\.\d{3}\Z$$));
2. Find Rows that Don't Match
SELECT * FROM mytable WHERE mycolumn !~ $$^\d+\.\d{3}\Z$$;
3. Find Rows that Match
SELECT * FROM mytable WHERE mycolumn ~ $$^\d+\.\d{3}\Z$$;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hive query to extract a column which has alphanumeric characters - regex

Related

Tableau Regex Extract date from file name

REGEXP_EXTRACT with String Value in Bigquery

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

POSTGRESQL at least 8 characters in name with LIKE or REGEX

Validate column using regular expression in postgre SQL

Categories

Resources