Extract rows from a table with regular expression hive sql - regex

Please check the link for the result and table info. I need to query rows
with value '343' in Col B with a regular expression . All columns are strings . Also please be kind enough to point any good learning materials in how to write good REGEX in Hive . Thank you

For Hive use this:
select * from tablename where B rlike '343';
Checking it works:
hive> select '123435' rlike '343';
OK
_c0
true
Negative test:
hive> select '12345' rlike '343';
OK
_c0
false
Time taken: 1.675 seconds, Fetched: 1 row(s)
Hive uses Java flavor regex. You can find good reference and practice here: https://regexr.com/ and of course regex101

this will work:
select * from tablename where regexp_like(B,'(.*)(343)(.*)');
hive equivalent is :
select * from tablename where rlike(B,'(.*)(343)(.*)');

Related

SQL RLIKE function Postcode Search

I am trying to understand why the following query pulls through postcodes that I wouldn't expect.
SQL
Select distinct Postcode from tableA where like 'NE1%';
Shows 2 postcodes, all beginning with NE1
I've tried :-
Select distinct Postcode from tableA where rlike '^NE[0-1]%'
Shows many postcodes, including the 2 from above, such as NE27 0EZ - I'm assuming because it has a zero in the 2nd part of the postcode, but no idea why NE2 2NE appears !
My goal is to filter all postcodes that begin with an N (not NE) BUT only have a numeric as the next character - SQL only, not python or scala, as this filter forms 1 of many postcode filters (a large OR clause)
I would have thought for all postcodes beginning with a N that had a numeric as the next character would have worked :-
Select distinct Postcode from tableA where rlike 'N[0-9] %' or 'N[0-9][0-9] %'
select distinct 'rlike' as Func , postcode from npex.npex where postcode rlike '^NE[0-1]*'
union
select distinct 'like', postcode from npex.npex where postcode like 'NE1%'
order by 1;
RESULTS
Func postcode
like NE1 3BB
like NE12 1AB
rlike NE27 0EZ
rlike NE6 2UT
rlike NE27 0LT
rlike NE12 1AB
rlike NE2 2NE
rlike NE3 4DT
rlike NE1 3BB
* is not needed, otherwise you would be matching 0 or more of zeroes or ones.
select distinct postcode from npex.npex where postcode rlike '^NE[0-1]'
If you want to get those beginning with an N followed by a numeric, you can use
select distinct postcode from npex.npex where postcode rlike '^N[0-9]'

How to put a regex filter in a where clause in bigquery

How do I add to the where clause of a sql select statement to match a particular regex.
I have a table with phone numbers. The phone numbers are 10 digits long. The data is dirty, so I want to not select records that are not in this format. like this:
select * from Phones where Phones like `RegExp("^\\d{9}$")`; <-- this doesn't work
Thanks
For BigQuery Standard SQL - use below (assuming your regexp itself is correct)
WHERE REGEXP_CONTAINS(Phones, r'^\d{10}$')
above will filter out any row where Phone is not 10 digits string

Create a new column by executing regular expression on existing column

I have column with data as follows:
p=Chicago, IL|q=rental houses
My goal is to obtain
Chicago IL rental houses as the outcome by running regular expression on the column via a select query.
Use below regx on string
/p=(.*)|q=(.*)/
Then join 2 substrings with spaces.
If you want get result from select query you can use select with concat or concat_ws function instead.

Validate column using regular expression in postgre SQL

I need to check whether a column in a table having a numeric value followed by decimal point and 3 precisions after the decimal point. Kindly suggest how to do using regular expression in postgre SQL (or) any other alternate method.
Thanks
The basic regex for digits, a period and digits is \d+\.\d{3}
You can use it for several things, for instance:
1. Add a Constraing to your Column Definition
ALTER TABLE mytable ADD (CONSTRAINT mycolumn_regexp CHECK (mycolumn ~ $$^\d+\.\d{3}\Z$$));
2. Find Rows that Don't Match
SELECT * FROM mytable WHERE mycolumn !~ $$^\d+\.\d{3}\Z$$;
3. Find Rows that Match
SELECT * FROM mytable WHERE mycolumn ~ $$^\d+\.\d{3}\Z$$;

Using regexp_extract in Hive

I am trying to find the rows from a hive table where a particular column does not contain null values or \N values or STX character '\002'. The objective is to find which rows contain some characters other than these three.
I tried this hive query:
select column1,length(regexp_replace(column1,'\N|\002|NULL','')) as value
FROM table1 LIMIT 10;
I was expecting zero in the following cases but I am getting the following:
column1 value
NULL NULL
0
NULL NULL
0
\N\N\N\N\N\N\N\N 8
NULL NULL
\N\N\N\N\N\N\N\N 8
NULL NULL
NULL NULL
\N\N\N 3
Could someone please help me on the correct regex for the above case?
Thank you.
Ravi
It looks that hive is using Java's regular expression engine so the problem seems to be with the regex itself, more specifically in the escape sequences.
Try the following and if it doesn't work then please let me know:
(?:(?:\\\\N)+|\002|NULL)