Extract text before a specific word with hive

Extract text before a specific word with hive - regex

I have data in a column that looks like below :
Avenue 1 HE1 345 HOUSE 123.
FLAT 202 HRE2 D34 HOUSE 345.
DOOR 324 HA1 345 HOUSE 67
5.
I need to extract the postcode which comes always before house varying between 6-7 characters in all the cases. There's always a white space before HOUSE and in between postcode and one before postcode.
Desired output :
HE1 345
HRE2 D34
HA1 345
I've tried using substring_index two times only to know that hive doesn't support the function. I'm pretty much novice to Hive. Help and any reference to material will be a great gesture too.
Thanks in advance.

You can use this regex pattern ' (\\w+ \\w+) HOUSE'. This means one space, one or more word characters, one space, one or more characters, one space, HOUSE. In the parentheses is a group to be extracted. Group index is 1.
Demo:
select regexp_extract(s,' (\\w+ \\w+) HOUSE',1)
from
(select 'Avenue 1 HE1 345 HOUSE 123.' s union all
select 'FLAT 202 HRE2 D34 HOUSE 345.' s union all
select 'DOOR 324 HA1 345 HOUSE 67' s) s;
OK
HE1 345
HRE2 D34
HA1 345
Time taken: 26.472 seconds, Fetched: 3 row(s)
For case insensitive use (?i) modifier:
hive>
>
> select regexp_extract(s,' (\\w+ \\w+) (?i)HOUSE',1)
> from
> (select 'Avenue 1 HE1 345 HOUSe 123.' s union all
> select 'FLAT 202 HRE2 D34 HOUsE 345.' s union all
> select 'DOOR 324 HA1 345 HOuSE 67' s) s;
OK
HE1 345
HRE2 D34
HA1 345
See regex docs here: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
About case insensitive: http://www.regular-expressions.info/modifiers.html

You can save that file as csv file (Copy the contents in notepad and save that with .csv extension).
Now you can create table in hive and load data from csv file in the table.
hive> create table text(column1 string,column2 string,column3 string,column4 string,column5 string, column6 string) Row format delimited fields terminated by ' ' ;
OK
Time taken: 0.137 seconds
For loading data into table :
Use
hive> load data LOCAL inpath 'location of you file ' overwrite into table text;
hive> load data LOCAL inpath '/home/cloudera/FinalProjects/text.csv' overwrite into table text;
Loading data to table default.text
Table default.text stats: [numFiles=1, numRows=0, totalSize=84, rawDataSize=0]
OK
Time taken: 0.59 seconds
hive> select column3, column4 from text;
OK
HE1 345
HRE2 D34
HA1 345
Time taken: 0.145 seconds, Fetched: 3 row(s)

Related

How to select data from a table based on multiple expressions?

I have a view by the name of info and it's structure and data sample is the following:
id
name
contacts
1
ali
1234
1
ali
122
2
john
133
2
john
144
2
john
122
3
mike
111
4
khan
444
5
jan
122
5
jan
155
So I am using the above view data in oracle apex report. I want to search data by id for example I search for id=1, it contains two values in contacts column one of the value which is 122 is also included in another records so the result should also contain all the other records which contain 122 in their contacts column.
The expected result which I want is:
id
name
contacts
1
ali
1234
1
ali
122
2
john
133
2
john
144
2
john
122
5
jan
122
5
jan
155

We can phrase your requirement as wanting to return any record with id = 1 or any record whose contacts overlap with the contacts of id = 1.
SELECT id, name, contacts
FROM yourTable
WHERE id = 1 OR
id IN (
SELECT id
FROM yourTable
WHERE contacts IN (SELECT contacts FROM yourTable WHERE id = 1)
)
ORDER BY id;
Demo

How to sum by group in Power Query Editor?

My table look like this :
Serial WO# Value Indicator
A 333 10 333-1
A 333 4 333-2
B 456 5 456-1
A 334 1 334-1
A 334 5 334-2
I want to create a new column that sums up the Values based on WO#. It should look like this:
Serial WO# Value Indicator SumValue
A 333 10 333-1 14
A 333 4 333-2 14
B 456 5 456-1 5
A 334 1 334-1 6
A 334 5 334-2 6
Eventually I will remove duplicates on the WO# and remove the Value and Indicator Columns from the data. I can't seem to find a function in M that allows for sum by group. Thanks in advance!

If you load the data with Power Query, there is a Group command on the ribbon that will do just that.
Make sure to use the Advanced option and add all columns you want to retain to the grouping section. Screenshot from Excel ....
.... and from Power BI

Combine Toad sql queries with decreasing output results into one list

I've been trying to produce a result where multiple queries return more restrictive returns. How can I see the full list as well as those records that meet the more restrictive conditions? Query 1 returns 538 records of sites in the given counties.
SELECT E_SITES.ID "SITE ID",
E_SITES.NAME "SITE NAME",
E_SITES.ADDR_1 "SITE ADDRESS"
E_SITES.CITY_NAME || ', ' || E_SITES.STATE_CODE || ' ' || E_SITES.POSTAL_CODE,
E_SITES.COUNTY_NAME
FROM E_SITES
WHERE E_SITES.COUNTY_NAME IN ('ALLAMAKEE', 'BENTON', 'BLACK HAWK', 'BREMER', 'BUCHANAN', 'CHICKASAW', 'CLAYTON', 'DELAWARE', 'DUBUQUE')
ORDER BY E_SITES.ID
Query 2 returns the number of sites that have a contact person identified. This is 503 records.
SELECT E_SITES.ID "SITE ID",
E_SITES.NAME "SITE NAME",
E_SITES.ADDR_1 "SITE ADDRESS"
E_SITES.CITY_NAME || ', ' || E_SITES.STATE_CODE || ' ' || E_SITES.POSTAL_CODE,
E_SITES.COUNTY_NAME,
E_INDIVIDUALS.FIRST_NAME || ' ' || E_INDIVIDUALS.LAST_NAME
FROM E_SITES, E_AFFILIATIONS, E_INDIVIDUALS
WHERE E_SITES.SITE_ID = E_AFFILIATIONS.SITE_ID
AND E_AFFILIATIONS.INDIVIDUAL_RID = E_INDIVIDUALS.RID
AND E_AFFILIATIONS.AFFILIATION_TYPE = ('SITE_CONTACT')
AND E_SITES.COUNTY_NAME IN ('ALLAMAKEE', 'BENTON', 'BLACK HAWK', 'BREMER', 'BUCHANAN', 'CHICKASAW', 'CLAYTON', 'DELAWARE', 'DUBUQUE')
ORDER BY E_SITES.ID
A further query would return those sites with a mailing address, which reduces the results down to 486 records. I need to get all 538 records, whether or not they have a contact or mailing address, and for those that do, have one row for each site.
Additional Information
My current results can look like this for Query 1 (including column headers for clarity, quotes to distinguish data elements):
"SITE ID" "SITE NAME" "SITE ADDRESS" "CITY, STATE ZIP" "COUNTY_NAME"
"09698" "BODINE ELECTRIC" "18114 KAPP DR" "PEOSTA, IA 52067" "BREMER"
"16895" "BRUGGEMAN LUMBER" "3003 WILLOW RD" "HOPKINTON, IA 52237" "DELAWARE"
"40047" "GENEVIEVE, LLC" "707 LINCOLN ST" "GARNAVILLOR, IA 52052" "CLAYTON"
Query 2 which requires a contact person currently only returns records that meet the requirement, even though I use the (+) operator.
"SITE ID" "SITE NAME" "SITE ADDRESS" "CITY, STATE ZIP" "COUNTY_NAME" "FIRST NAME LAST NAME"
"40047" "GENEVIEVE, LLC" "707 LINCOLN ST" "GARNAVILLOR, IA 52052" "CLAYTON" "DALE KARTMAN"
I get 1 record rather than the 3 records, with 2 having no contact person and 1 with a contact person. This is my dilema. I have to run each of these queries separately, get the results and copy them to a spreadsheet. Then I have to align the records with contact names to the 1st query of all facilities. Very labor intensive. Hope this helps clarify my needs.

If I understood you correctly, it is the OUTER JOIN you're looking for.
Here's a simple example (based on Scott's EMP and DEPT tables) which shows what it is.
There are 4 departments in the DEPT table:
SQL> select deptno from dept order by deptno;
DEPTNO
----------
10
20
30
40
However, no employee works in department 40:
SQL> select deptno, ename from emp order by deptno;
DEPTNO ENAME
---------- ----------
10 KING
10 CLARK
10 MILLER
20 FORD
20 SMITH
20 JONES
30 JAMES
30 TURNER
30 MARTIN
30 WARD
30 ALLEN
30 BLAKE
12 rows selected.
SQL>
If you want to display information collected from both of those tables (department name from the DEPT table and employee name from the EMP table), you'd join those tables - just like you did (I'll use ANSI syntax which actually JOINS tables, instead of enumerating them and putting join conditions into the WHERE clause):
SQL> select d.deptno, d.dname, e.ename
2 from dept d join emp e on e.deptno = d.deptno
3 order by d.deptno;
DEPTNO DNAME ENAME
---------- -------------- ----------
10 ACCOUNTING KING
10 ACCOUNTING CLARK
10 ACCOUNTING MILLER
20 RESEARCH FORD
20 RESEARCH SMITH
20 RESEARCH JONES
30 SALES JAMES
30 SALES TURNER
30 SALES MARTIN
30 SALES WARD
30 SALES ALLEN
30 SALES BLAKE
12 rows selected.
SQL>
Looks OK, but - I'd like to get information about DEPTNO = 40, although nobody works in it. So, use outer join:
SQL> select d.deptno, d.dname, e.ename
2 from dept d left join emp e on e.deptno = d.deptno
3 order by d.deptno;
DEPTNO DNAME ENAME
---------- -------------- ----------
10 ACCOUNTING KING
10 ACCOUNTING CLARK
10 ACCOUNTING MILLER
20 RESEARCH FORD
20 RESEARCH SMITH
20 RESEARCH JONES
30 SALES JAMES
30 SALES TURNER
30 SALES MARTIN
30 SALES WARD
30 SALES ALLEN
30 SALES BLAKE
40 OPERATIONS
13 rows selected.
SQL>
Right! Here it is! (note that LEFT JOIN produces the same result as LEFT OUTER JOIN; no need to specify "outer", although it makes thinks somewhat more obvious).
Also, there's the "old" Oracle outer join operator, (+) (literally, a + sign enclosed into round brackets). The above query would work as well if we put it like this:
select d.deptno, d.dname, e.ename
from dept d, emp e
where d.deptno = e.deptno (+);
I'd suggest you do the same with (outer join) your query. Once again:
join tables in the JOIN clause
put filters into the WHERE clause
Query will be easier to read and maintain, you'll know what is what, and - if necessary (and it might even be the case for you), if you use the "old" (+) operator, you won't be able to outer join one table to more than just one another table. As you're going deeper and deeper, you might need to outer join some table to several others, and that's where ANSI join takes place.
Good luck!

Select most recent rows in Django ORM with grouping

We have a system written in Django to track patients recruited to clinical trials.
Spread sheets are used to record the number of patients recruited each month throughout a financial year; so the sheet only contains 12 months of data even though a study may run for years.
There is a table in a django database in to which the spread sheets are imported each month. The data includes the month/year, a count of patients, and some other fields. Each import will include all the previous months data; we need this to make sure no data has been changed on the import sheet since the last import.
For example, the import table containing two imports (the first up to January and the second up to February) would look like this:
id | study_id | data_date | patient_count | [other fields] -->
100 5456 2016-04-01 10 ...
101 5456 2016-05-01 8 ...
102 5456 2016-06-01 5 ...
... all months in between ...
109 5456 2016-01-01 12 ...
110 5456 2016-02-01 NULL ...
111 5456 2016-03-01 NULL ...
112 5456 2016-04-01 10 ...
113 5456 2016-05-01 8 ...
114 5456 2016-06-01 5 ...
... all months in between ...
121 5456 2016-01-01 12 ...
122 5456 2016-02-01 6 ...
123 5456 2016-03-01 NULL ...
The other fields includes a foreign key to another table containing the actual study identification number (iras_number), so I have to join to that to select the rows for a particular study.
I want the most recent values of data_date and patient_count for a study, which may span more than one financial year, so I tried this query (iras_number is passed to the function performing this query):
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(import_study__iras_number=iras_number) \
.annotate(max_id=Max('id')).order_by()
However, this produces a SQL query which includes patient_count in the GROUP BY, resulting in duplicate rows:
data_date | patient_count | max_id
2016-04-01 10 100
2016-04-01 10 112
2016-05-01 8 101
2016-05-01 8 113
...
2016-01-01 12 109
2016-01-01 12 121
2016-02-01 NULL 110
2016-02-01 6 122
How do I select the most recent data_date and patient_count from the table using the ORM?
If I were writing the SQL I would do an inner select of the max(id) grouped by data_date and then use that to join, or use an IN query, to select the fields I require from the table; such as:
SELECT data_date, patient_count
FROM importstudydata
WHERE id IN (
SELECT MAX(id) AS "max_id"
FROM importstudydata INNER JOIN importstudy
ON importstudydata.import_study_id = importstudy.id
WHERE importstudy.iras_number = 5456
GROUP BY importstudydata.data_date
)
ORDER BY data_date ASC
I've tried to create an inner select to replicate the SQL query, however the inner select returns more than one field (column) a causes the query to fail:
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=ImportStudyData.objects.values('data_date') \
.filter(import_study__iras_number=iras_number) \
.annotate(max_data_id=Max('id'))
Now I can't get the inner select to return only the max(id) grouped by `data_date' and for it to be performed in a single SQL query.

For now I'm splitting the query in to a number of steps to get the result I want.
First I query for the most recent id of all rows related to the study
id_qry = ImportStudyData.objects.values('data_date')\
.filter(import_study__iras_number=iras_number)\
.annotate(max_id=Max('id'))
To get a list of just the numbers, stripping out the date, I use list comprehension:
id_list = [x['max_id'] for x in id_qry]
This list is then used as a filter for the final query to get the number of patients
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=id_list)
It hits the database twice, and is computationally more expensive, but for now it works and I need to move on.
I'll come back to this problem at a later date.

Use: distinct=True
totals = ImportStudyData.objects.values('data_date', 'patient_count').filter(import_study__iras_number=iras_number).annotate(max_id=Max('id')).order_by('data_date').distinct()

How to return all results using REGEXP_SUBSTR?

I need four results to be returned, and only three are showed. How could I do this?
Query:
SELECT REGEXP_SUBSTR(lines, '[0-9]{1,3}', 1, 1,'m')
from
(
SELECT '111 - first line' lines FROM dual
UNION
SELECT '222 - second line' FROM dual
UNION
SELECT '333 - third line
444 - fourth line' FROM dual
)
It's returning this:
111
222
333
I want this:
111
222
333
444
Is this possible?

At first lets make some data:
CREATE TABLE some_table (some_data VARCHAR2(200))
/
INSERT INTO some_table VALUES ('111 - first line')
/
INSERT INTO some_table VALUES ('222 - second line')
/
INSERT INTO some_table VALUES ('333 - third line
444 - fourth line')
/
INSERT INTO some_table VALUES ('555 - fifth line
666 - some ugly line')
/
INSERT INTO some_table VALUES ('123 - meh meh
321 - one more
678 - and more
986 - and more :)')
/
Then let's make a query:
SELECT DISTINCT TRIM(REGEXP_SUBSTR(some_data,'[^'||CHR(10)||']+', 1, level))
FROM some_table
CONNECT BY REGEXP_SUBSTR(some_data, '[^'||CHR(10)||']+', 1, level) IS NOT NULL;
You can easy implement your query to this. I was a little bit lazy to think, so i put a DISTINCT to remove repeating ROWS. In proper code you should not use it.
Your data will look like this:
333 - third line
678 - and more
986 - and more :)
321 - one more
555 - fifth line
444 - fourth line
666 - some ugly line
222 - second line
123 - meh meh
111 - first line

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract text before a specific word with hive - regex

Related

How to select data from a table based on multiple expressions?

How to sum by group in Power Query Editor?

Combine Toad sql queries with decreasing output results into one list

Select most recent rows in Django ORM with grouping

How to return all results using REGEXP_SUBSTR?

Categories

Resources