Hive: Usage of Concat in Column Name - regex

I am trying to get data from a table that has column name as: year_2016, year_2017, year_2018 etc.
I am not sure how to get the data from this table.
The data looks like:
| count_of_accidents | year_2016 | year_2017 |year_2018 |
|--------------------|-----------|-----------|----------|
| 15 | 12 | 5 | 1 |
| 5 | 10 | 6 | 18 |
I have tried 'concat' function but this doesn't really work.
I have tried with this:
select SUM( count_of_accidents * concat('year_',year(regexp_replace('2018_1_1','_','-'))))
from table_name;
The column name (year_2017 or year_2018 etc) will be passed as a parameter. So, I am not really able to hardcode the column name like this-
select SUM( count_of_accidents * year_2018) from table_name;
Is there any way I can do this?

You can do it using regular expressions. Like this:
--create test table
create table test_col(year_2018 string, year_2019 string);
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;
--test select using hard-coded pattern
select year_2018, `(year_)2019` from test_col;
OK
year_2018 year_2019
Time taken: 0.862 seconds
--test pattern parameter
set hivevar:year_param=2019;
select year_2018, `(year_)${year_param}` from test_col;
OK
year_2018 year_2019
Time taken: 0.945 seconds
--two parameters
set hivevar:year_param1=2018;
set hivevar:year_param2=2019;
select `(year_)${year_param1}`, `(year_)${year_param2}` from test_col t;
OK
year_2018 year_2019
Time taken: 0.159 seconds
--parameter contains full column_name and using more strict regexp pattern
set hivevar:year_param2=year_2019;
select `^${year_param2}$` from test_col t;
OK
year_2019
Time taken: 0.053 seconds
--select all columns using single pattern year_ and four digits
select `^year_[0-9]{4}$` from test_col t;
OK
year_2018 year_2019
Parameter should be calculated and passed to the hive script, no functions like concat(), regexp_replace are supported in the column names.
Also column aliasing does not work for columns extracted using regular expressions:
select t.number_of_incidents, `^${year_param}$` as year1 from test_t t;
throws exception:
FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias
or column reference '^year_2018$': (possible column names are:
number_of_incidents, year_2016, year_2017, year_2018)
I found a workaround to alias a column using union all with empty dataset, see this test:
create table test_t(number_of_incidents int, year_2016 int, year_2017 int, year_2018 int);
insert into table test_t values(15, 12, 5, 1); --insert test data
insert into table test_t values(5,10,6,18);
--parameter, can be passed from outside the script from command line
set hivevar:year_param=year_2018;
--enable regex columns and print column names
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;
--Alias column using UNION ALL with empty dataset
select sum(number_of_incidents*year1) incidents_year1
from
(--UNION ALL with empty dataset to alias columns extracted
select 0 number_of_incidents, 0 year1 where false --returns no rows because of false condition
union all
select t.number_of_incidents, `^${year_param}$` from test_t t
)s;
Result:
OK
incidents_year1
105
Time taken: 38.003 seconds, Fetched: 1 row(s)
First query in the UNION ALL does not affect data because it returns no rows. But it's column names become the names of the whole UNION ALL dataset and can be used in the upper query. This trick works. If you will find a better workaround to alias columns extracted using regexp, please add your solution as well.
Update:
No need in regular expressions if you can pass full column_name as a parameter. Hive substitutes variables as is (does not calculate them) before query execution. Use regexp only if you can not pass full column name for some reason and like in the original query some pattern concatenation is needed. See this test:
--parameter, can be passed from outside the script from command line
set hivevar:year_param=year_2018;
select sum(number_of_incidents*${year_param}) incidents_year1 from test_t t;
Result:
OK
incidents_year1
105
Time taken: 63.339 seconds, Fetched: 1 row(s)

Related

Redshift union with date column filter error

I have a table t with date column d in Redshift.
I can run a simple select query with a filter on d such as:
select *
from t
where d = '2022-02-01'::date
The above also runs for 2022-01-01 and other dates.
However, if I try to run a query with a union of filtered dates like:
select *
from t
where d = '2022-02-01'::date
union
select *
from t
where d = '2022-01-01'::date
I get the error:
[42846] ERROR: could not convert type "unknown" to numeric because of modifier
Removing the where condition filter in the union above actually runs without problems. Column d values are all valid dates with no nulls.
Am I missing something here?

Extract values from column to use in calculated measure formula

I have TABLE A
In that table I have a measure with values like so:
Targets|
--------
4 |
5 |
6 |
In the same table I have a calculated column (summed totals) like so:
Totals |
--------
10 |
11 |
12 |
Because this is a direct query data source, query editor is disabled and manipulation must be done through DAX formulas.
I would like to do a simple operation of Targets-Totals
Code I've tried for a calculated column:
test = TableA[targets] - TableA[totals]
However this results in an error:
The column TableA[test] cannot be pushed to the remote data source and cannot be used in this scenario.
How can I create a new column with the above operation considering the fact that one column is a 'measure ' and the other a 'calculated column'
In this case, you will need a measure that does a row by row calculation, but not as a calculated column. For this you will need SUMX which will do a iteration.
Your measure should be:
New Measure = SUMX(TableA, [targets] - [totals])

How can I transform a table with only ONE cell into a simple value?

For example: if I just have this table
---------
| power |
---------
| 100 |
---------
I just want the value 100 (a number in this case). And if for example, the value is a string I want to have the value of the string too.
I need this to incorporate this subquery into another query, more precisely in a condition function.
PS: I use M language (Power Query in POWER BI Desktop)
After you load the table in the query editor, you can right-click on the value 100 and select Drill Down. This will give you a single value instead of a table.
It does this using the M code Source{0}[Power] where 0 is the row index and Power is the column.
let
Source = <data source for your table>,
Power = Source{0}[Power]
in
Power
There are other ways you can do this too. For example, the first element in the table's column:
List.First(Source[Power])

How can I call a variable from a table that is a macro variable?

apologies if my title was confusing to read, but I am not aware of how else to describe it briefly.
I am trying to call a variable from a table that is a macro variable (The table is a macro variable)
My macro looks like this:
%macro genre_analysis(table1=,table2=,genre=,genre1=);
proc sql;
create table &table1 as
select id, original_title, genres, revenue
from genres_revenue
where genres_revenue.genres like &genre
and revenue is not null
group by id
having revenue ne 0
;
quit;
proc sql;
create table &table2 as
select avg(revenue) as Average format=dollar16.2, median(revenue) as Median format=dollar16.2, std(revenue) as std format=dollar16.2
from &table1;
quit;
Everything works fine until I get to this part of the macro:
proc sql;
title "Revenue Stats by Genre";
insert into genre_summary
set Genre=&genre1,
average=&table2.average,
median=&table2.median,
std=&table2.std;
%mend genre_analysis;
I am trying to insert a row into a table I create outside of the macro. But using "&table2.average" and the 2 others that starts with "&table2" does not call the variables from the table I created in the macro.
For example:
%genre_analysis(table1=horror_revenue,table2=horror_revenue_stats,genre='%Horror%',genre1='Horror')
Returns:
NOTE: Table WORK.HORROR_REVENUE created, with 725 rows and 4 columns.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
NOTE: Table WORK.HORROR_REVENUE_STATS created, with 1 rows and 3 columns.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
ERROR: Character expression requires a character format.
ERROR: Character expression requires a character format.
ERROR: Character expression requires a character format.
ERROR: It is invalid to assign a character expression to a numeric value using the SET clause.
ERROR: It is invalid to assign a character expression to a numeric value using the SET clause.
ERROR: It is invalid to assign a character expression to a numeric value using the SET clause.
**ERROR: The following columns were not found in the contributing tables:
horror_revenue_statsaverage, horror_revenue_statsmedian, horror_revenue_statsstd.**
I have been focusing on the Error that I starred, as I believe that is where the issue is.
I tried using a "from" clause but that does not seem to work either.
Any assistance or suggestions would be appreciated!
You don't need to store a selection of summary statistics in an intermediate table for later use in an INSERT statement. Instead, you can insert the selection directly into the table.
For example:
* table to receive summary computations;
proc sql;
create table stats
(age num
, average num
, median num
, std num
, N num
);
* insert summary computations directly;
proc sql;
insert into stats
select
age
, mean (height)
, median (height)
, std (height)
, count (height)
from sashelp.class
group by age
;
* insert more summary computations directly;
For the case of having new-results in one table that need to be added to collecting-table you can do
proc append base=<collecting-table> data=<new_results>;
OR
insert into <collecting-table> select * from <new-results>;
Finally, for the case of having some new result values in macro variables themselves, you can insert those values with a VALUE clause. You have literal quote any macro variable resolution that will be mapped to a character column.
insert into <collecting-table> values ("&genre", &genre.mean, ...);

SAS where condition exact matching

Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...