upgrade from hive0.10 to hive 0.12 - mapreduce

I have upgraded my hive from hive 0.10 to hive 0.12, hive 0.10 worked fine before, but now with hive 0.12 I've run into an exception when i executed a query like
select count(*) from table1, any help will be really appreciated, thanks !!!
I have hadoop 1.0.3, hbase 0.92.1, hive 0.12
java.lang.NegativeArraySizeException: -1
at org.apache.hadoop.hbase.util.Bytes.readByteArray(Bytes.java:147)
at org.apache.hadoop.hbase.mapreduce.TableSplit.readFields(TableSplit.java:133)
at org.apache.hadoop.hive.hbase.HBaseSplit.readFields(HBaseSplit.java:53)
at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:151)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:396)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
hive> describe tct;
OK
deviceid string from deserializer
level int from deserializer
stage int from deserializer
ispayer string from deserializer

Related

get date without time from athena table using s3 bucket

As i have the table in athena with multiple columns. In that table one of the column named such as date_col with below format.
date_col
1/13/2022 3:00:16 PM
1/13/2022 3:00:13 PM
1/13/2022 2:00:16 PM
1/13/2022 2:15:16 PM
From the above date_col records, I want to get the only date without time part.
Here it is i am using the query :
select date_col, date_format(date_col, '%m/%d/%Y') from 'test'.sample_table'
But getting below error like :
SYNTAX_ERROR: line 1:25: Unexpected parameters (varchar, varchar(8)) for function date_format. Expected: date_format(timestamp with time zone, varchar(x)) , date_format(timestamp, varchar(x))
Required format should be like:
date_col
1/13/2022
1/13/2022
1/13/2022
1/13/2022
I used different ways to get that result. But, I couldn't get the required format. Can you please help me for that. Thanks in advance.
Please try with the date_format function
select date_col ,date_format(date_col, '%m/%d/%Y') FROM <table_name>;

Power BI create a measure grouping Max(Date) that selects only MaxDate Row

First, new to Power bi. I can do this with tsql as directquery, but want to see how it is done with power bi using import.
I have the following table structure
one to many device to metric
device
id, devicename, column3....
metric
id, Device_ID, metric1, metric2, createddate.
DeviceName
metric1
Metric2
createdDate
netsclprdelr09
44.70
14.73
2021-11-15 20:00:01.343
netsclrprdctc13
8.90
6.66
2021-11-15 20:00:01.343
netsclrprdelr16
7.40
6.58
2021-11-15 20:00:01.343
netsclprdelr09
40.50
14.95
2021-11-15 19:00:01.567
netsclrprdctc13
9.20
6.64
2021-11-15 19:00:01.567
netsclrprdelr16
8.00
6.59
2021-11-15 19:00:01.567
netsclprdelr09
44.70
14.93
2021-11-15 18:00:01.120
netsclrprdctc13
9.20
6.66
2021-11-15 18:00:01.120
netsclrprdelr16
9.00
6.62
2021-11-15 18:00:01.120
I am only looking to return the first 3 rows (max(Createddate). eliminating all other.
I would suspect this could be done with the calculate function and some filters, but I am stuck. Thanks

spark timestamp timezone in JDBC read/write

I am creating a parquet file from reading data from oracle.
Oracle is running in UTC. I confirmed using,
SELECT DBTIMEZONE FROM DUAL;
Output:
DBTIMEZONE|
----------|
+00:00 |
Reading from JDBC and writing to S3 as parquet:
df = spark.read.format('jdbc').options(url=url,
dbtable=query,
user=user,
password=password,
fetchsize=2000).load()
df.write.parquet(s3_loc, mode="overwrite")
Now, I checked value of spark.sql.session.timeZone
print(spark.conf.get("spark.sql.session.timeZone"))
Output:
UTC
Now, I am reading data from S3 location:
df1 = spark.read.parquet(s3_loc)
df1.show()
Output:
+-------------------+
| col1 |
+-------------------+
|2012-11-11 05:00:00|
|2013-11-25 05:00:00|
|2013-11-11 05:00:00|
|2014-12-25 05:00:00|
+-------------------+
col1 is date in oracle and converted to timestamp in spark df.
Why 5 hours are added in the output? Database is running in UTC and spark.sql.session.timeZone is UTC.
Note:
Both RDS and EMR are running in AWS US-EAST-1
On all the spark nodes, I ran TZ=UTC
Timezone is reckognized by JDBC driver, which does not know about Spark's timezone setting, but relies on JVM's default timezone. Moreover, it ignores remote database session's timezone settings. You said you ran TZ=UTC - I'm not sure, but probably it didn't work. Check what TimeZone.getDefault tells you.
If, as I suspect, your JVM timezone is EDT (US-EAST-1 is Virginia), then 2012-11-11 00:00:00 read from Oracle by JDBC is interpreted to be in EDT. Displayed in Spark it's 2012-11-11 05:00:00 UTC and this is the result you got.
To fix it, override JVM default timezone when running spark-submit:
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" \
--conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC" \
...

Query to calculate cost by month using AWS Athena querying

I have a table like below.
item_id bill_start_date bill_end_date usage_amount
635212 2019-02-01 00:00:00.000 3/1/2019 00:00:00.000 13.345 user_project
IBM
I am trying to find usage_amount by each month and each project. Amazon Athena query engine is based on Presto 0.172. Due to the limitations in Athena, it's not recognizing query like select sysdate from dual;.
I tried to convert bill_start_date and bill_end_date from timestamp to date but failed. even current_date() didn't work in my case. I am able to do calculate the total cost by hard coding the values but my end goal is to perform the action on columns.
SELECT (FLOOR(SUM(usage_amount)*100)/100) AS total,
user_project
FROM test_table
WHERE bill_start_date
BETWEEN date '2019-02-01'
AND date '2019-03-01'
GROUP BY user_project;
In Presto, current_timestamp is a SQL standard function which does not use parentheses.
To group by month, I'd use date_trunc('month', bill_start_date).
All of these functions are documented here

SQL Server exception: Received an invalid column length from the bcp client for colid modify_time

I am getting the below error while using bcp client of SQL Server to load data into Azure SQL Data warehouse.
Exact exception:
com.microsoft.sqlserver.jdbc.SQLServerException: 107096;Received an invalid column length from the bcp client for colid modify_time.
I am able to load the data correctly to Azure SQL database. But while loading the data to Azure SQL data warehouse, this issue happens.
And, this is happening only for timestamp columns.
When I created the table in Azure SQL data warehouse, it was created like this:
name | type | warehouse type | precision | length | java sql type
------------+------------+----------------+-----------+--------+-----------
modify_time | datetime2 | -9 | 27 | 54 | -9*
Bulk load operation is done by the following sample code:
SQLServerBulkCopy copy = new SQLServerBulkCopy(conn);
copy.setDestinationTableName("my_table");
copy.writeToServer(new ISQLServerBulkRecord() {
//Overridden methods
});