Hive null issue - mapreduce

I have created the hive table and started loading the data using statement load data inpath<hdfs path>into table <hive_table_name>
When I tried to open the data there are two problems
1) At the end of last column there are continuous NULL appending to last column which are not present in file
2) When I tried to run count(*) from hive map is 100% and reduce 0% and it is continuously executing. I am not getting any result
Example of the csv data is given below
xxxxx,2xxxx 08:15:00.0,19 ,Wxxxxx 2 IST 2015,0,2015- 100.0,1A,gggg,null,null,null,null,null,null,null,null,null,RP,AAGhghjgS,DELVS3885,1ghhh63,Djhkj85,null,AGY,jkjk85,1122JK,55666,null,1,BjhkhkjDC,null,006hhgjgAGS,null,null,null,/DCS-SYNCUS,null,null,kljlkl,null,null,null,null,null,null,null,null,null,null,null,null,14jkjhj63,DELVS3885,T,null,1A,hgfd,IN,null,null,null,null,null,null,14300963,DELbhjhhjkhk,T,null,1A,DEL,IN,null,null,null,null,null,null,null,hgjhhjj,A,null,UK,ghj,IN,null,null,null,null,null,null,Wed Jan 20 13:36:28 IST 2016
Please help me on this.

Related

How to Append SQL Output into a New Table in Power BI?

I am working on creating a trend line for a daily count I am getting from a SQL query. Each time the query runs, I get a count and the current date.
I need a way to record the output of the query into a new table and continue to append each time the query runs. The appended table would looke like this:
Count
Date
250
10/12/2022
257
10/13/2022
220
10/14/2022
This table would allow me to create a trend line. I am also open to a different approach if there is a better way.
Have you tried to use power query append option.
Should solve your issue.

Limit transformation to top 1 row for each day in dataset

Background:
Have a monitoring script that is run 3 times a day and outputs a .csv file to a SharePoint folder. Each time the script is run, the new csv contains an update on the various processes run. Currently able to get all of csv files back as a series of rows in the transformation.
Question:
Is there a way to limit the amount of rows for each day to just the Top 1 row so that the dashboard being created shows the most up-to-date information for each particular day. Would like to do this at the Transform stage so don't have to load any unnecessary data.
Eg. Example data in tranformation:
Filename
Extension
Date created
Keep in Transformation?
file9
.csv
29/04/2021 07:52:41
KEEP
file8
.csv
28/04/2021 16:52:14
KEEP
file7
.csv
28/04/2021 11:52:20
[redundant]
file6
.csv
28/04/2021 07:52:49
[redundant]
file5
.csv
27:04/2021 16:51:41
KEEP
file4
.csv
27/04/2021 11:52:21
[redundant]
file3
.csv
27/04/2021 07:52:03
[redundant]
file2
.csv
26/04/2021 16:52:43
KEEP
file1
.csv
26/04/2021 11:52:20
[redundant]
Feels weird to answer my own question, but thought I would post, just in case someone has the same question...
The steps to get the latest row for each day are:
Ensure that the dataset is ordered by the Date created column in descending order.
Duplicate the Date created column to perform transformations on. It might create a new column called Date created - Copy.
Highlight the Date created - Copy column, and then select Split Column by Delimiter. As it's a Date/Time column, I split the column by the Space delimiter. This will create 2 new columns, Date created - Copy.1 and Date created - Copy.2.
Highlight the new Date column Date created - Copy.1 and then select Remove Rows - Remove Duplicates.
At this point you should only see the latest row of data for each day.
Remove the 2 split columns to tidy up the dataset.

How can I speed up this Athena Query?

I am running a query through the Athena Query Editor on a table in the Glue Data Catalog and would like to understand why it takes so long to do a simple select * from this data.
Our data is stored in an S3 bucket that is partitioned by year/month/day/hour, with 80 snappy Parquet files per partition that are anywhere between 1 - 10 MB in size each. When I run the following query:
select stringA, stringB, timestampA, timestampB, bigintA, bigintB
from tableA
where year='2021' and month='2' and day = '2'
It scans 700MB but takes over 3 minutes to display the Athena results. I feel that we have already optimized the file format and partitioning for this data, and so I am unsure how else we can improve the performance if we're just trying to select this data out and display it in a tool like QuickSight.
The select * performance was impacted by the number of files that needed to be scanned, which were all relatively small. Repartitioning and removing the hour partition resulted in an improvement in both runtime (14% reduction) and also data scanned (26% reduction) due to snappy compression getting more gains on larger files.
Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

reading an xls in python

I am currently trying to read an xls file in pandas with multiple worksheets(each month has 1 worksheet per day). I don't need all worksheets just the sheets that are named 1 to 31 depending on which month. How would I go about just joining only those dataframes into 1 data frame.
I tried to hard-code the sheetnames in but got errors on days with only 28 or 30 days.
Is there a way to just read the sheetname if it is an int as the sheets I dont need are usually named 'Sheet1' etc

First time Updating a table

I was recently given permissions to update a single table in our database but this is not something I have done before and I do not what to mess anything up. I have tried searching for something online that was similar to what I am wanting to do with no success.
The table name is dbo.Player_Miles and it only has two columns of data Player_ID and Miles both of which are set as (int,null).
Currently there are about 300K records in this table and I have a csv file I need to use to update this table. In the file there are 500k Records so I need to be able to:
INSERT the new records ~250k records
UPDATE the records with that have new information ~200K records
Leave untouched and record that has the same information(although updating those to the same thing would not hurt the database would be a resource hog I would guess) ~50K records
Also leave untouched any records in the table currently that are not in the updated file. ~50k records
I am using SSMS 2008 but the Server is 2000.
You could approach this in stages...
1) Backup the database
2) Create a temporary SQL table to hold your update records
create table Player_Miles_Updates (
PlayerId int not null,
Miles int null)
3) Load the records from your text file into your temporary table
bulk insert Player_Miles_Updates
from 'c:\temp\myTextRecords.csv'
with
(
FIELDTERMINATOR =' ,',
ROWTERMINATOR = '\n'
)
4) Begin a transaction
begin transaction
5) Insert your new data
insert into Player_Miles
select PlayerId, Miles
from Player_Miles_Updates
where PlayerId not in (select PlayerId from Player_Miles)
6) Update your existing data
update Player_Miles
set Player_Miles.Miles = pmu.Miles
from Player_Miles pm join Player_Miles_Updates pmu on pm.Player_Id = pmu.Player_Id
7) Select a few rows to make sure what you wanted to happen, happened
select *
from Player_Miles
where Player_Id in (1,45,86,14,83) -- use id's that you have seen in the csv file
8a) If all went well
commit transaction
8b) If all didn't go well
rollback transaction
9) Delete the temporary table
drop table Player_Miles_Updates
You should use SSIS (or DTS, which was replaced by SSIS in SQL Server 2005).
Use the CSV as your source and "upsert" the data to your destination table.
In SSIS there are different ways to get this task done.
An easy way would be to use a lookup task on Player_ID.
If there's a match update the value and if there's no match just insert the new value.
See this link for more informations on lookup-pattern-upsert