I have two text file with same structure that I extracted from SQL Server. One file is 1.5gb while another is 7.5gb. I created a table in hive and then copied these files to corresponding gcs buckets. Now when I am trying to load data in tables it is failing for 7.5 gb file. After running LOAD DATA INPATH command my 7.5gb file in the bucket is getting deleted. While in case of 1.5 GB file it is working perfectly fine. What alternative way should I try to fix this issue.
My Hive QL is as below.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
( v_nbr int,
v_nm varchar(80),
p_nbr int,
r_nbr int,
a_account varchar(80),
a_amount decimal(13,4),
c_store int,
c_account int,
c_amount decimal(13,4),
rec_date date)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable1.5/file1.5gb.txt' OVERWRITE INTO TABLE myschema.table1.5;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable7.5/file7.5gb.txt' OVERWRITE INTO TABLE myschema.table7.5;
You can try this:
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
( v_nbr int,
v_nm varchar(80),
p_nbr int,
r_nbr int,
a_account varchar(80),
a_amount decimal(13,4),
c_store int,
c_account int,
c_amount decimal(13,4),
rec_date date)
row format delimited
fields terminated by ','
stored as textfile
LOCATION 'gs://mybucket/myschema.db/mytable1.5/file1.5gb.txt';
Related
So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```
I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.
create external table reason ( reason_id int,
retailer_id int,
reason_code string,
reason_text string,
ordering int,
creation_date date,
is_active tinyint,
last_updated_by int,
update_date date
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
location 's3://bucket_name/athena-workspace/athena-input/'
TBLPROPERTIES ("skip.header.line.count"="1");
Query above successfuly executes, however, there is no files in the provided location!!!
Upon successful execution table is created and is empty. How is this possible?
Even if I upload file to the provided location, created table is still empty!!
Athena is not a data store, it is simply a serverless tool to read data in S3 using SQL like expressions.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
This query is creating the metadata of the table, it doesn't write to that location it reads from it.
If you put a CSV into the location and performed select * from reason it would attempt to map any CSV in the prefix of athena-workspace/athena-input/ within bucket bucket_name to your data format using the ROW FORMAT and SERDEPROPERTIES to parse the files. It would also skip the first line assuming its a header.
Has any one tried creating AWS Athena Table on top of Sequence Files. As per the Documentation looks like it is possible. I was able to execute below create table statement.
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
STORED AS sequencefile
location 's3://bucket/sequencefile/';
The Statement executed Successfully but when i try to read data from the table it throws below error
Your query has the following error(s):
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://viewershipforneo4j/2017-09-26/000030_0 (offset=372128055, length=62021342) using org.apache.hadoop.mapred.SequenceFileInputFormat: s3://viewershipforneo4j/2017-09-26/000030_0 not a SequenceFile
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 9f0983b0-33da-4686-84a3-91b14a39cd09.
Sequence file are valid one . Issue here is there is not deliminator defined.
Ie row format delimited fields terminated by is missing
if in your case if tab is column deliminator row data is in next row it will be
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
row format delimited fields terminated by '\t'
STORED AS sequencefile
location 's3://bucket/sequencefile/';
I have a .csv file containing data about crime incidences in Philadelphia.
I am using a hive script in amazon EMR to convert this data into a HIVE table.
I am using the following hive script:
CREATE EXTERNAL TABLE IF NOT EXISTS Crime(
Dc_Dist INT,
PSA INT,
Dispatch_Date_Time TIMESTAMP,
Dispatch_Date date,
Dispatch_Time STRING,
Hour INT,
Dc_Key BIGINT,
Location_Block STRING,
UCR_General INT,
Text_General_Code STRING,
Police_Districts INT,
Month STRING,
Lon STRING,
Lat STRING)
COMMENT 'Data about crime from a public database'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location 's3://dsabucket/crimeData/crime';
I run this script but I do not get a file or data in my output folder. I am not sure if the table is created properly or not. As I understand the 'STORED AS TEXTFILE' line should store this table as a textfile.
to check table created or not use DESCRIBE
ie DESCRIBE tableNAMe;