I'm running a hvie script on EMR that's pulling data out of s3 keys. I can get all the data and put it in a table just fine. The problem is, some of the data I need is in the key name. How do I get the key name from within hive and put that into the hive table?
I faced similar problem recently. From what I researched, it depends. You can get the data out of the "directory" part but not the "filename" part of s3 keys.
You can use partition if s3 keys are formatted properly. partition can be queried the same way as columns. here is a link with some examples: Loading data with Hive, S3, EMR, and Recover Partitions
You can also specify the partitions yourself if s3 files are already grouped properly. For example I needed the date information so my script looked like this:
create external table Example(Id string, PostalCode string, State string)
partitioned by (year int, month int, day int)
row format delimited fields terminated by ','
tblproperties ("skip.header.line.count"="1");
alter table Example add partition(year=2014,month=8,day=1) location 's3n://{BuckeyName}/myExampledata/2014/08/01/';
alter table Example add partition(year=2014,month=8,day=2) location 's3n://{BuckeyName}/myExampledata/2014/08/02/';
...keep going
The partition data must be part of the "directory name" and not the "filename" because Hive loads data from a directory.
If you need to read some text out of the file name, I think you have to create custom program to rename the objects to so that the text you need is in the "directory name".
Good luck!
Related
I'm still getting to grips with Athena, so apologies if this question doesn't make sense. I'm trying to create a table in Athena from a csv file I have uploaded to my Amazon S3 bucket. During the table creation process on Athena, I need to define field names and datatypes. How are these fields matched with those stored in the Amazon S3 bucket? Is it just the order in which they appear in the csv?
Yes indeed depending on the order.
https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
This depends on the data format and the serde used. For CSV data, columns are mapped by index. The same goes for the regex and grok serde, as well as ORC. JSON, Avro, and Parquet is instead mapped by name.
Grok, ORC and Parquet can be configured to be either index or name, but the above are the defaults.
I'm trying to move a partitioned table over from the US to the EU region but whenever I manage to do so, It doesn't partition the table on the correct column.
The current process that I'm taking is:
Create a Storage bucket in the region that I want the partitioned table to be in
Export the partitioned table over via CSV to the original bucket (within the old region)
Transfer the table across buckets (from the original bucket to the new one)
Create a new table using the CSV from the new bucket (auto-detect schema is on)
bq --location=eu load --autodetect --source_format=CSV table_test_set.test_table [project ID/test_table]
I expect that the column to be partitioned on the DATE column but instead it's partitioned on the column PARTITIONTIME
Also a note that I'm currently doing this with CLI commands. This will need to be redone multiple times and so having reusable code is a must.
When I migrate data from 1 table to another one, I follow this process
I extract the data to GCS (CSV or other format)
I extract the schema to the source table with this command bq show --schema <dataset>.<table>
I create via the GUI the destination table with the edit as text schema and I paste it. I define manually the partition field that I want to use from the schema;
I load the data from GCS to the destination table.
This process has 2 advantages:
When you import a CSV format, you define the REAL type that you want. Remember, in schema autodetect, Bigquery look about 10 or 20 lines and deduce the schema. Often, string fields are set as INTEGER but the first line of my file doesn't contains letter, only numbers (in serial number for example)
You can define your partition fields properly
The process is quite easy to script. I use the GUI for creating destination table, but bq command lines are great for doing the same thing.
After some more digging I managed to find out the solution. By using "--time_partitioning_field [column name]" you are able to partition by a specific column. So the command would look like this:
bq --location=eu --schema [where your JSON schema file is] load --time_partitioning_field [column name] --source_format=NEWLINE_DELIMITED_JSON table_test_set.test_table [project ID/test_table]
I also found that using JSON files to make things easier.
quite a beginner's question -
I have log data stored in S3 files, in zipped JSON format.
The files reside in a directory hierarchy which reflects partitioning, in the following way: s3://bucket_name/year=2018/month=201805/day=201805/some_more_partitions/file.json.gz
I recently changed the schema of the logging to a slightly different directory structure. I Added some more partition levels, the fields currently reside inside of the JSON and I want to move them to the folder hierarchy. Also, I changed the inner JSON schema slightly. They reside in a different S3 bucket.
I wish to convert the old logs to the new format, because I have Athena mapping over the new schema structure.
Is AWS EMR the tool for this? If so, what's the simplest way to achieve this? I thought I need an EMR cluster of type step execution but it probably creates just one output file, no?
Thanks
Yes, Amazon EMR is an appropriate tool to use.
You could use Hive, which has similar-ish syntax to Athena:
Create an External Table pointing to your existing data, using your old schema
Create an External Table pointing to where you wish to store the data, using your new schema
INSERT INTO new-table SELECT * FROM old-table
If your intention is to query the data with Amazon Athena, you can use Amazon EMR to convert the data into Parquet format, which will give even better query performance.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Yes EMR can be used for such conversion.
Here's the sample code where to covert the data coming as csv (stg folder aka source folder) format to orc file format. You may want to do the insert overwrite in case you have overlapping partitions between your staging (aka source) files and Target files
DROP TABLE IF EXISTS db_stg.stg_table;
CREATE EXTERNAL TABLE `db_stg.stg_table`(
GEO_KEY string,
WK_BEG_DT string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket.name/stg_folder_name/'
TBLPROPERTIES ('has_encrypted_data'='false');
drop table db_tgt.target_table;
CREATE EXTERNAL TABLE db_tgt.target_table(
GEO_KEY string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
PARTITIONED BY(FIS_WK_NUM)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
location 's3://bucket.name/tgt_folder_name/'
TBLPROPERTIES (
'orc.compress'='SNAPPY');
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table db_tgt.target_table partition(FIS_WK_NUM)
select
GEO_KEY ,
WK_BEG_DT ,
FIS_WK_NUM ,
AMOUNT1
from db_stg.stg_table;
Agree with John that converting to a columnar file format like Parquet or ORC (along with compression like SNAPPY) will give you the best performance with AWS Athena.
Remember the key to using Athena is to optimize the amount of data you scan an read. Hence, if the data is in columnar format and you are reading certain partitions, you AWS Athena cost will go down significantly. All you need to do is to make sure you are using the filter condition in your Athena queries that selects the required partitions.
Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables
I'm new to AWS and Hive, and I'm trying to use Hive to analyze Google Ngrams data. I tried to save a table as tab-delimited CSV in an S3 bucket, but now I don't know how to view it or download it to see if my job executed correctly.
The query I used to create the table was
CREATE EXTERNAL TABLE test_table2 (
gram string,
year int,
occurrences bigint,
pages bigint,
books bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://mybucket/sub-bucket/test-table2.txt';
I then filled the table with data:
INSERT OVERWRITE TABLE test_table2
SELECT
gram,
year,
occurrences,
pages,
books
FROM
eng1m_5grams_normed
WHERE
gram = 'early bird gets the worm';
The query ran fine, and I think everything worked correctly. However, when I navigate to my bucket in the S3 Management Console online, the text file appears as a folder containing a bunch of files. These files have long hexadecimal character names and are 0 bytes big.
Is this just the text file represented as a directory? Is there a way I can view or download the file to see if my query worked? I tried to make the directory public so I could download it, but the download button in the "Actions" dropdown menu is still greyed out.
In Hive/S3 , think of S3 directories as tables. The files contained in those directories are contents of those tables (i.e. rows). The reason you have multiple files in the directory is because multiple reducers are writing the "table".
S3 Browser is a very nice tool for working with S3.
What happened is that very few rows may have qualified against the predicate in the where clause. so very few (or no) rows were selected and emitted into the output (and hence the zero sized files). EMR doesn't give a simple way to download the result of a query.