I am trying to upload csv file from HDFS to HBase, well i able to dump the csv file from hdfs to Hbase table but i am not able to get full csv file, CSV file has 53,292 but when i run count in my table it gives only 999 entries.
Actual records
# hdfs dfs -cat /user/card_transactions.csv | wc -l 53292
Records i am getting after dumping data from hdfs to Hbase table
=> ["card_transaction"]
hbase(main):002:0> count 'card_transaction'
999 row(s) in 0.3730 seconds
=> 999
hbase(main):003:0>
Command i am using to dumping data from hdfs to Hbase table is below.
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns='HBASE_ROW_KEY,cf:card_id,cf:member_id,cf:amount,cf:postcode,cf:pos_id,cf:transaction_dt,cf:status' card_transaction /user/card_transactions.csv
Could someone please help me out?
Related
I am new to AWS and want to do some data pipelining in AWS.
I have a bunch on CSV file stored in S3
Things I want to achieve:
I want to union all the CSV files and add the filename to each
line, the first line needs to be removed for each file before
unioning the CSVs;
Split the filename column by the _ delimiter;
Store this all in a DB after processing.
What is the best/fastest way to achieve this in a way.
Thanks
You can create a glue job using pyspark which will get the csv file in df and then you can transform it however you like.
After that you can convert that df to parquet and save that in s3.
Then you can run a glue crawler which will convert the parquet data to table which you can query.
Basically you are doing ETL using glue aws.
I'm trying to load data to a Kudu table but getting a strange result.
In the Impala console I created an external table from the four HDFS files imported by Sqoop:
drop table if exists hdfs_datedim;
create external table hdfs_datedim
( ... )
row format
delimited fields terminated by ','
location '/user/me/DATEDIM';
A SELECT COUNT(*) tells me there lots of rows present. The data looks good when queried.
I use a standard select into to copy the results
INSERT INTO impala_kudu.DATEDIM
SELECT * FROM hdfs_datedim;
A SELECT COUNT(*) tells me impala_kudu.DATEDIM has four rows (the number of files in HDFS not the number of rows in the table.
Any Ideas?
Currently Sqoop does not support Kudu yet. You can import to HDFS and then use Impala to write data to Kudu.
The data created by sqoop was under the covers was a sequence of poorly formatted csv files. The import failed without an error because of data in the flat file. Watch out for date formats and text strings with delimiters embedded in the string.
If you have the data in HDFS in (csv/avro/parquet) format,then you can use the below command to import the files to Kudu table.
Prerequisites:
Kudu jar with compatible version (1.6 or higher)
spark2-submit --master yarn/local --class org.apache.kudu.spark.tools.ImportExportFiles <path of kudu jar>/kudu-spark2-tools_2.11-1.6.0.jar --operation=import --format=<parquet/avro/csv> --master-addrs=<kudu master host>:<port number> --path=<hdfs path for data> --table-name=impala::<table name>
There's an excel file testFile.xlsx, it looks like as below:
ID ENTITY STATE
1 Montgomery County Muni Utility Dist No.39 TX
2 State of Washington WA
3 Waterloo CUSD 5 IL
4 Staunton CUSD 6 IL
5 Berea City SD OH
6 City of Coshocton OH
Now I want to import the data into the AWS GLUE database, a crawler in AWS GLUE has been created, there's nothing in the table in AWS GLUE database after running the crawler. I guess it should be the issue of classifier in AWS GLUE, but have no idea to create a proper classifier to successfully import data in the excel file to AWS GLUE database. Thanks for any answers or advice.
I'm afraid Glue Crawlers have no classifier for MS Excel files (.xlsx or .xls). Here you can find list of supported formats and built-in classifiers. Probably, it would be better to convert files to CSV or some other supported format before exporting to AWS Glue Catalog.
Glue crawlers doesn't support MS Excel files.
If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.
You can convert it easily using pandas.
Create a normal python job and read the excel file.
import pandas as pd
df = pd.read_excel('yourFile.xlsx', 'SheetName', dtype=str, index_col=None)
df.to_csv('yourFile.csv', encoding='utf-8', index=False)
This will convert your file to csv then run crawler over this file and your table will be loaded.
Hope it helps.
When you say that "there's nothing in the table in AWS Glue database after running the crawler" are you saying that in the Glue UI, you are clicking on Databases, then the database name, then on "Tables in xxx", and nothing is showing up?
The second part of your question seems to indicate that you are looking for Glue to import the actual data rows of your file into the Glue database. Is that correct? The Glue database does not store data rows, just the schema information about the files. You will need to use a Glue ETL job, or Athena, or hive to actually move the data from the data file into something like mySQL.
You should write script (most likely python shell job in glue) to convert excel to csv and then run crawler over it.
In my S3 bucket I have .xls file (this file is grouped file, I mean first 20 row having some image and some extract details about client).
So first I want to convert .xls into .csv then I load Redshift table through copy commands and ignore first 20 rows also.
Note: I manualy save as .xls into .csv then I try to load Redshift
table through copy commands is successfully loaded. Now my problem is
how to convert .xls into .csv through Pentaho jobs.
You can convert excel to csv by using transformation with just two steps inside:
Microsoft Excel input - it should read rows from your excel file
Text file output - it saves rows from step 1 to csv file
Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.
I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?