Loading data from HDFS to Kudu

Loading data from HDFS to Kudu - hdfs

I'm trying to load data to a Kudu table but getting a strange result.
In the Impala console I created an external table from the four HDFS files imported by Sqoop:
drop table if exists hdfs_datedim;
create external table hdfs_datedim
( ... )
row format
delimited fields terminated by ','
location '/user/me/DATEDIM';
A SELECT COUNT(*) tells me there lots of rows present. The data looks good when queried.
I use a standard select into to copy the results
INSERT INTO impala_kudu.DATEDIM
SELECT * FROM hdfs_datedim;
A SELECT COUNT(*) tells me impala_kudu.DATEDIM has four rows (the number of files in HDFS not the number of rows in the table.
Any Ideas?

Currently Sqoop does not support Kudu yet. You can import to HDFS and then use Impala to write data to Kudu.

The data created by sqoop was under the covers was a sequence of poorly formatted csv files. The import failed without an error because of data in the flat file. Watch out for date formats and text strings with delimiters embedded in the string.

If you have the data in HDFS in (csv/avro/parquet) format,then you can use the below command to import the files to Kudu table.
Prerequisites:
Kudu jar with compatible version (1.6 or higher)
spark2-submit --master yarn/local --class org.apache.kudu.spark.tools.ImportExportFiles <path of kudu jar>/kudu-spark2-tools_2.11-1.6.0.jar --operation=import --format=<parquet/avro/csv> --master-addrs=<kudu master host>:<port number> --path=<hdfs path for data> --table-name=impala::<table name>

Related

Moving a partitioned table across regions (from US to EU)

I'm trying to move a partitioned table over from the US to the EU region but whenever I manage to do so, It doesn't partition the table on the correct column.
The current process that I'm taking is:
Create a Storage bucket in the region that I want the partitioned table to be in
Export the partitioned table over via CSV to the original bucket (within the old region)
Transfer the table across buckets (from the original bucket to the new one)
Create a new table using the CSV from the new bucket (auto-detect schema is on)
bq --location=eu load --autodetect --source_format=CSV table_test_set.test_table [project ID/test_table]
I expect that the column to be partitioned on the DATE column but instead it's partitioned on the column PARTITIONTIME
Also a note that I'm currently doing this with CLI commands. This will need to be redone multiple times and so having reusable code is a must.

When I migrate data from 1 table to another one, I follow this process
I extract the data to GCS (CSV or other format)
I extract the schema to the source table with this command bq show --schema <dataset>.<table>
I create via the GUI the destination table with the edit as text schema and I paste it. I define manually the partition field that I want to use from the schema;
I load the data from GCS to the destination table.
This process has 2 advantages:
When you import a CSV format, you define the REAL type that you want. Remember, in schema autodetect, Bigquery look about 10 or 20 lines and deduce the schema. Often, string fields are set as INTEGER but the first line of my file doesn't contains letter, only numbers (in serial number for example)
You can define your partition fields properly
The process is quite easy to script. I use the GUI for creating destination table, but bq command lines are great for doing the same thing.

After some more digging I managed to find out the solution. By using "--time_partitioning_field [column name]" you are able to partition by a specific column. So the command would look like this:
bq --location=eu --schema [where your JSON schema file is] load --time_partitioning_field [column name] --source_format=NEWLINE_DELIMITED_JSON table_test_set.test_table [project ID/test_table]
I also found that using JSON files to make things easier.

Problems while uploading quoted data to Redshift from S3 using AWS GLUE. How do I insert the data?

I am trying to insert a data set in Redshift with values as :
"2015-04-12T00:00:00.000+05:30"
"2015-04-18T00:00:00.000+05:30"
"2015-05-09T00:00:00.000+05:30"
"2015-05-24T00:00:00.000+05:30"
"2015-07-19T00:00:00.000+05:30"
"2015-08-02T00:00:00.000+05:30"
"2015-09-05T00:00:00.000+05:30"
The crawler which I ran over S3 data is unable to identify the columns or datatype of the values. I have been tweaking the table settings to get the job to push the data into Redshift but no avail. Here is what I have tried so far :
Manually added the column in the table definition in Glue Catalog. There is only 1 column which is mentioned above.
Changed the Serde serialization lib from LazySimpleSerde to org.apache.hadoop.hive.serde2.lazy.OpenCSVSerDe
Added the following Serde parameters - quoteChar ", line.delim \n, field.delim \n
I have already tried different combinations of line.delim and field.delim properties. Including one, omitting another and taking both at the same time as well.
Changed the classification from UNKONWN to text in table properties.
Changed the recordCount property to 469 to match the raw data row counts.
The job runs are always successful. After the job runs, when I go to select * from table_name, I always get correct count of rows in the redshift table as per the raw data but all the rows are NULL. How do I populate the rows in Redshift ?
The table properties have been uploaded in image album here : Imgur Album

I was unable to push the data into Redshift using Glue. So I turned to COPY command of Redshift. Here is the command that I executed in case anyone else needs it or faces the same situation :
copy schema_Name.Table_Name
from 's3://Path/To/S3/Data'
iam_role 'arn:aws:iam::Redshift_Role'
FIXEDWIDTH 'Column_Name:31'
region 'us-east-1';

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here.
So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue.
The process that I did was:
1 - Run Apache Spark script to generate 30 millions rows partitioned by date at S3 stored by Orc.
2 - Run a Athena query to create the external table.
3 - Checked the table at EMR connected with Glue Data Catalog and it worked perfect. Both Spark and Hive were able to access.
4 - Generate another 30 millions rows in other folder partitioned by date. In Orc format
5 - Ran the Glue Crawler that identify the new table. Added to Data Catalog and Athena was able to do the query. But Spark and Hive aren't able to do it. See the exception below:
Spark
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcStruct
Hive
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating audit_id (state=,code=0)
I was checking if was any serialisation problem and I found this:
Table created manually (Configuration):
Input format org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
Output format org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
orc.compress SNAPPY
Table Created with Glue Crawler:
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
So, this is not working to read from Hive or Spark. It works for Athena. I already changed the configurations but with no effect at Hive or Spark.
Anyone faced that problem?

Well,
After few weeks that I posted this question AWS fixed the problem. As I showed above, the problem was real and that was a bug from Glue.
As it is a new product and still have some problems some times.
But this was solved properly. See the properties of the table now:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

How to export IXF and LOB files from dashdb

I need to copy some tables in one dashdb database over to separate dashdb database. Normally I would export the CSV file from one and load it into the other using the Web console, however one table in particular has a CLOB column and so we will need to export to an ixf + lob files and then import it. Unfortunately I can't see any easy way to do this as it looks like clpplus can only export to the server that the database is on (which I don't have access to) and I can't see any way to get it to export the lob files. Does anyone know how best to accomplish this?

If the CLOB values are in reality smaller than 32K you can try to transform them into a VARCHAR value as part of the SELECT statement that you provide to EXPORT.
If you really need to export LOB files you can write them to your users home dir inside the dashDB instance and then use the /home REST API to download the files e.g. with curl: https://developer.ibm.com/static/site-id/85/api/dashdb-analytics/

Another option is to export the table with the LOBs to a local machine and then import into another dashDB.
One way to export a dashDB table to a local client is to run the EXPORT command in a DB2 Command Line Processor (CLP) on your client machine. To do so, you need to install the IBM Data Server Runtime Client and then catalog your dashDB databases in the client, like this:
CATALOG TCPIP NODE mydash REMOTE dashdb-txn-small-yp-lon02-99.services.eu-gb.bluemix.net SERVER 50000;
CATALOG DATABASE bludb AS dash1 AT NODE mydash;
CONNECT TO dash1 USER <username> USING <password>;
Now, let's export the table called "mytable" so that the LOB column is written to a separate file:
export to mytable.del of del
lobfile mylobs
modified by lobsinfile
select * from mytable;
This export commands produces the files mytable.del and mylobs.001.lob. The file mytable.del contains pointers into the file mylobs.001.lob that specify the offset and length of each value.
If the LOB data is too large to fit into a single file, then additional files mylobs.002.lob, mylobs.003.lob, etc. will be created.
Note the exported data will be sent from dashDB to your local client in uncompressed form, which may take some time depending on the data volume.
If the .DEL and .LOB files reside on a client machine, such as your laptop or a local server, you can use the IMPORT command to ingest these files into a table with a LOB column. In the CLP you would first connect to the dashDB database that you want to load into.
Let's assume the original table has been exported to the files mytable.del and mylobs.001.lob, and that these files are now located on your client machine in the directory /mydata. Then this command will load the data and LOBs into the target table:
IMPORT FROM /mydata/mytable.del OF DEL
LOBS FROM /mydata
MODIFIED BY LOBSINFILE
INSERT INTO mytable2;
This IMPORT command can be run in a DB2 Command Line Processor on your client machine.

Re-parsing Blob data stored in HDFS imported from Oracle by Sqoop

Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.

I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js