Re-parsing Blob data stored in HDFS imported from Oracle by Sqoop - mapreduce

Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.

I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?

Related

create table and load data in redshift from Parquet file in S3

I have my Parquet file in S3. I want to load this to the redshift table. I don't know the schema of the Parquet file.
Is there any command to create a table and then copy parquet data to it?
Also, I want to add the default time column date timestamp DEFAULT to_char(CURRDATE, 'YYYY-MM-DD').
You need first to create an external schema. Normal colunar schemas do not support parquet files.
Them you need to create a external table to support the columns on the s3 file. So look for a good relation between the file and datatypes. I usually avoid smalls ints for example. To floats create as REAL data type.
Then, do the copy command bellow:
COPY schema_name.table_name
FROM bucket_object iam_role user_arn
FORMAT PARQUET;

Big query table update from reference CSV file

I have a BigQuery table and I want to update the content of few rows it from a reference CSV file. This CSV file is uploaded to Google cloud storage bucket.
When you use external table from storage, you can only read the CSV, not update them.
However, you can load you CSV into a BigQuery native table, perform the update with DML, and then export the table to CSV. That only works if you have only one CSV.
If you have several CSV files, you can, at least, print the pseudo column _FILE_NAME to identify the files where you need to perform the change. But the change will have to be performed manually or with the previous solution (native table)

Export athena table to S3 as one readable file

I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.
The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?
If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://my_athena_results/ctas_query_result_bucketed/',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1)
AS (
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
);
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].
See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about
considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.
Update 2019-08-13
Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.
Update 2019-08-14
You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.
If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.
Athena stores query results in Amazon S3.
A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)
Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call.
(https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.

Load Parquet files into Redshift

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.
Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?
Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.
Appreciate your help!
I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).
The Amazon Redshift COPY command can natively load Parquet files by using the parameter:
FORMAT AS PARQUET
See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats
The table must be pre-created; it cannot be created automatically.
Also note from COPY from Columnar Data Formats - Amazon Redshift:
COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.
use parquet-tools from GitHub to dissect the file :
parquet-tool schema <filename> #will dump the schema w/datatypes
parquet-tool head <filename> #will dump the first 5 data structures
Use the jsonpaths file to specify mappings