How to delete a row from csv file on datalake store without using usql? - unit-testing

I am writing a unit test for appending data to CSV file on a datalake. I want to test it by finding my test data appended to the same file and once I found it I want to delete the row I inserted. Basically once I found the test data My test will pass but as the tests are run in production so I have to search for my test data i.e to find the row I have inserted in a file and delete it after the test is run.
I want to do it without using usql inorder to avoid the cost factor involved in using usql. What are the other possible ways we can do it?

You cannot delete a row (or any part) from a file. Azure data lake store is an append-only file system. Data once committed cannot be erased or updated. If you're testing in production, your application needs to be aware of test rows and ignore them appropriately.
The other choice is to read all the rows in U-SQL and then write an output excluding the test rows.

Like other big data analytics platforms, ADLA / U-SQL does not support appending to files per se. What you can do is take an input file, append some content to it (eg via U-SQL) and write it out as another file, eg a simple example:
DECLARE #inputFilepath string = "input/input79.txt";
DECLARE #outputFilepath string = "output/output.txt";
#input =
EXTRACT col1 int,
col2 DateTime,
col3 string
FROM #inputFilepath
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *
FROM #input
UNION ALL
SELECT *
FROM(
VALUES
(
2,
DateTime.Now,
"some string"
) ) AS x (col1, col2, col3);
OUTPUT #output
TO #outputFilepath
USING Outputters.Csv(quoting : false, outputHeader : true);
If you want further control, you can do some things via the Powershell SDK, eg test an item exists:
Test-AdlStoreItem -Account $adls -Path "/data.csv"
or move an item with Move-AzureRmDataLakeStoreItem. More details here:
Manage Azure Data Lake Analytics using Azure PowerShell

Related

GCP - load table from file and variable in BQ routine

I want to load table from file and variable . As the file schema is not same as table to be loaded hence extra columns needs to be filled by variable inside stored procedure.
Like below example pty is not part of csv file and other 2 columns mt and de are part of file.
set pty = 'sss';
LOAD DATA INTO `###.Tablename`
(
pty STRING ,
mt INTEGER ,
de INTEGER
)
FROM FILES
(
format='CSV',
skip_leading_rows=1,
uris = ['gs://###.csv']
);
I think you can do that on 2 steps and 2 queries :
LOAD DATA INTO `###.Tablename`
FROM FILES
(
format='CSV',
skip_leading_rows=1,
uris = ['gs://###.csv']
);
update `###.Tablename`
set pty = "sss"
where pty is null;
If it's complicated for you to apply your logic with Bigquery and SQL, you can also create a Python script with Google Biguery client and Google storage client.
You script loads the csv file
Transforms results to a list of Dict
Add extra fields to each element of the Dict with your code logic
Load the result Dicts to Bigquery

Athena SQL create table with text data

Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."

Issue querying Athena with select having special characters

Below is the select query I am trying:
SELECT * from test WHERE doc = '/folder1/folder2-path/testfile.txt';
This query returns zero results.
If I change the query using like, it works omitting the special chars /-.
SELECT * from test WHERE doc LIKE '%folder1%folder2%path%testfile%txt';
This works
How can I fix this query to use eq or IN operator, as I am interested to run a batch select?
To test your situation, I created a text file containing:
hello
there
/folder1/folder2-path/testfile.txt
this/that
here.there
I uploaded the file to a directory on S3, then created an external table in Athena:
CREATE EXTERNAL TABLE stack (doc string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://my-bucket/my-folder/'
I then ran the command:
select * from stack WHERE doc = '/folder1/folder2-path/testfile.txt'
It returned:
1 /folder1/folder2-path/testfile.txt
So, it worked for me. Therefore, your problem would either be a result of the contents of the file, or the way that the external table is defined (eg using a different Serde).

Can kettle export BLOB data from a oracle table?

I have a oracle table where I have columns like Document (type BLOB), Extension ( VARCHAR2(10) with values like .pdf, .doc) and Document Description(VARCHAR2
(100)). I want to export this data and provide to my customer.
Can this be done in kettle ?
Thanks
I have a MSSQL database that stores images in a BLOB column, and found a way to export these to disk using a dynamic SQL step.
First, select only the columns necessary to build a file name and SQL statement (id, username, record date, etc.). Then, I use a Modified Javascript Value step to create both the output filename (minus the file extension):
outputPath = '/var/output/';
var filename = outputPath + username + '_' + record_date;
// --> '/var/output/joe_20181121'
and the dynamic SQL statement:
var blob_query = "SELECT blob_column FROM dbo.table WHERE id = '" + id + "'";
Then, after using a select to reduce the field count to just the filename and blob_query, I use a Dynamic SQL row step (with "Outer Join" selected) to retrieve the blob from the database.
The last step is to output to a file using Text file output step. It allows you to supply a file name from a field and give it a file extension to append. On the Content tab, all boxes are unchecked, the Format is "no new-line term" and the Compression is "None". The only field exported is the "blob_column" returned from the dynamic SQL step, and the type should be "binary".
Obviously, this is MUCH slower than other table/SQL operations due to the dynamic SQL step making individual database connections for each row... but it works.
Good luck!

import data from HBase insert to another table using Pig UDF

I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)