Pass list in command line argument in hive - list

Basically I want to automate hive query based job. Which will take list as input from the file and query will generate reports by applying filters using this input. The input is a list which I want to pass as "where-in" condition in hive query. My query looks like,
// temp.sql //
INSERT INTO TABLE Table1
SELECT * FROM Table2 where pixel_id in ($PIXEL);
I am trying to pass input in command line like this,
hive -f temp.sql -d PIXEL= '('608207','608206','608205','608204','608203','608201','608184','608198','608189')' > temp.log 2>&1 &
I am not sure this is correct way or not?
Anyone has some idea to work around this problem?
Please suggest me some way to work around.

If pixel_id is number, you can use this simple script:
Create shell script script.sh with hive -e "select * from orders where order_id in (${1})"
Save it and change permissions by running chmod +x script.sh
Run shell script by passing the values as parameter for example ./script.sh "1, 2, 3, 4"
You can do it for string columns by escaping " like this "\"1\", \"2\""

Try passing like a string with ',' separated list-
run this on bash: hive -hiveconf myargs="1','2','3','4"
considering your script looks like this: SELECT * from mytable where my_id in ('${hiveconf:myargs}');

Related

Issue querying Athena with select having special characters

Below is the select query I am trying:
SELECT * from test WHERE doc = '/folder1/folder2-path/testfile.txt';
This query returns zero results.
If I change the query using like, it works omitting the special chars /-.
SELECT * from test WHERE doc LIKE '%folder1%folder2%path%testfile%txt';
This works
How can I fix this query to use eq or IN operator, as I am interested to run a batch select?
To test your situation, I created a text file containing:
hello
there
/folder1/folder2-path/testfile.txt
this/that
here.there
I uploaded the file to a directory on S3, then created an external table in Athena:
CREATE EXTERNAL TABLE stack (doc string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://my-bucket/my-folder/'
I then ran the command:
select * from stack WHERE doc = '/folder1/folder2-path/testfile.txt'
It returned:
1 /folder1/folder2-path/testfile.txt
So, it worked for me. Therefore, your problem would either be a result of the contents of the file, or the way that the external table is defined (eg using a different Serde).

Trying to remove database name from sql file

I'm trying to import a postgres dump into a sqlite3 database.
Now pg_dump add the database name to the expressions and this is not good for sqlite3.
CREATE TABLE dbname.table
Is it possible to tell sqlite3 to ignore database name?
The next solution is to try to write a regexp that modifies the sql file but I'm not a regexp magician, I've obtained something along the lines of:
printf "\nCREATE TABLE dbname.data;\nINSERT INTO dbname.data VALUES (\"the data in dbname.\")\n" | sed -e '/^CREATE TABLE/s/dbname.//g' -e '/^INSERT INTO/s/dbname.//g'
But this is incorrect cause I want to substitute only the first occurrence...
Can you give me some suggestion?
You actually don't have to change your file of SQL statements:
$ sqlite3
sqlite> ATTACH 'your_database.db' AS dbname;
sqlite> .read dump_file.sql
ATTACH will open a database using the schema name dbname so that dbname.tablename will refer to it.

How to export Google spanner query results to .csv or google sheets?

I am new to google spanner and I have run a query and found about 50k rows of data. I want to export that resultset to my local machine like .csv or into a google sheet. Previously I have used TOAD where I have an export button, but here I do not see any of those options. Any suggestions please.
The gcloud spanner databases execute-sql command allows you to run SQL statements on the command line and redirect output to a file.
The --format=csv global argument should output in CSV.
https://cloud.google.com/spanner/docs/gcloud-spanner
https://cloud.google.com/sdk/gcloud/reference/
Unfortunately, gcloud spanner databases execute-sql is not quite compatible with --format=csv because of the way the data is laid out under the hood (an array instead of a map). It's much less pretty, but this works:
SQL_STRING='select * from your_table'
gcloud spanner databases execute-sql [YOURDB] --instance [YOURINSTANCE] \
--sql=SQL_STRING --format json > data.json
jq '.metadata.rowType.fields[].name' data.json | tr '\n' ', ' > data.csv
echo "" >> data.csv
jq '.rows[] | #csv' data.json >> data.csv
This dumps the query in json form to data.json, then writes the column names to the CSV, followed by a line feed, and finally the row contents. As a bonus, jq is installed by default on cloudshell, so this shouldn't carry any extra dependencies there.
As #redpandacurios stated you can use the gcloud spanner databases execute-sql CLI command to achieve this, though without the --format csv option as it causes a Format [csv] requires a non-empty projection. error on gcloud v286.0.0.
This does not produce the projection error:
gcloud spanner databases execute-sql \
[DATABASE] \
--instance [INSTANCE] \
--sql "<YOUR SQL>" \
>output.csv
But you get an output formatted as:
<column1> <column2>
<data1> <data1>
<data2> <data2>
...
<dataN> <dataN>
So not quite csv, but whitespace separated. If you want JSON, use --format json >output.json in place of the last line.
To get CSV it seems you may need to convert from JSON to CSV as stated in one of the other answers.
You could use a number of standard database tools with Google Cloud Spanner using a JDBC driver.
Have a look at this article: https://www.googlecloudspanner.com/2017/10/using-standard-database-tools-with.html
Toad is not included as an example, and I don't know if Toad supports dynamic loading of JDBC drivers and connecting to any generic JDBC database. If not, you could try one of the other tools listed in the article. Most of them would also include an export function.
Others have mentioned using --format "csv" but getting the error Format [csv] requires a non-empty projection.
I believe I discovered how to specify projections that will get --format csv to work as expected. An example:
gcloud spanner databases execute-sql [DATABASE] --instance [INSTANCE] \
--sql "select c1, c2 from t1" \
--flatten "rows" \
--format "csv(rows[0]:label=c1, rows[1]:label=c2)"
rows is the actual field name returned by execute-sql and that we need to properly transform in the projection.
I made it with awk only, my gcloud is producing "text" output by default, where values have no whitespaces and are separated with tabs:
gcloud spanner databases execute-sql \
[DATABASE] \
--instance [INSTANCE] \
--sql "<YOUR SQL>" \
| awk '{print gensub(/[[:space:]]{1,}/,",","g",$0)}' \
> output.csv
For key=value format (useful where there are many columns) I use this awk filter instead, to catch the column names in 1st row, then to combine them with values:
awk 'NR==1 {split($0,columns); next} {split ($0,values); for (i=1; i<=NF; i++) printf ("row %d: %s=%s\n", NR-1, columns[i], values[i]); }'

Export Hive database, All tables and columns names, to text or csv

There are about 240 tables in a Hive database on AWS. I want to export all tables with column names and their data types to a csv. How can I do that?
Use-
hive -e 'set hive.cli.print.header=true; SELECT * FROM db1.Table1' | sed 's/[\t]/,/g' > /home/table1.csv
set hive.cli.print.header=true : This will add the column names in the csv file
SELECT * FROM db1.Table1: Here, you have to provide your query.
/home/table1.csv: Path where you want to save the file (here as table1.csv).
Hope this solve your problem!

How to delete a row from csv file on datalake store without using usql?

I am writing a unit test for appending data to CSV file on a datalake. I want to test it by finding my test data appended to the same file and once I found it I want to delete the row I inserted. Basically once I found the test data My test will pass but as the tests are run in production so I have to search for my test data i.e to find the row I have inserted in a file and delete it after the test is run.
I want to do it without using usql inorder to avoid the cost factor involved in using usql. What are the other possible ways we can do it?
You cannot delete a row (or any part) from a file. Azure data lake store is an append-only file system. Data once committed cannot be erased or updated. If you're testing in production, your application needs to be aware of test rows and ignore them appropriately.
The other choice is to read all the rows in U-SQL and then write an output excluding the test rows.
Like other big data analytics platforms, ADLA / U-SQL does not support appending to files per se. What you can do is take an input file, append some content to it (eg via U-SQL) and write it out as another file, eg a simple example:
DECLARE #inputFilepath string = "input/input79.txt";
DECLARE #outputFilepath string = "output/output.txt";
#input =
EXTRACT col1 int,
col2 DateTime,
col3 string
FROM #inputFilepath
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *
FROM #input
UNION ALL
SELECT *
FROM(
VALUES
(
2,
DateTime.Now,
"some string"
) ) AS x (col1, col2, col3);
OUTPUT #output
TO #outputFilepath
USING Outputters.Csv(quoting : false, outputHeader : true);
If you want further control, you can do some things via the Powershell SDK, eg test an item exists:
Test-AdlStoreItem -Account $adls -Path "/data.csv"
or move an item with Move-AzureRmDataLakeStoreItem. More details here:
Manage Azure Data Lake Analytics using Azure PowerShell