I have a file which essentially has tags in it in each of its lines.
Example:
<stock_name>Abc Inc.</stock_name>
<stock_value>123.456</stock_value>
........
I have a database table which has records telling (new) stock value for the stocks.
The database table has say 2 columns: stock_name and stock_value
I need to scan this database table and for each stock_name in the table, I need to replace the <stock_value>xxx</stock_value> with <stock_value>yyy</stock_value> against the appropriate <stock_name> in the file (xxx=existing stock_value in the file; yyy=new stock_value retrieved from the database for the particular stock). Could anyone help me on this please?
PS: This is not a homework. I am in the middle of writing a perl script to modify the file.
Appreciate your help.
If your markup is guaranteed to be <tag>value</tag>, then this will replace the values using a hash keyed off of the tag name:
my %table = (
stock_name => 'xxx',
stock_value => 'yyy',
);
$file = s/<([^>\/]*)>[^<]*/<$1>$table{$1}/g;
Related
Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."
I need to replace parts of text using the following query:
UPDATE table_name SET field = REPLACE(field, 'old', 'new')
The problem is that the query must not replace "old.jpg" or "oldcar" and so on, but ONLY "old".
Can we use something like below?
WHERE field LIKE 'old'
I have a oracle table where I have columns like Document (type BLOB), Extension ( VARCHAR2(10) with values like .pdf, .doc) and Document Description(VARCHAR2
(100)). I want to export this data and provide to my customer.
Can this be done in kettle ?
Thanks
I have a MSSQL database that stores images in a BLOB column, and found a way to export these to disk using a dynamic SQL step.
First, select only the columns necessary to build a file name and SQL statement (id, username, record date, etc.). Then, I use a Modified Javascript Value step to create both the output filename (minus the file extension):
outputPath = '/var/output/';
var filename = outputPath + username + '_' + record_date;
// --> '/var/output/joe_20181121'
and the dynamic SQL statement:
var blob_query = "SELECT blob_column FROM dbo.table WHERE id = '" + id + "'";
Then, after using a select to reduce the field count to just the filename and blob_query, I use a Dynamic SQL row step (with "Outer Join" selected) to retrieve the blob from the database.
The last step is to output to a file using Text file output step. It allows you to supply a file name from a field and give it a file extension to append. On the Content tab, all boxes are unchecked, the Format is "no new-line term" and the Compression is "None". The only field exported is the "blob_column" returned from the dynamic SQL step, and the type should be "binary".
Obviously, this is MUCH slower than other table/SQL operations due to the dynamic SQL step making individual database connections for each row... but it works.
Good luck!
There are about 240 tables in a Hive database on AWS. I want to export all tables with column names and their data types to a csv. How can I do that?
Use-
hive -e 'set hive.cli.print.header=true; SELECT * FROM db1.Table1' | sed 's/[\t]/,/g' > /home/table1.csv
set hive.cli.print.header=true : This will add the column names in the csv file
SELECT * FROM db1.Table1: Here, you have to provide your query.
/home/table1.csv: Path where you want to save the file (here as table1.csv).
Hope this solve your problem!
I have imported a csv file into python and I'm using pandas. I need to output a new csv file containing only some of the data, and in a different order with blank columns. The new csv file will be used to import data from one system into, and the data need to line up.
so if the original csv file had the following columns
"date" "department" "name" "title" "employee id"
I need the rows of the csv file to read
"name",,,,,"department",,,,"date",,
I have deleted the columns that I don't need:
del df["title"],def["employee id"]
I wrote a bunch of blank columns:
df[a] = '';
df[b] = '';
df[c] = '';
When I write them to csv in the order I want
df.to_csv('outfile.csv', cols=["name","a","b","c","department","d","e","f","date","g","h"], index=False,header=False)
It comes out
date,department,,,,,,,,,,,name,,
Should I be working with the csv module for this particular type of project? I'm scouring the documentation, but having trouble figuring how what I'm reading applies to my task
It'll be easier in my opinion to reindex your df, this will put the cols in the order you desire and where columns don't exist put NaN values there:
df.reindex(columns=["name","a","b","c","department","d","e","f","date","g","h"]).to_csv('outfile.csv', index=False,header=False)