Extract text from Nifi attribute - regex

I'm listing out all the keys in S3 bucket. Below is the flow.
Here in the keys as part of the filename attribute(FetchS3Object attributes) I have the complete path of the keys, out of which I want extract the last but one text
e.g.
If below is the complete path of the key
/buckname/root1/subobject/subsubobject/path1/path2/path3/text.csv
in the file name attribute I have root1/subobject/subsubobject/path1/path2/path3/text.csv, out of which I want extract path2 text.
Any suggestions to extract text from the attributes please.

You should be able to use the getDelimitedField expression language function:
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#getdelimitedfield
mypath = ${filename:getDelimitedField(5, '/')}

Related

Import csv file in s3 bucket with semi colon separated fields

I am using AWS Data Pipelines to copy SQL data to a CSV file in AWS S3. Some of the data has a comma between string quotes, e.g.:
{"id":123455,"user": "some,user" .... }
While importing this CSV data into DynamoDB, it takes the comma as the end of the field value. This way it results in errors, as the data given in the mapping does not match the schema we have provided.
My solution for this is - while copying the data from SQL to an S3 bucket - to separate our CSV fields with a ; (semicolon). In that way values within the quotes will be taken as one. And the data would look like (note the blank space within the quote string after the comma):
{"id" : 12345; "user": "some, user";....}
My stack looks like this:
- database_to_s3:
name: data-to-s3
description: Dumps data to s3.
dbRef: xxx
selectQuery: >
select * FROM USER;
s3Url: '#{myS3Bucket}/xxxx-xxx/'
format: csv
Is there any way I can use a delimiter to separate fields with a ; (semicolon)?
Thank you!
give a try to AWS Glue, where you can marshal your data before insert into dynamoDB.

How to filter GCP Data Catalog entries with a specific tag value via console?

I ran a DLP job saving results to Data Catalog and would like to filter the entries in Data Catalog where the standard tag template (Data Loss Prevention Tags) has a value Contains DLP findings: true. I know how to do it with the API. But is there a way to filter out by tag values via Console UI?
Yes, there is a way to filter out by tag values via the Data Catalogs Console using the search box. The below syntax can be used to achieve the results:
tag:<project_name>.tag_template_name.key:value
As per your requirement, the below query would work:
tag:<project_name>.data_loss_prevention.has_findings=true
The above query can be broken down as follows:
<project_name> (It is optional)
data_loss_prevention (tag_template_name of "Data Loss Prevention Tags")
has_findings (ID of "Contains DLP findings"). For every Display name, there should be an ID. The ID of the "Display name" from the “View tag template” should be used instead of the Display name as the key during search.
= - Choose the <operator> based on the data type.
string: ":" Note: The colon in this string search denotes an exact token match, not a substring.
boolean and enum: "="
double: "=", "<", ">", "<=", ">="
timestamp: ":", "=", "<", ">", "<=", ">="
true (The value of “Contains DLP findings”)
tag:data_loss_prevention.has_finding
That'll work.

Athena SQL create table with text data

Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."

U-SQL extract a column based on its ordinal position

I'm experimenting with Azure Data Lake, and trying to consume a bunch of data files.
The files are CSVs. The folder structure looks like:
/jobhistory/(AccountId)/(JobId)/*.csv
In the CSV files, the 6th column is the username.
What I'd like to do is extract out the account id, the job id, and the username (then, in the interests of experimentation, do some aggregates on this data).
Following an online tutorial, I wrote something like this:
DECLARE #file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
#metadata =
EXTRACT AccountId int,
JobId string,
FileName string,
UserName string
FROM #file_set_path
USING Extractors.Csv();
Now, the problem (I think) I have is that the UserName field is the 6th column in the csv files, but there are no header rows in the files.
How would I assign UserName to the 6th column in the files?
Also, please let me know if I'm totally going down a wrong path here; this is very different from what I'm used to.
The built-in CSV extractor is a positional extractor. That means you have to specify all columns (even those you are not interested in) in the extract schema.
So you would write something like (assuming username is the 6th col and you have 10 cols):
DECLARE #file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
#metadata =
EXTRACT AccountId int,
JobId string,
FileName string,
c1 string, c2 string, c3 string, c4 string, c5 string,
UserName string,
c7 string, c8 string, c9 string
FROM #file_set_path
USING Extractors.Csv();
#metadata =
SELECT AccountId,
JobId,
FileName,
UserName
FROM #metadata;
Note that the SELECT projection will be pushed into the EXTRACT so it will not do the full column processing for the columns that you do not select.
If you know that the 6th column is the column you are interested in, you could also write a custom extractor to skip the other columns. However the trade-off of running a custom extractor compared to a built-in one may not be worth it.
(Also note that you can use the ADL Tooling to create the EXTRACT expression (without the virtual columns), so you do not need to do it by hand:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_Summer/USQL_Release_Notes_2017_Summer.md#adl-tools-for-visualstudio-now-helps-you-generate-the-u-sql-extract-statement

Filter out RSS items with description less than 3 characters using Yahoo Pipes

I am trying to filter out items that has empty description or description shorter than 3 characters using this Yahoo Pipe:
http://pipes.yahoo.com/pipes/pipe.edit?_id=966d5a5006cad6b2825d4f744b1ebb50#eefd469cf1c28d4d6cb6bd6c6c1ab6b8
Here is the workflow of the Pipe:
"Fetch Feed" module - fetch the feeds
"Create RSS" module - create new Feed and use item description as item title for new feed
"Regex" module – remove html tags from title
"Filter" module – I want to block items that have either empty descriptions or descriptions shorter than 3 characters, I am not sure what to put there – "null", "*"…?
Try changing your filter to permit items that match rather than block, and change your regex to the following:
^.{3,}$