U-SQL extract a column based on its ordinal position - hdfs

I'm experimenting with Azure Data Lake, and trying to consume a bunch of data files.
The files are CSVs. The folder structure looks like:
/jobhistory/(AccountId)/(JobId)/*.csv
In the CSV files, the 6th column is the username.
What I'd like to do is extract out the account id, the job id, and the username (then, in the interests of experimentation, do some aggregates on this data).
Following an online tutorial, I wrote something like this:
DECLARE #file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
#metadata =
EXTRACT AccountId int,
JobId string,
FileName string,
UserName string
FROM #file_set_path
USING Extractors.Csv();
Now, the problem (I think) I have is that the UserName field is the 6th column in the csv files, but there are no header rows in the files.
How would I assign UserName to the 6th column in the files?
Also, please let me know if I'm totally going down a wrong path here; this is very different from what I'm used to.

The built-in CSV extractor is a positional extractor. That means you have to specify all columns (even those you are not interested in) in the extract schema.
So you would write something like (assuming username is the 6th col and you have 10 cols):
DECLARE #file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
#metadata =
EXTRACT AccountId int,
JobId string,
FileName string,
c1 string, c2 string, c3 string, c4 string, c5 string,
UserName string,
c7 string, c8 string, c9 string
FROM #file_set_path
USING Extractors.Csv();
#metadata =
SELECT AccountId,
JobId,
FileName,
UserName
FROM #metadata;
Note that the SELECT projection will be pushed into the EXTRACT so it will not do the full column processing for the columns that you do not select.
If you know that the 6th column is the column you are interested in, you could also write a custom extractor to skip the other columns. However the trade-off of running a custom extractor compared to a built-in one may not be worth it.
(Also note that you can use the ADL Tooling to create the EXTRACT expression (without the virtual columns), so you do not need to do it by hand:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_Summer/USQL_Release_Notes_2017_Summer.md#adl-tools-for-visualstudio-now-helps-you-generate-the-u-sql-extract-statement

Related

How to replace a column value using PowerQuery editor

I'm relatively new to Power Query. I'm looking for the best way to replace a column's value as below.
The column has date values in a mixed format such as below.
09/16/2022
09/20/2022
09/26/2022
09/30/2022
10-01-2022
10-03-2022
10-05-2022
I'm looking to standardize and make the format generic as below.
09-16-2022
09-20-2022
09-26-2022
09-30-2022
10-01-2022
10-03-2022
10-05-2022
It seems one of the ways to implement this is to use Advanced Editor and build M queries to implement the replacement, functions like Table.TransformColumns and Text.Replace.
'can't figure out the exact code to be used with this or if there is a better way.
Looking for suggestions. Thanks.
If you're a beginner, let the UI write this code for you. Highlight the column by clicking the column header, go to Transform on the ribbon and click Replace Values. Replace "-" with "/" and click OK.
Finally right click the column header again, click Change Type and then select Date.
In M code you could use:
Table.TransformColumns(#"Previous Step",{{"Column Name", each Text.Replace(_,"/","-")}})
As an example:
let
//create sample table
Source = Table.FromColumns(
{{"09/16/2022",
"09/20/2022",
"09/26/2022",
"09/30/2022",
"10-01-2022",
"10-03-2022",
"10-05-2022"}},
type table[dates=text]),
//replace "/" with "-"
Normalize = Table.TransformColumns(Source,{{"dates", each Text.Replace(_,"/","-"), type text}})
in
Normalize
Source
Results
Notes:
Original data is dates as text strings
Final data is also dates as text strings
To convert the strings to dates, you could use, instead, something like: Table.TransformColumns(Source,{{"dates", each Date.From(_, "en-US"), type date}}) but the separator would be in accord with your Windows Regional date settings.

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?

I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:
id,date,string,int_values,double_values
"6F87U",2021-03-21,"Text",0,1.1483
"8DU87",2021-03-22,"More text, oh yes",1,2.525
"79LO2",2021-03-23,"Moar, give me moar, text",2,3.485489
When I run a Crawler with everything default, querying with Athena like so:
select * from tb_csv_data
...the results in Athena are thus:
id
date
string
int_values
double_values
"6F87U"
2021-03-21
"Text"
0
1.1483
"8DU87"
2021-03-22
"More text
oh yes"
1
"79LO2"
2021-03-23
"Moar
give me moar
text
The problem at this level seems to be with proper detection (read: ignoring) of commas as delimiters within quotation marks. So I have a CSV classifier with the following characteristics that I have attached to the Crawler, I run the Crawler again with the classifier attached, and the resulting table properties are thus:
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.serde2.OpenCSVSerde
Serde parameters
quoteChar "
separatorChar ,
Table properties
sizeKey 4356512114
objectCount 3
UPDATED_BY_CRAWLER crawler-name
CrawlerSchemaSerializerVersion 1.0
recordCount 3145398
averageRecordSize 1384
CrawlerSchemaDeserializerVersion 1.0
compressionType none
columnsOrdered true
areColumnsQuoted true
delimiter ,
typeOfData file
The resulting table with the same simple Athena query as above seems to be correct:
id
date
string
int_values
double_values
6F87U
2021-03-21
Text, yes
0
1.1483
8DU87
2021-03-22
More text, oh yes
1
2.525
79LO2
2021-03-23
Moar, give me moar, text
2
3.485489
The expected automatic inference of data types is supposed to be this (let's simplify and presume the date is correct as a string):
Column name
Data type
id
string
date
string
string
string
int_values
bigint (or long)
double_values
double
...but instead they're all strings!
Column name
Data type
id
string
date
string
string
string
int_values
string
double_values
string
I need this data to be accurately queryable from Athena as it is, where it is, so what can I do without further processing of the raw data? I suppose I could manually adjust the table properties in the Console but is that really correct when I need the entire pipeline to be automated? I also want to avoid having to cast types in queries 80+ times for each field as most of these columns are numeric. What can I do?
Thank you!
The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.
For date type to be detected it has to be in UNIX numeric format, such as 1562112000 according to the doc.

Athena SQL create table with text data

Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."

Extract text from Nifi attribute

I'm listing out all the keys in S3 bucket. Below is the flow.
Here in the keys as part of the filename attribute(FetchS3Object attributes) I have the complete path of the keys, out of which I want extract the last but one text
e.g.
If below is the complete path of the key
/buckname/root1/subobject/subsubobject/path1/path2/path3/text.csv
in the file name attribute I have root1/subobject/subsubobject/path1/path2/path3/text.csv, out of which I want extract path2 text.
Any suggestions to extract text from the attributes please.
You should be able to use the getDelimitedField expression language function:
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#getdelimitedfield
mypath = ${filename:getDelimitedField(5, '/')}

Regex QueryString Parsing for a specific in BigQuery

So last week I was able to begin to stream my Appengine logs into BigQuery and am now attempting to pull some data out of the log entries into a table.
The data in protoPayload.resource is the page requested with the querystring paramters included.
The contents of protoPayload.resource looks like the following examples:
/service.html?device_ID=123456
/service.html?v=2&device_ID=78ec9b4a56
I am getting close, but when there is another entry before device_ID, I am not getting it. As you can see I am not great with Regex, but it is the only way I think I can parse the data in the query. To get just the device ID from the first example, I was able to use the following example. Works great. My next challenge is to the data when the second parameter exists. The device IDs can vary in length from about 10 to 26 characters.
SELECT
RIGHT(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'),
length(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'))-10) as Device_ID
FROM logs
What I would like is just the values from the querystring device_ID such as:
123456
78ec9b4a56
Assuming you have just 1 query string per record then you can do this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=(.*)$') as device_id FROM mytable
The part within the parentheses will be captured and returned in the result.
If device_ID isn't guaranteed to be the last parameter in the string, then use something like this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=([^\&]*)') as device_id FROM mytable
One approach is to split protoPayload.resource into multiple service entries, and then apply regexp - this way it will support arbitrary number of device_id, i.e.
select regexp_extract(service_entry, r'device_ID=(.*$)') from
(select split(protoPayload.resource, ' ') service_entry from
(select
'/service.html?device_ID=123456 /service.html?v=2&device_ID=78ec9b4a56'
as protoPayload.resource))