Regex QueryString Parsing for a specific in BigQuery - regex

So last week I was able to begin to stream my Appengine logs into BigQuery and am now attempting to pull some data out of the log entries into a table.
The data in protoPayload.resource is the page requested with the querystring paramters included.
The contents of protoPayload.resource looks like the following examples:
/service.html?device_ID=123456
/service.html?v=2&device_ID=78ec9b4a56
I am getting close, but when there is another entry before device_ID, I am not getting it. As you can see I am not great with Regex, but it is the only way I think I can parse the data in the query. To get just the device ID from the first example, I was able to use the following example. Works great. My next challenge is to the data when the second parameter exists. The device IDs can vary in length from about 10 to 26 characters.
SELECT
RIGHT(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'),
length(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'))-10) as Device_ID
FROM logs
What I would like is just the values from the querystring device_ID such as:
123456
78ec9b4a56

Assuming you have just 1 query string per record then you can do this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=(.*)$') as device_id FROM mytable
The part within the parentheses will be captured and returned in the result.
If device_ID isn't guaranteed to be the last parameter in the string, then use something like this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=([^\&]*)') as device_id FROM mytable

One approach is to split protoPayload.resource into multiple service entries, and then apply regexp - this way it will support arbitrary number of device_id, i.e.
select regexp_extract(service_entry, r'device_ID=(.*$)') from
(select split(protoPayload.resource, ' ') service_entry from
(select
'/service.html?device_ID=123456 /service.html?v=2&device_ID=78ec9b4a56'
as protoPayload.resource))

Related

Column does not exist AWS Timestream Query error

I am trying to apply WHERE clause on DIMENSION of the AWS Timestream records. However, I got the error: Column does not exist
Here is my table schema:
The table schema
The table measure
First, I will show all the sample data I put in the table
SELECT username, time, manual_usage
FROM "meter-reading"."meter-metrics"
ORDER BY time DESC
LIMIT 4
The result:
Result
What I wanted to do is to query and filter the records by the Dimension ("username" specifically).
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE measure_name = "OnceADay"
ORDER BY time DESC LIMIT 10
Then I got the Error: Column 'OnceADay' does not exist
I tried to search for any quotas for Dimensions name and check for error in my schema:
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.naming
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.system_identifier
But I didn't find that my "username" for the dimension violate any of the above rules.
I checked for some other queries by AWS Blog, the author used the WHERE clause for the Dimension filter normally:
https://aws.amazon.com/blogs/database/effective-queries-for-common-query-patterns-in-amazon-timestream/
I figured it out after I tried with the sample code. Turn out it was a silly mistake I believe.
Using apostrophe (') instead of single quotation marks ("") solved my problem.
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE username = 'OnceADay'
ORDER BY time DESC LIMIT 10

Athena SQL create table with text data

Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."

Get parameters from Rest Http url for get method using streamsets microservice pipeline

I have created a microservice pipeline in streamsets. Upon making a get callout, i have to retrieve data from mysql depending on the parameters sent in the http get url using expression evaluator?
My url is supposed to be like this: http://my.url.com:0191?param1=xyz&param2=abc
I have to retrieve data based on param1 value and param2 value.
Also, how do I handle cases when the params send will be null?
The URL parameters appear in the queryString record header attribute; you can parse them out in an Expression Evaluator with the Field Expression:
${str:splitKV(record:attribute('queryString'), '&', '=')}
If you set the Output Field to /, then your parameters will now be in the fields /param1 and /param2. You can use these in a MySQL query in the JDBC Lookup processor like this:
-- Assuming col1 is an integer (doesn't need quotes) and col2 is a string (needs quotes)
SELECT * FROM tablename
WHERE col1 = ${record:value('/param1') AND col2 = '${record:value('/param2')'
You can handle nulls using the record:attributeOrDefault() function to set a default, or by using a Stream Selector to send the record along a different path.

Athena Schema creation when log format has missing fields

I have a custom log format where the log entries vary by the request type. So certain rows have more fields.
Can we specify certain fields as optional so that in rows that they are missing, the values will be set to certain default (null, 0)?
Here are some hypothetical log entries:
{"data":"[2017-09-10 10:44:54.448998 -0000] info ip=773.555.557.445 cluster=\"production\" query=uris type=TXT class=IN rcode=NXDOMAIN cnt=0 offset=74","header":{"recvtime":"2017-09-10 10:45:02","server":"m0107481","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=991.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=NOERROR cnt=1 offset=90 score=400","header":{"recvtime":"2017-09-10 10:45:02","server":"m010748","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=971.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=REFUSED cnt=1","header":{"recvtime":"2017-09-10 10:45:02","server":"m010574","refid":"ABC-123"}}
Note that each row of the log data is in json format, and the header part is fixed. If query in data is dnsbl, then sometimes the row has a score field, but other times it is missing. And I am planning to use Athena to parse this type of data from S3 and query for some stats in the line of: what % of data are dns queries and what % have score above 300.
It looks like your data is JSON with embedded structured logging in the data field. As long as the data is well formed JSON with one object per line you should be able to create a JSON table and then use functions to extract the other pieces out of the data field. You can create a view that does the extraction so that you don't have to do that in every query.
I'm thinking something like this:
CREATE EXTERNAL TABLE raw_log_entries (
data string,
header struct<recvtime: string, server: string, refid: string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://some-bucket/and/path/';
CREATE VIEW log_entries AS
SELECT
header.recvtime,
header.server,
header.refid,
regexp_extract(data, 'query=(\S+)', 1) AS query,
regexp_extract(data, 'type=(\S+)', 1) AS type,
regexp_extract(data, 'score=(\S+)', 1) AS score,
-- and so on
FROM raw_log_entries
You'll have to experiment with the regexes, since I don't have your data I can't be sure if they will work for all cases, but I hope you get the idea.

How to tweak LISTAGG to support more than 4000 character in select query?

Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production.
I have a table in the below format.
Name Department
Johny Dep1
Jacky Dep2
Ramu Dep1
I need an output in the below format.
Dep1 - Johny,Ramu
Dep2 - Jacky
I have tried the 'LISTAGG' function, but there is a hard limit of 4000 characters. Since my db table is huge, this cannot be used in the app. The other option is to use the
SELECT CAST(COLLECT(Name)
But my framework allows me to execute only select queries and no PL/SQL scripts.Hence i dont find any way to create a type using "CREATE TYPE" command which is required for the COLLECT command.
Is there any alternate way to achieve the above result using select query ?
You should add GetClobVal and also need to rtrim as it will return delimiter in the end of the results.
SELECT RTRIM(XMLAGG(XMLELEMENT(E,colname,',').EXTRACT('//text()')
ORDER BY colname).GetClobVal(),',') from tablename;
if you cant create types (you can't just use sql*plus to create on as a one off?), but you're OK with COLLECT, then use a built-in array. There's several knocking around in the RDBMS. run this query:
select owner, type_name, coll_type, elem_type_name, upper_bound, length
from all_coll_types
where elem_type_name = 'VARCHAR2';
e.g. on my db, I can use sys.DBMSOUTPUT_LINESARRAY which is a varray of considerable size.
select department,
cast(collect(name) as sys.DBMSOUTPUT_LINESARRAY)
from emp
group by department;
A derivative of #anuu_online but handle unescaping the XML in the result.
dbms_xmlgen.convert(xmlagg(xmlelement(E, name||',')).extract('//text()').getclobval(),1)
For IBM DB2, Casting the result to a varchar(10000) will give more than 4000.
select column1, listagg(CAST(column2 AS VARCHAR(10000)), x'0A') AS "Concat column"...
I end up in another approach using the XMLAGG function which doesn't have the hard limit of 4000.
select department,
XMLAGG(XMLELEMENT(E,name||',')).EXTRACT('//text()')
from emp
group by department;
You can use:
SELECT department
, REGEXP_REPLACE(XMLCAST(XMLAGG(XMLELEMENT(x, name, ',')) AS CLOB), ',$')
FROM emp
GROUP BY department
it will return CLOB that has no size limit, handles correctly XML entity escapes and separators.
Instead of REGEXP_REPLACE(..., ',$')) you can use RTRIM(..., ','), which should be faster, but will remove all separators from the end of the result (including those that can appear in name at the end, or previous ones if last names are empty).