Regular expression for AQL - regex

2011-12-01T00:43:51.251871+05:18 Dec 01 2011 00:41:32 KOC-TEJ-AMEX-ASA-5510-6 : %ASA-4-106023: Deny icmp src TCS:172.26.40.1 dst AMEX:172.26.40.187 (type 5, code 0) by access-group "TCS_access_in" [0x953d065b, 0x0]
Need to extract 2011-12-01T00:43:51.251871+05:18
My code
create view standardLogTime as
extract regex /(\d{4}\-\d{2}\-\d+\w+\:\d{2}\:\d+\.\d+\+\d+\:\d+)/ on D.text as testValue
from Document D;
-- Extracting standard log generation time.
create view standardLogTime as
extract regex /\d{4}(-\d{2}){2}T(\d{2}:){2}\d{2}\.\d+?\+\d{2}:\d{2}/ on D.text as testValue
from Document D;
output view standardLogTime;
-- Extracting incoming request Date.
create view dateView as
extract regex /(\s+\w+\s\d+\s\d{4})/ on Date.text as testDate from Document Date;
--output view dateView;
-- Extracting incoming request Time.
create view timeView as
extract regex /\s+(\d{1,2}\:\d{1,2}\:\d{1,2})/ on Time.text
as requestTime from Document Time;
--output view timeView;
-- Extracting the firewall device name.
create view deviceName as
extract regex /(\w+\-\w+\-\w+\-\w+\-\d+\-\d+)/ on Device.text
as deviceName from Document Device;
--output view deviceName;
create view combinedView as
extract pattern (<S.testValue>) (<D.testDate>) (<T.requestTime>) (<Div.deviceName>)
return group 0 as logTime and
group 1 as date and
group 2 as time and
group 3 as deviceName
from standardLogTime S,dateView D ,timeView T,deviceName Div;
output view combinedView;*/

I don't know what language that is, but in Python I would do
date = line.split()[0]
or, if I were forced to use an RE, it'd be
^(\S+)\s

\d{4}(-\d{2}){2}T(\d{2}:){2}\d{2}\.\d+?\+\d{2}:\d{2}

Related

Telegraf: How to extract from field using regex processor?

I would like to extract the values for connections, upstream and downstream using telegraf regex processor plugin from this input:
2022/11/16 22:38:48 In the last 1h0m0s, there were 10 connections. Traffic Relayed ↑ 60 MB, ↓ 4 MB.
Using this configuration the result key "upstream" is a copy of the initial message but without a part of the 'regexed' stuff.
[[processors.regex]]
tagpass = ["snowflake-proxy"]
[[processors.regex.fields]]
## Field to change
key = "message"
## All the power of the Go regular expressions available here
## For example, named subgroups
pattern = 'Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),'
replacement = "${UPSTREAM}"
## If result_key is present, a new field will be created
## instead of changing existing field
result_key = "upstream"
Current output:
2022/11/17 10:38:48 In the last 1h0m0s, there were 1 connections. Traffic 3 MB ↓ 5 MB.
How do I get the decimals?
I'm quite a bit confused how to use the regex here, because on several examples in the web it should work like this. See for example: http://wiki.webperfect.ch/index.php?title=Telegraf:_Processor_Plugins
The replacement config option specifies what you want to replace in for any matches.
I think you want something closer to this:
[[processors.regex.fields]]
key = "message"
pattern = '.*Relayed.{3}(?P<UPSTREAM>\d{1,4}\W.B),.*$'
replacement = "${1}"
result_key = "upstream"
to get:
upstream="60 MB"

How to extract date from String column using regex in Spark

I have a dataframe which consist of filename, email and other details. Need to get the dates out of it from one of the column file name.
Ex: File name: Test_04_21_2019_34600.csv
Need to extract the date: 04_21_2019
Dataframe
val df1 = Seq(
("Test_04_21_2018_1200.csv", "abc#gmail.com",200),
("home/server2_04_15_2020_34610.csv", "abc1#gmail.com", 300),
("/server1/Test3_01_2_2019_54680.csv", "abc2#gmail.com",800))
.toDF("file_name", "email", "points")
Output To be
date email points
04_21_2018 abc#gmail.com 200
04_15_2020 abc1#gmail.com 300
01_2_2019 abc2#gmail.com 800
Can we use regex on spark dataframe to achieve this or any other way to achieve this. Any help will be appreciated.
You can use regexp_extract function to extract the date as below
val resultDF = df1.withColumn("date",
regexp_extract($"file_name", "\\d{1,2}_\\d{1,2}_\\d{4}", 0)
)
Output:
+--------------------+--------------+------+----------+
| file_name| email|points| date|
+--------------------+--------------+------+----------+
|Test_04_21_2018_1...| abc#gmail.com| 200|04_21_2018|
|home/server2_04_1...|abc1#gmail.com| 300|04_15_2020|
|/server1/Test3_01...|abc2#gmail.com| 800| 01_2_2019|
+--------------------+--------------+------+----------+

fetching name and age from a text file

I have a .txt file from which I have to fetch name and age.
The .txt file has data in the format like:
Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs Height: 5 feet 1 inch; weight is 56 kgs.
This medical record is 10 years old.
Output 1: John, Sam, Kenner
Output_2: 47, 29, 36
I am using the regular expression to extract data. For example, for age, I am using the below regular expressions:
re.compile(r'age:\s*\d{1,3}',re.I)
re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)
re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)
re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)
I will apply another regular expression to the output of these regular expressions to extract the numbers. The problem is with these regular expressions, I am also getting the data which I do not want. For example
This medical record is 10 years old.
I am getting '10' from the above sentence which I do not want.
I only want to extract the names of people and their age. I want to know what should be the approach? I would appreciate any kind of help.
Please take a look at the Cloud Data Loss Prevention API. Here is a GitHub repo with examples. This is what you'll likely want.
def inspect_string(project, content_string, info_types,
min_likelihood=None, max_findings=None, include_quote=True):
"""Uses the Data Loss Prevention API to analyze strings for protected data.
Args:
project: The Google Cloud project id to use as a parent resource.
content_string: The string to inspect.
info_types: A list of strings representing info types to look for.
A full list of info type categories can be fetched from the API.
min_likelihood: A string representing the minimum likelihood threshold
that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
max_findings: The maximum number of findings to report; 0 = no maximum.
include_quote: Boolean for whether to display a quote of the detected
information in the results.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library.
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp.DlpServiceClient()
# Prepare info_types by converting the list of strings into a list of
# dictionaries (protos are also accepted).
info_types = [{'name': info_type} for info_type in info_types]
# Construct the configuration dictionary. Keys which are None may
# optionally be omitted entirely.
inspect_config = {
'info_types': info_types,
'min_likelihood': min_likelihood,
'include_quote': include_quote,
'limits': {'max_findings_per_request': max_findings},
}
# Construct the `item`.
item = {'value': content_string}
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Call the API.
response = dlp.inspect_content(parent, inspect_config, item)
# Print out the results.
if response.result.findings:
for finding in response.result.findings:
try:
if finding.quote:
print('Quote: {}'.format(finding.quote))
except AttributeError:
pass
print('Info type: {}'.format(finding.info_type.name))
print('Likelihood: {}'.format(finding.likelihood))
else:
print('No findings.')

How to get this output from pig Latin in MapReduce

I want to get the following output from Pig Latin / Hadoop
((39,50,60,42,15,Bachelor,Male),5)
((40,35,HS-grad,Male),2)
((39,45,15,30,12,7,HS-grad,Female),6)
from the following data sample
data sample for adult data
I have written the following Pig Latin script:
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
BV= group sensitive by (EDU,SEX) ;
BVA= foreach BV generate group as EDU, COUNT (sensitive) as dd:long;
Dump BVA ;
Unfortunately, the results come out like this
((Bachelor,Male),5)
((HS-grad,Male),2)
Than try to project the AGE data too.
Something like this:
BVA= foreach BV generate
sensitive.AGE as AGE,
FLATTEN(group) as (EDU,SEX),
COUNT(sensitive) as dd:long;
Another suggestion is to specify the datatype when you load the data.
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE:int,EDU:chararray,SEX:chararray,SALARY:chararray);

How do you find anomalies in format of a column in a spark dataframe?

As the question says, I want to find anomalies in the format of the value in a column in a large dataset.
For example: if I have a date column within a dataset of say 500 million rows, I want to make sure that the date format for all rows in the column is MM-DD-YYYY. I want to find the count and the values where there is an anomaly in this format.
How do I do this? Can I use regex? Can someone give an example? Want to do this using Spark Dataframe.
Proper date format validation using regex can be tricky (See: Regex to validate date format dd/mm/yyyy), but you can use Joda-Time as below:
import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf
object FormatChecker extends java.io.Serializable {
val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
case Failure(_) => true
case _ => false
}
}
val df = sc.parallelize(Seq(
"01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")
invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()