NiFi: ReplaceTextWithMapping processor - regex

I have the following insert statements:
insert into temp1 values (test1, test2)
insert into temp2 values (test3)
Expected results:
insert into temp1 values (100, 200)
insert into temp2 values (300)
Essentially, I wanted to replace the first query literals test1, test2 with value 100, 200 respectively and for the second query replace test3 with value 300. Can someone help with the mapping file for the above use case?
I tried with the following, but it doesn't have any effect.
Search Value (RegEx) Replacement values
(1)(.*values.*)(.*test1)(.*,)(.*test2) -> $2 val1 $4 val2
(2)(.*values.*)(.*test1) -> $2 val3

If this is literally the extent of the mapping you need to perform, a regular ReplaceText processor is enough. Using the settings below results in the desired output:
It simply detects every instance of test followed by a single digit and replaces it with that digit and 00.
If you need to use ReplaceTextWithMapping for more complex lookups, the mapping file must be of the format:
search_value_1 replacement_value_1
search_value_2 replacement_value_2
etc.
The delimiter between the search and replacement values is \t.
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Wed Dec 07 10:48:24 PST 2016'
Key: 'lineageStartDate'
Value: 'Wed Dec 07 10:48:24 PST 2016'
Key: 'fileSize'
Value: '66'
FlowFile Attribute Map Content
Key: 'filename'
Value: '56196144045589'
Key: 'path'
Value: './'
Key: 'uuid'
Value: 'f6b28eb0-73b5-4d94-86c2-b7a5d4cc991e'
--------------------------------------------------
insert into temp1 values (100, 200)
insert into temp2 values (300)

Related

How to extract where the reg expression doesn't match in data frame column?

I have a two dataframes:
OrderedDict([('page1', name dob
0 John 07-20200
1 Lilly 05-1999
2 James 02-2002), ('page2', name dob
0 Chris 07-2020
1 Robert 05-1999
2 barb 02-20022)])
I want to run my reg expression against each date in both dataframes and if they are all matches I want to continue with my program and if there is not a match I want to print a message that shows cases the df name, index and date thats wrong like this:
INVALID DATE: Page1: index 0: dob: 02-20200
INVALID DATE: Page2: index 2: dob: 02-20022
I got to this point
date_pattern = r'(?<!\d)((?:0?[1-9]|1[0-2])-(?:19|20)\d{2})(?!\d)'
for df_name, df in employee_dict.items():
x = df[df.dob.str.contains(date_pattern, regex=True)]
print(x)
that prints where they do match in a table format but I want to print where they don't match in individual print statements
any ideas?
You may iterate over all the rows of the dataframes and if the entry does not match your pattern, you may generate the message of your choice:
for df_name, df in employee_dict.items(): # Iterate over your DFs
for index, row in df.iterrows(): # Iterate over DF rows
if not re.search(date_pattern, row['dob']): # If the dob column value has no match
print("INVALID DATE: {}: index {}: dob: {}".format(df_name, index,row['dob'])) # Print error message
If your df is pd.DataFrame({'dob': ['05-2020','4-2020','07-1999','2-2001','1-20202020','112-2020']}), the results will be
INVALID DATE: page1: index 4: dob: 1-20202020
INVALID DATE: page1: index 5: dob: 112-2020
You're looking for Series.str.match.
Essentially, you need to extract the dob series, which I assume is what you're doing with df['dob'], and do result = df['dob'].str.match(date_pattern). The result will be a series of True and False values, corresponding to their respective df['dob'] values.

AWS Glue pushdown predicate not working properly

I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Extract value from a string occurring after a particular word

A json script is passed as a string and I need to extract the numeric value after the content_id for further mapping. Sample Data below:
{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}
The parameters are dynamic so I can't extract using a substr function or count to extract after certain number of occurrences of a special character.
JSON in your example is malformed, it contains extra ] and some tail after closing }. For correct JSON you can use get_json_object, for example:
select get_json_object(src_json,'$.url.content_id') from
(
select '{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36], "packager_path": "/opt/bento4"}}' as src_json
)s
;
Result:
OK
1000231205
Time taken: 21.606 seconds, Fetched: 1 row(s)
You can use regexp_extract funciton in hive with matching regex to extract only the digits from content_id.
Example:
select regexp_extract(col1,'"content_id":\\s"(\\d+)"',1) from (
select string('{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}')col1
)t;
+-------------+--+
| _c0 |
+-------------+--+
| 1000231205 |
+-------------+--+
Regex description:
"content_id":\\s"(\\d+)" //match literal "content_id": + any space + "digit inside quotes"
Found an expensive way to do it through a combination of regex and substring functions
substr(split(regexp_extract(message,'content_id([^&]*)'), '"')[3],1) as content_id

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'

Sorting logs using regex?

I'm trying to figure out how to sort logs for example...
User: test
Level: user
Domain: localhost
Time: 12pm
Blah: INFO
Date: 07-12-2016
Ip: 127.0.0.1
I would like the output text to be this also there is tab spaces.
User:Level:Domain:Time:Blah:Date:IP
If i get your question right, you're talking not about sorting, but about parsing. You have log strings which you want to convert to another format. The regex to match your log string would be
(?P<User>[^:]+):(?P<Level>[^:]+):(?P<Domain>[^:]+):(?P<Time>[^:]+):(?P<Blah>[^:]+):(?P<Date>[^:]+):(?P<IP>[^:]+)
However, since you have so many groups, it could be done much more efficiently, here's an example in python
import re
logString = "User:Level:Domain:Time:Blah:Date:IP"
logGroups = ["User", "Level", "Domain", "Time", "Blah", "Date", "IP"]
reLogGroups = "(?P<"+">[^:]+):(?P<".join(logGroups)+">[^:]+)"
matchLogGroups = re.search(reLogGroups,logString)
if matchLogGroups:
counter = 1
for logGroup in logGroups:
print(str(counter)+". " + logGroup + ": " + matchLogGroups.group(logGroup) + "\n")
counter += 1
The output is
1. User: User
2. Level: Level
3. Domain: Domain
4. Time: Time
5. Blah: Blah
6. Date: Date
7. IP: IP