Kafka transforms ignoring regex

Kafka transforms ignoring regex - regex

I'm trying to use kafka transforms.RemoveString to modify the name of my topic before passing it into my connector. My topic name looks like this
foo.bar_1.baz
I want to extract bar_1 and pass that in as the topic name. From what I can tell my regex is correct but the kafka transform doesn't seem to like it -
transforms=ReplaceField,RenameField,RemoveString
transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.RemoveString.regex=(\w*.)(\w*\d+)(.*)
transforms.RemoveString.replacement=$2
I can tell the RemoveString is being used as when I change the regex to the following I get my desired results but this is rather restrictive for my use case -
transforms.RemoveString.regex=(foo.)(.*)(.baz)
transforms.RemoveString.replacement=$2
Is there some sort of limitation to the regex usage within Kafka transforms?

Found the issue, backslashes had to be escaped, with my improved regex it now looks like this -
(\\w*)\\.(\\w+)\\.(.*)

You have typo in RegexRouter, You missed the R
transforms=ReplaceField,RenameField,RemoveString
transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.RemoveString.regex=(\w*.)(\w*\d+)(.*)
transforms.RemoveString.replacement=$2

Related

Using regular expression to consume multiple topic in Flink

I know that flink is able to consume multiple topics using regular expression as enter link description here.
I have the following topic name such as
sclee-10343434
sclee-10342432
sclee-34234
sclee-3343423432424
....
In this case, when I set the value as below using the regular expression as sclee-[\\d+],
then it gave me the exception.
Is it correct for the case of the regular expression in my case?
And also, did the Flink really support it?
val source = new FlinkKafkaConsumer[T](
java.util.regex.Pattern.compile("sclee-[\\d+]"),
deserializer,
consumerProps
)
The error is below.
Caused by: java.lang.RuntimeException: Unable to retrieve any partitions with KafkaTopicsDescriptor: Topic Regex Pattern (dev-plexer-10507689[\d+])
at org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:153)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:553)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:291)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:473)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:469)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:522)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
at java.lang.Thread.run(Thread.java:750)

I think the problem here is that the regex will not match the topics provided. The provided regex sclee-[\\d+] matches sclee- followed by a single digit or + sign.
In this case You most probably need:
sclee-[\\d]+.

I would recommend to use the latest Flink stable version (currently Flink 1.16), since the documentation link you've included points to 1.4 which is no longer supported plus the Kafka Consumer has changed. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/datastream/kafka/#topic-partition-subscription for more details

Can I change the regex on NagVis and if yes, how can I do this?

I have a problem with Nagvis. There I created several maps with the locations of hosts and used the service lines to display the bandwidth and utilization of individual interfaces. It all worked well until we eventually switched to CheckMK 2.0. We have renamed the interfaces and theoretically it would not be a problem to simply transfer the new names to NagVis.
However, the regex error mentioned below occurs. I also checked the new label with the regex using regex101 and found that the label has changed. It is structured according to the pattern: 'Interface_Name "Interface description"'. Nagvis's regex doesn't allow quotes, and thus neither does the name of the interface.
I'm relatively new to this and haven't had much to do with it before. One solution would be to escape the quotation marks, but I don't know where to do that. If you have any suggestions for a solution, I would be very grateful.
If you have any questions, just ask.
CMK version: 2.0.0p26
OS version: Windows 10
Error message: The attribute has the wrong format (Regex: /^[0-9a-zа-яё\p{L}\s:+_.,'-*?!##=/]+ $/u).

Regex to differentiate APIs

I need to create a regex to help determine the number the number of times an API is called. We have multiple APIs and this API is of the following format:
/foo/bar/{barId}/id/{id}
The above endpoint also supports query parameters so the following requests would be valid:
/foo/bar/{barId}/id/{id}?start=0&limit=10
The following requests are also valid:
/foo/bar/{barId}/id/{id}/
/foo/bar/{barId}/id/{id}
We also have the following endpoints:
/foo/bar/{barId}/id/type/
/foo/bar/{barId}/id/name/
/foo/bar/{barId}/id/{id}/price
My current regex to extract calls made only to /foo/bar/{barId}/id/{id} looks something like this:
\/foo\/bar\/(.+)\/id\/(?!type|name)(.+)
But the above regex also includes calls made to /foo/bar/{barId}/id/{id}/price endpoint.
I can check if the string after {id}/ isn't price and exclude calls made to price but it isn't a long term solution since if we add another endpoint we may need to update the regex.
Is there a way to filter calls made only to:
/foo/bar/{barId}/id/{id}
/foo/bar/{barId}/id/{id}/
/foo/bar/{barId}/id/{id}?start=0&limit=10
Such that /foo/bar/{barId}/id/{id}/price isn't also pulled in?

\/foo\/bar\/(.+)\/id\/(?!type|name)(.+)
There is something in your RegEx which is the cause to your problem. "(.+)" RegEx code matches every character after it. So replace it with "[^/]" and add the following code "/?(?!.+)". This is working for me.
/foo/bar/([^/]+)/id/(?!type|name)([^/]+)/?(?!.+)

parse hl7 with regex

I have the following hl7 message:
MSH|^~\&|EPIC|SMHRMC|JCAPS|QHN|20170626165726|EDILABIH|ORU^R01^LAB|00004841|P|2.3|||||||||
PID|1||W00xxxxx^^^SMHRMC||mouse^Mickey^E||19860905|F||1|2601 somestreet AVE NO 8^^City^ST^zip^USA^^^county|MESA|(970)xxx-xxxx^P^PH|||Single||175375903|xxxxxxx||last^first^^|NON-HISPANIC||||||||||
PV1|1|I|MNEU^908^A^^R^^^^^^||||9999999^pcp^pcp^LYNNE^^^^^NPI^^^^NPI~999999999^last^first^LEE^^^^^NPI^^^^NPI||||||||||00000000^last^first^LYNNE^^^^^NPI^^^^NPI||000000603|CAID||||||||||||||||||||||||20170626000000
Hl7 is hard to extract with regex however I have an field that is always in the same location and feel that might be easier. I need to pull the encounter number which is the 'W00xxxxx' in the stream above. It is always in the 3rd pipe delimited section of the PID and stops at the ^.
Currently I have: select substring(column from 'PID\|[1]\|\|(.)\^') but this is not working. However when I use select substring(column from 'PV1\|[1]\|(.)\|') it will pull the 'I'. I can't see the big differences in my regex to know why this isn't working. Thanks.

how about this:
PID\|[1]\|\|(.+?)\^

You can't reliably parse HL7 V2.x messages using regex because the encoding characters may change in MSH-1 and MSH-2. Whatever language you're using there's probably already an HL7 parsing library you can use instead.

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Kafka transforms ignoring regex - regex

Found the issue, backslashes had to be escaped, with my improved regex it now looks like this - (\\w)\\.(\\w+)\\.(.)

You have typo in RegexRouter, You missed the R transforms=ReplaceField,RenameField,RemoveString transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter transforms.RemoveString.regex=(\w.)(\w\d+)(.*) transforms.RemoveString.replacement=$2

Related

Using regular expression to consume multiple topic in Flink

Can I change the regex on NagVis and if yes, how can I do this?

Regex to differentiate APIs

parse hl7 with regex

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Kafka transforms ignoring regex - regex

Found the issue, backslashes had to be escaped, with my improved regex it now looks like this - (\\w*)\\.(\\w+)\\.(.*)

You have typo in RegexRouter, You missed the R transforms=ReplaceField,RenameField,RemoveString transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter transforms.RemoveString.regex=(\w*.)(\w*\d+)(.*) transforms.RemoveString.replacement=$2

Related

Using regular expression to consume multiple topic in Flink

Can I change the regex on NagVis and if yes, how can I do this?

Regex to differentiate APIs

parse hl7 with regex

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

Categories

Resources

Found the issue, backslashes had to be escaped, with my improved regex it now looks like this - (\\w)\\.(\\w+)\\.(.)

You have typo in RegexRouter, You missed the R transforms=ReplaceField,RenameField,RemoveString transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter transforms.RemoveString.regex=(\w.)(\w\d+)(.*) transforms.RemoveString.replacement=$2