Using regular expression to consume multiple topic in Flink

Using regular expression to consume multiple topic in Flink - regex

I know that flink is able to consume multiple topics using regular expression as enter link description here.
I have the following topic name such as
sclee-10343434
sclee-10342432
sclee-34234
sclee-3343423432424
....
In this case, when I set the value as below using the regular expression as sclee-[\\d+],
then it gave me the exception.
Is it correct for the case of the regular expression in my case?
And also, did the Flink really support it?
val source = new FlinkKafkaConsumer[T](
java.util.regex.Pattern.compile("sclee-[\\d+]"),
deserializer,
consumerProps
)
The error is below.
Caused by: java.lang.RuntimeException: Unable to retrieve any partitions with KafkaTopicsDescriptor: Topic Regex Pattern (dev-plexer-10507689[\d+])
at org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:153)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:553)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:291)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:473)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:469)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:522)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
at java.lang.Thread.run(Thread.java:750)

I think the problem here is that the regex will not match the topics provided. The provided regex sclee-[\\d+] matches sclee- followed by a single digit or + sign.
In this case You most probably need:
sclee-[\\d]+.

I would recommend to use the latest Flink stable version (currently Flink 1.16), since the documentation link you've included points to 1.4 which is no longer supported plus the Kafka Consumer has changed. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/datastream/kafka/#topic-partition-subscription for more details

Related

Can I change the regex on NagVis and if yes, how can I do this?

I have a problem with Nagvis. There I created several maps with the locations of hosts and used the service lines to display the bandwidth and utilization of individual interfaces. It all worked well until we eventually switched to CheckMK 2.0. We have renamed the interfaces and theoretically it would not be a problem to simply transfer the new names to NagVis.
However, the regex error mentioned below occurs. I also checked the new label with the regex using regex101 and found that the label has changed. It is structured according to the pattern: 'Interface_Name "Interface description"'. Nagvis's regex doesn't allow quotes, and thus neither does the name of the interface.
I'm relatively new to this and haven't had much to do with it before. One solution would be to escape the quotation marks, but I don't know where to do that. If you have any suggestions for a solution, I would be very grateful.
If you have any questions, just ask.
CMK version: 2.0.0p26
OS version: Windows 10
Error message: The attribute has the wrong format (Regex: /^[0-9a-zа-яё\p{L}\s:+_.,'-*?!##=/]+ $/u).

Kafka transforms ignoring regex

I'm trying to use kafka transforms.RemoveString to modify the name of my topic before passing it into my connector. My topic name looks like this
foo.bar_1.baz
I want to extract bar_1 and pass that in as the topic name. From what I can tell my regex is correct but the kafka transform doesn't seem to like it -
transforms=ReplaceField,RenameField,RemoveString
transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.RemoveString.regex=(\w*.)(\w*\d+)(.*)
transforms.RemoveString.replacement=$2
I can tell the RemoveString is being used as when I change the regex to the following I get my desired results but this is rather restrictive for my use case -
transforms.RemoveString.regex=(foo.)(.*)(.baz)
transforms.RemoveString.replacement=$2
Is there some sort of limitation to the regex usage within Kafka transforms?

Found the issue, backslashes had to be escaped, with my improved regex it now looks like this -
(\\w*)\\.(\\w+)\\.(.*)

You have typo in RegexRouter, You missed the R
transforms=ReplaceField,RenameField,RemoveString
transforms.RemoveString.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.RemoveString.regex=(\w*.)(\w*\d+)(.*)
transforms.RemoveString.replacement=$2

Custom character filter in analyzer not working

i am using regexp character filter in couchbase for my analyzer. desirable result following
phuong 1 -> phuong_1
phuong 12 -> phuong_12
Configuration character filter in Couchbase Web Console following
Regular expression : ([a-z])\s+(\\d)
Replacement: $1_
Result of above configuration is produce term [phuong,1, 12 ]
Desirable result is [phuong_1 , phuong_12]
I have aligned this code many times But it still not working correct
Can you help me this problem ?

Couchbase's Full text search is implemented in golang. Here's a playground illustration of how your regular expression works ..
https://play.golang.org/p/Jray7DTYZam
As you can see in the illustration above, $1x is equivalent to ${1x}, not ${1}x. So your replacement needs to be updated to ${1}_.
Now this said, we have a limitation that variables ($1, ${2} etc.) aren't supported at the moment. I've created an internal ticket to extend support for this.

Regex expreesion for last update file

Please help me out with a regex to find the last modified file/latest updated file in the folder.
The files are in this manner:
Test.2014_02_20 updated 13:00:23
Test.2014_02_21 updated 15:23:23
Test.2014_02_25 updated 21:24:23
Using regex we need to pick up the file Test.2014_02_25 updated 21:24:23
Thanks.

Through the comments from #Theox and #piet.t, OP has concluded that regular expressions were not the best tool to accomplish the task here.
"You can use regexs to validate the format of the line (...)" - Theox
"(...) the concept of ordering things by some criteria is way out of the scope of regex" - piet.t

How to create Gmail filter searching for text only at start of subject line?

We receive regular automated build messages from Jenkins build servers at work.
It'd be nice to ferret these away into a label, skipping the inbox.
Using a filter is of course the right choice.
The desired identifier is the string [RELEASE] at the beginning of a subject line.
Attempting to specify any of the following regexes causes emails with the string release in any case anywhere in the subject line to be matched:
\[RELEASE\]*
^\[RELEASE\]
^\[RELEASE\]*
^\[RELEASE\].*
From what I've read subsequently, Gmail doesn't have standard regex support, and from experimentation it seems, as with google search, special characters are simply ignored.
I'm therefore looking for a search parameter which can be used, maybe something like atstart:mystring in keeping with their has:, in: notations.
Is there a way to force the match only if it occurs at the start of the line, and only in the case where square brackets are included?
Sincere thanks.

Regex is not on the list of search features, and it was on (more or less, as Better message search functionality (i.e. Wildcard and partial word search)) the list of pre-canned feature requests, so the answer is "you cannot do this via the Gmail web UI" :-(
There are no current Labs features which offer this. SIEVE filters would be another way to do this, that too was not supported, there seems to no longer be any definitive statement on SIEVE support in the Gmail help.
Updated for link rot The pre-canned list of feature requests was, er canned, the original is on archive.org dated 2012, now you just get redirected to a dumbed down page telling you how to give feedback. Lack of SIEVE support was covered in answer 78761 Does Gmail support all IMAP features?, since some time in 2015 that answer silently redirects to the answer about IMAP client configuration, archive.org has a copy dated 2014.
With the current search facility brackets of any form () {} [] are used for grouping, they have no observable effect if there's just one term within. Using (aaa|bbb) and [aaa|bbb] are equivalent and will both find words aaa or bbb. Most other punctuation characters, including \, are treated as a space or a word-separator, + - : and " do have special meaning though, see the help.
As of 2016, only the form "{term1 term2}" is documented for this, and is equivalent to the search "term1 OR term2".
You can do regex searches on your mailbox (within limits) programmatically via Google docs: http://www.labnol.org/internet/advanced-gmail-search/21623/ has source showing how it can be done (copy the document, then Tools > Script Editor to get the complete source).
You could also do this via IMAP as described here:
Python IMAP search for partial subject
and script something to move messages to different folder. The IMAP SEARCH verb only supports substrings, not regex (Gmail search is further limited to complete words, not substrings), further processing of the matches to apply a regex would be needed.
For completeness, one last workaround is: Gmail supports plus addressing, if you can change the destination address to youraddress+jenkinsrelease#gmail.com it will still be sent to your mailbox where you can filter by recipient address. Make sure to filter using the full email address to:youraddress+jenkinsrelease#gmail.com. This is of course more or less the same thing as setting up a dedicated Gmail address for this purpose :-)

Using Google Apps Script, you can use this function to filter email threads by a given regex:
function processInboxEmailSubjects() {
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++) {
var subject = threads[i].getFirstMessageSubject();
const regex = /^\[RELEASE\]/; //change this to whatever regex you want, this one should cover OP's scenario
let isAtLeast40 = regex.test(subject)
if (isAtLeast40) {
Logger.log(subject);
// Now do what you want to do with the email thread. For example, skip inbox and add an already existing label, like so:
threads[i].moveToArchive().addLabel("customLabel")
}
}
}
As far as I know, unfortunately there isn't a way to trigger this with every new incoming email, so you have to create a time trigger like so (feel free to change it to whatever interval you think best):
function createTrigger(){ //you only need to run this once, then the trigger executes the function every hour in perpetuity
ScriptApp.newTrigger('processInboxEmailSubjects').timeBased().everyHours(1).create();
}

The only option I have found to do this is find some exact wording and put that under the "Has the words" option. Its not the best option, but it works.

I was wondering how to do this myself; it seems Gmail has since silently implemented this feature. I created the following filter:
Matches: subject:([test])
Do this: Skip Inbox
And then I sent a message with the subject
[test] foo
And the message was archived! So it seems all that is necessary is to create a filter for the subject prefix you wish to handle.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js