Remove Columns in Nifi

Remove Columns in Nifi - regex

I'm trying to remove the last 16 of 18 columns from a Flowfile with CSV-formatted text. I thought my regex pattern would work, but the output is the exact same as the original data. My log doesn't show anything because it thinks it applied the rule correctly, so something must be wrong with my regex. I've included two images below of my flow and the ReplaceText Processor attributes I have set.

Figured it out: I'm not sure if it was my grouping pattern not working or what, but I changed the .* to [,]* and made two separate groups for each of the first two columns, then a group of (.*) for the rest of the columns
^((?:[^,]*,))((?:[^,]*))((.*))

If you already use Record based processing I would suggest that it will be better to use QueryRecord and select only the needed columns using that processor. Doing complex regex is painful for maintainence imo.
The sql for the QueryRecord processor will be:
SELECT column_header1, column_header2 FROM FLOWFILE

Related

Regex, Grafana Loki, Promtail: Parsing a timestamp from logs using regex

I want to parse a timestamp from logs to be used by loki as the timestamp.
Im a total noob when it comes to regex.
The log file is from "endlessh" which is essentially a tarpit/honeypit for ssh attackers.
It looks like this:
2022-04-03 14:37:25.101991388 2022-04-03T12:37:25.101Z CLOSE host=::ffff:218.92.0.192 port=21590 fd=4 time=20.015 bytes=26
2022-04-03 14:38:07.723962122 2022-04-03T12:38:07.723Z ACCEPT host=::ffff:218.92.0.192 port=64475 fd=4 n=1/4096
What I want to match, using regex, is the second timestamp present there, since its a utc timestamp and should be parseable by promtail.
I've tried different approaches, but just couldn't get it right at all.
So first of all I need a regex that matches the timestamp I want.
But secondly, I somehow need to form it into a regex that exposes the value in some sort?
The docs offer this example:
.*level=(?P<level>[a-zA-Z]+).*ts=(?P<timestamp>[T\d-:.Z]*).*component=(?P<component>[a-zA-Z]+)
Afaik, those are named groups, and that is all that it takes to expose the value for me to use it in the config?
Would be nice if someone can provide a solution for the regex, and an explanation of what it does :)

You could for example create a specific pattern to match the first part, and capture the second part:
^\d{4}-\d{2}-\d{2} \d\d:\d\d:\d\d\.\d+\s+(?P<timestamp>\d{4}-\d{2}-\d{2}T\d\d:\d\d:\d\d\.\d+Z)\b
Regex demo
Or use a very broad if the format is always the same, repeating an exact number of non whitespace characters parts and capture the part that you want to keep.
^(?:\S+\s+){2}(?<timestamp>\S+)
Regex demo

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.

My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G

It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers

To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Regex needed to search for a numeric id within a tag

My very basic regex skills are not allowing me to successfully extract an id number within a tag.
I think it would be fairly straightforward. I would like to extract the id from the following extract.
<id>53222132</id>
The id number is not a specific length but I just need to be able to find the id number which is numeric only.
More specifically this is the only instance of the tag id so it's unique and should be used within the regex.
Finally is there a way that this can saved within a variable.
Using regex as part of a splunk query where I will use the variable to make it distinct.
I have got as far as the following which captures everything including the tag.
<\s*id[^>]*>(.*?)<\s*\/\s*id>
Thanks in advance

(?<=<id>)\d+(?=<\/id>)
This would be my first thought. This will use a positive look ahead and positive look behind and it will only match a string of digit characters in the middle. Another alternative is:
\d+(?=<\/id>)
This will only use the look ahead as the look behind is not entirely supported. One other option:
\d+(?=\s*<\s*\/\s*id\s*>)
This will ignore any spaces that might be present in that ending tag, and still find the id regardless. One of these should work for your scenario.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!

You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.

I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Efficiently match table of regex to a string

Typically, you have a regular expression and lots of strings to process.
I have the opposite. I have one string, and I want to find all the regular expressions that match it. Let's say I have 10 million regular expressions. I'm not trying to do any replacement or rewriting of the string, I just want to find things that match.
I'd like to store these in a database. A crude way to do this would be to select all ten million lines and iterate through them. For each iteration, apply the regex and somehow (I'm a little unclear on this piece too) see if it matches. Perhaps my regex library has a function which I give it a string and a regex, and it tells me if it matches. If it does, then I print out the regex.
This would be slow. I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string, out of its table of 10 million.
I'm agnostic on the database used, I'd just like it to be fast. I don't need it to be "custom assembler" fast but just "let the database figure it out so I don't have to iterate on 10 million lines" fast.

I'm wondering if I can somehow hand this off to a database, so that it just returns me a table of the regular expressions that match a given string
At least mysql can do this:
SELECT regex FROM table_with_regexes WHERE
regex REGEXP someString;
Also it would be helpful if you tell us more about your actual problem. I don't think you wrote ten millions regexes by hand, they must have been automatically generated - tell us how.

In your case, I would process in three steps:
Step 1 : Find a first sql query
Build a sql query that search for the regex matching my string.
I would start with a small regex set for building the sql query.
Step 2 : Refine it if nessary
Add more regexes and see how the sql query performs.
I would optimize, rewrite it if necessary here.
Step 3 : Use choosed database optimization tools
I would simply fine tune my sql query to respond as quickly as possible.
I would use hints for the sql engine, indices, parallel executions etc
Handing off all the hard work to the database is a good approach since IMO it's an elegant and clear approach.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js