KNIME regex expression to return 6th line - regex

I have a column with string values present in several lines. I would like to only have the values in the 6th line, all the lines have varying lengths, but all the cells in the column have the information I need in the 6th line.
I am honestly absolutely new and have no background in Java nor KNIME - I have scoured this forum and other internet sources, and none seem to tackle what I need in KNIME specifically - I found something similar but it doesn't work in KNIME:
Regex for nth line in a text file

Your answer will probably need to be broken into two parts
How to do a regex search in KNIME
How to do a regex search for the 6th line
I can help with the regex search, but I don't know KNIME
To start with, you want to know how to search for a single line which is
([^\n]*\n)
This looks for
*: 0 or more of
[^\n]: anything that isn't a new line
followed by \n: a new line
and (): groups them together into a single match
We can then expand this into: ([^\n]*\n){5}([^\n]*\n){1} Which creates 2 capture groups, one with the first 5 lines, the second with the 6th line.
If KNIME supports Non-Capturing groups you can then expand that into the following so that you only have one matching capture group. You can decide for yourself which you like best.
(?:[^\n]*\n){5}([^\n]*\n){1}
I've created an example you can test on RegExr
Regardless of which way you go, make sure to document the regex with comments or stick it into a variable with a very clear name since they aren't particularly human readable

Related

Regex: Replace double double quotes (solved), but only in lines that contain a special string (subcondition unsolved)

1. Summary of the problem
I have a csv file where I want to replace normal quotes in text with typographic ones.
It was hard (because HTML is also included), but I have meanwhile created a good regex expression that does just the right thing: in three "capturing groups" I find the left and right quotation marks and the text inside. Replacing then is a piece of cake.
2. Regex engine
I can use the regex engine of Notepad++ (boost) or PCRE2 comaptible, for developping and testing purposes I have used https://regex101.com.
3. What I'm having a hard time with and just can't get right, where I need your help is here:
I want to add a sub condition, in order to find the text in quotes only in certain lines, want to identify these lines by the language, e.g. ENGLISH or FRENCH (see also example in the screenshot).
Screenshot of a sample
The string indicating the language is always in the same line before the text to be found, BUT only the text in quotes (main condition) should be marked after matching the sub condition, so that I will be able to replace them.
It is about a few thousand records in the csv file, in the worst case I could also replace it manually. But I'm pretty sure that this should also work via regex.
4. What I have tried
Different approaches with look arounds and non-capturing groups didn't lead me to the desired result - possibly because I didn't really understand how they work.
An example can be found here: https://regex101.com/r/ketwwm/1
The example can be found here, it only contains the regex expression to match and mark the (three) groups WITHOUT the searched subcondition:
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
Hopefully anyone in the community could help? (Hopefully I have not missed anything, it's my first post here )
5. Update 03/18/2022: Almost resolved with two slightly different approaches (thank you all!) What is still unsolved ..
Solution of #Thefourthbird (see answer 1)
^(?!.?"ENGLISH")[^"]".*(SKIP)(F)|("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
Nearly perfect, just missing matches in an HTML section. HTML sections in the csv file are always enclosed by double quotes and may have line feeds (LF). https://regex101.com/r/x5shnx/1
Solution of #Wiktor Stribiżew (see in comments below)
^.?"ENGLISH".?\K("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
The same with matches in HTML sections, see above. Plus: Doesn't match text in double double quotes if more than one such entry occurs within a text. https://regex101.com/r/I4NTdb/1
Screenshot (only to illustrate)
If you want to match multiple occasions, you can use SKIP matching all lines that do not start with FRENCH:
^"(?!FRENCH")[^"]*".*(*SKIP)(*F)|("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
The pattern matches:
^ Start of string
" Match literally
(?!FRENCH") Negative lookhead, assert not FRENCH" directly to the right
[^"]*" Match any char except " and match "
.*(*SKIP)(*F) Match the rest of the line and skip it
| Or
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$))) Your current pattern
Regex demo

RegEx for underlining text

How can I match one line of text with a regex and follow it up with a line of dashes exactly as many as characters in the initial match to achieve text-only underlining. I intend to use this with the search and replace function (likely in the scope of a macro) inside an editor. Probably, but not necessarily, Visual Studio Code.
This is a heading
should turn into
This is a heading
-----------------
I believe I have read an example for that years ago but can't find it; neither do I seem to be able to formulate a search query to get anything useful out of Google (including variations of the question's title). If you are I'd be interested in that, too.
The best I can come up with is this:
^(.)(?=(.*\n?))|.
Substitution
$1$2-
syntax
note
^(.)
match the first character of a line, capture it in group 1
(?=(.*\n?))
then look ahead for the rest of this line and capture it in group 2, including a line break if there's any
|.
or a normal character
But the text must has a line break after it, or the underline only stays on the same line.
Not sure if it is any useful but here are the test cases.

Tweak RegEx code to continue to first period - even if on next line

I spent half of yesterday trying various approaches found in threads here, but was unable to get something put together that worked. I'm using UiPath to read a PDF document and RegEx to grab patterns out of the resulting string. I have a code that works for 22 of my 23 cases. I have been unable to tweak the RegEx to add only the last item while maintaining only the other cases.
The problem stems from 20-02-004 - Test #4 in the link below. It contains a line break as the sentence runs into the next line in the PDF. I essentially want the RegEx to continue until the period on the next line since it hasn't bumped into it yet, but without messing up the prior matches/adding others. The 4 Test cases are the only 4 items I want the RegEx to match against from that sample.
Link below contains the input string, a sample of the string on the text tab, 4 specific test cases of what I want to match in the string on the test tab, the current RegEx I have, and the engine (JavaScript).
Sample
Adding (?:\n.*)? to the regex group allows the search to run optional, additional line before matching the [.] at the end.
/(?<=\d{2}[-]\d{2}[-]\d{3}\s)(.*(?:\n.*)?)([.])/g allows the search to wrap only one additional line to find the period.
The tests pass and the text tab appears to only capture smaller more relevant matches.
If you set the expression to enable single line matching the .* matches newline characters and the tests appear to pass:
/(?<=\d{2}[-]\d{2}[-]\d{3}\s)(.*)([.])/gs

Regular expression for rest of line after first x characters

I have a bunch of lines with IDs as the first six characters, and data I don't need after. Is there a way to identify everything after the ID section so Find and Replace can replace it with whitespace?
/.{6}\K.*//
If you want something more specific, please be more specific in your question.

How to extract file location using Regular Expressions(VB.NET)

I am facing a problem whereby I am given a string that contains a path to a file and the file's name and I only want to extract the path (without the file's name)
For example, I will receive something like
C:\Users\OopsD\Projects\test.acdbd
and from that string I want to extract only
C:\Users\OopsD\Projects
I was trying to create a RegEx to match a backslash followed by a word, followed by a dot followed by another word - this is to match the
\test.acdbd
part and replace it with empty string so that the final result is
C:\Users\OopsD\Projects
Can anyone, familiar with RegEx, help me on this one? Also, I will be using regular expressions quite a lot in the future. Is there a (free) program I can download to create regular expressions?
Are you really sure you need to be using Regex for such as simple task? How about this:
Dim file As New IO.FileInfo(" C:\Users\OopsD\Projects\test.acdbd")
MsgBox(file.Directory.FullName)
Regarding the free program on Regex, I would definitely recommend http://www.gskinner.com/RegExr/ - using it all the time. But you always have to consider alternatives, before going the Regex way.
The regex that you are looking for is as below:
[^/]+$
where,
^ (caret):Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.
$ (dollar):Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break.
+ (plus):Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
More reference can be found out at this link.
Many Regex softwares and tools are out there. Some of them are:
www.gskinner.com/RegExr/
www.txt2re.com
Rubular- It is not just for Ruby.