Reading a text config file: using regex to parse - regex

Looking for a way to read the following config file sample using a multi line regex matcher. I could just read in the file by line, but I want to get decent with the specifics of flexible regular expression matching.
So the config file is filled with blocks of code as follows:
blockName BLOCK
IDENTIFIER value
IDENTIFIER value
IDENTIFIER
"string literal value that
could span multiple lines"
The number of identifiers could be from 1..infinity. IDENTIFIER could be NAME, DESCRIPTION, TYPE, or the like.
I have never worked with multi line regular expressions before. I'm not very familiar with the process. I essentially want to use a findAll function using this regular expression to put all of the parsed block data into a data structure for processing.
EDIT: clarification: I'm only looking to read this file once. I do not care about efficiency or elegance. I want to read the information into a data structure and then spit it out in a different format. It is a large file (3000 lines) and I don't want to do this by hand.

I don't think regex is the best tool for this.

Try this, which should work in perl regular expressions:
([\w\d]*)\s+BLOCK\s*\n(\s*(NAME|DESCRIPTION|TYPE|...)\s*([\w\d]*|"(.*)")\s*\n)+
I verified it at REGex TESTER using the following test text:
blockName BLOCK
NAME value
NAME value
DESCRIPTION
"string literal value that
could span multiple lines"
otherName BLOCK
NAME value
TYPE value
DESCRIPTION
"string literal value that
could span multiple lines"
It will only find the last block/identifier if the file ends in a newline

Related

RegEx to get unique values from large file with duplicates

I have a large XML-file that I want to extract unique values from. The values I'm looking for are placed in the XML-tag: ns3:order_id
To make it more complex, the file contains duplicates of order_id, and I'm only interested in geeting the unique order_id values.
I've been using RegEx to extract the values, this is the expression:
(?sm)(\<ns3:order_id>\d+\b)(?!.*\1\b)
The expression gives me what I need, BUT only if the file is way smaller. When I try this expression on the "big" file I receive: "Catastrophic backtracking has been detected and the execution of your expression has been halted." I guess it has with *, and I have tried different ways replacing it without success.
Is there any way to correct my expression so that I can collect the values?
As seen in the text above, I've tried several diffrent RegEx ways. The expression above works, but not in bigger files.

Regular expression to extract either integer or string from JSON

I am working in an environment without a JSON parser, so I am using regular expressions to parse some JSON. The value I'm looking to isolate may be either a string or an integer.
For instance
Entry1
{"Product_ID":455233, "Product_Name":"Entry One"}
Entry2
{"Product_ID":"455233-5", "Product_Name":"Entry One"}
I have been attempting to create a single regex pattern to extract the Product_ID whether it is a string or an integer.
I can successfully extract both results with separate patterns using look around with either (?<=Product_ID":")(.*?)(?=") or (?<=Product_ID":)(.*?)(?=,)
however since I don't know which one I will need ahead of time I would like a one size fits all.
I have tried to use [^"] in the pattern however I just cant seem to piece it together
I expect to receive 455233-5 and 455233 but currently I receive "455233-5"
(?<="Product_ID"\s*:\s*"?)[^"]+(?="?\s*,)
, try it here.

nifi routeText processor usage issue

I am facing issue in configuring RouteText Processor correctly. I have to filter out those lines which have say a particular string values at a particular index. Let's say I want all the lines which have 'BT' or 'PV7' and 'PV30' values at index 19. My file is csv.
I tried using below configuration but all of my lines are moved to unmatched relation. However, data is containing other lines too.
You need to change the Matching Strategy to "Satisfies Expression" since you are not using regular expressions here.
The docs for Satisfies Expression says:
"Match lines based on whether or not the the text satisfies the given Expression Language expression. I.e., the line will match if the property value, evaluated as an Expression, returns true. The expression is able to reference FlowFile Attributes, as well as the variables 'line' (which is the text of the line to evaluate) and 'lineNo' (which is the line number being evaluated. This will be 1 for the first line, 2 for the second and so on)."

Flat file schema validation using regular expression - not allow new line and delimiter char

I know this must be primitive question but I am still not able to find a solution to my simple problem.
In a BizTalk solution, I want to validate a inbound flat file against a flat file schema (Delimiter char is pipe '|'). The rule is that there must be exact same number of fields in every record (every line). So after disassembling, none of the field must have new line char (CR LF or \r\n) and pipe '|' char.
Every line in flat file is a single record and there are 10 fields in every record. so there must me exact 9 '|' pipe chars in every line.
I tried to solve it using XSD regular expression validation but since regex is not my area of expertise, I am not able to create a final regex. Currently I am testing with .*(?!([^\r\n\|])).* but it doesn't work when there are more than 9 '|' chars however it works when there are less than 9.
Finally I want a XSD regex which must not allow a new line char and '|' in string but can have empty '' value.
I have referred below links to create my regex,
XML Schema Regular Expressions
XML Schema - Regular Expressions
I think you're trying to solve the wrong problem.
First, do you really need to do this? I don't recall ever needing or even considering what you're describing.
Second, you can just Validate the parsed Xml. If the field count is wrong, it will fail there. If you really need to check for extra '|', you can put that in the Schema to test for it in a Map.
IBM Integration Bus solves this problem by allowing you to describe the non-XML data format using an XSD. The technology is called Data Format Description Language (DFDL).
https://en.wikipedia.org/wiki/Data_Format_Description_Language

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"ΒΆ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.