Extract values from this string? - regex

I have the following string of text.
LOCATION: -20.443 122.951TEMPERATURE: 54.5CCONFIDENCE:
50%SATELLITE: aquaOBS TIME: 2014-05-06T05:30:30ZGRID:
1km
This is being pulled from a feed, and the fieldnames stay the same, but the values differ.
I have been trying to get my head around regular expressions and find a way to pull:
54.5 (temperature)
50 (confidence)
So I need two separate regular expressions that can pull the above from the original string. Any clues or pointers would be great.
I am doing this within a product that allows me to point to strings and can apply regular expressions to the strings so that values can be extracted and written to new fields.

ArcGIS appears to be using a very limited regex engine. It looks like it doesn't even support capturing groups, let alone lookaround. So I guess you need to try the following:
TEMPERATURE: ([0-9.]+)C
will match the TEMPERATURE entry and
CONFIDENCE: ([0-9]+)%
will match the CONFIDENCE entry.
If you're lucky, you can then access the relevant part of the match via the special variable \1 or $1 (which would then contain "54.5" and "50", respectively.
If that's not possible, you'll have to "manually" trim the first 13/12 characters from the left side from the string as well as the rightmost character.

You can split this text with delimiter- new line. As result you get an array. Than you can split the elements of the array with delimiter ':'

Related

Blueprism: how to use the replace function in a calculation stage?

I am reading a text from an application using BluePrism. The text has the following structure (the number varies from case to case): "Please take note of your order reference: 525". I need to be able to extract the number from the text. Looking at the calculation stage, there is a replace function: replace(text, pattern, new-text). I want to use this function to replace all alphabetic characters in my text with an empty string to return only whatever is numeric. How can I input that in the pattern?
So I want something like this:
Replace([Order confirmation text ], /^[A-z]+$/, " ")
Also, I tried to look for a proper documentation for the VBOs that are shipped with blueprism, but couldn't find any. Does anyone know where we can get documentations for blueprism functions?
The Replace() function in calculate stage is the simplest possible one. It's not a regex one!
So, if the stirng is always in that format, then you can use:
Replace([Text],"Please take note of your order reference:","")
If the text is not always that standard, then you should rather use a regular expressions. To do that, you need to use an object, that will invoke a regex code.
In the standard blueprism objects, you can find:
Object: Utility - Strings C#
Action: Extract Regex Values
I think there is no Regex Replace action, by default, so if you'd like to, then you have to implement it. Below you can find a code that I am using:
Dim R as New Regex(Regex_Pattern, RegexOptions.SingleLine)
Dim M as Match = R.Match(Text)
replacement_result = R.Replace(Text,Regex_Pattern,replacement_string)
Quick Answer if the pre text is constant use a Mid statement then this will take out the issue the other guy had with the right. i.e.
Mid("Please take note of your order reference: 525",42,6)
If you aim for a maximum number length it will stop at the end anyway.
A few things here:
-Your pattern isn't matching because it's looking for a constant string of letters from start to finish (^ anchors to the beginning of the string and $ anchors to the end).
-You're replacing the pattern with a space, not an empty string, so you'll end up with a bunch of spaces in your result even if you correct the pattern.
-You said you only want to replace alphabetic characters, but it looks like you also want to get rid of spaces and colons.
Try replacing [A-Za-z :]+ with "".
Your goal is to retrieve number from string then use Right():
Right("Please take note of your order reference: 525", 3)
This will return only numeric.
Regards
Vimal

Regex for removing spaces and random trailing chars

I am successfully validating an ID such as:
ZFA1G2H34J5K6L7P5
using this regex:
([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]){17}$
This ID sometimes arrives corrupted (comes from a OCR process) and therefore the previous regex does not work. I need to support the most common way of corruption which is having a space within the ID:
ZFA1G2H34 J5K6L7P5
The regex should remove the space and compose just the allowed 17 chars of the ID.
Please note I cannot use scripting (.replace for example) because the software where this regex is used does not support it.
As a bonus, sometimes the ID contains trailing chars which I would like to remove as well:
ZFA1G2H34 J5K6L7P5...ç
You can use one of the following regular expressions to validate the query:
^(?:(?![iIoO])[ ç0-9a-zA-Z]){17,}$
^([ ça-hA-Hj-nJ-Np-zP-Z0-9]){17,}$
And then, you can use the following regular expression to only match characters you like:
(?:(?![iIoO])[0-9a-zA-Z])
[a-hA-Hj-nJ-Np-zP-Z0-9]
Don't use , in a set like [A-Z,a-z], because commas are actually part of the set and not a separator between the character ranges.

regexReplace in String Manipulation KNIME

I'm trying to remove the content of all cells that start with a character that is not a number using KNIME (v3.2.1). I have different ideas but nothing works.
1) String Manipulation Node: regexReplace(§column§,"^[^0-9].*","")
The cells contain multiple lines, however only the first line is removed by this approach.
2) String Manipulation Node: regexMatcher($casrn_new$,"^[^0-9].*") followed by Rule Engine Node to remove all columns that are "TRUE".
The regexMatcher gives me "False" even for columns that should be "True" though.
3) String Replacer Node: I inserted the expression ^[^0-9].* into the Pattern column and selected "Replace whole String" but the regex is not recognised by that node so nothing gets replaced.
Does anyone have a solution for any of those approaches or knows another Node that might do the job? Help is much appreciated!
I would go with your first solution, since it has already worked, you just have to expand your regex to include newlines. I would try something like this:
regexReplace($column$,"^[^0-9].(.|\n)*","")
This should match any text starting with a character that is not a number, followed by any number of occurrences of any character or a newline. Depending on the line endings, you might need (.|\n|\r) instead of (.|\n).
You should use the following expression:
"(?s)^\D.*$"
So the dot will match even new lines. (Based on this: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#DOTALL)
In case you need to only change the content of the cells that do not start with a number, I do not think you need to filter any columns or rows. (BTW in case you want to remove rows, there are the Rule-based Row Filter/Splitter nodes which also support regular expressions with the MATCHES predicate.)

Insertion syntax for regex in Notepad++ or Perl

Shortform: searching:
"{,[0-9][0-9]," inserting Space+00... getting replaced string segment:
"{,SPACE00[0-9][0-9]," or other so-garbaged data for found [0-9][0-9] sequence ... so how do I search with a regex and insert in the middle???
Longform question:
I'm trying to do a series of simple character insertions -- digits actually -- in a series of mixed model CSV profiling data (five files each with different model parameters, several hundred lines each).
I'm visually challenged and desire to insert padding characters to columize data, so I can focus on tweaking key values, not keeping place data file to data file.
This need where the CSV data lines format are:
*Variable_symbolic-name*,{##,##,* ... ('Set of CSV Numerical Data lists' ...},\n*
an actual data line:
61,parameter17,{,70,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
to be morphed to:
61,parameter17,\t\t{, 0070,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Give or take a tab character to align all the { numeric field starts...
I've found searching: "{,[0-9][0-9]," failed but "\{,[0-9][0-9]," succeeds for the find part of the search and replace operation... but have hit a proverbial brick wall in how to do the actual replace (with an insert) of such a short length. (Obviously with so many parameters and files, I'm moving cautiously!)
However, This Perl Help tutorial leaves me in the dark as to how to keep the found ranges and insert padding before (Space, zero, zero to be specific if positive, '-00' if negative) In short, I need to know how to insert 2-3 places in the replace field in Notepad++... and retain the original data without prejudicing it!
Articles herein have cited replacing paragraphs and lines, adding newlines, etc. but this simple insertion alteration seems too simple for you all. But it's been several hours of frustration for me!
Thanks! // Frank
Resolved:
Good news: ({,)([0-9][0-9],) and \1 xx\2 works fine as does ({,)(#[0-9][0-9],) and replacing with \1 xx#\2 ... whether or not tabs are utilized. Obviously the key was ([0-9][0-9],) which included the discrimination of the comma... though I have no idea why that seemed to fail an hour ago with trials made using Sobrinho's help. Must have not tried the sequence. Thanks all!
Try to type this in the search box:
(.+)(\{,[0-9][0-9].*)
And in the replace:
\1\t\t\2
When you have things between parenthesis, they are "stored" by Notepad++ and can be reused in the replace box.
The order of the parenthesis starts with one and are accessed as \1, \2, ...
You tagged it as Perl, so here is how you do it in Perl ...
I prefer to use lookahead assertions rather than backreferences
s/(?= {,[0-9][0-9], ) /\t\t/x
Alternatively, $& contains the matched string ($0 is something different)
s/ {,[0-9][0-9], /\t\t$&/x
You will need a backreference here, meaning something which, in the replace part, will be equal to what you have matched.
Usually, the whole matched part is stored in the $0 backreference. (You can get $1 with a capture group too, and up to $2 with two capture groups, etc)
Back to your question, you could try this:
Find:
(\{,)([0-9][0-9],)
Replace by:
\t\t$1 00$2
This will insert two tab characters before the part that matched \{,[0-9][0-9], (or in other words, replace the part that matched by 2 tab characters and what you matched), then put the first captured part ({,) and then the space and double 0's and then the second captured part, the two digits and following comma.
regex101 demo

Split a string based on each time a Deterministic Finite Automata reaches a final state?

I have a problem which has an solution that can be solved by iteration, but I'm wondering if there's a more elegant solution using regular expressions and split()
I have a string (which excel is putting on the clipboard), which is, in essence, comma delimited. The caveat is that when the cell values contain a comma, the whole cell is surrounded with quotation marks (presumably to escape the commas within that string). An example string is as follows:
123,12,"12,345",834,54,"1,111","98,273","1,923,002",23,"1,243"
Now, I want to elegantly split this string into individual cells, but the catch is I cannot use a normal split expression with comma as a delimiter, because it will divide cells that contain a comma in their value. Another way of looking at this problem, is that I can ONLY split on a comma if there is an EVEN number of quotation marks preceding the comma.
This is easy to solve with a loop, but I'm wondering if there's a regular expression.split function capable of capturing this logic. In an attempt to solve this problem, I constructed the Deterministic Finite Automata (DFA) for the logic.
The question now is reduced to the following: is there a way to split this string such that a new array element (corresponding to /s) is produced each time the final state (state 4 here) is reached in a DFA?
Using regex (unescaped): (?:(?:"[^"]*")|(?:[^,]*))
Use that and call Regex.Matches() which is .NET, or its analog in other platforms.
You could further expand the above to this: ^(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*))(?:,(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*)))*$
This will parse the whole string in 1 shot, but you need named groups and multi-capture per group for this to work (.NET supports it).
Eligible commas are also followed by an even number of quotes, and VBScript does support lookaheads. Try splitting on this:
",(?=(?:[^""]*""[^""]*"")*[^""]*$)"