OpenRefine custom text faceting - regex

I have a column of names like:
Quaglia, Pietro Paolo
Bernard, of Clairvaux, Saint, or
.E., Calvin F.
Swingle, M Abate, Agostino, Assereto
Abati, Antonio
10-NA)\u, Ferraro, Giuseppe, ed, Biblioteca comunale ariostea. Mss. (Esteri
I want to make a Custom text facet with openrefine that mark as "true" the names with one comma and "false" all the others, so that I can work with those last (".E., Calvin F." is not a problem, I'll work with that later).
I'm trying using "Custom text facet" and this expression:
if(value.match(/([^,]+),([^,]+)/), "true", "false")
But the result is all false. What's the wrong part?

The expression you are using:
if(value.match(/([^,]+),([^,]+)/), "true", "false")
will always evaluate to false because the output of the 'match' function is either an array, or null. When evaluated by 'if' neither an array nor 'null' evaluate to true.
You can wrap the match function in a 'isNonBlank' or similar to get a boolean true/false, which would then cause the 'if' function to work as you want. However, once you have a boolean true/false result the 'if' becomes redundant as its only function is to turn the boolean true/false into string "true" or "false" - which won't make any difference to the values function of the custom text facet.
So:
isNonBlank(value.match(/([^,]+),([^,]+)/))
should give you the desired result using match

Instead of using 'match' you could use 'split' to split the string into an array using the comma as a split character. If you measure the length of the resulting array, it will give you the number of commas in the string (i.e. number of commas = length-1).
So your custom text facet expression becomes:
value.split(",").length()==2
This will give you true/false
If you want to break down the data based on the number of commas that appear, you could leave off the '==2' to get a facet which just gives you the length of the resulting array.

I would go with lookahead assertion to check if only 1 "," can find from the beginning until the end of line.
^(?=[^\,]+,[^\,]+$).*
https://regex101.com/r/iG4hX6/2

Related

Extract Specific Parameter value using Regex Postgresql

Given input string as
'PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,kWh,800,Gallons,500'
and unit_param = 'Gallons'
I need to extract value of unit_param (Gallons) which is 1000 using postgresql regex functions.
As of now, I have a function that first extracts value for THRESHOLDLIST which is "kWh,2000,Gallons,1000,litre,3000", then splits and loops over the array to get the value.
Can I get this efficiently using regex.
SELECT substring('PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,xyz' FROM '%THRESHOLDLIST=#".........#",%' FOR '#')
Use substring() with the target input grouped:
substring(myCol, 'THRESHOLDLIST=[^=]*Gallons,([0-9]+)')
The expression [^=]* means “characters that are not =”, so it won’t match Gallons within another parameter.
select
Substring('PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,xyz' from 'Gallons,\d*');
returns Gallons,1000

Create an If statement comparing a custom field MS Word

I'm trying to create an if statement (in MS Word) that looks at a custom field.
The custom field is DocProperty Client_ABV
I want it to print a line of text if client_abv matches a certain value else be completely blank (or delete the empty line if possible)
I believe it needs to look something like this:
{IF DocProperty.Client_ABV="Test" "Print this line if Test",""}
I've very little experience with this function in Word but I have some with conditional programming.
Can anyone shed any light. I've been googling it for the last 45 minutes and have had little success with the example pages I've found.
Use Ctrl+F9 to insert the field code { brackets }. They look like wavy brackets, but these are actually special "escape codes" that tell Word this is a field code.
You need a pair of brackets for both the IF and the DocProperty fields.
When performing a string comparison it's a good idea to put "quotes" around the field code as well as around the literal string.
There is no punctuation in the DocProperty field code (no period). And no comma between the true/false evaluation, only a space between the closing " and opening ".
If a paragraph mark should be part of the true/false evaluation (for example, you want to suppress the paragraph mark if the comparison is false) include it inside the "quotes" for the evaluation result. The field code will look a bit odd, but that does work.
For example:
{ IF "{ DocProperty Client_ABV }"="Test" "Print this line if Test¶
" ""}

Need better regex to test for "a" but not "ax"

I use the following regex in SSRS to test for a particular column name in a parameter:
=IIf(InStr(Join(Parameters!ColumnNames.Value, ","), "x"), False, True)
This will hide a column on a report if it is not one of the chosen columns. This works just fine if there is not another column called "xy". The string being tested may be "z,x,w", in which case the test works fine; but it may also be "z,xy,w", in which case it will find "x" and display both "x" and "xy".
I tried checking for "x," which only works if "x" is not the last character of the string. I need to know the syntax to check for both "x," OR "x as the last piece of the string". Unfortunately "x" can have any length. The basic problem is I do not know how to use an OR in the IIF statement.
I tried the most obvious ways and kept getting errors. Using "\b" also does not work because there are no spaces in the string (so word boundaries are not applicable).
What you can do is add the delimiter to your check, so that way you're checking the exact string only and not any that just include it:
=IIf
(InStr("," & Join(Parameters!ColumnNames.Value, ",") & ",", ",x,") > 0
, False
, True)
So this will catch x but not xy.
One thing to note:
I have added a check to see of InStr > 0, as this returns an integer and not a boolean.
You want to match a specific column name in an array of column names but do this on a single line to include in the IIF statement.
Based on the last technique suggested in How can I quickly determine if a string exists within an array? your code would need to be.
=IIf((UBound(Filter(Parameters!ColumnNames.Value, "x", True, compare)) > -1), False, True)
It doesn't look like there is an actual Regex anywhere?

subset doesn't recognize regex

I just cannot get this working. i want to subset all the rows containing "mail". I use this:
Email <- subset(Total_Content, source == ".*mail.*")
I have rows like this ones:
"snt152.mail.live.com",
"mailing.serviciosmovistar.com",
"blu179.mail.live.com"
But when using: "View(Email)"
I just get a data.frame empty (just see the columns). I don't need to "scape" any metacharacter, because i need the "." to mean "anycharacter" and the "*" (0 or more times), right? Thanks.
Well, no, it doesn't - it's not meant to. You're not passing it a regular expression to be evaluated against each row, you're just passing it a character string; it doesn't know that . and * are regex characters because it's not performing a regex search. It's returning all rows where source is the literal string .mail. - which in this case is 0 rows.
What you probably want to be doing (I'm assuming this is a data.frame, here) is:
Email <- Total_Content[grepl(x = Total_Content$source, pattern = ".*mail.*"),]
grepl produces a set of boolean values of whether each entry in Total_Content$source matched the pattern. Total_Content[boolean_vector,] limits to those rows of Total_Content where the equivalent boolean is TRUE.
Why not use subset with a logical regex funtion?
Email <- subset(Total_Content, grepl(".*mail.*", source) )
The subset function does create a local environment for the evaluation of expressions that are used in either the 'subset' (row targets) or the 'select' (column targets) arguments.

Using Regular Expressions in R to change number formating

In data frame df column c1 has negative numerical values formatted like this:
(1,000,000)
I would like to remove the parentheses from the negative values in df$c1, so as to return:
-1,000,000
I am using the following command in R: df$c1<-gsub('^\\($','-',gsub(',','',df$c1))
But the output is not returning the desired effect.
How can I adjust the regular expression in this R command to return the proper formatting?
gsub("\\((.+)\\)", "-\\1", "(1,000,000)")
# [1] "-1,000,000"
Wouldn't it instead be:
df$c1<-sub('^\\(', '-' , sub('\\)$','',df$c1))
This removes leading left-parens, replacing them with minus signs, and removes trailing right-parens. Your version was insisting that the 'outer' pattern be exactly (, which I doubt would match any items, and was removing commas using the 'inner' call. I changed to sub since there was only the desire to do this once for each element.