Searching and replacing multiple values in Google Refine

Searching and replacing multiple values in Google Refine - replace

I'd like to search and replace multiple values in a column with a single function with GREL (or anything other) in Google Refine.
For example:
1. replace(value, "Buch", "bibo:Book")
2. replace(value, "Zeitschrift", "bibo:Journal")
3. replace(value, "Patent", "bibo:Patent")
4. and many more.
Is there a way to do this with one GREL expression?

For your first three, you can do:
value.replace("Buch", "bibo:Book").replace("Zeitschrift", "bibo:Journal").replace("Patent", "bibo:Patent")
Depending how many your "many more" is, that pattern may suffice. Otherwise you could investigate some type of table lookup (which might be easier in Python than GREL - just choose Jython for your expression language).

To do this in a single GREL line:
replace(value,/(.+)/,"bibo:$1")
I use this to reformat a column of digit strings with commas:
1,317
2,000
1,055
The GREL expression
replace(value,/(\d),(\d)/,"$1$2")
returns
1317
2000
1055
which I can then use as numbers.

Related

Regexmatch for multiple words in Sheets

I'm trying to write a REGEXMATCH formula for Sheets that will analyze all of the text in a cell and then write a given keyword into another cell.
I've figured out how to do this for a single keyword: for example,
=IF(REGEXMATCH(F3, "czech"),"CZ",IF(REGEXMATCH(F3, "african"),"AF",IF(REGEXMATCH(F3, "mykonos"),"MK")))
What I'm having trouble with though is writing one of these values only if two or more terms are matched in the reference cell.
If I were trying to match one of two words, I realize I could use | as in:
=IF(REGEXMATCH(F3, "czech|coin"),"CZC"
etc
But in this instance I only want to produce CZC if the previous cell contains BOTH czech AND coin.
Can someone help me with this?

try like this:
=IF((REGEXMATCH(F3, "czech"))*(REGEXMATCH(F3, "coin")), "CZC", )
multiplication stands for AND

Attempting to split all 4 digit numbers in spreadsheet cells with regex and formulas

I’m currently running into some difficulties with splitting and regular expressions in a Google spreadsheet. I’m attempting to split the contents of a cell across a row, but only pulling out sequences of four consecutive digits (representing years) and only using cell formulas (not functions). Eventually, this formula would apply to an entire column, but I’ve limited it to a single cell for the time being. For example, given a cell “I2” with the contents:
2009; Library of Congress; 1939-1945; 23rd 1984; 16
I need a result (placed in “J2, K2, L2, M2, etc.”) like:
2009 1939 1945 1984
This sample cell is as representative as I’m aware of for various possibilities that are likely to come up, though the number of entries between semicolons varies from one to many. In my own attempts so far, I’ve ended up with two formulas that are close to what I need, but both fall short.
1) The first formula is:
=ArrayFormula(SPLIT(SUBSTITUTE(REGEXREPLACE(I$2, "[^\d\-\;]", ""),"-", ";"), ";"))
which achieves (in "J2, K2, L2, M2, N2"):
2009 1939 1945 231984 16
2) The second formula is:
=ArrayFormula(SPLIT(SUBSTITUTE(REGEXREPLACE(REGEXREPLACE(I$2, "[^\d]", ";"), "[^\d\-\;]", ""),"-", ";"), ";"))
which gets me (in "J2, K2, L2, M2, N2, O2"):
2009 1939 1945 23 1984 16
I’ve been trying to think of a way to limit the formula’s returns with "\d{4}", for example, but no combination or alterations I’ve made so far have been successful. Does anyone have any insight which would solve this problem?

The following seems to work, although I am no expert in Sheets, and there may be more efficient methods.
Apparently if you use capture groups, REGEXEXTRACT will return an array of values. This method, however, seems to require that you know the exact number of matches to be extracted.
So the following seems to work:
=REGEXEXTRACT($I2,REPT("(\b\d{4}\b).*?",(len($I2)-len(REGEXREPLACE($I2,"\b\d{4}\b","")))/4))
How it works:
First compute the number of matches in the string:
=(len(I2)-len(REGEXREPLACE(I2,"\b\d{4}\b","")))/4
Next, create a regex expression incorporating the regex the correct number of times:
REPT("(\b\d{4}\b).*?", ...Above_formula...)
And finally, we put it all together in our final formula above.
Of course, if you know that the number of matches will always be four (4), there is no need for constructing the regex string in this manner, you can just hard code it.
EDIT To eliminate unwanted zero's if there are no matches, test to see if there are any matches using REGEXMATCH: eg:
=ArrayFormula(if(REGEXMATCH($I2,"\b\d{4}\b"),(value(REGEXEXTRACT($I2,REPT("(\b\d{4}\b).*?",(len($I2)-len(REGEXREPLACE($I2,"\b\d{4}\b","")))/4)))),""))

Use this formula, perhaps replacing the colon as split character with another character that's not likely to occur in source strings.
=filter(split(regexreplace(I$2, "\D+", ":"), ":"), len(split(regexreplace(I$2, "\D+", ":"), ":"))=4)
Explanation: it's a way around regex limitations in Google RE2 engine. Instead of looking for the pattern, we look for the anti-pattern (anything that is not digit) and replace it with the separator, then split. What remains is only substrings composed of digits, so we filter them so that only 4-character substrings remain.

Google Sheets Pattern Matching/RegEx for COUNTIF

The documentation for pattern matching for Google Sheets has not been helpful. I've been reading and searching for a while now and can't find this particular issue. Maybe I'm having a hard time finding the correct terms to search for but here is the problem:
I have several numbers (part numbers) that follow this format: ##-####
Categories can be defined by the part numbers, i.e. 50-03## would be one product category, and the remaining 2 digits are specific for a model.
I've been trying to run this:
=countif(E9:E13,"50-03[123][012]*")
(E9:E13 contains the part number formatted as text. If I format it any other way, the values show up screwed up because Google Sheets thinks I'm writing a date or trying to do arithmetic.)
This returns 0 every time, unless I were to change to:
=countif(E9:E13,"50-03*")
So it seems like wildcards work, but pattern matching does not?

As you identified and Wiktor mentioned COUNTIF only supports wildcards.
There are many ways to do what you want though, to name but 2
=ArrayFormula(SUM(--REGEXMATCH(E9:E13, "50-03[123][012]*")))
=COUNTA(FILTER(E9:E13, REGEXMATCH(E9:E13, "50-03[123][012]*")))

This is a really big hammer for a problem like yours, but you can use QUERY to do something like this:
=QUERY(E9:E13, "select count(E) where E matches '50-03[123][012]' label count(E) ''")
The label bit is to prevent QUERY from adding an automatic header to the count() column.
The nice thing about this approach is that you can pull in other columns, too. Say that over in column H, you have a number of orders for each part. Then, you can take two cells and show both the count of parts and the sum of orders:
=QUERY(E9:H13, "select count(E), sum(H) where E matches '50-03[123][012]' label count(E) '', sum(H) ''")
I routinely find this question on $searchEngine and fail to notice that I linked another question with a similar problem and other relevant answers.

regular expression to reverse text order

I need to reverse the order of an html files title tag.. so the first text before the : are put at the end, and so on
original:
<title>text: texttwo: three more: four | site.com</title>
output:
<title>four: three more: texttwo: text | site.com</title>
the title inside is divided by : and needed to reverse the order, sometimes they are four (separated with three : and sometimes they are three, or whatever..
I use Notepad++ to replace.. - or if you want to suggest any other easy software to use to do that..
Thanks

I don't believe that this can be done with a standard regular expression - at least not with the requirement of needing to support any number of fields.
Assuming you have a large number of these to process, I'd use your favorite programming or scripting language, split the fields into an array (you can use regular expressions for this) - then read back from the array in reverse.

If you really don't want to write code (which I think is not a good idea because it is a really good opportunity to learn something new) you can try this:
http://jsimlo.sk/notepad/manual/wiki/index.php/Reverse_tools (Order of Words on Each Line (Ctrl+Shift+F))
but you need to download this:
http://jsimlo.sk/notepad/

notepad++ regular expressions to convert lines for SPSS syntax editor

I am curently busy with bulding a synthax document in SPSS and have a column of variable strings that consists of approximately 40 lines (it will be much much more in coming week). SPSS has a nice way of creating it (can be seen here :)
http://vault.hanover.edu/~altermattw/methods/stats/reliable/reliability-1.html) but it can be done per one variable at a time which is possible to automatize.
I am a total beginner (I wouldn't mind if you would call me n00b) at search&replace with reqular expressions in notepad++ but I can use the extended search function as a basic user :P
The data contains scores Likert scale (from 1-7) and I would like to reverse it to do some tests.
For example: my variable name on the line is q_4_SQ001 and the sline in synthax editor is q_4_SQ001=COMPUTE q_4_SQ001r=8-q_4_SQ001.
My question so far is thus:
How can I convert a line containing a unique variable name into it's revers formula?
So in this case, how can I replace the following lines:
q_4_SQ001
q_4_SQ002
q_4_SQ003
q_4_SQ004
into the synthax given under:
COMPUTE q_4_SQ001r=8-q_4_SQ001.
COMPUTE q_4_SQ002r=8-q_4_SQ002.
COMPUTE q_4_SQ003r=8-q_4_SQ003.
COMPUTE q_4_SQ004r=8-q_4_SQ004.
Please remark the dots in the end of each line I did this manually to give you an impression of what I would like to achieve. My data set has different questions and different variable strings so I would like to make my life a bit easier right now :P
I also tried recording and running a macro as stated in here (http://stackoverflow.com/questions/2467875/notepad-replace-all-regular-expression-start-of-the-line-and-end-of-the-line) but that still is pretty time consuming since I have to do each line manulally and clean up with extended search in the end.
Wouldn't it be easier to convert each line?
Thanks a bunch in advance :)

Funny, Notepad++ works under Wine, as I just found out ;)
New file, inserted:
q_4_SQ001
q_4_SQ002
q_4_SQ003
q_4_SQ004
Select all (CTRL+A), replace (CTRL+R).
Tick Regular Expr, stick ^(.*)$ in the "find" bit (first textbox), and COMPUTE \1r=8-\1. in the "replace" bit (second textbox). Hit the Find button, and then the Replace Rest button.
Parenthesis () around a pattern cause the pattern to be "memorised", each set of parenthesis available to the replacement pattern via \1, \2, etc.
After the replace, I got:
COMPUTE q_4_SQ001r=8-q_4_SQ001.
COMPUTE q_4_SQ002r=8-q_4_SQ002.
COMPUTE q_4_SQ003r=8-q_4_SQ003.
COMPUTE q_4_SQ004r=8-q_4_SQ004.
Which I assume is what you wanted. Enjoy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Searching and replacing multiple values in Google Refine - replace

To do this in a single GREL line: replace(value,/(.+)/,"bibo:$1") I use this to reformat a column of digit strings with commas: 1,317 2,000 1,055 The GREL expression replace(value,/(\d),(\d)/,"$1$2") returns 1317 2000 1055 which I can then use as numbers.

Related

Regexmatch for multiple words in Sheets

Attempting to split all 4 digit numbers in spreadsheet cells with regex and formulas

Google Sheets Pattern Matching/RegEx for COUNTIF

regular expression to reverse text order

notepad++ regular expressions to convert lines for SPSS syntax editor

Categories

Resources