How to coded reference from an APA a reference in excel? - regex

I have a database with APA references in a column like this one (let's sake it's in A6 for the sake of our example) in Google Sheets (online reproducibility aimed)
Smith (2010) Assessing the Impact of Aimhigher Kent and Medway
I would like to create a new code column with just the last name of the first author and the date. For our example, that would be
Smith2010
I tried things like
=REGEXEXTRACT(A6; "\w.(\d\d\d\d)")
but it doesn't work. Could you help me please in this very low-level issue?
Thanks in advance,

One option is to use REGEXREPLACE with 2 capturing groups, and use those groups in the replacement.
^[^\S\r\n]*(\w+)[^()\r\n]+\((\d{4})\).*
Regex demo
=REGEXREPLACE(A6; "^[^\S\r\n]*(\w+)[^()\r\n]+\((\d{4})\).*"; "$1$2")

Thanks to #The fourth bird the solution is:
> =REGEXREPLACE(A6; "^[^\S\r\n]*?(\w+)[^()\r\n]+\((\d{4})\).*"; "$1$2")
Thanks a lot for the ones who helped !!

Related

How to Keep rows of multi-line cells containing a keyword in google sheets

I'm trying to keep lines that contain the word "NOA" in a column A which has many multi-line cells as can be viewed in this Google Spreadsheet.
If "NOA" is present then, I would like to keep the line. The input and output should look like the image which I have "working" with too-many helper cells. Can this be combined into a single formula?
Theoretical Approaches:
I have been thinking about three approaches to solve this:
ARRAYFORMULA(REGEXREPLACE - couldn't get it to work
JOIN(FILTER(REGEXMATCH(TRANSPOSE - showing promise as it works in multiple steps
Using the QUERY Function - unfamiliar w/ function but wondering if this function has a fast solution
Practical attempts:
FIRST APPROACH: first I attempted using REGEXEXTRACT to extract out everything that did not have NOA in it, the Regex worked in demo but didn't work properly in sheets. I thought this might be a concise way to get the value, perhaps if my REGEX skill was better?
ARRAYFORMULA(REGEXREPLACE(A1:A7, "^(?:[^N\n]|N(?:[^O\n]|O(?:[^A\n]|$)|$)|$)+",""))
I think the Regex because overly complex, didn't work in Google or perhaps the formula could be improved, but because Google RE2 has limitations it makes it harder to do certain things.
SECOND APPROACH:
Then I came up with an alternate approach which seems to work 2 stages (with multiple helper cells) but I would like to do this with one equation.
=TRANSPOSE(split(A2,CHAR(10)))
=TEXTJOIN(CHAR(10),1,FILTER(C2:C7,REGEXMATCH(C2:C7,"NOA")))
Questions:
Can these formulas be combined and applied to the entire Column using an Index or Array?
Or perhaps, the REGEX in my first approach can be modified?
Is there a faster solution using Query?
The shared Google spreadhseet is here.
Thank you in advance for your help.
Here's one way you can do that:
=index(substitute(substitute(transpose(trim(
query(substitute(transpose(if(regexmatch(split(
filter(A2:A,A2:A<>""),char(10)),"NOA"),split(
filter(A2:A,A2:A<>""),char(10)),))," ","❄️")
,,9^9)))," ",char(10)),"❄️"," "))
First, we split the data by the newline (char 10), then we filter out the lines that don't contain NOA and finally we use a "query smush" to join everything back together.

Avoid duplicate code in Excel IF formula code

I want to avoid duplicate code within excel formulas. Is there a method to repeat a certain code segment?
=IF(A1=1,(A1-B2-C3),(A1-B2-C3)+1)
This would be especially useful when it comes to more complex or longer sections. But: everything must be in ONE formula in ONE cell. Thanks! :-)
EDIT: This is my current code.
=IF(ISNUMBER(SEARCH(".amp",A2)),IFERROR(MID(A2,FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))+1,SEARCH(".html",A2)-FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))-5),""),IFERROR(MID(A2,FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))+1,SEARCH(".html",A2)-FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))-1),""))
It strips the long ID number out of any URL of a specific CMS. So
FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-","")))
is probably the part which occurs more than once and should be replaced for a code which does not be that duplicate-prone.
EXAMPLE: www.domain.com/path1/path2/this-is-an-article-123-dd-123456789.html --> 1234567890
EXAMPLE: www.domain.com/path1/path2/this-is-an-article-123-dd-1234567890.amp.html ->
1234567890
EXAMPLE: www.domain.com/path1/this-is-an-article-1234567890.html ->
1234567890
In google sheets, you could use REGEXEXTRACT to get what you want:
Formula in B1:
=REGEXEXTRACT(A1,"\d{8,}")
Place the complex common sub-expression in its own cell and refer to that cell.
EDIT#1:
As an alternative, you can use a Named Formula for the sub-expression:
Named Formula
So here is another way of finding the code in Excel:
Here is the formula in Cell B1 which needs to be confirmed by pressing Ctrl+Shift+Enter, then drag it down to apply across board:
{=FILTERXML("<data><a>"&SUBSTITUTE(MID(A1,LARGE(IF(MID(A1,ROW($A$1:INDEX($A:$A,LEN(A1))),1)="-",ROW($A$1:INDEX($A:$A,LEN(A1)))),1)+1,LEN(A1)),".","</a><a>")&"</a></data>","/data/a[1]")}
For the logic behind this formula you may give a read to this article: Extract Words with FILTERXML.
Cheers :)
Ps. it seems that GoogleSheet has out performed Excel in some area already.

how to build regular expressions

I'm dealing with some google spreadsheet with data, some of which is in a very confused way, but regular, so i hope we can figure this out.
I've tried reg ex builders but I can't find the right one for google sheets or I misunderstand some stuff.
I would appreciate help with these sentances below:
1. {"user":{"Czy faktura?":"Y","Nazwa firmy":"Name of the company ","NIP":"113 234 20 57"}}
2. {"user":{"Czy faktura?":"Y","Nazwa firmy":"The longer name of the company","NIP":"2352225961"}}
3. {"user":{"Czy faktura?":"N","Nazwa firmy":"","NIP":""}}
The point is to extract: (using arrayformula in google sheets)
Y or N
Name of the company
NIP number
Problems:
The name of the company has different lengths, and the NIP number is sometimes with white-spaces.
Do you guys have any idea how can I properly use it?
I know it's the REGEXEXTRACT formula of course :)
Just have a problem on how to formulate the regular expression..
=regexreplace(B1, "(^.*Nazwa firmy"":"")(.*)("",""NIP.*$)", "$2")
Well the support was fantastic :)
After all, a simple "Y|N" solves the first problem
I used #ttarchala's solution for the company name as it seems to work for some reason - i don't know why or how :)
"(^.Nazwa firmy"":"")(.)("",""NIP.*$)", "$2"
and the NIP is isolated by this one: "NIP\"":\""(.+)\"""),"-|\s","" and later trimmed of off the "-" minus and whitespaces signs.
cheers

Emacs regexp - replacing text strings, query replace regexp

It seems simple enough but I can't get it done.
My text file looks like this :
Johnson Cary, 2009, This important article, 109 pages.
Smith Tom, 2003, Much ado about nothing: a study, 89 pages.
I need this :
Johnson Cary%2009%This important article%109 pages.
Any special character unlikely to appear in text will do. The end goal is to end up with a .csv then a .xls file.
I am using
^\([^,]+\)\([,]\)
to find the first occuring comma but when I try to replace with
\1 %
it does not work, nor any kind of close combination of that sort for that matter.
Any help will be dearly welcome!
Thank you much in advance.
Replace this:
^\([^,]*\), \([^,]*\), \([^,]*\), \(.*\)$
with this:
\1%\2%\3%\4
to get the correct result.

Splitting a title into separate parts

I need a to split a string of the form
2,9.1,The Godfather (1972), (it's a csv line)
to:
2
9.1
The Godfather
1972
any ideas for a good regular expression?
BTW,
if you know a good regular expressions creator based on examples you provide it'd be great.
I'm a bit new to this..
10x!!
(\d+)\.(\d+\.\d+),(.*?)(?= \()\((\d{4})\)
^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^ ^^^^^^^
2 9.1 Title Year
I wouldn't recommend using regex to split the csv files as it can't handle comma escaping well. But having that said, how about using the simplest available solution?
A simplest regex like this should solve your problem
'(.*?),(.*?),(.*?)\((\d+)\)'
A little time with Google gave me this: /,(?!(?:[^",]|[^"],[^"])+")/. Seeems to split CSV just fine.
>>> '2,9.1,The Godfather (1972)'.split(/,(?!(?:[^",]|[^"],[^"])+")/)
["2", "9.1", "The Godfather (1972)"]
If you are sure that the format is static, you can use this:
(\d+),(\d+\.\d+),(.*?) \((\d+)\)
But if it can contain more information, use a real CSV parser to read the line and then just split The Godfather (1972) using (.*?) \((\d+)\).
CSV has a lot of corner cases, your regexp approach might take you into a world of pain.
For example if the title has a comma in it, the title would then be double quoted. Which would screw up with all of the regexps given so far.