how to build regular expressions

how to build regular expressions - regex

I'm dealing with some google spreadsheet with data, some of which is in a very confused way, but regular, so i hope we can figure this out.
I've tried reg ex builders but I can't find the right one for google sheets or I misunderstand some stuff.
I would appreciate help with these sentances below:
1. {"user":{"Czy faktura?":"Y","Nazwa firmy":"Name of the company ","NIP":"113 234 20 57"}}
2. {"user":{"Czy faktura?":"Y","Nazwa firmy":"The longer name of the company","NIP":"2352225961"}}
3. {"user":{"Czy faktura?":"N","Nazwa firmy":"","NIP":""}}
The point is to extract: (using arrayformula in google sheets)
Y or N
Name of the company
NIP number
Problems:
The name of the company has different lengths, and the NIP number is sometimes with white-spaces.
Do you guys have any idea how can I properly use it?
I know it's the REGEXEXTRACT formula of course :)
Just have a problem on how to formulate the regular expression..

=regexreplace(B1, "(^.*Nazwa firmy"":"")(.*)("",""NIP.*$)", "$2")

Well the support was fantastic :)
After all, a simple "Y|N" solves the first problem
I used #ttarchala's solution for the company name as it seems to work for some reason - i don't know why or how :)
"(^.Nazwa firmy"":"")(.)("",""NIP.*$)", "$2"
and the NIP is isolated by this one: "NIP\"":\""(.+)\"""),"-|\s","" and later trimmed of off the "-" minus and whitespaces signs.
cheers

Related

RegexExtract Syntax - Extract text after decimal/number from string ex. "UNDER 12.5-120"

I am trying to write a RegexExtract formula to extract the odds of a bet.
My example text is UNDER 12.5-120.
In this example, I would hope to return -120 but I need my equation to be dynamic enough to extract other odds as well.
More examples of this would be +120, +1200, +12000, -1000, etc etc.
The string will always be in this order though - OVER or UNDER then the line of the bet and then the odds of the bet. I have successfully written the regex for the line and the over/under but cant figure out the odds portion.
This is what I have so far:
=REGEXEXTRACT('Form Responses 2'!C2,"[\d.,].*") but this returns 12.5-120 and I need only the -120.

Try this for a range
=INDEX(IFERROR(REGEXEXTRACT(Q2:Q8,"[-|+]\d+")))
If you want to have pure numbers try
=INDEX(IFERROR(REGEXEXTRACT(Q2:Q8,"[-|+]\d+")+0))

This ended up working for me!!!
=REGEXEXTRACT('Form Responses 2'!C2,"[-|+][\d]+")

How to coded reference from an APA a reference in excel?

I have a database with APA references in a column like this one (let's sake it's in A6 for the sake of our example) in Google Sheets (online reproducibility aimed)
Smith (2010) Assessing the Impact of Aimhigher Kent and Medway
I would like to create a new code column with just the last name of the first author and the date. For our example, that would be
Smith2010
I tried things like
=REGEXEXTRACT(A6; "\w.(\d\d\d\d)")
but it doesn't work. Could you help me please in this very low-level issue?
Thanks in advance,

One option is to use REGEXREPLACE with 2 capturing groups, and use those groups in the replacement.
^[^\S\r\n]*(\w+)[^()\r\n]+\((\d{4})\).*
Regex demo
=REGEXREPLACE(A6; "^[^\S\r\n]*(\w+)[^()\r\n]+\((\d{4})\).*"; "$1$2")

Thanks to #The fourth bird the solution is:
> =REGEXREPLACE(A6; "^[^\S\r\n]*?(\w+)[^()\r\n]+\((\d{4})\).*"; "$1$2")
Thanks a lot for the ones who helped !!

Avoid duplicate code in Excel IF formula code

I want to avoid duplicate code within excel formulas. Is there a method to repeat a certain code segment?
=IF(A1=1,(A1-B2-C3),(A1-B2-C3)+1)
This would be especially useful when it comes to more complex or longer sections. But: everything must be in ONE formula in ONE cell. Thanks! :-)
EDIT: This is my current code.
=IF(ISNUMBER(SEARCH(".amp",A2)),IFERROR(MID(A2,FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))+1,SEARCH(".html",A2)-FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))-5),""),IFERROR(MID(A2,FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))+1,SEARCH(".html",A2)-FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-",""))))-1),""))
It strips the long ID number out of any URL of a specific CMS. So
FIND("#",SUBSTITUTE(A2,"-","#",LEN(A2)-LEN(SUBSTITUTE(A2,"-","")))
is probably the part which occurs more than once and should be replaced for a code which does not be that duplicate-prone.
EXAMPLE: www.domain.com/path1/path2/this-is-an-article-123-dd-123456789.html --> 1234567890
EXAMPLE: www.domain.com/path1/path2/this-is-an-article-123-dd-1234567890.amp.html ->
1234567890
EXAMPLE: www.domain.com/path1/this-is-an-article-1234567890.html ->
1234567890

In google sheets, you could use REGEXEXTRACT to get what you want:
Formula in B1:
=REGEXEXTRACT(A1,"\d{8,}")

Place the complex common sub-expression in its own cell and refer to that cell.
EDIT#1:
As an alternative, you can use a Named Formula for the sub-expression:
Named Formula

So here is another way of finding the code in Excel:
Here is the formula in Cell B1 which needs to be confirmed by pressing Ctrl+Shift+Enter, then drag it down to apply across board:
{=FILTERXML("<data><a>"&SUBSTITUTE(MID(A1,LARGE(IF(MID(A1,ROW($A$1:INDEX($A:$A,LEN(A1))),1)="-",ROW($A$1:INDEX($A:$A,LEN(A1)))),1)+1,LEN(A1)),".","</a><a>")&"</a></data>","/data/a[1]")}
For the logic behind this formula you may give a read to this article: Extract Words with FILTERXML.
Cheers :)
Ps. it seems that GoogleSheet has out performed Excel in some area already.

Delimiting columns very specifically

I've got a column (with many thousands of rows) which I'd like to delimit into multiple rows. I have some experience using regular expressions in Excel, and I have some experience using delimiters in excel, but this one is just a tad too hard..
Let me give you three example-lines:
- 23-12-05: For sale for 2000. 2010-09-09: Not found
- 25-11-09: For sale for 3400. Last date found: 2010-07-08
- 18-06-08: For sale for 5500. 21-07-09: Changed from 5500 to 4900. 16-09-09: Jumped from 4900 to 4700. 2010-02-04: Not found
Most other lines follow these structures. How can I create a new column based on just the first symbols before [COLON]; A second column based on the symbols between the first [COLON] and the first [DOT]. How can I continue to the last IF the text LAST DATE is not found? Finally: How can I use regex (or another way) to use the text 'NOT FOUND' to paste the last date into a new column?
Trust me, I have been at this for quite some time now (sigh). Any help is much appreciated!

you can actually use formulas for this.
Assuming the text is in A1,
B1: =LEFT(A1,FIND(":",A1)-1)
C1: =MID(A1,FIND(":",A1)+1,FIND(".",A1,FIND(":",A1))-FIND(":",A1))
D1: =MID(A1,FIND(".",A1,FIND(":",A1))+1,LEN(A1))
E1: =MID(A1,FIND("Not found",A1)-12,10)
(I'm assuming the date format does not change for the E1)

By the way, this also works for me, to get the last date in a cell:
=LOOKUP(9999999999999999,FIND("**-**-**",A1,ROW($1:$1024)))
Only problem here is: I haven't the slightest clue what exactly I am doing here.
For example, I'd like to use the same code to find the FIRST occurence of a date.
Can anyone explain this code to me? Why am I searching for a very high number? What is it in this code that makes that I find the last occurence? What does it mean that the 'starting number' is "row(1:1024)"?
Anybody knows?

I'm going to be teaching a few developers regular expressions - what are some good homework problems? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm thinking of presenting questions in the form of "here is your input: [foo], here are the capture groups/results: [bar]" (and maybe writing a small script to test their answers for my results).
What are some good regex questions to ask? I need everything from beginner questions like "validate a 4 digit number" to "extract postal codes from addresses".

A few that I can think off the top of my head:
Phone numbers in any format e.g. 555-5555, 555 55 55 55, (555) 555-555 etc.
Remove all html tags from text.
Match social security number (Finnish one is easy;)
All IP addresses
IP addresses with shorthand netmask (xx.xx.xx.xx/yy)

There's a bunch of examples of various regular expression techniques over at www.regular-expressions.info - everything for simple literal matching to backreferences and lookahead.

To keep things a bit more interesting than the usual email/phone/url stuff, try looking for more original exercises. Avoid boredom.
For example, have a look at the Forsysth-Edwards Notation which is used for describing a particular board position of a chess game.
Have your students validate and extract all the bits of information from a string like this:
rnbqkbnr/pp1ppppp/8/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R b KQkq - 1 2
Additionaly, have a look at algebraic chess notation, used to describe moves. Extract chess moves out of a piece of text (and make them bold).
1. e4 e5 2. Nf3 Black now defends his pawn 2...Nc6 3. Bb5 Black threatens c4

Validate phone numbers (extract area code + rest of number with grouping) (Assuming US phone number, otherwise generalize for you style)
Play around with validating email address (probably want to tell the students that this is hugely complicated regular expression but for simple ones it is pretty straight forward)

regexplib.com has a good library you can search through for examples.

H0w about extract first name, middle name, last name, personal suffix (Jr., III, etc.) from a format like:
Smith III, John Paul
How about Reg Ex to remove line breaks and tabs from the input

I would start with the common ones:
validate email
validate phone number
separate the parts of a URL

Be cruel. Tell them parse HTML.
RegEx match open tags except XHTML self-contained tags

Are you teaching them theory of finite automata as well?
Here is a good one: parse the addresses of churches correctly from this badly structured format (copy and paste it as text first)
http://www.churchangel.com/WEBNY/newhart.htm

I'm a fan of parsing date strings. Define a few common data formats, as well as time and date-time formats. These are often good exercises because some dates are simple mixes of digits and punctuation. There's a limited degree of freedom in parsing dates.

Just to throw them for a loop, why not reword a question or two to suggest that they write a regular expression to generate data fitting a specific pattern like email addresses, phone numbers, etc.? It's the same thing as validating, but can help them get out of the mindset that regex is just for validation (whereas the data generation tool in visual studio uses regex to randomly generate data).

Rather than teaching examples based from the data set, I would do examples from the perspective of the rule set to get basics across. Give them simple examples to solve that leads them to use ONE of several basic groupings in each solution. Then have a couple of "compound" regex's at the end.
Simple:
s/abc/def/
Spinners and special characters:
s/a\s*b/abc/
Grouping:
s/[abc]/def/
Backreference:
s/ab(c)/def$1/
Anchors:
s/^fred/wilma/
s/$rubble/and betty/
Modifiers:
s/Abcd/def/gi
After this, I would give a few examples illustrating the pitfalls of trying to match html tags or other strings that shouldn't be done with regex's to show the limitations.

Try to think of some tests that don't include ones that can be found with Google.
Asking a email validator should pose no trouble finding..
Try something like a 5 proof test.
Input 5 digit. Sum up each digit must be dividable by five: 12345 = 1+2+3+4+5 = 15 / 5 = 3(.0)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to build regular expressions - regex

=regexreplace(B1, "(^.Nazwa firmy"":"")(.)("",""NIP.*$)", "$2")

Related

RegexExtract Syntax - Extract text after decimal/number from string ex. "UNDER 12.5-120"

How to coded reference from an APA a reference in excel?

Avoid duplicate code in Excel IF formula code

Delimiting columns very specifically

I'm going to be teaching a few developers regular expressions - what are some good homework problems? [closed]

Categories

Resources