Parse input from a particular format - regex

Let us say I have the following string: "Algorithms 1" by Robert Sedgewick. This is input from the terminal.
The format of this string will always be:
1. Starts with a double quote
2. Followed by characters (may contain space)
3. Followed by double quote
4. Followed by space
5. Followed by the word "by"
6. Followed by space
7. Followed by characters (may contain space)
Knowing the above format, how do I read this?
I tried using fmt.Scanf() but that would treat a word after each space as a separate value. I looked at regular expressions but I could not make out if there is a function using which I could GET values and not just test for validity.

1) With character search
The input format is so simple, you can simply use character search implemented in strings.IndexRune():
s := `"Algorithms 1" by Robert Sedgewick`
s = s[1:] // Exclude first double qote
x := strings.IndexRune(s, '"') // Find the 2nd double quote
title := s[:x] // Title is between the 2 double qotes
author := s[x+5:] // Which is followed by " by ", exclude that, rest is author
Printing results with:
fmt.Println("Title:", title)
fmt.Println("Author:", author)
Output:
Title: Algorithms 1
Author: Robert Sedgewick
Try it on the Go Playground.
2) With splitting
Another solution would be to use strings.Split():
s := `"Algorithms 1" by Robert Sedgewick`
parts := strings.Split(s, `"`)
title := parts[1] // First part is empty, 2nd is title
author := parts[2][4:] // 3rd is author, but cut off " by "
Output is the same. Try it on the Go Playground.
3) With a "tricky" splitting
If we cut off the first double quote, we may do a splitting by the separator
`" by `
If we do so, we will have exactly the 2 parts: title and author. Since we cut off first double quote, the separator can only be at the end of the title (the title cannot contain double quotes as per your rules):
s := `"Algorithms 1" by Robert Sedgewick`
parts := strings.Split(s[1:], `" by `)
title := parts[0] // First part is exactly the title
author := parts[1] // 2nd part is exactly the author
Try it on the Go Playground.
4) With regexp
If after all the above solutions you still want to use regexp, here's how you could do it:
Use parenthesis to define submatches you want to get out. You want 2 parts: the title between quotes and the author that follows by. You can use regexp.FindStringSubmatch() to get the matching parts. Note that the first element in the returned slice will be the complete input, so relevant parts are the subsequent elements:
s := `"Algorithms 1" by Robert Sedgewick`
r := regexp.MustCompile(`"([^"]*)" by (.*)`)
parts := r.FindStringSubmatch(s)
title := parts[1] // First part is always the complete input, 2nd part is the title
author := parts[2] // 3rd part is exactly the author
Try it on the Go Playground.

You should use groups (parentheses) to get out the information you want:
"([\w\s]*)"\sby\s([\w\s]+)\.
This returns two groups:
[1-13] Algorithms 1
[18-34] Robert Sedgewick
Now there should be a regex method to get all matches out of a text. The result will contain a match object which then contains the groups.
I think in go it is: FindAllStringSubmatch
(https://github.com/StefanSchroeder/Golang-Regex-Tutorial/blob/master/01-chapter2.markdown)
Test it out here:
https://regex101.com/r/cT2sC5/1

Related

RegEx to format Wikipedia's infoboxes code [SOLVED]

I am a contributor to Wikipedia and I would like to make a script with AutoHotKey that could format the wikicode of infoboxes and other similar templates.
Infoboxes are templates that displays a box on the side of articles and shows the values of the parameters entered (they are numerous and they differ in number, lenght and type of characters used depending on the infobox).
Parameters are always preceded by a pipe (|) and end with an equal sign (=). On rare occasions, multiple parameters can be put on the same line, but I can sort this manually before running the script.
A typical infobox will be like this:
{{Infobox XYZ
| first parameter = foo
| second_parameter =
| 3rd parameter = bar
| 4th = bazzzzz
| 5th =
| etc. =
}}
But sometime, (lazy) contributors put them like this:
{{Infobox XYZ
|first parameter=foo
|second_parameter=
|3rd parameter=bar
|4th=bazzzzz
|5th=
|etc.=
}}
Which isn't very easy to read and modify.
I would like to know if it is possible to make a regex (or a serie of regexes) that would transform the second example into the first.
The lines should start with a space, then a pipe, then another space, then the parameter name, then any number of spaces (to match the other lines lenght), then an equal sign, then another space, and if present, the parameter value.
I try some things using multiple capturing groups, but I'm going nowhere... (I'm even ashamed to show my tries as they really don't work).
Would someone have an idea on how to make it work?
Thank you for your time.
The lines should start with a space, then a pipe, then another space, then the parameter name, then a space, then an equal sign, then another space, and if present, the parameter value.
First the selection, it's relatively trivial:
^\s*\|\s*([^=]*?)\s*=(.*)$
Then the replacement, literally your description of what you want (note the space at the beginning):
| $1 = $2
See it in action here.
#Blindy:
The best code I have found so far is the following : https://regex101.com/r/GunrUg/1
The problem is it doesn't align the equal signs vertically...
I got an answer on AutoHotKey forums:
^i::
out := ""
Send, ^x
regex := "O)\s*\|\s*(.*?)\s*=\s*(.*)", width := 1
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
width := Max(width, StrLen(_[1]))
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
out .= Format(" | {:-" width "} = {2}", _[1],_[2]) "`n"
else
out .= A_LoopField "`n"
Clipboard := out
Send, ^v
Return
With this script, pressing Ctrl+i formats the infobox code just right (I guess a simple regex isn't enough to do the job).

How to extract real name, first name, lastname, and create nickname from name that have acronyms, aristocratic title, academic titles/degrees

I am trying to extract name, firstname, lastname, create nickname/firstname, and nickname/lastname from name list that have acronyms, aristocratic title, academic titles/degrees, nickname inside parentheses or after slash-backslash in libreoffice calc using regex function. This is the expected result
The rules are:
Expected Name: remove nickname inside parentheses or after slash-backslash if exist, job titles, academic titles and degrees except for (R) or (R.), (Ra) or (Ra.), and one alphabet with(out) dot in the name ex: (A.) (A), (M.), and name acronym with(out) dot ex: (Muh.), (Moh), etc
Firstname: (from expected name) extract first name including dot in acronym ex: (R.), (Ra.), (Muh.) and one alphabet with(out) dot in the name ex: (A.) (A), (M.)
Lastname: (from expected name) extract last name including dot
Nickname-fname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If not exist, use next word in name even it's last word. If the name consist only one word, extract it even only one character
Nickname-lname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract last name that is not: one char or acronym with(out) dot. If not exist, extract prev word in name even it's first word. If the name consist only one word, extract it even only one character
I tried the following:
=REGEX($A2,"\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$|,.*","")
to extract real name. It remove nickname inside parentheses or after slash-backslash with(out) space like
( Nita ) in Yunita ( Nita )
( Nita ) in Yunita( Nita )
(Nita) in Yunita (Nita)
(Nita) in Yunita(Nita)
( Nita) in Yunita ( Nita)
( Nita) in Yunita( Nita)
(Nita ) in Yunita (Nita )
(Nita ) in Yunita(Nita ))
\Nita in Yunita\Nita
\ Nita in Yunita\ Nita
\Nita in Yunita \Nita
\ Nita in Yunita \ Nita
...
and so on
it also remove academic degrees like Ph.D. but it failed on Ra. Ayu S. Ph.D. (because there is no comma after S.). It failed on Prof. and Dr.. I want to keep Ra..
=REGEX($A23,"(?:^|(?:[.!?]\s))([\w.]+)")
to extract first name but failed on Moh.Ali (yes, without space after dot).
=REGEX($A23,"\b(?<last>[\w\[.\]\\]+)$")
to extract last name but also failed on Moh.Ali.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<first>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If the first name not meet condition, use next word in name even it's last word. If still not meet condition, use name that consist only one word, extract it even only one character. The regex failed in most case.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<last>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from last name that is not: one char, acronym with(out) dot. If the last name not meet condition, use prev word in name even it's first word. If still not meet condition, use name that consist only one word, extract it even only one character. It failed in most case too.
This is the result of the regex:
But I'm new to regex and stuck. Please help me
The parsing rules are rather elaborate, especially for nicknames, so
I suggest using a multi-step process involving 2 helper columns and
the non-advanced
(POSIX ERE
plus \b) features of the available
ICU
regular expressions combined with a few spreadsheet formulas.
If you copy the formulas listed and commented below to columns B
through H in your spreadsheet you should find that
conditional formatting highlights a total of 5 cells, all nicknames:
1 for Jasmine, 2 for Imah, and 2 for Nur.. For the first 3
the expected result contradicts the nickname rules; removing
the dot in Nur. is left as an exercise.
I used the input as it is but in cases like this pre-processing is
the key to avoid complex or special-case regexes, e.g. insert a blank
in names like M.Ali, or prefix a comma to comma-less Ph.D..
Aside: When you post source data as an image rather than plaintext
what's the likelihood of someone OCR'ing the image and editing the
text?
Step 1
Add auxiliary column AuxSplit separating the input string into its
3 main groups -- name, titles, nickname (last 2 optional) -- and
joining them with #. NB: # is used for this reply but it
must be a non-meta character not appearing in any input string.
REGEX(A22;"(.*?)( *(Ph\.D\.)?|(,[^\\/(]*?))?( *[\\/(].*)? *$";"$1#$2#$5")
where
$ anchors regex to end of string
$5 is the captured optional nickname (incl. any delimiters and blanks)
$2 the optional suffixed titles (incl. any comma and blanks);
note that the comma-less Ph.D is special-cased
$1 the name (incl. any leading blanks)
Sample content: Jasmine Fianna#, B.A., M.A.#/Jasmine
Step 2
Add auxiliary column AuxNick extracting an explicit nickname (if any)
from 3rd main group (after last #) in AuxSplit.
REGEX(G22;".*#[\\/( ]*([^ )]*)[ )]*$";"$1")
Capturing string after \ or / or between (), trimming off blanks.
Sample content Agus.
Step 3
Extract ExpectedName from 1st group (up to first #) in AuxSplit
stripping off any prefixed titles.
REGEX(G22;"^((Prof|Dr)\.[ ]*)?([^#]*).*";"$3")
Note that these titles are special-cased as they have the same form
as an abbreviated name.
Sample content: Q. Ranita El
Step 4
From ExpectedName extract Firstname,
REGEX(B22;"^(Ra?\.?|[^.]+\.|[^ ]+)")
stripping off R/R./Ra/Ra. and blanks from start of string,
and Lastname
REGEX(B22;"[^ .]*[^ ]$")
extracting the last character sequence after blank/dot;
any trailing blanks were removed in step 1.
Note: with M.Ali as M. Ali simplify Lastname formula to
REGEX(B22;"[^ ]+$")
Step 5
Extract Lastnick from AuxNick, LastName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(D22;"\b([^ ]+\.|[^ .])$"));D22;REGEX(B22;".*?([^ ]+)[ ]+([^ ]*\.|[^.])$";"$1")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(D22;"…";D22 : else use Lastname unless abbrev. or 1-char.
;REGEX(B22;"…";"$1"))) : else use previous word in name, exploiting the
fact that an unmatched replacement returns the entire string
and Firstnick from AuxNick, FirstName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(B22;"^([^ ]+\.|[^ ]\b|I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b)"));C22;REGEX(B22;"^(I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b|.*?)[ .]\b([^ ]+).*";"$4")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(B22;"…";C22 : else use FirstName unless abbrev. / 1-char. / special case
;REGEX(B22;"…";"$4"))) : else use next word in name, exploiting the
fact that an unmatched replacement returns the entire string
TODO: remove trailing dot from Nur.
Recall the (LibreOffice 6.2+) REGEX syntax:
*Syntax*: REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags|Occurrence ] ] )
*Expression*: A text representing the regular expression, using
[ICU](https://unicode-org.github.io/icu/userguide/strings/regexp.html)
regular expressions. If there is no match and Replacement is not
given, #N/A is returned.
Replacement: Optional. The replacement text and references to capture
groups. If there is no match, Text is returned unmodified.

REGEX - Automatic text selection and restructering

I am kinda new to AHK, I've written some scripts. But with my latest script, I'm kind of stuck with REGEX in AHK.
I want to make the report of a structure of texts I make.
To do this I've set up a system:
sentences ending on a '.', are the important sentences with "-". (variable 'Vimportant') BUT WITHOUT the words mentioned for 'Vanecdotes2' or 'Vdelete2' cfr. 4
sentences ending on a '.*', are the anecdotes (variable 'Vanecdotes1') where I've put a star manualy after the point.
sentences ending on a '.!', are irrelevant sentences and need to be deleted (variable 'Vdelete1') were I've put a star manually after the point.
an extra option I want to implement are words to detect in a sentence so that the sentence will be automatically added to the variable 'Vanecdotes2' or 'Vdelete2'
An random example would be this (I already have put ! and * after the sentence (why is not important) and of which "acquisition" is an example op Vanecdotes2 of my point 4 above):
Last procedure on 19/8/2019.
Normal structure x1.!
Normal structure x2.!
Abberant structure x3, needs follow-up within 2 months.
Structure x4 is lower in activity, but still above p25.
Abberant structure x4, needs follow-up within 6 weeks.
Normal structure x5.
Good aqcuisition of x6.
So the output of the Regex in the variables should be
Last procedure on 19/8/2019.
Normal structure x1.! --> regex '.!' --> Vdelete1
Normal structure x2.! --> regex '.!' --> Vdelete1
Abberant structure x3, needs follow-up within 2 months. --> Regex '.' = Vimportant
Structure x4 is lower in activity, but still above p25.* --> regex '.*' = Vanecdote1
Abberant structure x4, needs follow-up within 6 weeks. --> Regex '.' = Vimportant
Normal structure x5.! --> regex '.!' --> Vdelete1
Good aqcuisition of x6. --> Regex 'sentence with the word acquisition' = Vanecdote2
And the output should be:
'- Last procedure on 19/8/2019.
- Abberant structure x3, needs follow-up within 2 months.
- Abberant structure x4, needs follow-up within 6 weeks.
. Structure x4 is lower inactivity, but still above p25.
. Good aqcuisition of x6.
But I have been having a lot of trouble with the regex, especialy with the selection of sentences ending on a * or !. But also with the exclusion criteria, they just don't want to do it.
Because AHT doesn't have a real good tester, I first tested it in another regex tester and I was planning to 'translate' it later on to AHK code.. but it just doesn't work. (so I know in the script below I'm using AHK language with nonAHK regex, but I've just put the to together for illustration)
This is what i have now:
Send ^c
clipwait, 1000
Temp := Clipboard
Regexmatch(Temp, "^.*[.]\n(?!^.*\(Anecdoteword1|Anecdoteword2|deletewordX|deletewordY)\b.*$)", Vimportant)
Regexmatch(Temp, "^.*[.][*]\n")", Vanecdotes1)
Regexmatch(Temp, "^.*[.][!]\n")", Vdelete1)
Regexmatch(Temp, "^.*\b(Anecdoteword1|Anecdoteword2)\b.*$")", Vanecdotes2)
Regexmatch(Temp, "^.*\b(deletewordX|deletewordY)\b.*$")", Vdelete2)
Vanecdotes_tot := Vanecdotes1 . Vanecdotes2
Vdelete_tot := Vdelete1 . Vdelete2
Vanecdotes_ster := "* " . StrReplace(Vanecdotes_tot, "`r`n", "`r`n* ")
Vimportant_stripe := "- " . StrReplace(Vimportant, "`r`n", "`r`n- ")
Vresult := Vimportant_stripe . "`n`n" . Vanecdotes_ster
For "translation to AHK" I tried to make ^.*\*'n from the working (non ahk) regex ^.*[.][*]\n.
There isn't really such a thing as AHK regex. AHK pretty much uses PCRE, apart from the options.
So don't try to turn a linefeed \n into an AHK linefeed `n.
And there seem to be some syntax errors in your regexes. Not quite sure what those extra ") in there are supposed to be. Also, instead of using [.][*], you're supposed to use \.\*. The \ is required with those specific characters to escape their normal functionality (any character and match between zero and unlimited).
[] is to match any character in that group, like if you wanted to match either . or * you'd do [.*].
And seems like you got the idea of using capture groups, but just in case, here's a minimal example about them:
RegexMatch("TestTest1233334Test", "(\d+)", capture)
MsgBox, % capture
And lastly, about your approach to the problem, I'd recommend looping through the input line by line. It'll be much better/easier. Use e.g LoopParse.
Minimal example for it as well:
inp := "
(
this is
a multiline
textblock
we're going
to loop
through it
line by line
)"
Loop, Parse, inp, `n, `r
MsgBox, % "Line " A_Index ":`n" A_LoopField
Hope this was of help.
This i were i al up till now, nothing works (i will try the suggested loop when Regex is working): ^m::
BlockInput, On
MouseGetPos, , ,TempID, control
WinActivate, ahk_id %TempID%
if WinActive("Pt.")
Send ^c
clipwait, 1000
Temp := Clipboard
Regexmatch(Temp, "(^(?:..\n)((?! PAX|PAC|Normaal|Geen).)$)", Vimportant)
Vimportant := Vimportant.1
Regexmatch(Temp, "(^..*\n)", Vanecdotes1_ster)
Regexmatch(Temp, "(^..!\n)" , Vdelete1_uitroep)
Regexmatch(Temp, "(^.\b(PAX|PAC)\b.$)", Vanecdotes2)
Regexmatch(Temp, "(^.\b(Normaal|Geen)\b.$)", Vdelete2)
Vanecdotes1 := StrReplace(Vanecdotes1_ster, ".", ".")
Vdelete1 := StrReplace(Vdelete1_uitroep, ".!", ".")
Vanecdotes_tot := Vanecdotes1 . Vanecdotes2
Vdelete_tot := Vdelete1 . Vdelete2
Vanecdotes_ster := " " . StrReplace(Vanecdotes_tot, "rn", "rn* ")
Vimportant_stripe := "- " . StrReplace(Vimportant, "rn", "rn- ")
Vresult := Vimportant_stripe . "nn" . Vanecdotes_ster
Clipboard := Vresult
Send ^v
return

Very slow RegEx in AHK yet fast in Notepad++

I'd like to find a certain string in a webpage. I decided to use RegEx. (I know my RegExes are quite terrible, however, they work). My two expressions are very fast when used in Notepad++ (probably < 1s) and on Regex101, but they are horribly slow when used in AutoHotKey – about 2-5 minutes. How do I fix this?
sWindowInfo2 = http://www.archiwum.wyborcza.pl/Archiwum/1,0,4583161,20060208LU-DLO,Dzis_bedzie_Piast,.html
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", sWindowInfo2, false ), whr.Send()
whr.ResponseText
sPage := ""
sPage := whr.ResponseText
; get city name (if exists) – the following is very slooooow
if RegExMatch(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+", "$1")
;MsgBox, % sCity
city := 1
}
if RegExMatch(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+", "$1")
city := 1
}
EDIT:
In the page I provided the match is Lublin. Have a look at: https://regex101.com/r/qJ2pF8/1
You do not need to use RegExReplace to get the captured value. As per reference, you can pass the 3rd var into RegExMatch:
OutputVar
OutputVar is the unquoted name of a variable in which to store a match object, which can be used to retrieve the position, length and value of the overall match and of each captured subpattern, if any are present.
So, use a much simpler pattern:
FoundPos := RegExMatch(sPage, "<metryczka>GW\s(.+)\snr", SubPat) ;
It will return the position of the match, and will store "Lublin" in SubPat[1].
With this pattern, you avoid heavy backtracking you had with [\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+ as the first [\s\S]+ matched up to the end of the string, and then backtracked to accommodate for the subsequent subpatterns. The longer the string, the slower the operation is.

Matlab Extracting sub string from cell array

I have a '3 x 1' cell array the contents of which appear like the following:
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'
I want to be able to elegantly extract and create another '3x1' cell array with contents as:
'NEWYORK'
'WASHINGTON'
'BERLIN'
If you notice in above the NAME's are after the last underscore and before the first SPACE or '_ML'. How do I write such code in a concise manner.
Thanks
Edit:
Sorry guys I should have used a better example. I have it corrected now.
You can use lookbehind for _ and lookahead for space:
names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once');
Where A is the cell array containing the strings:
A = {...
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'};
>> names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once')
names =
'NEWYORK'
'WASHINGTON'
'BERLIN'
NOTE: The question was changed, so the answer is no longer complete, but hopefully the regexp example is still useful.
Try regexp like this:
names = regexp(fullNamesCell,'_(NAME\d?)\s','tokens');
names = cellfun(#(x)(x{1}),names)
In the pattern _(NAME\d?)\s, the parenthesis define a subexpression, which will be returned as a token (a portion of matched text). The \d? specifies zero or one digits, but you could use \d{1} for exactly one digit or \d{1,3} if you expect between 1 and 3 digits. The \s specified whitespace.
The reorganization of names is a little convoluted, but when you use regexp with a cell input and tokens you get a cell of cells that needs some reformatting for your purposes.