Silver Searcher: how to return filename without path - ag

I am using Silver Searcher to find information in my Calibre library which, by default uses long directory and filenames that are a bit redundant. Example search:
chris#ODYSSEUS:~/db/ebooks/paper-art$ ag --markdown angel
Christophe Boudias (Editor)/Origami Bogota 2014 (Paginas de Origami) (2)/Origami Bogota 2014 (Paginas de Origami) - Christophe Boudias (Editor).md
8:* [16] Angel (???)
9:* [22] Christmas Angel (Uniya Filonova)
Juan Fernando Aguilera (Editor)/Origami Bogota 2013 (Paginas de Origami) (1)/Origami Bogota 2013 (Paginas de Origami) - Juan Fernando Aguilera (Editor).md
29:* [96] Inspired Origami Angel (K. Dianne Stephens)
31:* [100] Angel for Eric Joisel (Kay Kraschewski)
I would like to return just the filename where the whole path is shown in the example. How can I do that?

The l (lowecase L) flag will return the files-with-matches instead of the lines matched.
e.g.
$ ag -l "angel"
you can pipe into sed to remove anything up to and including the final / which leaves the filename.
ag -l angel | sed 's=.*/=='

Related

Validate Title Case Full Name with Regex

To learn Regex, I was solving some problems to train and study. And this is the problem, i know it might not be the best way to do with Regex, and my Regex is a mess, but i liked the challenge.
Problem:
The names needs to be Title Case;
There are exceptions for some lowercase words inside;
And some Names, e.g.: McDonald, MacDuff, D'Estoile
Names with ' and - are accepted, and sometimes they are o'Brien, O'brien, O'Brien, O' Brien or 'Ehu Kali.
No whitespaces on the beggining and end of Name;
No more than one space between each Name of Full Name;
A . is accepted if not alone, e.g.: Dan . Ferdnand (isn't accepted) and Dan G. Ferdnand (is accepted)
Numbers and symbols are not accepted
However, Roman numbers are accepted and aren't Title Case, e.g.: Elizabeth II
Some names can be alone, e.g.: Akihito (Prince of Japan)
Some special characters common in some countries are accepted, e.g.: Valeh ßlÿsgÿroğlu, Lażżru Role, Alaksiej Taraškievič
Regex
The code is
^(?![ ])(?!.*(?:\d|[ ]{2}|[!$%^&*()_+|~=`\{\}\[\]:";<>?,\/]))(?:(?:e|da|do|das|dos|de|d'|la|las|el|los|l'|al|of|the|el-|al-|di|van|der|op|den|ter|te|ten|ben|ibn)\s*?|(?:[A-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð'][^\s]*\s*?)(?!.*[ ]$))+$
And the Regex101 with a validation list
References
What i tried so far was based on these:
regular expression for first and last name
Regular Expression to disallow two consecutive white spaces in the middle of a string
A regex to test if all words are title-case
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
Use Regex to Split Numbered List array into Numbered List Multiline
Not working
I did this Regex and don't know how to make a way for it to not recognize the cases below, that are matching:
CAPITAL LETTER
AlTeRnAtE LeTtEr
And those aren't and should:
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Question
Is there a way to optimize this Regex (monster)?
And how do i fix the problems stated before on Not working?
p.s.: The list of names with examples for validation can be found on the link to Regex101.
Brief
Seeing as how you're learning Regex and haven't specified a regex flavour to use, I've chosen PCRE as it has a wide variety of support in the regex world.
Code
See this regex in use here
(?(DEFINE)
(?# Definitions )
(?<valid_nameChars>[\p{L}\p{Nl}])
(?<valid_nonNameChars>[^\p{L}\p{Nl}\p{Zs}])
(?<valid_startFirstName>(?![a-z])[\p{L}'])
(?<valid_upperChar>(?![a-z])\p{L})
(?<valid_nameSeparatorsSoft>[\p{Pd}'])
(?<valid_nameSeparatorsHard>\p{Zs})
(?<valid_nameSeparators>(?&valid_nameSeparatorsSoft)|(?&valid_nameSeparatorsHard))
(?# Invalid combinations )
(?<invalid_startChar>^[\p{Zs}a-z])
(?<invalid_endChar>.*[^\p{L}\p{Nl}.\p{C}]$)
(?<invalid_unaccompaniedSymbol>.*(?&valid_nameSeparatorsHard)(?&valid_nonNameChars)(?&valid_nameSeparatorsHard))
(?<invalid_overTwoUpper>(?:(?&valid_nameChars)*\p{Lu}){3})
(?<invalid>(?&invalid_startChar)|(?&invalid_endChar)|(?&invalid_unaccompaniedSymbol)|(?&invalid_overTwoUpper))
(?# Valid combinations )
(?<valid_name>(?:(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*(?&valid_nameChars)+(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*)+\.?)
(?<valid_firstName>(?&valid_startFirstName)(?:\.|(?&valid_name)*))
(?<valid_multipleName>(?&valid_firstName)(?=.*(?&valid_nameSeparators)(?&valid_upperChar))(?:(?&valid_nameSeparatorsHard)(?&valid_name))+)
(?<valid>(?&valid_multipleName)|(?&valid_firstName))
)
^(?!(?&invalid))(?&valid)$
Results
Input
== 1NcOrrect N4M3S ==
CAPITAL LETTER
AlTeRnAtE LeTtEr
Natalia maria
Natalia aria
Natalia orea
Maria dornelas
Samuel eto'
Miguel lasagna
Antony1 de Home Ap*ril
Ap*ril Willians
Antony_ de Home Apr+il
Ant_ony de Home Apr#il
Antony# de Ho#me Apr^il
Maria Silva
Maria silva
maria Silva
Maria Silva
Maria Silva
Maria / Silva
Maria . Silva
John W8
==Correct Names==
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Hind ibn Sheik
Colop-U-Uichikin
Lażżru Role
Alaksiej Taraškievič
Petruso Husoǔski
Sumu-la-El
Valeh ßlÿsgÿroğlu
'Arab al-Rashayida
Tariq al-Hashimi
Nabeeh el-Mady
Tariq Al-Hashimi
Brian O'Conner
Maria da Silva
Maria Silva
Maria G. Silva
Maria McDuffy
Getúlio Dornelles Vargas
Maria das Flores
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King Jr.
Ai Wong
Chao Chang
Alzbeta Bara
Marcos Assunção
Maria da Silva e Silva
Juscelino Kubitschek de Oliveira
Maria da Costa e Silva
Samuel Eto'o
María Antonieta de las Nieves
Eugène
Antòny de Homé April
àntony de Home ùpril
Antony de Home Aprìl
Pierre de l'Estache
Pierre de L'Estoile
Akihito
Nadine Schröder
Anna A. Møller
D. Pedro I
Pope Benedict XVI
Marsibil Ragnarsdóttir
Natanaël Morel
Isaac De la Croix
Jean-Michel Bozonnet
Qutaibah Mu'tazz Abadi
Rushd Jawna' Kassab
Khaldun Abdul-Qahhar Sabbag
'Awad Bashshar Asker
Al B. Zellweger
Gunnleif Snæ-Ulfsson
Käre Toresson
Sorli Ærnmundsson
Arnkel Øystæinsson
Ástríður Dórey
Åsmund Kåresson
Yahatti-Il
Ipqu-Annunitum
Nabu-zar-adan
Eskopas Cañaverri
Botolph of Langchester
Aelfhun the Cantrell
Fraco di Natale
Fraco Di Natale
Iván de Luca
Iván De Luca
Man'nah
Atabala Aüamusalü
Ramiz Ağasəfalu
Dadaş Aghakhanov
Fÿrxad Mübarizlı
Vaclaǔ Šupa
Yakiv Volacič
Flor Van Vaerenbergh
Flor van Vaerenbergh
Edwin van der Sar
Husein Ekmečić
Álvaro Guimarães Alencar
Phone U Yaza Arkar
Seocan MacGhille
X'wat'e Tlekadugovy
Albert-Jan Bootsveld
Maurits-jan Kuipers op den Kollenstaart
Elco ter Hoek
Robbert te Poele
Aad ten Have
'Ehu Kali
Ho'opa'a Loni
Aukanai'i Mahi'ai
Kalman ben Tal El
Żytomir Roszkowski
K'awai
==EXTRA== only if possible, strange ones
Maol-Moire Mac'IlleBhuidh
Tòmas MacIlleChruim
Aindreas MacIllEathain
Eanruig MacGilleBhreac
Peadar MacGilleDhonaghart
Maolmhuire MacGill-Eain
Eanruig MacGilleBhreac
Wim van 't Plasman
Output
Note: Shown below are only the strings that matched from the above Input
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Hind ibn Sheik
Colop-U-Uichikin
Lażżru Role
Alaksiej Taraškievič
Petruso Husoǔski
Sumu-la-El
Valeh ßlÿsgÿroğlu
'Arab al-Rashayida
Tariq al-Hashimi
Nabeeh el-Mady
Tariq Al-Hashimi
Brian O'Conner
Maria da Silva
Maria Silva
Maria G. Silva
Maria McDuffy
Getúlio Dornelles Vargas
Maria das Flores
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King Jr.
Ai Wong
Chao Chang
Alzbeta Bara
Marcos Assunção
Maria da Silva e Silva
Juscelino Kubitschek de Oliveira
Maria da Costa e Silva
Samuel Eto'o
María Antonieta de las Nieves
Eugène
Antòny de Homé April
àntony de Home ùpril
Antony de Home Aprìl
Pierre de l'Estache
Pierre de L'Estoile
Akihito
Nadine Schröder
Anna A. Møller
D. Pedro I
Pope Benedict XVI
Marsibil Ragnarsdóttir
Natanaël Morel
Isaac De la Croix
Jean-Michel Bozonnet
Qutaibah Mu'tazz Abadi
Rushd Jawna' Kassab
Khaldun Abdul-Qahhar Sabbag
'Awad Bashshar Asker
Al B. Zellweger
Gunnleif Snæ-Ulfsson
Käre Toresson
Sorli Ærnmundsson
Arnkel Øystæinsson
Ástríður Dórey
Åsmund Kåresson
Yahatti-Il
Ipqu-Annunitum
Nabu-zar-adan
Eskopas Cañaverri
Botolph of Langchester
Aelfhun the Cantrell
Fraco di Natale
Fraco Di Natale
Iván de Luca
Iván De Luca
Man'nah
Atabala Aüamusalü
Ramiz Ağasəfalu
Dadaş Aghakhanov
Fÿrxad Mübarizlı
Vaclaǔ Šupa
Yakiv Volacič
Flor Van Vaerenbergh
Flor van Vaerenbergh
Edwin van der Sar
Husein Ekmečić
Álvaro Guimarães Alencar
Phone U Yaza Arkar
Seocan MacGhille
X'wat'e Tlekadugovy
Albert-Jan Bootsveld
Maurits-jan Kuipers op den Kollenstaart
Elco ter Hoek
Robbert te Poele
Aad ten Have
'Ehu Kali
Ho'opa'a Loni
Aukanai'i Mahi'ai
Kalman ben Tal El
Żytomir Roszkowski
K'awai
Maol-Moire Mac'IlleBhuidh
Tòmas MacIlleChruim
Aindreas MacIllEathain
Eanruig MacGilleBhreac
Peadar MacGilleDhonaghart
Maolmhuire MacGill-Eain
Eanruig MacGilleBhreac
Wim van 't Plasman
Explanation
I used a define block to create definitions. You can look at each definition to see how it works. In general, I use \p{.} where . is replaced with some pointer to a Unicode character group (i.e \p{L} is any letter from any language - this will not work in most flavours of regex, but it does allow the regex to be much more simplified if available, which is why I used it).
If you need anything else explained, don't hesitate to ask me and I'll do my best, but regex101 should be able to explain anything you're wondering about regex.

Extracting everything after first two words in R

I am trying to extract all the info, using a regular expression in R, after the first number and first word of an entry in a data frame.
For example:
Header =
c("2006 Volvo XC70",
"2012 Ford Econoline Cargo Van E-250 Commercial",
"2012 Nissan Frontier",
"2012 Kia Soul 5dr Wagon Automatic")
I want to write a pattern that will grab Volvo XC70, or Econoline Cargo Van E-250 Commercial (everything after the year and make) from an entry in my "header" column so that I may run the function on my data frame and create a new "model" column. I can't figure out a pattern that will allow me to skip the first string of integers, then a space, then the first string of characters, and then a space, and then grab everything proceeding.
Any help would be appreciated. Thanks!
Just use sub.
sub("^\\d+\\s+\\w+\\s+", "", df$x)
Example:
x <- "2012 Ford Econoline Cargo Van E-250 Commercial"
sub("^\\d+\\s+\\w+\\s+", "", x)
# [1] "Econoline Cargo Van E-250 Commercial"
For this task, I would fetch a basic list using the XML package:
library(XML)
doc <- xmlParse('http://www.fueleconomy.gov/ws/rest/ympg/shared/menu/make')
Now that we fetched the XML data we can create a vector with the car makes:
mk <- xpathSApply(doc, '//value', xmlValue)
Finally, I'll compile the pattern and play around with sprintf and sub:
df$Makes <- sub(sprintf('\\d+ (?:%s) ', paste(mk, collapse='|')), '', df$Header)
Output:
## Header
# 1 2006 Volvo XC70
# 2 2012 Ford Econoline Cargo Van E-250 Commercial
# 3 2012 Nissan Frontier
# 4 2012 Kia Soul 5dr Wagon Automatic
## Makes
# 1 XC70
# 2 Econoline Cargo Van E-250 Commercial
# 3 Frontier
# 4 Soul 5dr Wagon Automatic

Regex code for product models and codes

I found a very useful regex code in order to extract product codes here, this is the expression:
\b((?:[a-z]+\S*\d+|\d\S*[a-z]+)[a-z\d_-]*)\b
It works almost perfectly, but I need to detect and extract only the product codes that have a length of at least 5 digits.
For example, for the following strings:
5T COFFEE BREW FOR BLACK & DECKER DCM-601B
10T COFFEE BREW FOR BLACK & DECKER DCM-1100B
10T COFFEE BREW FOR BLACK & DECKER DCM-1100W
8T COFFEE BREW FOR BLACK & DECKER CM-1509
Rice Cookers 15T DOMESTIC USE RC5428, ELECTRIC BLACK & DECKER
Rice Cookers 15T RC/5723 DOMESTIC USE, ELECTRIC BLACK & DECKER
Rice Cookers B D REF.RC3203
Hand mixer, S / M, PS62509R
SLOW COOKING POTS, HAMILTON BEACH, HB33136T
OVEN 110V TOSTA SANKEY REF.TO-9
24 PZA METAL TEAPOT S / M CHINA REF: 92479
ELECTRIC RICE COOKER, 1.5 L ROYAL ROA-15SV
ELECTRIC RICE COOKER, 1.8 L ROYAL ROA-18SV
ELECTRIC RICE COOKER, 2.2 L ROYAL ROA-22SV
ELECTRIC RICE COOKER, 2.8 L ROYAL ROA-28SV
Waffle Makers DOMESTIC USE, ELECTRIC BLACK & DECKER G-49TD
2.00 PZA TOAST OVEN, METAL / GLASS ROYAL, CHINA, REF: RTH-28A
20.00 PZA RICE, METAL, BLACK & DECKER, CHINA, REF: RCB550S
I get:
5TDCM-601B
10TDCM-1100B
10TDCM-1100W
8TCM-1509
15TRC5428
15TRC/5723
REF.RC3203
PS62509R
HB33136T
REF.TO-9
92479
ROA-15SV
ROA-18SV
ROA-22SV
ROA-28SV
G-49TD
2.00RTH-28A
20.00RCB550S
Desired outcome:
DCM-601B
DCM-1100B
DCM-1100W
CM-1509
RC5428
RC/5723
REF.RC3203
PS62509R
HB33136T
REF.TO-9
92479
ROA-15SV
ROA-18SV
ROA-22SV
ROA-28SV
G-49TD
RTH-28A
RCB550S
How can I do this?
If we assume that your codes contain 5 or more non-whitespace symbols, and there must be at least 1 digit, the regex for the codes will be:
\b(?!\d+\.\d+)(?=\S*\d)\S{5,}\b
See Demo 1
The (?!\d+\.\d+) disallows float/decimal numbers like 1.2345 or 12.44.
I'm not quite sure if I understood your question, but you can use a regex like this to get the product codes you want:
((?:\w{2,}\.)?\w{1,}[.\/-]?\d+\w+)(?=\b)
Working demo

Spiting Regular expression and accessing Array of Array

An example am trying to understand from website.
People2.txt is as follows.
2323:Doe John California
827:Doe Jane Texas
982982:Neuman Alfred Nebraska
I don't get the output as shown from the command below.
*PS C:\ Get-Content people2.txt | %{$data = [regex]::split($_, '\t|:'); Write-Output "$($data[2]) $($data[1]), $($data[3])"}
John Doe, California
Jane Doe, Texas
Alfred Neuman, Nebraska*
I could take out numbers and swapping first and second using
gc C:\appl\ppl.txt | %{$data = [regex]::split($_, ":") ;write-output $data[1] } | Out-File c:\appl\ppll.txt
gc C:\appl\ppll.txt | %{$data = $_.split(" "); Write-Output "$($data[1]) $($data[0]),
$($data[2])"}
Please help
**Need to find more efficient ways to do this.
Also I want to understand '\t|:' - is it 'Split at first TAB stop and a : ' ?**
Just threw this off the top of my head: ^(?<number>\d+):(?<first>\w+)\s+(?<last>\w+)\s(?<location>.*)$

Regex Pattern for String including newline characters

I am looking for a regex pattern that will return a match from %PDF-1.2 to and including %%EOF in the string below.
So far my patterns don't seem to work.
DOCUMENTS ACCEPTED
001//201//0E9136614////ACME 107 PTY LTD//8
**E10 End of validation report**
BDAT 4367 LAST
XSVBOUT
001XSVSEPRXXXOUT_TP.19
ZHDASCRA55 0700 8
ZCO*** TEST DATABASE ***ACME 107 PTY LTD 551824563 APTY LMSH PDF NSW 20111217 PNPC
ZIL 77000030149 Australian Securities and Investments Commission 86768265615 ZUMESOFT SOLUTIONS PTY LTD 61 buxton st north adelaide SA 5006
ZIAProprietary Company 42600 0E9136614 201 TAX INVOICE EXE 0 0E9136614201C PA 20111217 Not Subject to GST - Treasurer's Determination (Exempt Taxes, Fees and Charges)
ZTRENDRA55 5
%PDF-1.2
%????
3495
%%EOF
BDAT 11 LAST
/(?s)(%PDF-1\.2.+%%EOF)/ should solve your problem
If you are using an older flavor of regex the (?s) could be moved to the end of regex modifier like //s so.