Please need a hand on regex - regex

I would like to get the text in the frame "[[Fichier .... ]]" here in the text :
=== Langues ===
{{Article détaillé|Langues en Afrique du Sud}}
[[Fichier:South Africa dominant language map.svg|thumb|300px| Répartition
des langues officielles dominantes par région :
{{clear}}
{{legend|#80b1d3|[[Zoulou]]}}
{{legend|#8dd3c7|[[Afrikaans]]}}
{{legend|#fb8072|[[Xhosa (langue)|Xhosa]]}}
{{legend|#ffffb3|[[Anglais]]}}
{{legend|#fccde5|[[Tswana|Setswana]]}}
{{legend|#bebada|[[Ndébélés|Ndebele]]}}
{{legend|#fdb462|[[Sotho du Nord]]}}
{{legend|#b3de69|[[Sotho du Sud]]}}
{{legend|#bc80bd|[[Swati]]}}
{{legend|#ccebc5|[[Venda (langue)|Tshivenda]]}}
{{legend|#ffed6f|[[Tsonga (langue)|Xitsonga]]}}
{{legend|#d0d0d0|Pas de langage dominant}}]]
Il n'y a pas de langue maternelle majoritairement dominante en Afrique du Sud. Depuis [[1994]], [[Langues en Afrique du Sud|onze langues officielles]] (anglais, afrikaans, zoulou, xhosa, zwazi, ndebele, sesotho, sepedi, setswana, xitsonga, tshivenda<ref>[http://www.lafriquedusud.com/ethnies.htm lafriquedusud.com]</ref>) sont reconnues par la [[Constitution de l'Afrique du Sud|Constitution sud-africaine]]<ref>{{Ouvrage|langue=fr|auteur1=François- Xavier Fauvelle-Aymar|titre=Histoire
How can I improve the following regex:
\[\[Fichier:.*(.*\[\[.*\]\].*)*.*\]\]
In order to match all the liness until the correct ]]?

\[\[Fichier:(.*?(\n))+.*\]\]
Match all the lines between [[ and ]].
Here is the best sandbox: http://www.regexr.com

Provided you my have at most one level of nested [[...]] (as your test data sample suggests), the inner regex pattern may comprise a sequence of either a string in double brackets (\[\[.*?\]\]) or anything but a closing bracket ([^]]):
\[\[Fichier:(?:\[\[.*?\]\]|[^]])*\]\]
Demo: https://regex101.com/r/Q7zQQt/1
For arbitrary number of nested levels the answer depends on regex flavour. You may find more details on this here: http://www.regular-expressions.info/balancing.html.

Related

Is there a way to find and remove the datetime from multiple rows in Google Sheets?

Hope you're doing well.
Imagine I have the following Sheet:
5:20:58 xxxx: entro con el mismo xxxx
5:21:08 xxxx: xxxx
5:21:58 xxxxx: Perfecto, te pido de 5 a 10 minutos mientras
reviso la configuración de las etiquetas. ¿De acuerdo?
5:22:04 xxxxx: ok
I need to delete the datetime of all those rows. The result
xxxx: entro con el mismo xxxx
xxxx: xxxx
xxxxx: Perfecto, te pido de 5 a 10 minutos mientras
reviso la configuración de las etiquetas. ¿De acuerdo?
xxxxx: ok
Is there a formula in Google Sheets to make this?
I tried with REPLACE, SPLIT but is not applicable to all the rows in the sheet.
(The real sheet has too many rows, I extracted a part from the sheet to give an example)
EDIT
(following OP's comment)
...there are sometimes that the data not starts with a timestamp. ... How can I adjust the formula to make it work?
Please use the following altered formula
=INDEX(IFERROR(REGEXEXTRACT(B1:B;" (.*)");B1:B))
OR (for an even more robust formula)
=INDEX(IFERROR(REGEXEXTRACT(B1:B;"^\d+:\d+:\d+ (.+)");B1:B))
Original answer
Please use the following formula (adjust range to your needs)
=INDEX(IFERROR(REGEXEXTRACT(B1:B;" (.*)")))
OR (depending on your locale)
=INDEX(IFERROR(REGEXEXTRACT(B1:B," (.*)")))
Functions used:
INDEX
IFERROR
REGEXEXTRACT
Let's say your raw data is in A2:A. Place this in the second cell (e.g., B2) of an otherwise empty column:
=ArrayFormula(IF(A2:A="",,TRIM(REGEXREPLACE(A2:A,"\d+:\d+:\d+",""))))
ADDENDUM:
Version for some international locales (where semicolon is used in place of a comma within formulas):
=ArrayFormula(IF(A2:A="";;TRIM(REGEXREPLACE(A2:A;"\d+:\d+:\d+";""))))

How do I extract surnames with spaces in them as 'one' name/'en bloc'?

Could anyone please advise on a way to extract surnames that have spaces in them, as a single block of names?
I have names in a dataset that look like this
clear
input str40 name
"R. P. de la Espriella Guerrero"
"J. de Carvalho Ponce"
"E. De Freitas Drumond"
"R. de la Fuente and M. E. Medina-Mora"
"C. Van Heyningen and I. D. Watson"
"A. Z. van de Wiel and D. W. de Lange"
end
I only want the first surname (so only the first author and excluding other authors) but I want those names that have spaces to be extracted 'en bloc'. So, ultimately resulting in a dataset as follows, for instance:
clear
input str40 name
"de la Espriella Guerrero"
"de Carvalho Ponce"
"De Freitas Drumond"
"de la Fuente"
"Van Heyningen"
"van de Wiel"
end
I'd be grateful for any help.
Here is code that implements the two rules given in my comment above. It assumes the version of Stata used supports the unicode character string functions.
clear
input str40 name
"R. P. de la Espriella Guerrero"
"J. de Carvalho Ponce"
"E. De Freitas Drumond"
"R. de la Fuente and M. E. Medina-Mora"
"C. Van Heyningen and I. D. Watson"
"A. Z. van de Wiel and D. W. de Lange"
end
generate surname = name
replace surname = usubstr(surname,1,ustrpos(surname+" and "," and ")-1)
list, clean noobs
replace surname = usubstr(surname,ustrrpos(surname,". ")+1,.)
list, clean noobs

Regular expression in distinct texts

I have two patterns to use a regular expression. In the first one, I have this pattern and I can catch the word.
With this regex:
referente[,;]*\s\S\s(.+)\.\sOnde
O COORDENADOR-GERAL DE GESTÃO DE PESSOAS DO MINISTÉRIO DOS TRANSPORTES, PORTOS E AVIAÇÃO CIVIL, no uso das atribuições que lhe foram subdelegadas pela Portaria/SAAD nº. 202, art. 1°, inciso VII, de 08 de outubro de 2010, publicada no Diário Oficial da União, de 11 de outubro de 2010, resolve:
Retificar a Portaria COGEP-MT nº 3394, de 30 de novembro de 2016, publicada no Diário Oficial da União, Seção 2, página 55, de 13 de dezembro de 2016, referente à MARIA ALIXANDRINA COSTA REIS. Onde se lê "MARIA AUXILIADORA COSTA REIS"; Leia-se "MARIA ALIXANDRINA COSTA REIS.(Processo SEI: 50000.124582/2016-62) BA.
I need to take the name in another pattern.
O COORDENADOR-GERAL DE GESTÃO DE PESSOAS DO MINISTÉRIO DOS
TRANSPORTES, PORTOS E AVIAÇÃO CIVIL, no uso das atribuições que lhe
foram subdelegadas pela Portaria/SAAD nº. 202, art.1º, inciso VII, de
08 de outubro de 2010, publicada no Diário Oficial da União, de 11 de
outubro de 2010, resolve: Conceder Pensão Temporária, nos termos do
artigo 215 e 217, inciso II, alínea "a" da Lei nº 8.112/1990, à
ELIANE RIBEIRO MENESES, filha inválida do ex-servidor ASTOLFO
MENEZES, matrícula SIAPE nº. 0783182, do Quadro Permanente deste
Ministério, falecido na inatividade em 05 de agosto de 1997, cuja cota
parte equivale a 100% (cem por cento) do valor correspondente à
remuneração decorrente do cargo de Artífice de Mecânica (NI), Classe
"A", Padrão "III", com vigência partir do momento da Publicação da
Portaria de Concessão e efeitos financeiros a partir de 30 de maio de
2015, data do falecimento da viúva. (Processo SEI nº
50000.019342/2016-47) - MG.
I need to take the bold word too, in the same regex. How can I modify this regex?
I am not sure why you want to match/embolden the trailing punctuation and Onde/Where substring.
I would recommend this pattern to optionally match referente then the à then the all-caps words to follow. There are no capture groups, just replace the fullstring with the emboldened fullstring.
(I don't use nsregularexpression, so let me know if something is simply not right.)
/(?:referente )?à [A-Z]+(?: [A-Z]+)*/u
The unicode flag is to accommodate the accented letters that will be encountered.
Pattern Demo
p.s. In your "solution" you are incorporating [,;]* but that doesn't get represented in your sample strings, so I left it out. Reducing the total number of parenthetical groups delivers improved pattern efficiency -- that is why I use just two non-capturing groups.
you can use the following regex to match the 2 bold parts of your examples:
(à\sELIANE\s\w+\s\w+SES)|(referente[,;]*\s\S\s.+\.\sOnde)
Good luck!
My solution is ([,;]*)*\sà\s((\w+\s)+\w+)[\.,]

importi.io : some data not imported or mixed in same column

I'm using import.io's Magic API on this web page :
http://www.legifrance.gouv.fr/affichSarde.do?reprise=true&page=1&idSarde=SARDOBJT000007104398&ordre=null&nature=null&g=ls
Some types of info/fields are perfectly extracted.
But the extractor :
mixes the NOR number field (example : NOR DEVL1502938A) with a number that represents the number of pages (example : 10) in a same column. Probably because they both are linked text (the tag is the following :
a title="[...]" href="[...]")
then mixes the bibliographic reference field (example : JO du 04/04/2015 texte : 0080;10 pages 6232/6241) with the NOR number field. It seems strange to me because the NOR systematically precedes the reference and they are not on the same line in the web page (there is a br/ tag before the bibliographic reference field)
frequently fails to load the content of the text summary (example : (Application de l'art. R. 411-1 et s. du code de l'environnement - Abrogation de l'arrêté du 15 mai 1986 fixant sur tout ou partie du territoire national des mesures de protection des oiseaux représentés dans le département de la Guyane)) in one column. Instead it spreads it into various columns. I see it happens when a em tag is inserted after the span class="noir" tag. Example :
Application de l'art. R. 213-49-2 du code de l'environnement -
Abrogation de l'arrêté du 10 août 2011 relatif à la définition du
périmètre de l'Etablissement public du Marais poitevin)
I have tried using the New Extractor or working my away around through a special Google request results web page https://www.google.fr/search?q=PROTECTION+FAUNE+et+FLORE+SAUVAGES+site:legifrance.gouv.fr+filetype:pdf. To no avail. The Google web page alternative provides even worse results.
I would welcome any idea :
on the reason why the second problem
and how I can overcome the three problems on the Legifrance page.
Thanks a lot for reading this till the end :-)
PS : please note that I work primarily as a researcher. Although I can understand their logic, I am not familiar with Regex or Json. So if using them is needed, could you please either explain the logic behind or show enough a portion of the ideal code so that I can replicate it effectively ?

How to camelcase only a specific part of a string?

I have a CSV file like that:
"","LESCHELLES","","LESCHELLES"
"","SAINTE CROIX DE VERDON","","SAINTE CROIX DE VERDON"
"","SERRE CHEVALIER","","SERRE CHEVALIER"
"","SAINT JUST D'ARDECHE","","SAINT JUST D'ARDECHE"
"","NEUVILLE SUR VANNES","","NEUVILLE SUR VANNES"
"","ESCUEILLENS ET SAINT JUST","","ESCUEILLENS ET SAINT JUST"
"","PAS DES LANCIERS","","PAS DES LANCIERS"
"","PLAN DE CAMPAGNE","","PLAN DE CAMPAGNE"
And I'd like to convert it this way:
"","Leschelles","","LESCHELLES"
"","Sainte Croix De Verdon","","SAINTE CROIX DE VERDON","STE CROIX DE VERDON","93"
"","Serre Chevalier","","SERRE CHEVALIER","SERRE CHEVALIER","93"
"","Saint Just D'Ardeche","","SAINT JUST D'ARDECHE"
"","Neuville Sur Vannes","","NEUVILLE SUR VANNES"
"","Escueillens Et Saint Just","","ESCUEILLENS ET SAINT JUST","ESCUEILLENS ET ST JUST","91"
"","Luc","","LUC"
"","Pas Des Lanciers","","PAS DES LANCIERS","PAS DES LANCIERS","93"
"","Plan De Campagne","","PLAN DE CAMPAGNE","PLAN DE CAMPAGNE","93"
This would be nice. And better: lowercase all "whole" words like de, d', et, sur and des. This would give:
"","Leschelles","","LESCHELLES"
"","Sainte Croix de Verdon","","SAINTE CROIX DE VERDON","STE CROIX DE VERDON","93"
"","Serre Chevalier","","SERRE CHEVALIER","SERRE CHEVALIER","93"
"","Saint Just d'Ardeche","","SAINT JUST D'ARDECHE"
"","Neuville sur Vannes","","NEUVILLE SUR VANNES"
"","Escueillens et Saint Just","","ESCUEILLENS ET SAINT JUST","ESCUEILLENS ET ST JUST","91"
"","Luc","","LUC"
"","Pas des Lanciers","","PAS DES LANCIERS","PAS DES LANCIERS","93"
"","Plan de Campagne","","PLAN DE CAMPAGNE","PLAN DE CAMPAGNE","93"
Python has title():
Return a titlecased version of the string where words start with an
uppercase character and the remaining characters are lowercase.
The algorithm uses a simple language-independent definition of a word
as groups of consecutive letters. The definition works in many
contexts but it means that apostrophes in contractions and possessives
form word boundaries, which may not be the desired result:
"they're bill's friends from the UK".title() "They'Re Bill'S Friends From The Uk"
A workaround for apostrophes can be constructed
using regular expressions:
import re
def titlecase(s):
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?",
lambda mo: mo.group(0)[0].upper() +
mo.group(0)[1:].lower(),
s)
titlecase("they're bill's friends.") "They're Bill's Friends."
Update: here's the solution for French problem:
import re, sys
def titlecase(s):
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?",
lambda mo: mo.group(0)[0].upper() +
mo.group(0)[1:].lower(),
s)
def french_parse(s):
p = re.compile(
r"( de la | sur | sous | la | de | les | du | le | au | aux | en | des | et )|(( d'| l')([a-z]+))",
re.IGNORECASE)
return p.sub(
lambda mo: mo.group().find("'")>0
and mo.group()[:mo.group().find("'")+1].lower() +
titlecase(mo.group()[mo.group().find("'")+1:])
or (mo.group(0)[0].upper() + mo.group(0)[1:].lower()),
s);
for line in sys.stdin:
s = line[20:len(line)-1]
p = s.find('"')
t = s[:p]
# Just output to show which names have been modified:
if french_parse( titlecase(t) ) != titlecase(t):
print '"' + french_parse( titlecase(t) ) + '"'
Just launch it like this:
python thepythonscript.py < file.csv
Then the output will be:
"Grenand les Sombernon"
"Touville sur Montfort"
"Fontenay en Vexin"
"Durfort Saint Martin de Sossenac"
"Monclar d'Armagnac"
"Ports sur Vienne"
"Saint Barthelemy de Beaurepaire"
"Saint Bernard du Touvet"
"Rosoy le Vieil"
While you may be able to pull this off with some vim regex magic, I think it'll be easier if you solve the problem in your favorite scripting language, and pipe selected text through that from vim using the ! command. Here's an (untested) example in PHP:
#!/usr/bin/env php
<?php
$specialWords = array('de', 'd\'', 'et', 'du', /* etc. */ );
foreach (file('php://stdin') as $ville) {
$line = ucwords($line);
foreach ($specialWords as $w) {
$line = preg_replace("/\\b$w\\b/i", $w, $line);
}
echo $line;
}
Make that script executable and store it somewhere on your PATH; then from vim, select some text and use :'<,'>! yourscript.php to convert (or just :%! yourscript.php for the whole buffer).
The csv.vim ftplugin helps with working in CSV files. Though it does not offer a "substitute in column N" function directly, it may get your near that. At least you can arrange the columns into neat blocks, and then apply a simple regexp or visual blockwise selection to it.
But I second that using a different toolchain that is more suited to manipulating CSV-files may be preferable over doing this completely in Vim. It also depends on whether it's a one-off task or, you do this frequently.
Here is an oneliner vim command.
%s/"[^"]*",\zs\("[^"]*"\)/\=substitute(substitute(submatch(0), '\<\(\a\)\(\a*\)\>', '\u\1\L\2', 'g'), '\c\<\(de\|d\|l\|sur\|le\|la\|en\|et\)\>', '\L&', 'g')
I expect here to have no double-quotes in the first two fields.
The idea behind this solution is to rely on :h :s\= to execute a series of functions on the second field once found. The series of functions being: first change each word to TitleCase, then put all liants in lowercase.