What is the best way to clean a dirty file? [closed] - regex

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I took a very dirty xml file to study a little sed. Behold here:
<title><![CDATA[O BR-Linux está em pausa por tempo indeterminado]]></title>
<title><![CDATA[Funçoes ZZ atinge maioridade: versão 18.3]]></title>
<title><![CDATA[CloudFlare 1.1.1.1 e parceria com Firefox DoH]]></title>
<title><![CDATA[Slint, Distro Baseada no Slackware e Acessível]]></title>
<title><![CDATA[Utilização de CPU em sistemas Linux multi-thread]]></title>
<title><![CDATA[Realidade Aumentada com 10 anos de idade e 10 linhas de código.]]></title>
I managed to remove the garbage, and just keep the text. However, the solution did not please me very much. I would like a way to improve this, but I really don't know how. Here is the code:
#!/bin/bash
# Trauvin
URL=http://br-linux.org/feed/
lynx -source "$URL" |
grep '<title><!' | # get tag title
sed 's/<[^!>]*>//g' | # remove tag title
sed 's/<[^<]>*//g' | # remove <!
sed 's/CDATA/''/g' | # remove CDATA
sed 's/[[^[]//g' | # remove the square brackets start
sed 's/[]*]]//g' | # remove the squre brackets end
sed 's/>*//g' | # remove > end
head -n 5
I used several sed's for no more confusion, so I can add comments on all lines.

With xmlstarlet:
URL='http://br-linux.org/feed/'
lynx -source "$URL" | xmlstarlet select --template --value-of '//item/title'
Output:
O BR-Linux está em pausa por tempo indeterminado
Funçoes ZZ atinge maioridade: versão 18.3
CloudFlare 1.1.1.1 e parceria com Firefox DoH
Slint, Distro Baseada no Slackware e Acessível
Utilização de CPU em sistemas Linux multi-thread
Realidade Aumentada com 10 anos de idade e 10 linhas de código.
Nova versão da plataforma livre para o mapeamento de iniciativas em agroecologia
Instalação do WordPress com Vagrant
DatabaseCast 82: Ciência e dados
Aplicando ferramentas open source para se dar bem no jogo Suikoden Tierkreis
Tchelinux 2018: Chamada de palestras para Rio Grande
Palestra on-line - conhecendo o Elastic Stack
Curso gratuito básico de linux - Online e ao-vivo
Aulas Particulares de Programação em Shell Script
Protoboard em quadrinhos: manual apresenta 10 circuitos divertidos e desafiadores que você mesmo pode construir

The best way to work with an XML file is to use XML-aware tools, not regular expressions.
Example using XSLT to extract just the titles:
feed.xslt:
<?ml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="rss/channel/item">
<xsl:value-of select="title"/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When applied to the RSS feed:
$ xsltproc feed.xslt <(curl -s https://br-linux.org/feed/)
O BR-Linux está em pausa por tempo indeterminado
Funçoes ZZ atinge maioridade: versão 18.3
CloudFlare 1.1.1.1 e parceria com Firefox DoH
Slint, Distro Baseada no Slackware e Acessível
Utilização de CPU em sistemas Linux multi-thread
Realidade Aumentada com 10 anos de idade e 10 linhas de código.
Nova versão da plataforma livre para o mapeamento de iniciativas em agroecologia
Instalação do WordPress com Vagrant
DatabaseCast 82: Ciência e dados
Aplicando ferramentas open source para se dar bem no jogo Suikoden Tierkreis
Tchelinux 2018: Chamada de palestras para Rio Grande
Palestra on-line - conhecendo o Elastic Stack
Curso gratuito básico de linux - Online e ao-vivo
Aulas Particulares de Programação em Shell Script
Protoboard em quadrinhos: manual apresenta 10 circuitos divertidos e desafiadores que você mesmo pode construir

you could 'unwrap' the content step by step rather than separating end from start:
$ lynx -source "$URL" |
sed 's/<title>\(.*\)<\/title>/\1/' | # <title>x</title> -> x
sed 's/<!\[\(.*\)\]>/\1/' | # <![x]> -> x
sed 's/CDATA\[\(.*\)\]/\1/' | # CDATA[x] -> x

Related

Regular expression in distinct texts

I have two patterns to use a regular expression. In the first one, I have this pattern and I can catch the word.
With this regex:
referente[,;]*\s\S\s(.+)\.\sOnde
O COORDENADOR-GERAL DE GESTÃO DE PESSOAS DO MINISTÉRIO DOS TRANSPORTES, PORTOS E AVIAÇÃO CIVIL, no uso das atribuições que lhe foram subdelegadas pela Portaria/SAAD nº. 202, art. 1°, inciso VII, de 08 de outubro de 2010, publicada no Diário Oficial da União, de 11 de outubro de 2010, resolve:
Retificar a Portaria COGEP-MT nº 3394, de 30 de novembro de 2016, publicada no Diário Oficial da União, Seção 2, página 55, de 13 de dezembro de 2016, referente à MARIA ALIXANDRINA COSTA REIS. Onde se lê "MARIA AUXILIADORA COSTA REIS"; Leia-se "MARIA ALIXANDRINA COSTA REIS.(Processo SEI: 50000.124582/2016-62) BA.
I need to take the name in another pattern.
O COORDENADOR-GERAL DE GESTÃO DE PESSOAS DO MINISTÉRIO DOS
TRANSPORTES, PORTOS E AVIAÇÃO CIVIL, no uso das atribuições que lhe
foram subdelegadas pela Portaria/SAAD nº. 202, art.1º, inciso VII, de
08 de outubro de 2010, publicada no Diário Oficial da União, de 11 de
outubro de 2010, resolve: Conceder Pensão Temporária, nos termos do
artigo 215 e 217, inciso II, alínea "a" da Lei nº 8.112/1990, à
ELIANE RIBEIRO MENESES, filha inválida do ex-servidor ASTOLFO
MENEZES, matrícula SIAPE nº. 0783182, do Quadro Permanente deste
Ministério, falecido na inatividade em 05 de agosto de 1997, cuja cota
parte equivale a 100% (cem por cento) do valor correspondente à
remuneração decorrente do cargo de Artífice de Mecânica (NI), Classe
"A", Padrão "III", com vigência partir do momento da Publicação da
Portaria de Concessão e efeitos financeiros a partir de 30 de maio de
2015, data do falecimento da viúva. (Processo SEI nº
50000.019342/2016-47) - MG.
I need to take the bold word too, in the same regex. How can I modify this regex?
I am not sure why you want to match/embolden the trailing punctuation and Onde/Where substring.
I would recommend this pattern to optionally match referente then the à then the all-caps words to follow. There are no capture groups, just replace the fullstring with the emboldened fullstring.
(I don't use nsregularexpression, so let me know if something is simply not right.)
/(?:referente )?à [A-Z]+(?: [A-Z]+)*/u
The unicode flag is to accommodate the accented letters that will be encountered.
Pattern Demo
p.s. In your "solution" you are incorporating [,;]* but that doesn't get represented in your sample strings, so I left it out. Reducing the total number of parenthetical groups delivers improved pattern efficiency -- that is why I use just two non-capturing groups.
you can use the following regex to match the 2 bold parts of your examples:
(à\sELIANE\s\w+\s\w+SES)|(referente[,;]*\s\S\s.+\.\sOnde)
Good luck!
My solution is ([,;]*)*\sà\s((\w+\s)+\w+)[\.,]

Please need a hand on regex

I would like to get the text in the frame "[[Fichier .... ]]" here in the text :
=== Langues ===
{{Article détaillé|Langues en Afrique du Sud}}
[[Fichier:South Africa dominant language map.svg|thumb|300px| Répartition
des langues officielles dominantes par région :
{{clear}}
{{legend|#80b1d3|[[Zoulou]]}}
{{legend|#8dd3c7|[[Afrikaans]]}}
{{legend|#fb8072|[[Xhosa (langue)|Xhosa]]}}
{{legend|#ffffb3|[[Anglais]]}}
{{legend|#fccde5|[[Tswana|Setswana]]}}
{{legend|#bebada|[[Ndébélés|Ndebele]]}}
{{legend|#fdb462|[[Sotho du Nord]]}}
{{legend|#b3de69|[[Sotho du Sud]]}}
{{legend|#bc80bd|[[Swati]]}}
{{legend|#ccebc5|[[Venda (langue)|Tshivenda]]}}
{{legend|#ffed6f|[[Tsonga (langue)|Xitsonga]]}}
{{legend|#d0d0d0|Pas de langage dominant}}]]
Il n'y a pas de langue maternelle majoritairement dominante en Afrique du Sud. Depuis [[1994]], [[Langues en Afrique du Sud|onze langues officielles]] (anglais, afrikaans, zoulou, xhosa, zwazi, ndebele, sesotho, sepedi, setswana, xitsonga, tshivenda<ref>[http://www.lafriquedusud.com/ethnies.htm lafriquedusud.com]</ref>) sont reconnues par la [[Constitution de l'Afrique du Sud|Constitution sud-africaine]]<ref>{{Ouvrage|langue=fr|auteur1=François- Xavier Fauvelle-Aymar|titre=Histoire
How can I improve the following regex:
\[\[Fichier:.*(.*\[\[.*\]\].*)*.*\]\]
In order to match all the liness until the correct ]]?
\[\[Fichier:(.*?(\n))+.*\]\]
Match all the lines between [[ and ]].
Here is the best sandbox: http://www.regexr.com
Provided you my have at most one level of nested [[...]] (as your test data sample suggests), the inner regex pattern may comprise a sequence of either a string in double brackets (\[\[.*?\]\]) or anything but a closing bracket ([^]]):
\[\[Fichier:(?:\[\[.*?\]\]|[^]])*\]\]
Demo: https://regex101.com/r/Q7zQQt/1
For arbitrary number of nested levels the answer depends on regex flavour. You may find more details on this here: http://www.regular-expressions.info/balancing.html.

Extract specific pattern from data on R

I have a ".txt" file which came with a lot of juridics text but I only want to extract the dates to make a further analysis and graphics. Here is an example (sorry its in Portuguese):
"AR - 4024-03.2010.5.00.0000(2)" "ACORDAM os Ministros da Egrégia
Subseção II Especializada em Dissídios Individuais do Tribunal
Superior do Trabalho, por unanimidade, não conhecer do recurso
ordinário, por incabível. Brasília, 24 de maio de 2011. Firmado por
assinatura digital (MP 2.200-2/2001) Alberto Luiz Bresciani de Fontan
Pereira Ministro Relator fls. PROCESSO Nº
TST-AR-4024-03.2010.5.00.0000 Firmado por assinatura digital em
26/05/2011 pelo sistema AssineJus da Justiça do Trabalho, conforme MP
2.200-2/2001, que instituiu a Infra-Estrutura de Chaves Públicas Brasileira."
That file has a lot of those things but I want to extract only the highlighted parts and put them in a separate vector. I've been trying match, grep nothing is working. Perhaps because I'm new to R.
This pattern will match dates of the form that you have highlighted:
"\\d{1,2} de (janeiro|fevereiro|março|abril|maio|junho|julho|agosto|septembro|outubro|novembro|dezembro) de \\d{4}"
See ?regex for details on special characters and quantifiers. You can substitute on the items that match:
your_text <- c("AR - 4024-03.2010.5.00.0000", "ACORDAM os Ministros da Egrégia Subseção II Especializada em Dissídios Individuais do Tribunal Superior do Trabalho, por unanimidade, não conhecer do recurso ordinário, por incabível. Brasília, 24 de maio de 2011. Firmado por assinatura digital (MP 2.200-2/2001) Alberto Luiz Bresciani de Fontan Pereira Ministro Relator fls. PROCESSO Nº TST-AR-4024-03.2010.5.00.0000 Firmado por assinatura digital em 26/05/2011 pelo sistema AssineJus da Justiça do Trabalho, conforme MP 2.200-2/2001, que instituiu a Infra-Estrutura de Chaves Públicas Brasileira.")
sub( "(.+ )(\\d{1,2} de (janeiro|fevereiro|março|abril|maio|junho|julho|agosto|septembro|outubro|novembro|dezembro) de \\d{4})(.+)", "\\2", your_text[grepl("\\d{1,2} de (janeiro|fevereiro|março|abril|maio|junho|julho|agosto|septembro|outubro|novembro|dezembro) de \\d{4}", your_text )
[1] "AR - 4024-03.2010.5.00.0000" "24 de maio de 2011"
To remove the non-date-containing items, you can use grepl to preselect:
> sub( "(.+ )(\\d{1,2} de (janeiro|fevereiro|março|abril|maio|junho|julho|agosto|septembro|outubro|novembro|dezembro) de \\d{4})(.+)", "\\2", your_text[grepl("\\d{1,2} de (janeiro|fevereiro|março|abril|maio|junho|julho|agosto|septembro|outubro|novembro|dezembro) de \\d{4}", your_text )])
[1] "24 de maio de 2011"
If you need to play with the patterns to get the hang of using capture-classes, there are nifty regex testing webpages.

importi.io : some data not imported or mixed in same column

I'm using import.io's Magic API on this web page :
http://www.legifrance.gouv.fr/affichSarde.do?reprise=true&page=1&idSarde=SARDOBJT000007104398&ordre=null&nature=null&g=ls
Some types of info/fields are perfectly extracted.
But the extractor :
mixes the NOR number field (example : NOR DEVL1502938A) with a number that represents the number of pages (example : 10) in a same column. Probably because they both are linked text (the tag is the following :
a title="[...]" href="[...]")
then mixes the bibliographic reference field (example : JO du 04/04/2015 texte : 0080;10 pages 6232/6241) with the NOR number field. It seems strange to me because the NOR systematically precedes the reference and they are not on the same line in the web page (there is a br/ tag before the bibliographic reference field)
frequently fails to load the content of the text summary (example : (Application de l'art. R. 411-1 et s. du code de l'environnement - Abrogation de l'arrêté du 15 mai 1986 fixant sur tout ou partie du territoire national des mesures de protection des oiseaux représentés dans le département de la Guyane)) in one column. Instead it spreads it into various columns. I see it happens when a em tag is inserted after the span class="noir" tag. Example :
Application de l'art. R. 213-49-2 du code de l'environnement -
Abrogation de l'arrêté du 10 août 2011 relatif à la définition du
périmètre de l'Etablissement public du Marais poitevin)
I have tried using the New Extractor or working my away around through a special Google request results web page https://www.google.fr/search?q=PROTECTION+FAUNE+et+FLORE+SAUVAGES+site:legifrance.gouv.fr+filetype:pdf. To no avail. The Google web page alternative provides even worse results.
I would welcome any idea :
on the reason why the second problem
and how I can overcome the three problems on the Legifrance page.
Thanks a lot for reading this till the end :-)
PS : please note that I work primarily as a researcher. Although I can understand their logic, I am not familiar with Regex or Json. So if using them is needed, could you please either explain the logic behind or show enough a portion of the ideal code so that I can replicate it effectively ?

How to camelcase only a specific part of a string?

I have a CSV file like that:
"","LESCHELLES","","LESCHELLES"
"","SAINTE CROIX DE VERDON","","SAINTE CROIX DE VERDON"
"","SERRE CHEVALIER","","SERRE CHEVALIER"
"","SAINT JUST D'ARDECHE","","SAINT JUST D'ARDECHE"
"","NEUVILLE SUR VANNES","","NEUVILLE SUR VANNES"
"","ESCUEILLENS ET SAINT JUST","","ESCUEILLENS ET SAINT JUST"
"","PAS DES LANCIERS","","PAS DES LANCIERS"
"","PLAN DE CAMPAGNE","","PLAN DE CAMPAGNE"
And I'd like to convert it this way:
"","Leschelles","","LESCHELLES"
"","Sainte Croix De Verdon","","SAINTE CROIX DE VERDON","STE CROIX DE VERDON","93"
"","Serre Chevalier","","SERRE CHEVALIER","SERRE CHEVALIER","93"
"","Saint Just D'Ardeche","","SAINT JUST D'ARDECHE"
"","Neuville Sur Vannes","","NEUVILLE SUR VANNES"
"","Escueillens Et Saint Just","","ESCUEILLENS ET SAINT JUST","ESCUEILLENS ET ST JUST","91"
"","Luc","","LUC"
"","Pas Des Lanciers","","PAS DES LANCIERS","PAS DES LANCIERS","93"
"","Plan De Campagne","","PLAN DE CAMPAGNE","PLAN DE CAMPAGNE","93"
This would be nice. And better: lowercase all "whole" words like de, d', et, sur and des. This would give:
"","Leschelles","","LESCHELLES"
"","Sainte Croix de Verdon","","SAINTE CROIX DE VERDON","STE CROIX DE VERDON","93"
"","Serre Chevalier","","SERRE CHEVALIER","SERRE CHEVALIER","93"
"","Saint Just d'Ardeche","","SAINT JUST D'ARDECHE"
"","Neuville sur Vannes","","NEUVILLE SUR VANNES"
"","Escueillens et Saint Just","","ESCUEILLENS ET SAINT JUST","ESCUEILLENS ET ST JUST","91"
"","Luc","","LUC"
"","Pas des Lanciers","","PAS DES LANCIERS","PAS DES LANCIERS","93"
"","Plan de Campagne","","PLAN DE CAMPAGNE","PLAN DE CAMPAGNE","93"
Python has title():
Return a titlecased version of the string where words start with an
uppercase character and the remaining characters are lowercase.
The algorithm uses a simple language-independent definition of a word
as groups of consecutive letters. The definition works in many
contexts but it means that apostrophes in contractions and possessives
form word boundaries, which may not be the desired result:
"they're bill's friends from the UK".title() "They'Re Bill'S Friends From The Uk"
A workaround for apostrophes can be constructed
using regular expressions:
import re
def titlecase(s):
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?",
lambda mo: mo.group(0)[0].upper() +
mo.group(0)[1:].lower(),
s)
titlecase("they're bill's friends.") "They're Bill's Friends."
Update: here's the solution for French problem:
import re, sys
def titlecase(s):
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?",
lambda mo: mo.group(0)[0].upper() +
mo.group(0)[1:].lower(),
s)
def french_parse(s):
p = re.compile(
r"( de la | sur | sous | la | de | les | du | le | au | aux | en | des | et )|(( d'| l')([a-z]+))",
re.IGNORECASE)
return p.sub(
lambda mo: mo.group().find("'")>0
and mo.group()[:mo.group().find("'")+1].lower() +
titlecase(mo.group()[mo.group().find("'")+1:])
or (mo.group(0)[0].upper() + mo.group(0)[1:].lower()),
s);
for line in sys.stdin:
s = line[20:len(line)-1]
p = s.find('"')
t = s[:p]
# Just output to show which names have been modified:
if french_parse( titlecase(t) ) != titlecase(t):
print '"' + french_parse( titlecase(t) ) + '"'
Just launch it like this:
python thepythonscript.py < file.csv
Then the output will be:
"Grenand les Sombernon"
"Touville sur Montfort"
"Fontenay en Vexin"
"Durfort Saint Martin de Sossenac"
"Monclar d'Armagnac"
"Ports sur Vienne"
"Saint Barthelemy de Beaurepaire"
"Saint Bernard du Touvet"
"Rosoy le Vieil"
While you may be able to pull this off with some vim regex magic, I think it'll be easier if you solve the problem in your favorite scripting language, and pipe selected text through that from vim using the ! command. Here's an (untested) example in PHP:
#!/usr/bin/env php
<?php
$specialWords = array('de', 'd\'', 'et', 'du', /* etc. */ );
foreach (file('php://stdin') as $ville) {
$line = ucwords($line);
foreach ($specialWords as $w) {
$line = preg_replace("/\\b$w\\b/i", $w, $line);
}
echo $line;
}
Make that script executable and store it somewhere on your PATH; then from vim, select some text and use :'<,'>! yourscript.php to convert (or just :%! yourscript.php for the whole buffer).
The csv.vim ftplugin helps with working in CSV files. Though it does not offer a "substitute in column N" function directly, it may get your near that. At least you can arrange the columns into neat blocks, and then apply a simple regexp or visual blockwise selection to it.
But I second that using a different toolchain that is more suited to manipulating CSV-files may be preferable over doing this completely in Vim. It also depends on whether it's a one-off task or, you do this frequently.
Here is an oneliner vim command.
%s/"[^"]*",\zs\("[^"]*"\)/\=substitute(substitute(submatch(0), '\<\(\a\)\(\a*\)\>', '\u\1\L\2', 'g'), '\c\<\(de\|d\|l\|sur\|le\|la\|en\|et\)\>', '\L&', 'g')
I expect here to have no double-quotes in the first two fields.
The idea behind this solution is to rely on :h :s\= to execute a series of functions on the second field once found. The series of functions being: first change each word to TitleCase, then put all liants in lowercase.