detect string between brackets using regex - python-2.7

How can i use regex in python to detect in one file only the stings between the brackets?
So If i have some text like this:
</seg>
<src>Sono stati riportati casi di (sgomberi forzati) e violazioni dei diritti umani da parte della polizia, ma su scala minore rispetto agli anni precedenti.</src>
and I want to detect only (sgomberi forzati)
I use the
for line in file.readlines():
m=re.compile('\((.*?)\)', re.DOTALL).findall(line)
print m
but it does not print what I need: it prints also brackets empty like this
[]
[u'sgomberi forzati']

Escape your () as follows:
m = re.compile('\((.*?)\)', re.DOTALL).findall(line)

Related

how to find the second number in a string with Regex expression

I want a regular expression that finds the last match.
the result that I want is "103-220898-44"
The expression that I'm using is ([^\d]|^)\d{3}-\d{6}-\d{2}([^\d]|$). This doesn't work because it matches the first result "100-520006-90" and I want the last "103-220898-44"
Example A
Transferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
$ 1.000.00
/.*(\d{3}-\d{6}-\d{2})(?:[^\d]|$)/gs
If you add .* to the beginning of your regex, it only captures the last one since it's greedy. Also, you need to use the single-line regex flag (s) to capture new lines by using .*.
Note: I replaced some (...) strings with (?:...) since their aim is grouping, not capturing.
Demo: https://regex101.com/r/fygL1X/2
const regex = new RegExp('.*(\\d{3}-\\d{6}-\\d{2})(?:[^\\d]|$)', 'gs')
const str = `ransferencia exitosa
Comprobante No. 0000065600
26 May 2022 - 03:32 p.m.
Producto origen
Cuenta de Ahorro
Ahorros
100-520006-90
Producto destino
nameeee nene
Ahorros / ala
103-220898-44
Valor enviado
\$ 1.000.00`;
let m;
m = regex.exec(str)
console.log(m[1])

Find words between comma and list of keywords RegEx

I have a large text. I would like to find the address of the owner. My input is something like...
INPUT: (...) seiscientos catorce guión ocho, domiciliado en calle
Santillana número trescientos sesenta y nueve, Valle Lo Campino,
comuna de Quilicura, Región Metropolitana, constituyeron una sociedad
por acciones (...)
keywords_cap = ['DOMICILIO:', 'Domicilio:', 'Domicilio', 'DOMICILIO', 'domiciliado en', 'domiciliada en',
'Domiciliado en', 'Domiciliada en']
keywords_cap = map(re.escape, keywords_cap)
keywords_cap.sort(key=len, reverse=True)
obj = re.compile(r'\b(?:{})\s*(.*?)\.'.format('|'.join(keywords_cap)))
obj2 = obj.search(mensaje)
if obj2:
company_name = obj2.group(1)
else:
company_name = "None"
OUTPUT: calle Santillana número trescientos sesenta y nueve
Something it is wrong, because I would like to extract the text between one word of keywords and the next comma (,) or the next point (.).
But the extraction is being since this list of Keywords to only the next point (.).
Can someone help me with this foolishness?
The (.*?)\. pattern matches any chars other than line break chars, as few as possible before the leftmost . char. It can be "converted" to ([^.]*), a negated character class pattern that matches 0 or more chars other than . (note that the only difference from the original pattern is that negated character classes also match linebreaks, which is a good feature in this case).
The solution will be to just add , into the character class:
obj = re.compile(r'\b(?:{})\s*([^.,]*)'.format('|'.join(keywords_cap)))
^^^^^^^^
The regex will look like
\b(?:DOMICILIO:|Domicilio:|Domicilio|DOMICILIO|domiciliado en|domiciliada en|Domiciliado en|Domiciliada en)\s*([^.,]*)
See the regex demo.

What's an appropriate regex to split this line using scala?

I'm trying to split this line coming from a CSV file, to obtain the different matching groups from this (sample) line (file has around 750k lines):
919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT
As you can see there are four main parts in the line, id,free text, sentiment, option. Also, many characters in the content part (La dama de hierro...) and I don't know how to build a correct regex to obtain it like this: (id, txt, sent, opt).
What I've tried so far:
val fullRegex = """(\d+),(.+?),(N|P|NEU|NONE)(,\W+|;\W+)re?""".r
Works for some lines but fail for others.
Regex is powerful but sometimes it's hard to get right and cover all possible input formats. In this case it might not be needed.
val in = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val inSplit = in.split(",")
val id = inSplit.head // String = 919191911919
val txt = inSplit.tail.init.init.mkString(",") // free form text
val sent = inSplit.init.last // String = P
val opt = inSplit.last // String = AGREEMENT
As Bruno Grieder pointed out in the comments to the question, this can be handled more robustly without using regular expressions.
If this is not a well formatted CSV file (meaning, fields containing commas enclosed in quotation marks, quotation marks in field values escaped etc), an alternative is to realize that the first field does give you the ID and the last two fields do give you the sentiment and the option. Everything else is free text, so the structure of a line is rather simple.
Of course, if the file is indeed well-formatted CSV, use a library built for that purpose.
Assuming this is not well-formatted CSV, first split by a comma, put the first and the last two fields in their respective variables, and join the rest of the fields using a comma to recover the text.
I don't know much Scala, so the code is rather primitive. Improvements welcome:
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
val id :: rest = line.split(",").toList
val text = rest.slice(0, rest.size - 2).mkString(",")
val sentiment = rest(rest.size - 2);
val option = rest.last;
for (x <- List(id, text, sentiment, option))
println(x)
Output:
$ scala test.scala
919191911919
"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn
P
AGREEMENT
This will also work with embedded commas in the text (although there is some extra work involved in splitting and recombining the text field). For example, if line is:
val line = "1,this is some text with one, two, three, and four commas (,),7,8
This is the output you'll get:
1
this is some text with one, two, three, and four commas (,)
7
8
If you are sure that the text is enclosed in double quotes, you can first replace all commas inside the double quotes, then split at the commas, then put the commas back. The drawback of this solution is that you need to use a Unicode char that is guaranteed to not be present in your file
object CSVFixer {
def main(args: Array[String]) {
split(line) foreach println
}
val line = """919191911919,"La dama de hierro descubrió la ternura".(via#annabosch) Margaret Thatcher (86 años); ayer en el parque: http://host.com/gm2EEXqn ,P,AGREEMENT"""
private val AltSep = '\u0080' // Unicode char that we reasonably expect to not have in the input
val fieldSeparator = ","
private[this] def unSep(s: String) = {
val SepChr = fieldSeparator.charAt(0)
var inQS = false
for (c <- s) yield {
c match {
case '"' =>
inQS = !inQS; c
case SepChr if inQS =>
AltSep
case _ => c
}
}
}
def split(line: String) =
unSep(line).split(fieldSeparator, -1) // do not discard trailing empty strings
.map(_.replace(AltSep, fieldSeparator.charAt(0)))
.map(_.replaceAll("\"", ""))
}

Regex match a spanish word ending with a dot(.) and underscore

This is the regex I'm trying:
([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+)(\.\_)
Here are two examples that it should match against:
EL ROSARIO / ESCUINAPA._ Con poco más de 4 mil pesos...
and
Cuautitlán._ Con poco más de 4 mil pesos...
The expression works for the first example but not for the second because of encoding probably:
docHtml = urllib.urlopen(link).read()
#using the lxml function html
tree = html.fromstring(docHtml)
newsCity = CSSSelector('#pid p')
try:
city_paragraph = newsCity(tree)
city_match = re.search('([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+\._)',city_paragraph[0].text)
Your regular expression appears to be correct. I suspect that the bug is in how you're reading the strings that you're matching against. You want something like:
import codecs
f = codecs.open('spanish.txt', encoding='utf-8')
for line in f:
print repr(line)
Finally figured it out:
newsCity = CSSSelector('#tamano5 p')
city_paragraph = newsCity(tree)
city_p = city_paragraph[0].text
city_utf=city_p.encode("utf-8")
city_match = re.search('([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+\._)',city_utf)
This gives me the expected result which in this case was to extract the city string using re.search.

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next