Address extraction in Turkish with REGEX - regex

I'm new about regex.
I want to extract address line in Turkish text.
but in turkish there is no standart while writing address.
For instance, district = mahalle
You write district for types below
"Mah." "Mh." "MAH." "MH" "mh." "mah." or "mahalle"
regex = ((.*)((\b[Mm][Aa]?[Hh].?)(.*)))
The regex is find all types of district except last one.
Two possible types of district;
1. "mah. mh. "
2. "mahalle"
How can i find it same regex sentence?
Note: i don't want to | (or) statement. .... .... | (.*)mahalle(.*)

Since there aren't many options to begin with, you can use OR operator to avoid complexity. Take a look at how stanford nlp does it with us states:
ABSTATE = Ala|Ariz|[A]z|[A]rk|Calif|Colo|Conn|Ct|Dak|[D]el|Fla|Ga|[I]ll|Ind|Kans?|Ky|[L]a|[M]ass|Md|Mich|Minn|[M]iss|Mo|Mont|Neb|Nev|Okla|[O]re|[P]a|Penn|Tenn|[T]ex|Va|Vt|[W]ash|Wisc?|Wyo
so taking our example: Mah.|Mh.|MAH.|MH|mh.|mah.|mahalle. You can of course simplify this by using case insensitive flag to cover Mah./MAH./mah..

Related

Reserved Keyword search, but in reverse. Regex

I'm writing a code that looks through a string and then takes in words that are not considered "reserved keywords". I am new to regex, but have spent quite some time learning what kind of structure I need to look for reserved words. So far, I've written something along the lines of this:
\b(import|false|int|etc)\b
I am going to use an array list to feed in all of the reserved words into the string above, but I need it to work opposite of how it works now. I've figured out how to get it to search for the specific words with the code above, but how can I get it to look for the words that are NOT listed above. I've tried incorporating the ^ symbol, but I'm not having any luck there. Any regex veterans out there who see what I'm doing wrong?
There are two obvious possibilities, depending on what (else) you are doing.
Possibility 1: Use a dict or set:
You could just match words and then test for membership in a set or dictionary:
Reserved_words = set('import false true int ...'.split())
word_rx = r'\b\w+\b' # Or whatever rule you like for "words"
for m in re.finditer(...):
word = m.group(0)
if word in Reserved_words:
print("Found reserved word:", word)
else:
print("Found unreserved word:", word)
This approach is frequently used in lexers, where it is easier to just write a catch-all "match a word" rule, and then check a matched word against a list of keywords, than it is to write a fairly complex rule for each keyword and a catch-all to deal with the rest.
You can use a dict if you want to associate some kind of payload with the keyword (such as a class handle for instantiating a particular AST node type, etc.).
Possibility 2: Use named groups:
Another possibility is that you could use named groups in your regex to capture keyword/nonkeyword values:
word_rx = r'\b(?<keyword>import|int|true|false|\.\.\.)|(?<nonkeyword>\w+)\b'
for m in re.finditer(...):
word = m.group('keyword')
if word:
print("Found keyword:", word)
else:
word = m.group('nonkeyword')
print("Found nonkeyword:", word)
This is going to be slower than the previous approach, because of prefixes: "int" matches a keyword, but "integral" starts to match an int, then fails, then backtracks to the other branch, then matches a nonkeyword. :-(
However, if you are strongly tied to a mostly-regex implementation, for example, if you have many other regex-based rules, and you are processing them in a loop, then go for it!

Regex String for Restructuring Author Firstname, Lastname, Title

I want to convert strings in the format
The European Union - A Very Short Introduction - Pinder, John
to
John Pinder - The European Union - A Very Short Introduction
I am having trouble matching on "Pinder" and "John" to reformat in the desired way.
You can use:
^(.*?)(?:-\s+(\w+),\s+(\w+))$
Demo
If you may have authors with multiple names (such as 'von Clausewitz, Carl') this won't work. Instead, maybe:
^(.*)(?:-\s+([^,]+?),\s+(\w+))$
Demo 2
There are many ways to approach the problem, all requiring some assumptions not specified in your question. Here is one solution...
^.+-\s+(.+),\s+(.+)$
regexper diagram
It is working by consuming as many characters as possible (up to first capture group, using hyphen and whitespace as delimiter) then it assumes there is a comma followed by whitespace separating first name from last name, which it assumes is the end of the string.
Depending on what you know about the uniformity of the data, this may or may not work for you, but I thought it would be nice to have a solution which does not try to restrict characters in name, but rather the rest of the format.
Use this code:
$code = preg_match_all('/(?:.*?) - (?:.*?) -(.*?),(.*)/', $string,$matches);
This will give you an array and $matches[1] will give you the last name (in this case "Pinder") and $matches[2] will give you the first name ("John"). You can then turn it back into a string if you want to using $lastname = implode('',$matches[1]);.

Regular Expressions for City name

I need a regular Expression for Validating City textBox, the city textbox field accepts only Letters, spaces and dashes(-).
This answer assumes that the letters which #Manaysah refers to also encompasses the use of diacritical marks. I've added the single quote ' since many names in Canada and France have it. I've also added the period (dot) since it's required for contracted names.
Building upon #UIDs answer I came up with,
^([a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
The list of cities it accepts:
Toronto
St. Catharines
San Fransisco
Val-d'Or
Presqu'ile
Niagara on the Lake
Niagara-on-the-Lake
München
toronto
toRonTo
villes du Québec
Provence-Alpes-Côte d'Azur
Île-de-France
Kópavogur
Garðabær
Sauðárkrókur
Þorlákshöfn
And what it rejects:
A----B
------
*******
&&
()
//
\\
I didn't add in the use of brackets and other marks since it didn't fall within the scope of this question.
I've stayed away from \s for whitespace. Tabs and line feeds aren't part of a city name and shouldn't be used in my opinion.
This can be arbitrarily complex, depending on how precise you need the match to be, and the variation you're willing to allow.
Something fairly simple like ^[a-zA-Z]+(?:[\s-][a-zA-Z]+)*$ should work.
warning: This does not match cities like München, etc, but here you basically need to work with the [a-zA-Z] part of the expression, and define what characters are allowed for your particular case.
Keep in mind that it also allows for something like San----Francisco, or having several spaces.
Translates to something like:
1 or more letters, followed by a block of: 0 or more spaces or dashes and more letters, this last block can occur 0 or more times.
Weird stuff in there: the ?: bit. If you're not familiarized with regexes, it might be confusing, but that simply states that the piece of regex between parenthesis, is not a capturing group (I don't want to capture the part it matches to reuse later), so the parenthesis are only used as to group the expression (and not to capture the match).
"New York" // passes
"San-Francisco" // passes
"San Fran Cisco" // passes (sorry, needed an example with three tokens)
"Chicago" // passes
" Chicago" // doesn't pass, starts with spaces
"San-" // doesn't pass, ends with a dash
Adding my answer if anybody needs its while searching for Regex for City Names, Like I did
Please use this :
^[a-zA-Z\u0080-\u024F\s\/\-\)\(\`\.\"\']+$
As many city names contains dashes, such as Soddy-Daisy, Tennessee, or special characters like, ñ in La Cañada Flintridge, California
Hope this helps!
Here is the one I've found works best
for PCRE flavours allowing \p{L} (.NET, php, Golang)
/^\p{L}+(?:([\ \-\']|(\.\ ))\p{L}+)*$/u
for regex that does not allow \p{L} replace it with [a-zA-Z\u0080-\u024F]
so for javascript, python regex use
/^[a-zA-Z\u0080-\u024F]+(?:([\ \-\']|(\.\ ))[a-zA-Z\u0080-\u024F]+)*$/
White listing a bunch of character is easy, but there are things to watch for in your regex
consecutive non-alphabetical characters should not be allowed. i.e. Los Angeles should fail because it has two spaces
periods should have a space after. i.e. St.Albert should fail because it's missing the space
names cannot start or end with non-alphabetical characters i.e. -Chicago- should fail
a whitespace character \s !== \, i.e. a tab and line feed character could pass, so space character should be defined instead
Note: When building regex rules, I find https://regex101.com/tests is very helpful, as you can easily create unit tests
js: https://regex101.com/r/cgJwc0/1/tests
php: https://regex101.com/r/Yo3GV2/1/tests
Here's one that will work with most cities, and has been tested:
^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
Python code below, including its test.
import re
import pytest
CITY_RE = re.compile(
r"^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*" # a word
r"([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*"
r"[a-zA-Z\u0080-\u024F]*$"
)
def is_city(value: str) -> bool:
valid = CITY_RE.match(value) is not None
return valid
# Tests
#pytest.mark.parametrize(
"value,expected",
(
("1", False),
("Toronto", True),
("Saint-Père-en-Retz", True),
("Saint Père en Retz", True),
("Saint-Père en Retz", True),
("Paris 13e Arrondissement", True),
("Paris 13e Arrondissement ", True),
("Bouc-Étourdi", True),
("Arnac-la-Poste", True),
("Bourré", True),
("Å", True),
("San Francisco", True),
),
)
def test_is_city(value, expected):
valid, msg = validate.is_city(value)
assert valid is expected
^[a-zA-Z\- ]+$
Also this might be useful http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
use this regex:
^[a-zA-Z-\s]+$
After many hours of looking for a city regex matcher I have built this and it meets my needs 100%
(?ix)^[A-Z.-]+(?:\s+[A-Z.-]+)*$
expression for testing city.
Matches
City
St. City
Some Silly-City
City St.
Too Many Words City
it seems that there are many flavors of regex and I built this for my Java needs and it works great
^[a-zA-Z.-]+(?:[\s-][\/a-zA-Z.]+)*$
This will help identify some city names like St. Johns, Baie-Sainte-Anne, Grand-Salut/Grand Falls
I like shepley's suggestion, but it has a couple flaws in it.
If you change shpeley's regex to this, it will not accept other special characters:
^([a-zA-Z\u0080-\u024F]{1}[a-zA-Z\u0080-\u024F\. |\-| |']*[a-zA-Z\u0080-\u024F\.']{1})$
I use that one:
^[a-zA-Z\\u0080-\\u024F.]+((?:[ -.|'])[a-zA-Z\\u0080-\\u024F]+)*$
You can try this:
^\p{L}+(?:[\s\-]\p{L}+)*
The above regex will:
Restrict leading and trailing spaces, hyphens
Match cities with names like Néewiller-près-lauterbourg
Here are some fun edge-cases:
's Graveland
's Gravendeel
's Gravenpolder
's Gravenzande
's Heer Arendskerke
's Heerenberg
's Heerenhoek
's Hertogenbosch
't Harde
't Veld
't Zand
100 Mile House
6 October City
So, don't forget to add ' and 0-9 as a possible first character of the city name.

c++ search text n boolean mode

basically have two questions.
1. Is there a c++ library that would do full text boolean search just like in mysql. E.g.,
Let's say I have:
string text = "this is my phrase keywords test with boolean query.";
string booleanQuery = "\"my phrase\" boolean -test -\"keywords test\" OR ";
booleanQuery += "\"boolean search\" -mysql -sql -java -php"b
//where quotes ("") contain phrases, (-) is NOT keyword and OR is logical OR.
If answer to first is no, then;
2. Is it possible to search a phrase in text. e.g.,
string text =//same as previous
string keyword = "\"my phrase\"";
//here what's the best way to search for my phrase in the text?
TR1 has a regex class (derived from Boost::regex). It's not quite like you've used above, but reasonably close. Boost::phoenix and Boost::Spirit also provide similar capabilities, but for a first attempt the Boost/TR1 regex class is probably a better choice.
As to the 2nd point: string class does have a method find, see http://www.cppreference.com/wiki/string/find
Sure there is, try Spirit:
http://boost-spirit.com/home/

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.