elasticsearch regexp don't work - regex

I need to make a regexp on elasticsearch to filtre some data.
The field I filter on is the name of person. The data are not always well formatted (sometimes, there is no first name, sometimes, the family name is followed by a period or a comma or 'comma+first name' or 'point+first name'....).
For example, using "bouchard" I get the following matches:
"bouchard", "bouchard, m.", "bouchard, j.", "bouchard j.p.", "bouchard. j.p."
I need also to exclude name who begin with same prefixe like "bouchardat".
I tried many regexps and finally found that an exclusion may yield better results:
"query" : { "regexp" : {
"RECORDEDBY" : "bouchard([^a-z].*)"
}}
This doesn't work because it returns "bouchard, m.", "bouchard, j.", "bouchard j.p." but not "bouchard. j.p." and not "bouchard".
I try some regexps with + and .* but they don't work.
( "bouchard([^a-z].*.*)" "bouchard([^a-z]*+.*)")
To make it clear, I want to allow:
bouchard
bouchard, m.
bouchard, j.
bouchard j.p.
bouchard. j.p.
I want to exclude
bouchardat
Any advice is welcome.

In this case, you could use a conditional operator to exclude every [a-z] suffix if no special character like '', '.', or ',' follows the word you are looking for:
((bouchard)+?([ .,]+)[ ,.a-zA-Z]*)|(bouchard[^a-zA-Z]?)
This regexp returns for the condition (there has to be [ .,]+):
bouchard
bouchard, m.
bouchard, j.
bouchard j.p.
bouchard. j.p.
and ignores the stuff after the pipe | where no [ .,]+ applies:
bouchardat
Regex101

Related

Regex- to get part of String

I have got below string and I need to Get all the values Between Pizzahut: and |.
ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|
I have got RegExpression .scan(/(?<=Pizzahut:)([.*\s\S]+)(?=\|)/) but it fetches
"j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|"
Result should be: 34532jdhgj,3242237,67688873rg
You can use
s='ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|'
p s.scan(/Pizzahut:([^|]+)/).flatten
# => ["j34532jdhgj", "3242237", "67688873rg"]
See this Ruby demo and the Rubular demo.
It does not look possible that you have Pizzahut as a part of another word, but it is possible, use a version with a word boundary, /\bPizzahut:([^|]+)/.
The Pizzahut:([^|]+) matches Pizzahut: and then captures into Group 1 any one or more chars other than a pipe (with ([^|]+)).
Note that String#scan returns the captures only if a pattern contains a capturing group, so you do not need to use lookarounds.
I'm not sure why you're jumping to a regex solution here; that input string clearly looks structured to me, and you would probably do better by splitting it on the delimiters to convert it into a more convenient data structure.
Something like this:
input = "ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg"
converted_input = input
.split('|') #=> ["ABC:2fg45rdvsg", "Pizzahut:j34532jdhgj", ... ]
.map { |pair| pair.split(':') } #=> [["ABC", "2fg45rdvsg"], ["Pizzahut", "j34532jdhgj"], ... ]
.group_by(&:first) #=> {"ABC"=>[["ABC", "2fg45rdvsg"]], "Pizzahut"=>[["Pizzahut", "j34532jdhgj"], ... ], "Dominos"=>[["Dominos", "3424232"]], ... ]
.transform_values { |v| v.flat_map(&:last) }
(The above series of transformations is just one possible way; you could probably come up with a dozen similar alternative steps to convert this input into the same hash shown below! For example, by using reduce or even the CSV library.)
Which gives you the final result:
converted_input = {
"ABC" => ["2fg45rdvsg"],
"Pizzahut" => ["j34532jdhgj", "3242237", "67688873rg"],
"Dominos" => ["3424232"],
"Wendys" => ["3462783"]
}
Now that the data is formatted conveniently, obtaining data like your original request becomes trivial:
converted_input["Pizzahut"].join(',') #=> "j34532jdhgj,3242237,67688873rg"
(Although quite likely it would be more suitable to leave it as an Array, not a comma-separated String!!)

Wrong regexp query for elasticsearch

I have some problems with the regexp query for elasticsearch. In my index there's a text field with comma-separated numeric values (IDs), f.e.
2,140,3,2495
And I have the following query term:
"regexp" : {
"myIds" : {
"value" : "^2495,|,2495,|,2495$|^2495$",
"boost" : 1
}
}
But my result list is empty.
Let me say that I know that regexp queries are kind of slow but the index still exists and is filled with millions of documents so unfortunately it's not an option to restructure it. So I need a regex solution.
In ElasticSearch regex, patterns are anchored by default, the ^ and $ are treated as literal chars.
What you mean to use is "2495,.*|.*,2495,.*|.*,2495|2495" - 2495, at the start of string, ,2495, in the middle, ,2495 at the end or a whole string equal to 2495.
Or, you may use a simpler
"(.*,)?2495(,.*)?"
That means
(.*,)? - an optional text (not including line breaks) ending with ,
2495 - your value
(,.*)? - an optional text (not including line breaks) ending with ,
Here is an online demo showing how this expression works (not a proof though).
Ok, I got it to work but run in another problem now. I built the string as follows:
(.*,)?2495(,.*)?|(.*,)?10(,.*)?|(.*,)?898(,.*)?
It works good for a few IDs but if I have let's say 50 IDs, then ES throws an exception which says that the regexp is too complex to process.
Is there a way to simplify the regexp or restructure the query it selves?

Regular expression for matching diffrent name format in python

I need a regular expression in python that will be able to match different name formats like
I have 4 different names format for same person.like
R. K. Goyal
Raj K. Goyal
Raj Kumar Goyal
R. Goyal
What will be the regular expression to get all these names from a single regular expression in a list of thousands.
PS: My list have thousands of such name so I need some generic solution for it so that I can combine these names together.In the above example R and Goyal can be used to write RE.
Thanks
"R(\.|aj)? (K(\.|umar)? )?Goyal" will only match those four cases. You can modify this for other names as well.
Fair warning: I haven't used Python in a while, so I won't be giving you specific function names.
If you're looking for a generic solution that will apply to any possible name, you're going to have to construct it dynamically.
ASSUMING that the first name is always the one that won't be dropped (I know people whose names follow the format "John David Smith" and go by David) you should be able to grab the first letter of the string and call that the first initial.
Next, you need to grab the last name- if you have no Jr's or Sr's or such, you can just take the last word (find the last occurrence of ' ', then take everything after that).
From there, "<firstInitial>* <lastName>" is a good start. If you bother to grab the whole first name as well, you can reduce your false positive matches further with "<firstInitial>(\.|<restOfFirstName>)* <lastName>" as in joon's answer.
If you want to get really fancy, detecting the presence of a middle name could reduce false positives even more.
I may be misunderstanding the problem, but I'm envisioning a solution where you iterate over the list of names and dynamically construct a new regexp for each name, and then store all of these regexps in a dictionary to use later:
import re
names = [ 'John Kelly Smith', 'Billy Bob Jones', 'Joe James', 'Kim Smith' ]
regexps={}
for name in names:
elements=name.split()
if len(elements) == 3:
pattern = '(%s(\.|%s)?)?(\ )?(%s(\.|%s)? )?%s$' % (elements[0][0], \
elements[0][1:], \
elements[1][0], \
elements[1][1:], \
elements[2])
elif len(elements) == 2:
pattern = '%s(\.|%s)? %s$' % (elements[0][0], \
elements[0][1:], \
elements[1])
else:
continue
regexps[name]=re.compile(pattern)
jksmith_regexp = regexps['John Kelly Smith']
print bool(jksmith_regexp.match('K. Smith'))
print bool(jksmith_regexp.match('John Smith'))
print bool(jksmith_regexp.match('John K. Smith'))
print bool(jksmith_regexp.match('J. Smith'))
This way you can easily keep track of which regexp will find which name in your text.
And you can also do handy things like this:
if( sum([bool(reg.match('K. Smith')) for reg in regexps.values()]) > 1 ):
print "This string matches multiple names!"
Where you check to see if some of the names in your text are ambiguous.

Regular Expressions for City name

I need a regular Expression for Validating City textBox, the city textbox field accepts only Letters, spaces and dashes(-).
This answer assumes that the letters which #Manaysah refers to also encompasses the use of diacritical marks. I've added the single quote ' since many names in Canada and France have it. I've also added the period (dot) since it's required for contracted names.
Building upon #UIDs answer I came up with,
^([a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
The list of cities it accepts:
Toronto
St. Catharines
San Fransisco
Val-d'Or
Presqu'ile
Niagara on the Lake
Niagara-on-the-Lake
München
toronto
toRonTo
villes du Québec
Provence-Alpes-Côte d'Azur
Île-de-France
Kópavogur
Garðabær
Sauðárkrókur
Þorlákshöfn
And what it rejects:
A----B
------
*******
&&
()
//
\\
I didn't add in the use of brackets and other marks since it didn't fall within the scope of this question.
I've stayed away from \s for whitespace. Tabs and line feeds aren't part of a city name and shouldn't be used in my opinion.
This can be arbitrarily complex, depending on how precise you need the match to be, and the variation you're willing to allow.
Something fairly simple like ^[a-zA-Z]+(?:[\s-][a-zA-Z]+)*$ should work.
warning: This does not match cities like München, etc, but here you basically need to work with the [a-zA-Z] part of the expression, and define what characters are allowed for your particular case.
Keep in mind that it also allows for something like San----Francisco, or having several spaces.
Translates to something like:
1 or more letters, followed by a block of: 0 or more spaces or dashes and more letters, this last block can occur 0 or more times.
Weird stuff in there: the ?: bit. If you're not familiarized with regexes, it might be confusing, but that simply states that the piece of regex between parenthesis, is not a capturing group (I don't want to capture the part it matches to reuse later), so the parenthesis are only used as to group the expression (and not to capture the match).
"New York" // passes
"San-Francisco" // passes
"San Fran Cisco" // passes (sorry, needed an example with three tokens)
"Chicago" // passes
" Chicago" // doesn't pass, starts with spaces
"San-" // doesn't pass, ends with a dash
Adding my answer if anybody needs its while searching for Regex for City Names, Like I did
Please use this :
^[a-zA-Z\u0080-\u024F\s\/\-\)\(\`\.\"\']+$
As many city names contains dashes, such as Soddy-Daisy, Tennessee, or special characters like, ñ in La Cañada Flintridge, California
Hope this helps!
Here is the one I've found works best
for PCRE flavours allowing \p{L} (.NET, php, Golang)
/^\p{L}+(?:([\ \-\']|(\.\ ))\p{L}+)*$/u
for regex that does not allow \p{L} replace it with [a-zA-Z\u0080-\u024F]
so for javascript, python regex use
/^[a-zA-Z\u0080-\u024F]+(?:([\ \-\']|(\.\ ))[a-zA-Z\u0080-\u024F]+)*$/
White listing a bunch of character is easy, but there are things to watch for in your regex
consecutive non-alphabetical characters should not be allowed. i.e. Los Angeles should fail because it has two spaces
periods should have a space after. i.e. St.Albert should fail because it's missing the space
names cannot start or end with non-alphabetical characters i.e. -Chicago- should fail
a whitespace character \s !== \, i.e. a tab and line feed character could pass, so space character should be defined instead
Note: When building regex rules, I find https://regex101.com/tests is very helpful, as you can easily create unit tests
js: https://regex101.com/r/cgJwc0/1/tests
php: https://regex101.com/r/Yo3GV2/1/tests
Here's one that will work with most cities, and has been tested:
^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
Python code below, including its test.
import re
import pytest
CITY_RE = re.compile(
r"^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*" # a word
r"([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*"
r"[a-zA-Z\u0080-\u024F]*$"
)
def is_city(value: str) -> bool:
valid = CITY_RE.match(value) is not None
return valid
# Tests
#pytest.mark.parametrize(
"value,expected",
(
("1", False),
("Toronto", True),
("Saint-Père-en-Retz", True),
("Saint Père en Retz", True),
("Saint-Père en Retz", True),
("Paris 13e Arrondissement", True),
("Paris 13e Arrondissement ", True),
("Bouc-Étourdi", True),
("Arnac-la-Poste", True),
("Bourré", True),
("Å", True),
("San Francisco", True),
),
)
def test_is_city(value, expected):
valid, msg = validate.is_city(value)
assert valid is expected
^[a-zA-Z\- ]+$
Also this might be useful http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
use this regex:
^[a-zA-Z-\s]+$
After many hours of looking for a city regex matcher I have built this and it meets my needs 100%
(?ix)^[A-Z.-]+(?:\s+[A-Z.-]+)*$
expression for testing city.
Matches
City
St. City
Some Silly-City
City St.
Too Many Words City
it seems that there are many flavors of regex and I built this for my Java needs and it works great
^[a-zA-Z.-]+(?:[\s-][\/a-zA-Z.]+)*$
This will help identify some city names like St. Johns, Baie-Sainte-Anne, Grand-Salut/Grand Falls
I like shepley's suggestion, but it has a couple flaws in it.
If you change shpeley's regex to this, it will not accept other special characters:
^([a-zA-Z\u0080-\u024F]{1}[a-zA-Z\u0080-\u024F\. |\-| |']*[a-zA-Z\u0080-\u024F\.']{1})$
I use that one:
^[a-zA-Z\\u0080-\\u024F.]+((?:[ -.|'])[a-zA-Z\\u0080-\\u024F]+)*$
You can try this:
^\p{L}+(?:[\s\-]\p{L}+)*
The above regex will:
Restrict leading and trailing spaces, hyphens
Match cities with names like Néewiller-près-lauterbourg
Here are some fun edge-cases:
's Graveland
's Gravendeel
's Gravenpolder
's Gravenzande
's Heer Arendskerke
's Heerenberg
's Heerenhoek
's Hertogenbosch
't Harde
't Veld
't Zand
100 Mile House
6 October City
So, don't forget to add ' and 0-9 as a possible first character of the city name.

REGEXP_EXTRACT () every word except ‘,’ in a field

I’d like to select country except ‘,’ from a data field which looks like this
Japan,Singapore,Italy,France
and my Code looks like this REGEXP_EXTRACT(country,'([^,]*)'), unfortunately, it works but only the country at the first was selected. How can I code it to select it all?
I slightly changed the RegEx to ([^,]+) to make the country name at least one digit. Using * creates empty matches so that every other match contains the country name. (Example)
Take a look at the fixed example here.
Important is the /g tag in the end to make the RegEx match globally.
If you are looking to extract all the characters except , then it could be achieved using either of the the REGEXP_REPLACE Calculated Fields below:
1) Replace , with (space)
REGEXP_REPLACE(country, ",", " ")
2) Remove ,
REGEXP_REPLACE(country, ",", "")
Google Data Studio Report and a GIF to elaborate: