How are numbers matched in OSM Nominatim Geocoder search? - geocoding

Here is one things that I cannot make sense of how Nominatim handles numbers (at least in Barcelona):
Searching for Passeig de Gràcia and appending numbers 38 or 1, or 20 all returns the same set of results. I don't know where it's matching '38', or '1', or '20' because they are not in the results as fields.
Searching for Calle Mare de Deu dels Desemparats 18 is not returning any result but removing 18 returns 1 result.
Why does the service appear to be lenient to the number search parameters of Passeig de Gràcia while strict with Calle Mare de Deu dels Desemparats?

Related

Pandas exact str matching function?

Does pandas have a built-in string matching function for exact matches and not regex? The code below for tropical_two has a slightly higher count. Documentation tells me it does a regex search.
tropical = reviews['description'].map(lambda x: "tropical" in x).sum()
print(tropical)
tropical_two = reviews['description'].str.count("tropical").sum()
print(tropical_two)
The first way is the answer key from Kaggle but something about it seems less readable and intuitive to me compared to a .str function because when I run this it returns True instead of 2 so I am a little confused about if the answer key method is actually counting all occurrences of "tropical" and not just the first.
def in_str(text):
return "tropical" in text
in_str("tropical is tropical")
First 2 lines of dataframe:
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe #kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss #vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
Notebook here, tropical code in cell #2
https://www.kaggle.com/mikexie0/exercise-summary-functions-and-maps
You may use str.count with word boundary markers to match the exact search term:
tropical_two = reviews['description'].str.count(r'\btropical\b').sum()
print(tropical_two)
There may not be the need for a separate exact API, as str.count can be used for exact matches as well.

Is there a way to find and remove the datetime from multiple rows in Google Sheets?

Hope you're doing well.
Imagine I have the following Sheet:
5:20:58 xxxx: entro con el mismo xxxx
5:21:08 xxxx: xxxx
5:21:58 xxxxx: Perfecto, te pido de 5 a 10 minutos mientras
reviso la configuración de las etiquetas. ¿De acuerdo?
5:22:04 xxxxx: ok
I need to delete the datetime of all those rows. The result
xxxx: entro con el mismo xxxx
xxxx: xxxx
xxxxx: Perfecto, te pido de 5 a 10 minutos mientras
reviso la configuración de las etiquetas. ¿De acuerdo?
xxxxx: ok
Is there a formula in Google Sheets to make this?
I tried with REPLACE, SPLIT but is not applicable to all the rows in the sheet.
(The real sheet has too many rows, I extracted a part from the sheet to give an example)
EDIT
(following OP's comment)
...there are sometimes that the data not starts with a timestamp. ... How can I adjust the formula to make it work?
Please use the following altered formula
=INDEX(IFERROR(REGEXEXTRACT(B1:B;" (.*)");B1:B))
OR (for an even more robust formula)
=INDEX(IFERROR(REGEXEXTRACT(B1:B;"^\d+:\d+:\d+ (.+)");B1:B))
Original answer
Please use the following formula (adjust range to your needs)
=INDEX(IFERROR(REGEXEXTRACT(B1:B;" (.*)")))
OR (depending on your locale)
=INDEX(IFERROR(REGEXEXTRACT(B1:B," (.*)")))
Functions used:
INDEX
IFERROR
REGEXEXTRACT
Let's say your raw data is in A2:A. Place this in the second cell (e.g., B2) of an otherwise empty column:
=ArrayFormula(IF(A2:A="",,TRIM(REGEXREPLACE(A2:A,"\d+:\d+:\d+",""))))
ADDENDUM:
Version for some international locales (where semicolon is used in place of a comma within formulas):
=ArrayFormula(IF(A2:A="";;TRIM(REGEXREPLACE(A2:A;"\d+:\d+:\d+";""))))

extracting a location from a list with regex

I have a list like so
x=['hello#thepowerhouse.group', 'ThePowerHouse\xa0 is a part of the House of ElektroCouture', 'Our Studio is located at Bikini Berlin Terrace Level, 2nd floor Budapester Str. 46 10787 Berlin', '\xa0', 'Office:\xa0+49 30 20837551', '\xa0', '\xa0']
I want to extract the this element Our Studio is located at Bikini Berlin Terrace Level, 2nd floor Budapester Str. 46 10787 Berlin'
Since I am doing this for several sites I want to extra the element with regular expressions so it can work with others. I thought that I could grab the element by saying if the element has lower case and upper case letters, numbers , commas , and sometimes a period. This is what I attempted but it didn't work.
import re
for element in x:
if re.findall("([A-Za-z0-9,])",element)==True:
print("match")
You can split up your rule into several simple regexes and test them in sequence instead of making some monster-expression.
import re
def is_location(text):
"""Returns True if text contains digits, uppercase and lowercase characters."""
patterns = r'[0-9]', r'[a-z]', r'[A-Z]'
return all(re.search(pattern, text) for pattern in patterns)
x = [
'hello#thepowerhouse.group',
'ThePowerHouse\xa0 is a part of the House of ElektroCouture',
'Our Studio is located at Bikini Berlin Terrace Level, 2nd floor Budapester Str. 46 10787 Berlin',
'\xa0', 'Office:\xa0+49 30 20837551', '\xa0', '\xa0'
]
print(next(filter(is_location, x)))

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.