Pandas regex replacement when there is no match - regex

I'm using pandas.Series.str.replace to extract numbers from strings (its data that has been scraped from #WPWeather) and have got the point where I've extracted all the fields into a DataFrame like this...
df.head()
Out[48]:
temp pressure relative_humidity \
created_at
2019-12-13 10:19:13 5.2\xc2\xbaC, 975.4mb, 91.3%.
2019-12-12 10:19:07 2\xc2\xbaC, 990.3mb, 96.9%.
2019-12-11 10:19:07 4.2\xc2\xbaC, 1000.8mb, 85.7%.
2019-12-10 10:19:00 6.3\xc2\xbaC, 1008.5mb, 94.4%.
2019-12-09 10:18:51 5.4\xc2\xbaC, 1006.7mb, 68.5%.
last_24_max_temp last_24_min_temp rain sunshine
created_at
2019-12-13 10:19:13 7\xc2\xbaC, 2\xc2\xbaC, 9.5mm, 0
2019-12-12 10:19:07 6\xc2\xbaC, 1.5\xc2\xbaC, 0.9mm.' NaN
2019-12-11 10:19:07 11.7\xc2\xbaC, 2.2\xc2\xbaC, 14.1mm.' NaN
2019-12-10 10:19:00 6.5\xc2\xbaC, 1.9\xc2\xbaC, 1.1mm.' NaN
2019-12-09 10:18:51 8.5\xc2\xbaC, 5.2\xc2\xbaC, 1.5mm, 1.9
I'm trying to use regex's to extract the numerical values using...
pd.to_numeric(df['temp'].str.replace(r'(^-?\d+(?:\.\d+)?)(.*)', r'\1', regex=True))
...and it works well but I've hit an instance where one of the temperature fields doesn't have a value and is simply \xc2\xbaC,, as a consequence there is nothing matched in the first grouping to use in r'\1' and when it gets to trying to convert to numeric it fails with...
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "\xc2\xbaC," at position 120
How do I replace non-matches with something sane such as blank so that when I then call pd.to_numeric() it will convert to NaN?

Onde idea is change string for replace, then got not exist values get missing values:
df['temp'] = pd.to_numeric(df['temp'].str.replace(r'\xc2\xbaC,', '', regex=True))
print (df)
temp pressure relative_humidity
created_at
2019-12-13 10:19:13 5.2 975.4mb, 91.3%.
2019-12-12 10:19:07 2.0 990.3mb, 96.9%.
2019-12-11 10:19:07 4.2 1000.8mb, 85.7%.
2019-12-10 10:19:00 6.3 1008.5mb, 94.4%.
2019-12-09 10:18:51 5.4 1006.7mb, 68.5%.
Your solution should be changed with parameter errors='coerce' in to_numeric for replace non numeric to missing values:
df['temp'] = (pd.to_numeric(df['temp'].str.replace(r'(^-?\d+(?:\.\d+)?)(.*)',r'\1',regex=True),
errors='coerce'))

Related

Serialize pandas dataframe consists NaN fields before sending as a response

I have a dataframe that has NaN fields in it. I want to send this dataframe as a response. Because it has Nan fields I get this error,
ValueError: Out of range float values are not JSON compliant
I don't want to drop the fields or fill them with a character or etc. and the default response structure is ideal for my application.
Here is my views.py
...
forecast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
forecast['actual_value'] = df['y'] # <- Nan fields are added here
forecast.rename(
columns={
'ds': 'date',
'yhat': 'predictions',
'yhat_lower': 'lower_bound',
'yhat_upper': 'higher_bound'
}, inplace=True
)
context = {
'detail': forecast
}
return Response(context)
Dataframe,
date predictions lower_bound higher_bound actual_value
0 2022-07-23 06:31:41.362011 3.832143 -3.256209 10.358063 1.0
1 2022-07-23 06:31:50.437211 4.169004 -2.903518 10.566005 7.0
2 2022-07-28 14:20:05.000000 12.085815 5.267806 18.270929 20.0
...
16 2022-08-09 15:07:23.000000 105.655997 99.017424 112.419991 NaN
17 2022-08-10 15:07:23.000000 115.347283 108.526287 122.152684 NaN
Hoping to find a way to send dataframe as a response.
You could try to use the fillna and replace methods to get rid of those NaN values.
Adding something like this should work as None values are JSON compliant:
forecast = forecast.fillna(np.nan).replace([np.nan], [None])
Using replace alone can be enough, but using fillna prevent errors if you also have NaT values for example.

Pandas exact str matching function?

Does pandas have a built-in string matching function for exact matches and not regex? The code below for tropical_two has a slightly higher count. Documentation tells me it does a regex search.
tropical = reviews['description'].map(lambda x: "tropical" in x).sum()
print(tropical)
tropical_two = reviews['description'].str.count("tropical").sum()
print(tropical_two)
The first way is the answer key from Kaggle but something about it seems less readable and intuitive to me compared to a .str function because when I run this it returns True instead of 2 so I am a little confused about if the answer key method is actually counting all occurrences of "tropical" and not just the first.
def in_str(text):
return "tropical" in text
in_str("tropical is tropical")
First 2 lines of dataframe:
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe #kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss #vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
Notebook here, tropical code in cell #2
https://www.kaggle.com/mikexie0/exercise-summary-functions-and-maps
You may use str.count with word boundary markers to match the exact search term:
tropical_two = reviews['description'].str.count(r'\btropical\b').sum()
print(tropical_two)
There may not be the need for a separate exact API, as str.count can be used for exact matches as well.

RegEx to find numbers in a string using PowerShell

Imagine having 500 string like this with different dates:
The certificate has expired on 02/05/2014 15:43:01 UTC.
Given that this is a String and I'am using powershell. I need to treat the date (02/05/2014) as an object, so I can use operatators (-lt -gt).
Is the only way doing this is using RegEx, and in this case - can anyone help me finding the first 6 numbers (which change every time) using regEx.
>$regexStr = "(?<date>\d{2}\/\d{2}\/\d{4})"
>$testStr = "The certificate has expired on 02/05/2014 15:43:01 UTC."
>$testStr -match $regexStr
# $Matches will contain the regex group called "date"
>$Matches.date
02/05/2014
>$date = Get-Date ($Matches.date)
>$date
Wednesday, February 5, 2014 12:00:00 AM
If you need to parse the date string with another format you can do:
>$dateObj = [datetime]::ParseExact($Matches.date,”dd/MM/yyyy”,$null)
>$dateObj.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True DateTime System.ValueType
Hope that helps

Matching diverse dates in Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting the information presents in a column into (at least) 2 columns.
The data are, however, quite messed up.
I have sometimes full dates:
May 30, 1949
Sometimes full dates are combined with other dates and attributes:
May 30, 1949, published 1979
May 30, 1949 and 1951, published 1979
May 30, 1949, printed 1980
May 30, 1949, print executed 1988
May 30, 1949, prints executed 1988
published 1940
Sometimes you have timespan:
1905-05 OR 1905-1906
Sometimes only the year
1905
Sometimes year with attributes
August or September 1908
Doesn't seems to follow any specific schema or order.
I would like to extract (at least)ca start and end date year, in order to have two columns:
-----------------------
|start_date | end_date|
|1905 | 1906 |
-----------------------
without the rest of the attributes.
I can find the last date using
value.match(/.*(\d{4}).*?/)[0]
and the first one with
value.match(/.*^(\d{4}).*?/)[0]
but I got some trouble with the two formulas.
The latter cannot match anything in case of:
May 30, 1949 and 1951, published 1979
while in the case of:
Paris, winter 1911-12
The latter formula cannot match anything and the former formula match 1911
Anyone know how I can resolve the problem?
I would need a solution that take the first date as start_date and final date as end_date, or better (don't know if it is possible) earliest date as start_date and latest date as end_date.
Moreover, I would be glad to have some clue about how to extract other information, such as
if published or printed or executed is present in the text -> copy date to a new column name “execution”.
should be something like create a new column
if(value.match("string1|string2|string3" + (\d{4}), "perform the operation", do nothing)
value.match() is a very useful but sometimes tricky function. To extract a pattern from a text, I prefer to use Python/Jython's regular expressions :
import re
pattern = re.compile(r"\d{4}")
return pattern.findall(value)
From there, you can create a string with all the years concatenated:
return ",".join(pattern.findall(value))
Or select only the first:
return pattern.findall(value)[0]
Or the last:
return pattern.findall(value)[-1]
etc.
Same thing for your sub-question:
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
return pattern.findall(value)[0][1]
Or :
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
m = re.search(pattern, value)
return m.group(2)
Example:
Here is a regex which will extract start_date and end_date in named groups :
If there is only one date, then it consider it's the start_date :
((?<start_date>\d{4}).*?)?(?<end_date>\d{4}|(?<=-)\d{2})?$
Demo

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'