Extract strings based on pattern in python and writing them into pandas dataframe columns - regex

I have text data inside a caloumn of dataset as shown below
Record Note 1
1 Amount: $43,385.23
Mode: Air
LSP: Panalpina
2 Amount: $1,149.32
Mode: Ocean
LSP: BDP
3 Amount: $1,149.32
LSP: BDP
Mode: Road
4 Amount: U$ 3,234.01
Mode: Air
5 No details
I need to extract each of the details inside the text data and write them into new column as shown below how to do it in python
Expected Output
Record Amount Mode LSP
1 $43,385.23 Air Panalpina
2 $1,149.32 Ocean BDP
3 $1,149.32 Road BDP
4 $3,234.01 Air
5
Is this possible. how can this be do

Write a custom function and then use pd.apply() -
def parse_rec(x):
note = x['Note']
details = note.split('\n')
x['Amount'] = None
x['Mode'] = None
x['LSP'] = None
if len(details) > 1:
for detail in details:
if 'Amount' in detail:
x['Amount'] = detail.split(':')[1].strip()
if 'Mode' in detail:
x['Mode'] = detail.split(':')[1].strip()
if 'LSP' in detail:
x['LSP'] = detail.split(':')[1].strip()
return x
df = df.apply(parse_rec, axis=1)

import re
Amount = []
Mode = []
LSP = []
def extract_info(txt):
Amount_lst = re.findall(r"amounts?\s*:\s*(.*)", txt, re.I)
Mode_lst = re.findall(r"Modes??\s*:\s*(.*)", txt, re.I)
LSP_lst = re.findall(r"LSP\s*:\s*(.*)", txt, re.I)
Amount.append(Amount_lst[0].strip() if Amount_lst else "No details")
Mode.append(Mode_lst[0].strip() if Mode_lst else "No details")
LSP.append(LSP_lst[0].strip() if LSP_lst else "No details")
df["Note"].apply(lambda x : extract_info(x))
df["Amount"] = Amount_lst
df["Mode"]= Mode_lst
df["LSP"]= LSP_lst
df = df[["Record","Amount","Mode","LSP"]]
By using regex we can extract information such as the above code and write down to separate columns.

Related

Google Script: Match RegEx into 2D array

I'm trying to extract information from Gmail into Google Spreadsheet. The information in the email has a table structure with the following columns List of Products, QTY Sold and the Subtotal for each product. These repeat N times.
When accesing the information using message.getPlainBody() I get the following text:
Product
Quantity
Price
Chocolate
1
$8.58
Apples
2
$40.40
Bananas
1
$95.99
Candy
1
$4.99
Subtotal:
$149.96
Progress
First I tried to use a regular expression to identify each row with all his elements:
Product name: Any amount of characters that don't include ':' (.*)[^:]
QTY Sold: Any number \d*
Anything that looks like a SubTotal [$]\d*.\d*
Wrapping everything up it looks like this
function ExtractDetail(message){
var mainbody = message.getPlainBody();
//RegEx
var itemListRegex = new RegExp(/(.*)[^:][\r\n]+(\d*[\r\n]+[$](\d*\.\d*)[\r\n]+/g);
var itemList = mainbody.match(itemListRegex);
Logger.log(itemList);
}
And so far it works:
itemList: Chocolate 1 $8.58 ,Apples 2 $40.40 ,Bananas 1 $95.99
,Candy 1 $4.99
However, I'm getting the following result:
[Chocolate 1 $8.58]
[Apples 2 $40.40]
[Bananas 1 $95.99]
[Candy 1 $4.99]
Instead of:
[Chocolate] [ 1 ] [$8.58]
[Apples] [ 2 ] [$40.40]
[Bananas] [ 1 ] [$95.99]
[Candy] [ 1 ] [$4.99]
Question
My question is, how can I append a new row in a way that it each row corresponds to each match found and that each column corresponds to each property?
How do I turn the result of each match into an array? Is it possible or should I change my approach?
Update:
Since the result of my current attemp is a large string I'm trying to find other options. This one poped up:
var array = Array.from(mainbody.matchAll(itemListRegex), m => m[1]);
Source: How do you access the matched groups in a JavaScript regular expression?
I'm still working on it. I still need to find how to add more columns and for some reason it starts on 'Apples' (following the examples), leaving 'Chocolates' behind.
Log:
Logger.log('array: ' + array);
If you want to use matchAll like Array.from(mainbody.matchAll(itemListRegex), m => m[1]), how about this modification?
In this case, /(.*[^:])[\r\n]+(\d*)[\r\n]+([$]\d*\.\d*)[\r\n]/g is used as the regex.
Modified script:
const itemListRegex = /(.*[^:])[\r\n]+(\d*)[\r\n]+([$]\d*\.\d*)[\r\n]/g;
var array = Array.from(mainbody.matchAll(itemListRegex), ([,b,c,d]) => [b,Number(c),d]);
Result:
[
["Chocolate",1,"$8.58"],
["Apples",2,"$40.40"],
["Bananas",1,"$95.99"],
["Candy",1,"$4.99"]
]
The result is the same with TheMaster's answer.
Test of script:
const mainbody = `
Product
Quantity
Price
Chocolate
1
$8.58
Apples
2
$40.40
Bananas
1
$95.99
Candy
1
$4.99
Subtotal:
$149.96
`;
const itemListRegex = /(.*[^:])[\r\n]+(\d*)[\r\n]+([$]\d*\.\d*)[\r\n]/g;
var array = Array.from(mainbody.matchAll(itemListRegex), ([,b,c,d]) => [b,Number(c),d]);
console.log(array)
Note:
About how can I append a new row in a way that it each row corresponds to each match found and that each column corresponds to each property?, this means for putting the values to Spreadsheet? If it's so, can you provide a sample result you expect?
References:
matchAll()
Array.from()
Map and split the resulting array by \new lines:
const data = `Product
Quantity
Price
Chocolate
1
$8.58
Apples
2
$40.40
Bananas
1
$95.99
Candy
1
$4.99
Subtotal:
$149.96`;
const itemListRegex = /.*[^:][\r\n]+\d*[\r\n]+\$\d*\.\d*(?=[\r\n]+)/g;
const itemList = data.match(itemListRegex);
console.info(itemList.map(e => e.split(/\n/)));//map and split

use a custom function to find all words in a column

Background
The following question is a variation from Unnest grab keywords/nextwords/beforewords function.
1) I have the following word_list
word_list = ['crayons', 'cars', 'camels']
2) And df1
l = ['there are many crayons, in the blue box crayons that are',
'cars! i like a lot of sports cars because they go fast',
'the camels, in the middle east have many camels to ride ']
df1 = pd.DataFrame(l, columns=['Text'])
df1
Text
0 there are many crayons, in the blue box crayons that are
1 cars! i like a lot of sports cars because they go fast
2 the camels, in the middle east have many camels to ride
3) I also have a function find_next_words which uses word_list to grab words from Text column in df1
def find_next_words(row, word_list):
sentence = row[0]
trigger_words = []
next_words = []
for keyword in word_list:
words = sentence.split()
for index in range(0, len(words) - 1):
if words[index] == keyword:
trigger_words.append(keyword)
next_words.append(words[index + 1:index + 3])
return pd.Series([trigger_words, next_words], index = ['TriggerWords','NextWords'])
4) And it's pieced together with the following
df2 = df1.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))
Output
Text TriggerWords NextWords
0 [crayons] [[that, are]]
1 [cars] [[because, they]]
2 [camels] [[to, ride]]
Problem
5) The output misses the following
crayons, from row 0 of Text column df1
cars! from row 1 of Text column df1
camels, from row 2 of Text column df1
Goal
6) Grab all corresponding words from df1 even if the words in df1 have a slight variation e.g. crayons, cars! from the words in word_list
(For this toy example, I know I can easily fix this problem by just adding these word variations to word_list = ['crayons,','crayons', 'cars!',cars, 'camels,', 'camels']. But this would be impractical to do with my my real word_list, which contains ~20K words)
Desired Output
Text TriggerWords NextWords
0 [crayons, crayons] [[in, the], [that, are]]
1 [cars, cars] [[i,like],[because, they]]
2 [camels, camels] [[in, the], [to, ride]]
Questions
How do I 1) tweak my word_list (e.g. regex?) 2) or find_next_words function to achieve my desired output?
You can tweak your regex something like this
\b(crayons|cars|camels)\b(?:[^a-z\n]*([a-z]*)[^a-z\n]*([a-z]*))
Regex Demo
import nltk
change
words = sentence.split()
to
words = nltk.word_tokenize(sentence)
this leads to
'crayons', ','
instead of
'crayons,'
which allows find_next_words to correctly identify all words from word_list in Text column

extra commas when using read_csv causing too many "s in data frame

I'm trying to read in a large file (~8Gb) using pandas read_csv. In one of the columns in the data, there is sometimes a list which includes commas but it enclosed by curly brackets e.g.
"label1","label2","label3","label4","label5"
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null}
Therefore, when these particular lines were read in I was getting the error "Error tokenizing data. C error: Expected 37 fields in line 35, saw 42". I found this solution which said to add
sep=",(?![^{]*})" into the read_csv arguments which worked with splitting the data correctly. However, the data now includes the quotation marks around every entry (this didn't happen before I added the sep argument in).
The data looks something like this now:
"label1" "label2" "label3" "label4" "label5"
"{A1}" "2" "" "False" "{ "apple" : false, "pear" : false, "banana" : null}"
meaning I can't use, for example, .describe(), etc on the numerical data because they're still strings.
Does anyone know of a way of reading it in without the quotation marks but still splitting the data where it is?
Very new to Python so apologies if there is an obvious solution.
serialdev found a solution to removing the "s but the data columns are objects and not what I would expect/want, e.g. the integer values aren't seen as integers.
The data needs to be split at "," explicitly (including the "s), is there a way of stating that in the read_csv arguments?
Thanks!
To read in the data structure you specified, where the last element is an unknown length.
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null}"
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null, "orange": "true"}"
Change the separate to a regular expression using a negative forward lookahead assertion. This will enable you to separate on a ',' only when not immediately followed by a space.
df = pd.read_csv('my_file.csv', sep='[,](?!\s)', engine='python', thousands='"')
print df
0 1 2 3 4
0 "{A1}" 2 NaN "False" "{ "apple" : false, "pear" : false, "banana" :...
1 "{A1}" 2 NaN "False" "{ "apple" : false, "pear" : false, "banana" :...
Specifying the thousands separator as the quote is a bit of a hackie way to parse fields contains a quoted integer into the correct datatype. You can achieve the same result using converters which can also remove the quotes from the strings should you need it to and cast "True" or "False" to a boolean.
If need remove " from column, use vectorized function str.strip:
import pandas as pd
mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
{'"first_name"': '"Bob"', '"age"': '"8"'},
{'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
print (df)
"age" "first_name"
0 "7" "Bill"
1 "8" "Bob"
2 "9" "Ben"
df['"first_name"'] = df['"first_name"'].str.strip('"')
print (df)
"age" "first_name"
0 "7" Bill
1 "8" Bob
2 "9" Ben
If need apply function str.strip() to all columns, use:
df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
print (df)
age first_name
0 7 Bill
1 8 Bob
2 9 Ben
Timings:
mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
{'"first_name"': '"Bob"', '"age"': '"8"'},
{'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
df = pd.concat([df]*3, axis=1)
df.columns = ['"first_name1"','"age1"','"first_name2"','"age2"','"first_name3"','"age3"']
#create sample [300000 rows x 6 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
df1,df2 = df.copy(),df.copy()
def a(df):
df.columns = df.columns.str.strip('"')
df['age1'] = df['age1'].str.strip('"')
df['first_name1'] = df['first_name1'].str.strip('"')
df['age2'] = df['age2'].str.strip('"')
df['first_name2'] = df['first_name2'].str.strip('"')
df['age3'] = df['age3'].str.strip('"')
df['first_name3'] = df['first_name3'].str.strip('"')
return df
def b(df):
#apply str function to all columns in dataframe
df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
return df
def c(df):
#apply str function to all columns in dataframe
df = df.applymap(lambda x: x.lstrip('\"').rstrip('\"'))
df.columns = df.columns.str.strip('"')
return df
print (a(df))
print (b(df1))
print (c(df2))
In [135]: %timeit (a(df))
1 loop, best of 3: 635 ms per loop
In [136]: %timeit (b(df1))
1 loop, best of 3: 728 ms per loop
In [137]: %timeit (c(df2))
1 loop, best of 3: 1.21 s per loop
Would this work since you have all the data that you need:
.map(lambda x: x.lstrip('\"').rstrip('\"'))
So simply clean up all the occurrences of " afterwards
EDIT with example:
mydata = [{'"first_name"' : '"bill', 'age': '"75"'},
{'"first_name"' : '"bob', 'age': '"7"'},
{'"first_name"' : '"ben', 'age': '"77"'}]
IN: df = pd.DataFrame(mydata)
OUT:
"first_name" age
0 "bill "75"
1 "bob "7"
2 "ben "77"
IN: df['"first_name"'] = df['"first_name"'].map(lambda x: x.lstrip('\"').rstrip('\"'))
OUT:
0 bill
1 bob
2 ben
Name: "first_name", dtype: object
Use this sequence after selecting the column, it is not ideal but will get the job done:
.map(lambda x: x.lstrip('\"').rstrip('\"'))
You can change the Dtypes after using this pattern:
df['col'].apply(lambda x: pd.to_numeric(x, errors='ignore'))
or simply:
df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)
It depend on your file. Did you check your data if there is comma or not, in cell ? If you have like this e.g Banana : Fruit, Tropical, Eatable, etc. in same cell, you're gonna get this kind of bug. One of basic solution is removing all commas in a file. Or, if you can read it, you can remove special characters :
>>>df
Banana
0 Hello, Salut, Salom
1 Bonjour
>>>df['Banana'] = df['Banana'].str.replace(',','')
>>>df
Banana
0 Hello Salut Salom
1 Bonjour

Reading XLSX into MFC Application

I am needing to read in data from an XLSX document into my MFC application.
It has been working fine however I now have the requirement of reading in foreign language characters and they are getting lost.
Example value which I need to read in are:
Лице и тяло
Мама и бебе
Currently I first convert the xlsx to a csv using the following script which I found online
if WScript.Arguments.Count < 2 Then
WScript.Echo \"Error!Please specify the source path and the destination.Usage: XlsToCsv SourcePath.xls Destination.csv
Wscript.Quit
End If
Dim numRows
Dim numCols
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
oExcel.DisplayAlerts = False
Dim oBook
Set oBook = oExcel.Workbooks.Open(Wscript.Arguments.Item(0))
Set oWorksheet = oBook.Worksheets(1)
oWorksheet.Activate
oWorksheet.Cells.Replace ",", ""
Dim celltxt
celltxt = oWorksheet.Cells(1,1).Text
If InStr(1, celltxt, \"Date\") Then
For count = 13 to 17
oWorksheet.Columns(count).NumberFormat = "0"
Next
Else
For count = 1 to 4
oWorksheet.Columns(count).NumberFormat = "0"
Next
End If
numRows = oWorksheet.UsedRange.Rows.Count
numCols = oWorksheet.UsedRange.Columns.Count
Const adTypeText = 2
Const adSaveCreateOverWrite = 2
Dim BinaryStream
Set BinaryStream = CreateObject("ADODB.Stream")
BinaryStream.Charset = "UTF-8"
BinaryStream.Type = adTypeText
BinaryStream.Open
For r = 1 To numRows
s = ""
For c = 1 To numCols
s = s & oWorksheet.Cells(r, c).Value
If c < numCols Then
s = s & ","
End If
Next
BinaryStream.WriteText s, 1
Next
BinaryStream.SaveToFile WScript.Arguments.Item(1), adSaveCreateOverWrite
BinaryStream.Close
oBook.Close False
oExcel.Quit
Once this script runs and i open the resulting csv before any further processing, the data in the csv looks fine.
Once this csv is processed by the aplication (Loaded using CRecordSet) the above values look as follows:
¦Ы¦¬TЖ¦¦ ¦¬ TВTП¦¬¦-
¦Ь¦-¦-¦- ¦¬ ¦-¦¦¦-¦¦
What I found is that if I open the csv before processing it and go to "Save As", the "Save as type" value is on "Unicode Text".
If I now overwrite the csv by saving as "CSV (Comma delimited)", then let the file process into my app, the resulting values look as follows:
TшЎх ш Є ыю
¦рьр ш схсх
and if I manually saved the csv as "CSV (MS-DOS)" and then let my app process it, the values come out correctly:
Лице и тяло
Мама и бебе
So on the one hand, I dont understand the differences and why I need the MS-DOS csv version to get my app to read in the value correctly, and secondly, I need to be able to process the data correctly without hte manually save as in between.
Any help greatly appreciated!
Thanks alot
UPDATE:
Just to test, I changed my code (which runs the script to convert to csv) to output to a .txt instead of .csv and again the output file looks right before my application reads it in but it still imports from there into my application wrong.
Here is the output .txt contents:
From Date,To Date,Store ID,Store Name,Retailer,Retail Format,Region,Cluster,Market Attr 1 (Numeric),Market Attr 2 (Numeric),Market Attr 3 (Text),Market Attr 4 (Text),Market Attr 5 (Text),Product ID,EAN (Barcode),UPC (Product Code),Brand,Description,Size & Uom,Size,Uom,Supplier,Shrink Pack,Minimum Display Depth,KVI,Status,Product Height,Product Width,Product Depth,Supergroup A,Supergroup B,Category,Sub Category,Segment,Sub Segment,Product Attr 1 (Numeric),Product Attr 2 (Numeric),Product Attr 3 (Text),Product Attr 4 (Text),Product Attr 5 (Text),Sales (Value),Units (Volume),Sales at Cost,Ranging Indicator,Fact 1 (Numeric),Fact 2 (Numeric),Fact 3 (Numeric),Fact 4 (Text),Fact 5 (Text)
01/01/2014,31/01/2016,130,Maritza,Clicks,Pharmacy,Plovdiv,A Range,1.99,2.99,Attribute 3,Attribute 4,Attribute 5,1347,7612729000518,ПОЛИНЕЙЛ лечебен лак за нокти 80мг/гр 33мл,Coca Cola,ПОЛИНЕЙЛ лечебен лак за нокти 80мг/гр 33мл,330ml,330,ml,Coca Cola,6,2,1,Active,CONVERT,37,36,Лице и тяло,Медицинска козметика,Carbonated Soft Drinks,Diet,On The Go,Cans,1.99,2.99,Attribute 3,Attribute 4,Attribute 5,136.27,12,122.21,1,1.99,2.99,3.99,Yes,No
01/01/2014,31/01/2016,130,Maritza,,Pharmacy,Plovdiv,,,,,,,1349,5903263245933,СУДО крем антисептичен 125гр,,СУДО крем антисептичен 125гр,,,,,,,,,66,6,6,Мама и бебе,Козметика бебе,,,,,,,,,,,,,,,,,,
01/01/2014,31/01/2016,130,Maritza,,Pharmacy,Plovdiv,,,,,,,1381,,СПРИНЦОВКА 1МЛ ТРИСЪСТАВНА инсулинова спринцовка х 1бр,,СПРИНЦОВКА 1МЛ ТРИСЪСТАВНА инсулинова спринцовка х 1бр,,,,,,,,,198,3,1,Мама и бебе,Храни и напитки бебе,,,,,,,,,,,,,,,,,,
01/01/2014,31/01/2016,130,Maritza,,Pharmacy,Plovdiv,,,,,,,1607,8716200646536,ФРИЗОПЕП AC 400гр,,ФРИЗОПЕП AC 400гр,,,,,,,,,123,10,10,Мама и бебе,Храни и напитки бебе,,,,,,,,,,,,,,,,,,

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}