How to use regular expression in spark dataframe using scala?

How to use regular expression in spark dataframe using scala? - regex

In my case i have a data frame contaning some biological data which are: protein name, ecnumber (which could be more than one) and protein domains (which could be also more than one domain). The data frame is a one column containing all those data which i would like to split it into three columns, but the problem is that if a line (containing more than one ECnumber) is splitted, the second ECnumber goes to the third column and the domains will be then disappeared.
here is my code:
val df = rdd.toDF()
val mydf = df.withColumn("_tmp", split($"value", ";")).select(
$"_tmp".getItem(0).as("Entry"),
$"_tmp".getItem(1).as("ECnumber"),
$"_tmp".getItem(2).as("Domains")
And here is the result
enter image description here

Based on the provided reference data, I see you can use the following regular expression to retrieve your data into independent columns (by extracting using regular expression):
val dataFrameValueRegex = "(\\w++);(([0-9.-]*+;)++)((\\w++;?)++)".r
For example if data frame value has got the following:
val dataFrameValue = "A6MML6;2.1.-.-;2.1.3.16;IPR037431;IPR037432;IPR037433"
Now using regular expression, you can extract independent values from data frame value:
val dataFrameValueRegex(entry, ecNumbers, _, domains, _) = dataFrameValue
Above: All values will be retrieved in corresponding variables:
1.) entry: Entry string
2.) ecNumbers: Complete string of ecnumbers separated by semicolon's. There would be a semicolon present at the end of the string.
3.) domains: Complete string of domains separated by semicolon's.
Note: If for any reason the data frame value was not as expected you would get a MatchError exception been thrown.
In the below code just printing variables information.
println(s"Data value: Entry = [$entry], ECnumbers = [${ecNumbers.init}], Domains = [$domains]")
val ecNumber = ecNumbers.init.split(";")
ecNumber.foreach(e => println(s"ecNumber = [$e]"))
val domain = domains.split(";")
domain.foreach(d => println(s"Domain = [$d]"))

Related

How to tag named entities to prepare training data for custom named entity recognition with spacy?

I want to train spacy named entity recognizer on my custom dataset. I have prepared a python dictionary having key = entity_type and list of values = entity name, but i'm not getting any way using which I can tag the tokens in proper format.
I have tried normal string matching(find) and regular expression(search, compile) but not getting what I want.
for ex: my sentence and the dict I'm using are(this is the example)
sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
for k,v in dic.items():
for val in v:
if val in sentence:
print(k, val, sentence.index(val)) #right now I'm just printing
#the key, val and starting index
output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21
expected output: MLDM 0 32
so I can further prepare training data to train Spacy NER :
[{"content":"machine learning and data mining often employ the same methods
and overlap significantly.","entities":[[0,32,"MLDM"]]}

You may build a regex from all values in your dic to match them as whole words and upon a match grab the key associated with the matched value. I assume the value items are unique in the dictionary, they can contain whitespaces and only contain "word" characters (no special ones like + or ().
import re
sentence = "Machine learning and data mining often employ the same methods and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
def get_key(val):
for k,v in dic.items():
if m.group().lower() in map(str.lower, v):
return k
return ''
# Flatten the lists in values and sort the list by length in descending order
l=sorted([v for x in dic.values() for v in x], key=len, reverse=True)
# Build the alternation based regex with \b to match each item as a whole word
rx=r'\b(?:{})\b'.format("|".join(l))
for m in re.finditer(rx, sentence, re.I): # Search case insensitively
key = get_key(m.group())
if key:
print("{} {}".format(key, m.start()))
See the Python demo

Reading Values from a Bag of Tuples in pig

I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.

GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir

The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.

Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')

If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]

Dynamic Depending Lists in Separated WorkSheets in VBA (2)

I'm working with 7 dynamic dependent lists, and I thought the best way to automate the process and avoid to arrange anything in a future if I modify the lists was a VBA code.
The VBA code that I started to work on it is posted on: Dynamic Depending Lists in Separated WorkSheets in VBA
That code is just for the 2 first lists.
That's the main table that I have. I want pick lists for the first row only for the yellow columns:
That's the table that I have the lists (they must be dynamic):
The relations between my lists are:
Responsible list and Site list are related with Project list.
The other lists are related with the site list.

Okay. I've got what you are looking for. I solved this issue a few months back in another project. Basically, indirect is no good here because it doesn't work on dynamic named ranges, because they don't produce an actual result, just a formula reference.
First, set up your named ranges on a sheet like so. It's very important that the named ranges be named in the manner I described, as this will feed the code into making your dynamic lists. Also, note, I only wrote out SamplePoints for X1 and T2. If you select other options, the code won't work until you add those named ranges in.
Then assuming input sheet is set up like below:
Place this code in the worksheet change event of your input sheet. What it does is take the value selected in one cell and then appends the appropriate column name to feed that list. So, if Project A is selected and you want to pick a responsible party for project A, it sets the validation in Range("B(whatever row you are on)" to be A_Responsible, thus giving you that list.
Private Sub Worksheet_Change(ByVal Target As Range)
Dim wks As Worksheet
Dim strName As String, strFormula
Dim rng As Range
Set wks = ActiveSheet
With wks
If Target.Row = 1 Then Exit Sub
Select Case Target.Column
Case Is = .Rows(1).Find("Project", lookat:=xlWhole).Column
Set rng = Target.Offset(, 1)
strName = Target.Value
strFormula = "=" & Replace(strName, " ", "_") & "_Responsible"
AddValidation rng, 1, strFormula
'add any more cells that would need validation based on project selection here.
Case Is = .Rows(1).Find("Responsible", lookat:=xlWhole).Column
Set rng = Target.Offset(, 1)
strName = Target.Value
strFormula = "=" & Replace(strName, " ", "_") & "_SamplePoint"
AddValidation rng, 1, strFormula
'add any more cells that would need validation based on responsible selection here.
'Case Is = add any more dependenices here ... and continue with cases for each one
End Select
End With
You will also need this function in a standard module somewhere in your workbook.
Function AddValidation(ByVal rng As Range, ByVal iOperator As Integer, _
ByVal sFormula1 As String, Optional iXlDVType As Integer = 3, _
Optional iAlertStyle As Integer = 1, Optional sFormula2 As String, _
Optional bIgnoreBlank As Boolean = True, Optional bInCellDropDown As Boolean = True, _
Optional sInputTitle As String, Optional sErrorTitle As String, _
Optional sInputMessage As String, Optional sErrorMessage As String, _
Optional bShowInput As Boolean = True, Optional bShowError As Boolean = True)
'==============================================
'Enumaration for ease of use
'XlDVType
'Name Value Description
'xlValidateCustom 7 Data is validated using an arbitrary formula.
'xlValidateDate 4 Date values.
'xlValidateDecimal 2 Numeric values.
'xlValidateInputOnly 0 Validate only when user changes the value.
'xlValidateList 3 Value must be present in a specified list.
'xlValidateTextLength 6 Length of text.
'xlValidateTime 5 Time values.
'xlValidateWholeNumber 1 Whole numeric values.
'AlertStyle
'xlValidAlertInformation 3 Information icon.
'xlValidAlertStop 1 Stop icon.
'xlValidAlertWarning 2 Warning icon.
'Operator
'xlBetween 1 Between. Can be used only if two formulas are provided.
'xlEqual 3 Equal.
'xlGreater 5 Greater than.
'xlGreaterEqual 7 Greater than or equal to.
'xlLess 6 Less than.
'xlLessEqual 8 Less than or equal to.
'xlNotBetween 2 Not between. Can be used only if two formulas are provided.
'xlNotEqual 4 Not equal.
'==============================================
With rng.Validation
.Delete ' delete any existing validation before adding new one
.Add Type:=iXlDVType, AlertStyle:=iAlertStyle, Operator:=iOperator, Formula1:=sFormula1, Formula2:=sFormula2
.IgnoreBlank = bIgnoreBlank
.InCellDropdown = bInCellDropDown
.InputTitle = sInputTitle
.ErrorTitle = sErrorTitle
.InputMessage = sInputMessage
.ErrorMessage = sErrorMessage
.ShowInput = bShowInput
.ShowError = bShowError
End With
End Function

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to use regular expression in spark dataframe using scala? - regex

Related

How to tag named entities to prepare training data for custom named entity recognition with spacy?

Reading Values from a Bag of Tuples in pig

PIG regex extract then filter the unnamed regex tuple

Python .splitlines() to segment text into separate variables

Dynamic Depending Lists in Separated WorkSheets in VBA (2)

Categories

Resources