Why can't I extract text between patterns with varying text length - regex

I have a column in a data frame with characters that need to be split into columns. My code seems to break when the string in the column has a length of 12 but it works fine when the string has a length of 11.
S99.ABCD{T}
S99.ABCD{V}
S99.ABCD{W}
S99.ABCD{Y}
Q100.ABCD{A}
Q100.ABCD{C}
Q100.ABCD{D}
Q100.ABCD{E}
An example of the ideal format is on the left, what I'm getting is on the right:
ID WILD RES MUT | ID WILD RES MUT
ABCD S 99 T | ABCD S 99 T
... | ...
ABCD Q 100 A | .ABC Q 100 {
... | ...
My current solution is the following:
data <- data.frame(ID = substr(mdata$substitution,
gregexpr(pattern = "\\.",
mdata$substitution)[[1]] + 1,
gregexpr(pattern = "\\{",
mdata$substitution)[[1]] - 1),
WILD = substr(mdata$substitution, 0, 1),
RES = gsub("[^0-9]","", mdata$substitution),
MUT = substr(mdata$substitution,
gregexpr(pattern = "\\{",
mdata$substitution)[[1]] + 1,
gregexpr(pattern = "\\}",
mdata$substitution)[[1]] - 1))
I'm not sure why my code isn't working, I thought using gregexpr I would be able to find where the pattern was in the string to find out the position of characters I want to extract but it doesn't work when the length of the string changes.

Using this example you can do what you want
test=c("S99.ABCD{T}",
"S99.ABCD{V}",
"S99.ABCD{W}",
"S99.ABCD{Y}",
"Q100.ABCD{A}",
"Q100.ABCD{C}",
"Q100.ABCD{D}",
"Q100.ABCD{E}")
library(stringr)
test=str_remove(test,pattern = "\\}")
testdf=as.data.frame(str_split(test,pattern = "\\.|\\{",simplify = T))
testdf$V4substring(testdf$V1, 1, 1)
testdf$V5=substring(testdf$V1, 2)

Related

For loop for a list to prevent vertical display

ini_list = "[('G 02', 'UV', '2.73')]"
res = ini_list.strip('[]')
print(res)
('G 02', 'UV', '2.73')
result = res.strip('()')
print(result)
'G 02', 'UV', '2.73'
I have a list: 'G 02', 'UV', '2.73' and I would like to assign variables to this list
so that the outcome is as follows:
Element = G 02
Reason = UV
Time = 2.73
I have numerous lists that contain those parameters that I would like to later use to plot various things and so would like to extract each parameter from the list and associate it with the specific variable.
I tried to do it by:
Results = res
for index, Parameters in enumerate(Results):
element = Parameters[0]
print(element)
in the hopes that i could extract each item from the list to assign it a variable as mentioned above however when i print element the list prints vertically downwards and it also doesnt let me extract individual indexes.
'
G
0
2
'
,
'
U
V
'
,
'
2
.
7
3
'
how do i get it so it assigns variables to each parameter as mentioned above and so it prints as so:
element = G 02
reason = UV
time = 2.73
if the ini_list is a string such as ini_list = "[('G 02', 'UV', '2.73')]" then you need strip and split methods to do your work such as following,
ini_list = "[('G 02', 'UV', '2.73')]"
res = ini_list.strip('[]')
result = res.strip('()')
result1 = result.split(',')
result1=[x.strip(" ") for x in result1]
result1=[x.strip("''") for x in result1]
element = result1[0]
print("element:",element)
reason =result1[1]
print("reason:",reason)
time=result1[2]
print("time:",time)
output:
element: G 02
reason: UV
time: 2.73

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

spark scala pattern matching on a dataframe column

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]

Parsing text file in matlab

I have this txt file:
BLOCK_START_DATASET
dlcdata L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Parameterfiles\Bladed4.2\DLC-Files\DLCDataFile.txt
simulationdata L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Parameterfiles\Bladed4.2\DLC-Files\BladedFile.txt
outputfolder Pfadangabe\runs_test
windfolder L:\loads2\WEC\1002_50-2\_calc\50-2_D135_HH95_RB-AB66-0O_GL2005_towerdesign_Bladed_v4-2_revA01\_wind
referenzfile_servesea L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\Referencefiles\Bladed4.2\DLC\dlc1-1_04a1.$PJ
referenzfile_generalsea L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\Referencefiles\Bladed4.2\DLC\dlc6-1_000_a_50a_022.$PJ
externalcontrollerdll L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\external_Controller\DisCon_V3_2_22.dll
externalcontrollerparameter L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\external_Controller\ext_Ctrl_Data_V3_2_22.txt
BLOCK_END_DATASET
% ------------------------------------
BLOCK_START_WAVE
% a6*x^6 + a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0
factor_hs 0.008105;0.029055;0.153752
factor_tz -0.029956;1.050777;2.731063
factor_tp -0.118161;1.809956;3.452903
spectrum_gamma 3.3
BLOCK_END_WAVE
% ------------------------------------
BLOCK_START_EXTREMEWAVE
height_hs1 7.9
period_hs1 11.8
height_hs50 10.8
period_hs50 13.8
height_hred1 10.43
period_hred1 9.9
height_hred50 14.26
period_hred50 11.60
height_hmax1 14.8
period_hmax1 9.9
height_hmax50 20.1
period_hmax50 11.60
BLOCK_END_EXTREMEWAVE
% ------------------------------------
BLOCK_START_TIDE
normal 0.85
yr1 1.7
yr50 2.4
BLOCK_END_TIDE
% ------------------------------------
BLOCK_START_CURRENT
velocity_normal 1.09
velocity_yr1 1.09
velocity_yr50 1.38
BLOCK_END_CURRENT
% ------------------------------------
BLOCK_START_EXTREMEWIND
velocity_v1 29.7
velocity_v50 44.8
velocity_vred1 32.67
velocity_vred50 49.28
velocity_ve1 37.9
velocity_ve50 57
velocity_Vref 50
BLOCK_END_EXTREMEWIND
% ------------------------------------
Currently I'm parsing it this way:
clc, clear all, close all
%Find all row headers
fid = fopen('test_struct.txt','r');
row_headers = textscan(fid,'%s %*[^\n]','CommentStyle','%','CollectOutput',1);
row_headers = row_headers{1};
fclose(fid);
%Find all attributes
fid1 = fopen('test_struct.txt','r');
attributes = textscan(fid1,'%*s %s','CommentStyle','%','CollectOutput',1);
attributes = attributes{1};
fclose(fid1);
%Collect row headers and attributes in a single cell
parameters = [row_headers,attributes];
%Find all the blocks
startIdx = find(~cellfun(#isempty, regexp(parameters, 'BLOCK_START_', 'match')));
endIdx = find(~cellfun(#isempty, regexp(parameters, 'BLOCK_END_', 'match')));
assert(all(size(startIdx) == size(endIdx)))
%Extract fields between BLOCK_START_ and BLOCK_END_
extract_fields = #(n)(parameters(startIdx(n)+1:endIdx(n)-1,1));
struct_fields = arrayfun(extract_fields, 1:numel(startIdx), 'UniformOutput', false);
%Extract attributes between BLOCK_START_ and BLOCK_END_
extract_attributes = #(n)(parameters(startIdx(n)+1:endIdx(n)-1,2));
struct_attributes = arrayfun(extract_attributes, 1:numel(startIdx), 'UniformOutput', false);
%Get structure names stored after each BLOCK_START_
structures_name = #(n) strrep(parameters{startIdx(n)},'BLOCK_START_','');
structure_names = genvarname(arrayfun(structures_name,1:numel(startIdx),'UniformOutput',false));
%Generate structures
for i=1:numel(structure_names)
eval([structure_names{i} '=cell2struct(struct_attributes{i},struct_fields{i},1);'])
end
It works, but not as I want. The overall idea is to read the file into one structure (one field per block BLOCK_START / BLOCK_END). Furthermore, I would like the numbers to be read as double and not as char, and delimiters like "whitespace" "," or ";" have to be read as array separator (e.g. 3;4;5 = [3;4;5] and similar).
To clarify better, I will take the block
BLOCK_START_WAVE
% a6*x^6 + a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0
factor_hs 0.008105;0.029055;0.153752
factor_tz -0.029956;1.050777;2.731063
factor_tp -0.118161;1.809956;3.452903
spectrum_gamma 3.3
BLOCK_END_WAVE
The structure will be called WAVE with
WAVE.factor_hs = [0.008105;0.029055;0.153752]
WAVE.factor_tz = [-0.029956;1.050777;2.731063]
WAVE.factor_tp = [-0.118161;1.809956;3.452903]
WAVE.spectrum.gamma = 3.3
Any suggestion will be strongly appreciated.
Best regards.
You have answers to this question (which is also yours) as a good starting point! To extract everything into a cell array, you do:
%# Read data from input file
fd = fopen('test_struct.txt', 'rt');
C = textscan(fd, '%s', 'Delimiter', '\r\n', 'CommentStyle', '%');
fclose(fd);
%# Extract indices of start and end lines of each block
start_idx = find(~cellfun(#isempty, regexp(C{1}, 'BLOCK_START', 'match')));
end_idx = find(~cellfun(#isempty, regexp(C{1}, 'BLOCK_END', 'match')));
assert(all(size(start_idx) == size(end_idx)))
%# Extract blocks into a cell array
extract_block = #(n)({C{1}{start_idx(n):end_idx(n) - 1}});
cell_blocks = arrayfun(extract_block, 1:numel(start_idx), 'Uniform', false);
Now, to translate that into corresponding structs, do this:
%# Iterate over each block and convert it into a struct
for i = 1:length(cell_blocks)
%# Extract the block
C = strtrim(cell_blocks{i});
C(cellfun(#(x)isempty(x), C)) = []; %# Ignore empty lines
%# Parse the names and values
params = cellfun(#(s)textscan(s, '%s%s'), {C{2:end}}, 'Uniform', false);
name = strrep(C{1}, 'BLOCK_START_', ''); %# Struct name
fields = cellfun(#(x)x{1}{:}, params, 'Uniform', false);
values = cellfun(#(x)x{2}{:}, params, 'Uniform', false);
%# Create a struct
eval([name, ' = cell2struct({values{idx}}, {fields}, 2)'])
end
Well, I've never used matlab, but you could use the following regex to find a block:
/BLOCK_START_(\w+).*?BLOCK_END_\1/s
Then for each block, find all the attributes:
/^(?!BLOCK_END_)(\w+)\s+((?:-?\d+\.?\d*)(?:;(?:-?\d+\.?\d*))*)/m
Then based on the presence of semi colons in the second sub match you could assign it as either a single or multiple value variable. Not sure how to translate that into matLab, but I hope this helps!