Dataframe transformations produce empty values - regex

I have been trying to list all the Spark dataframes from Parquet files in directories except metadata directory.
The structure of directories looks like this:
dumped_data/
- time=19424145
- time=19424146
- time=19424147
- _spark_metadata
The main goal is to avoid reading data from _spark_metadata directory. I have created a solution but it constantly returns empty values for some reason. What could be the reason of it?
Here is the solution:
val dirNamesRegex: Regex = s"\\_spark\\_metadata*".r
def transformDf: Option[DataFrame] = {
val filesDf = listPath(new Path(feedPath))(fsConfig)
.map(_.getName)
.filter(name => !dirNamesRegex.pattern.matcher(name).matches)
.flatMap(path => sparkSession.parquet(Some(feedSchema))(path))
if (!filesDf.isEmpty)
Some(filesDf.reduce(_ union _))
else None
}
listPath - custom method for listing data files in hdfs. feedSchema is of StructType
Without if on Some and None I get this exception:
java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:137)
at scala.collection.immutable.List.reduceLeft(List.scala:84)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)

In your code you have 3 problems:
Seems you can use == operator instead of regex matching. You know concrete name of directory to filter, just use filtering by name.
As I got your code, filesDf is something like Traversable[DataFrame]. If you want reduce it safety even this collection is empty you can use reduceLeftOption instead of reduce.
In your transformDf method you are trying to filter directory names and reading data using spark, it can be too heavy to debug with spark also. I would advise you divide your logic into 2 different methods: first - read directories and filter them, second - read data and union them into one general DataFrame.
I propose such code samples:
case without dividing logic:
def transformDf: Option[DataFrame] = {
listPath(new Path(feedPath))(fsConfig)
.map(_.getName)
.filter(name => name == "_spark_metadata")
.flatMap(path => sparkSession.parquet(Some(feedSchema))(path))
.reduceLeftOption(_ union _)
}
case with divided logic:
def getFilteredPaths: List[String] =
listPath(new Path(feedPath))(fsConfig)
.map(_.getName)
.filter(name => name == "_spark_metadata")
def transformDf: Option[DataFrame] = {
getFilteredPaths
.flatMap(path => sparkSession.parquet(Some(feedSchema))(path))
.reduceLeftOption(_ union _)
}
In second way you can write some light-weight unit-tests for debug your paths extraction and when you will have correct paths you can easily read data from directories and union them.

Related

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" method as below and it was able to hold the load until incoming record counts were less than 50K
PS: The regular expression reference data is a broadcasted dataset.
dataset.join(regexDataset.value, expr("input_column rlike regular_exp_column")
Then I wrote a custom UDF to transform them using Scala native regex search as below,
Below val collects the reference data as Array of tuples.
val regexPreCalcArray: Array[(Int, Regex)] = {
regexDataset.value
.select( "col_1", "regex_column")
.collect
.map(row => (row.get(0).asInstanceOf[Int],row.get(1).toString.r))
}
Implementation of Regex matching UDF,
def findMatchingPatterns(regexDSArray: Array[(Int,Regex)]): UserDefinedFunction = {
udf((input_column: String) => {
for {
text <- Option(input_column)
matches = regexDSArray.filter(regexDSValue => if (regexDSValue._2.findFirstIn(text).isEmpty) false else true)
if matches.nonEmpty
} yield matches.map(x => x._1).min
}, IntegerType)
}
Joins are done as below, where a unique ID from reference data will be returned from UDF in case of multiple regex matches and joined against reference data using unique ID to retrieve other columns needed for result,
dataset.withColumn("min_unique_id", findMatchingPatterns(regexPreCalcArray)($"input_column"))
.join(regexDataset.value, $"min_unique_id" === $"unique_id" , "left")
But this too gets very slow with skew in execution [1 executor task runs for a very long time] when record count increases above 1M. Spark suggests not to use UDF as it would degrade the performance, any other best practises I should apply here or if there's a better API for Scala regex match than what I've written here? or any suggestions to do this efficiently would be very helpful.

Converting segments of large .cif files to smaller .pdb files

I'm trying to carve out some binding sites with ligands from cif-files of ribosome crystal structures, and have encountered an annoying problem involving a type error.
TypeError: %c requires int or char
Using the code below,
from Bio.PDB import *
from Bio import PDB
class save_res(Select):
def accept_residue(self, residue):
if residue in keep_res_list:
print(residue)
return 1
else:
return 0
keep_res_list = []
parser = MMCIFParser()
structure = parser.get_structure("1vvj.cif", "./1vvj.cif")
structure = structure[0]
atom_list = Selection.unfold_entities(structure, "A") # A for atoms
ns = NeighborSearch(atom_list)
for residue in structure.get_residues():
if residue.get_resname() == "PAR":
for atom in residue:
center = atom.get_coord()
neighbors = ns.search(center, 5.0)
neighbor_residue_list = Selection.unfold_entities(neighbors, "R")
for res in neighbor_residue_list:
if res not in keep_res_list:
keep_res_list.append(res)
io = PDBIO()
io.set_structure(structure)
io.save("1vvj_bs.pdb", save_res())
gives me the error:
File "/scratch/software/anaconda3/envs/my-devel-3.6/lib/python3.6/site-packages/Bio/PDB/PDBIO.py", line 112, in _get_atom_line
return _ATOM_FORMAT_STRING % args
TypeError: %c requires int or char
This code works well when changing the pdb-id to 1fyb, which also has the same ligand id.
I'm thinking the problem stems from the vast amounts of chains and their IDs in the original file. Am I completely wrong in this assumption or does anyone know how to fix this? I've been trying to find a way to rename the chain IDs, but haven't found a viable method to do this.
Your help is appreciated.
The chain name format in _ATOM_FORMAT_STRING is %c, while in this case you have chain named QA.
Chain names in PDB files were traditionally single characters.
But there are only so many letters and digits. For ribosome it's necessary to use longer names. The pdb format has space for a second letter -- empty column on the left from the 1-character chain name. Many programs support it, but not all, and this is not part of the official specification.
So you can either use PDB files with 2-character chains (if the rest of your workflow supports it) or rename chains in the output (your output is only a tiny part of the original structure).
Here is how to do it in gemmi:
import gemmi
structure = gemmi.read_structure('1vvj.cif')
model = structure[0]
ns = gemmi.NeighborSearch(model, structure.cell, 5.0).populate()
for chain in model:
for residue in chain:
if residue.name == 'PAR':
for atom in residue:
for nb in ns.find_neighbors(atom):
nb.to_cra(model).residue.flag = 'y'
sel = gemmi.Selection().set_residue_flags('y')
new_structure = sel.copy_structure_selection(structure)
#new_structure.remove_empty_chains()
#new_structure.shorten_chain_names()
new_structure.write_minimal_pdb('1vvj-par.pdb')
The two commented out lines are renaming the chains.
One difference comparing with your code is that the NeighborSearch in gemmi is symmetry-aware. It finds also nearby atoms from symmetry mates. In BioPython you search only in asymmetric unit (asu).
Both are different than the biological assembly --
PDB-101 covers it nicely.
If you'd like to search in asu only -- replace structure.cell with gemmi.UnitCell() above, i.e. don't pass the unit cell information.
(You can ask such questions on bioinformatics.SE -- it should get answer sooner there).

Parse CSV efficiently in python

I am writing a CSV parser which has following structure
class decode:
def __init__(self):
self.fd = open('test.csv')
def decodeoperation(self):
for row in self.fd:
getcmd = self.decodecmd(row)
if cmd == 'A'
self.decodeAopt()
elif cmd == 'B':
self.decodeBopt()
def decodeAopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd A till
#a condition occurs on any further row
return
def decodeBopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd B till
#a condition occurs on any further row
return
The current code is working fine for me but I am not feeling good to iterate through the CSV file in all the methods. Could it be done in a better way?
There is nothing inherently wrong with using a common iterator across multiple methods, as long as you can determine in advance which method to dispatch to at any given point in the sequence (which you are doing by decoding the cmd from the row and getting 'A', 'B', etc.). The design has issues if you have to read several items before you could determine which method to call, and might have to back up if you picked the wrong method and needed to try another. In parsing, this is called backtracking. Since you are passing around a file object, backing up is difficult. Note that your separate decoder methods will have to know when to stop before reading the next row that contains a command, so they will need some sort of terminating sentinel row that they can recognize.
Some general comments on your Python and class design:
You have a nice simple if-elif-elif dispatch table that can translate to a Python dict like this:
# put this code in place of your "if cmd == ... elif elif elif..." code
dispatch = {
# note - no ()'s, we just want to reference the methods, not call them
'A': self.decodeAopt,
'B': self.decodeBopt,
'C': self.decodeCopt,
# look how easy it is to add more decoders
}
# lookup which decoder to use for the current cmd
decoder = dispatch[cmd]
# run it
decoder()
# or do it all in one line
dispatch[cmd]()
Instead of having your __init__ method open a file, let it accept an iterator object. This will make it much easier to write tests for your object, since you'll be able to pass simple Python lists containing CSV rows.
class decode:
def __init__(self, sequence):
self.fd = sequence
You might want to rename this var from 'fd' to something like 'seq', since it doesn't have to be a file, but could be any iterable that gives you decodable rows.
If you are doing your own CSV parsing, look at using the builtin csv module. It will do quite a bit of work for you, like parsing quoted strings that could contain commas, and can give you easy-to-work-with dicts for each row, given headers read from the input file, or specified by you. If you have modified __init__ as I suggested, you can use it like:
import csv
# assuming test.csv has a header row
reader = csv.DictReader(open('test.csv'))
# or specify headers if not - I encourage you to give these columns better names
reader.fieldnames = ['cmd', 'val1', 'val2', 'val3']
decoder = decode(reader)
decoder.decodeoperation()
Then you can write in decodeoperation:
cmd = row['cmd']
Note that this would impart a slightly different design to your class, that it would expect to be given a sequence of dicts, rather than a sequence of strings.

Lua functions use "self" in source but no metamethod allows to use them

I've been digging into Lua's source code, both the C source from their website and the lua files from Lua on Windows. I found something odd that I can't find any information about, as to why they chose to do this.
There are some methods in the string library that allows OOP calling, by attaching the method to the string like this:
string.format(s, e1, e2, ...)
s:format(e1, e2, ...)
So I dug into the source code for the module table, and found that functions like table.remove(), also allows for the same thing.
Here's the source code from UnorderedArray.lua:
function add(self, value)
self[#self + 1] = value
end
function remove(self, index)
local size = #self
if index == size then
self[size] = nil
elseif (index > 0) and (index < size) then
self[index], self[size] = self[size], nil
end
end
Which indicate that the functions should support the colon method. Lo' and behold when I copy table into my new list, the methods carry over. Here's an example using table.insert as a method:
function copy(obj, seen) -- Recursive function to copy a table with tables
if type(obj) ~= 'table' then return obj end
if seen and seen[obj] then return seen[obj] end
local s = seen or {}
local res = setmetatable({}, getmetatable(obj))
s[obj] = res
for k, v in pairs(obj) do res[copy(k, s)] = copy(v, s) end
return res
end
function count(list) -- Count a list because #table doesn't work on keyindexed tables
local sum = 0; for i,v in pairs(list) do sum = sum + 1 end; print("Length: " .. sum)
end
function pts(s) print(tostring(s)) end -- Macro function
local list = {1, 2, 3}
pts(list.insert) --> nil
pts(table["insert"]) --> function: 0xA682A8
pts(list["insert"]) --> nil
list = copy(_G.table)
pts(table["insert"]) --> function: 0xA682A8
pts(list["insert"]) --> function: 0xA682A8
count(list) --> Length: 9
list:insert(-1, "test")
count(list) --> Length: 10
Was Lua 5.1 and newer supposed to support table methods like the string library but they decided to not implement the meta method?
EDIT:
I'll explain it a little further so people understand.
Strings have metamethods attached that you can use on the strings OOP style.
s = "test"
s:sub(1,1)
But tables doesn't. Even though the methods in the table's source code allow for it using "self" functions. So the following code doesn't work:
t = {1,2,3}
t:remove(#t)
The function has a self member defined in the argument (UnorderedArray.lua:25: function remove(self,index)).
You can find the metamethods of strings by using:
for i,v in pairs(getmetatable('').__index) do
print(i, tostring(v))
end
which prints the list of all methods available for strings:
sub function: 0xB4ABC8
upper function: 0xB4AB08
len function: 0xB4A110
gfind function: 0xB4A410
rep function: 0xB4AD88
find function: 0xB4A370
match function: 0xB4AE08
char function: 0xB4A430
dump function: 0xB4A310
gmatch function: 0xB4A410
reverse function: 0xB4AE48
byte function: 0xB4A170
format function: 0xB4A0F0
gsub function: 0xB4A130
lower function: 0xB4AC28
If you attach the module/library table to a table like Oka showed in the example, you can use the methods that table has just the same way the string metamethods work.
The question is: Why would Lua developers allow metamethods of strings by default but tables doesn't even though table's library and it's methods allow it in the source code?
The question was answered: It would allow a developer of a module or program to alter the metatables of all tables in the program, leading to the result where a table would behave differently from vanilla Lua when used in a program. It's different if you implement a class of a data type (say: vectors) and change the metamethods of that specific class and table, instead of changing all of Lua's standard table metamethods. This also slightly overlaps with operator overloading.
If I'm understanding your question correctly, you're asking why it is not possible to do the following:
local tab = {}
tab:insert('value')
Having tables spawn with a default metatable and __index breaks some assumptions that one would have about tables.
Mainly, empty tables should be empty. If tables were to spawn with an __index metamethod lookup for the insert, sort, etc., methods, it would break the assumption that an empty table should not respond to any members.
This becomes an issue if you're using a table as a cache or memo, and you need to check if the 'insert', or 'sort' strings exist or not (think arbitrary user input). You'd need to use rawget to solve a problem that didn't need to be there in the first place.
Empty tables should also be orphans. Meaning that they should have no relations without the programmer explicitly giving them relations. Tables are the only complex data structure available in Lua, and are the foundation for a lot of programs. They need to be free and flexible. Pairing them with the the table table as a default metatable creates some inconsistencies. For example, not all tables can make use of the generic sort function - a weird cruft for dictionary-like tables.
Additionally, consider that you're utilizing a library, and that library's author has told you that a certain function returns a densely packed table (i.e., an array), so you figure that you can call :sort(...) on the returned table. What if the library author has changed the metatable of that return table? Now your code no longer works, and any generic functions built on top of a _:sort(...) paradigm can't accept these tables.
Basically put, strings and tables are two very different beasts. Strings are immutable, static, and their contents are predictable. Tables are mutable, transient, and very unpredictable.
It's much, much easier to add this in when you need it, instead of baking it into the language. A very simple function:
local meta = { __index = table }
_G.T = function (tab)
if tab ~= nil then
local tab_t = type(tab)
if tab_t ~= 'table' then
error(("`table' expected, got: `%s'"):format(tab_t), 0)
end
end
return setmetatable(tab or {}, meta)
end
Now any time you want a table that responds to functions found in the table table, just prefix it with a T.
local foo = T {}
foo:insert('bar')
print(#foo) --> 1

SML comparing files at the bit level

I am attempting to compare files in a directory using SML. Using the TextIO library is fairly easy but I need to compare the files at the bit level. That is, a binary compare. I am using a function similar to this:
fun listDir (s) = let
fun loop (ds) = (case OS.FileSys.readDir (ds)
of "" => [] before OS.FileSys.closeDir (ds)
| file => file::loop (ds))
val ds = OS.FileSys.openDir (s)
in
loop (ds) handle e => (OS.FileSys.closeDir (ds); raise (e))
end
to list all the files in a given directory. But now, I need to look at the bits in each file. Any suggestions?
Take a look at the BinIO structure.