I am attempting to compare files in a directory using SML. Using the TextIO library is fairly easy but I need to compare the files at the bit level. That is, a binary compare. I am using a function similar to this:
fun listDir (s) = let
fun loop (ds) = (case OS.FileSys.readDir (ds)
of "" => [] before OS.FileSys.closeDir (ds)
| file => file::loop (ds))
val ds = OS.FileSys.openDir (s)
in
loop (ds) handle e => (OS.FileSys.closeDir (ds); raise (e))
end
to list all the files in a given directory. But now, I need to look at the bits in each file. Any suggestions?
Take a look at the BinIO structure.
Related
I have been trying to list all the Spark dataframes from Parquet files in directories except metadata directory.
The structure of directories looks like this:
dumped_data/
- time=19424145
- time=19424146
- time=19424147
- _spark_metadata
The main goal is to avoid reading data from _spark_metadata directory. I have created a solution but it constantly returns empty values for some reason. What could be the reason of it?
Here is the solution:
val dirNamesRegex: Regex = s"\\_spark\\_metadata*".r
def transformDf: Option[DataFrame] = {
val filesDf = listPath(new Path(feedPath))(fsConfig)
.map(_.getName)
.filter(name => !dirNamesRegex.pattern.matcher(name).matches)
.flatMap(path => sparkSession.parquet(Some(feedSchema))(path))
if (!filesDf.isEmpty)
Some(filesDf.reduce(_ union _))
else None
}
listPath - custom method for listing data files in hdfs. feedSchema is of StructType
Without if on Some and None I get this exception:
java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:137)
at scala.collection.immutable.List.reduceLeft(List.scala:84)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
In your code you have 3 problems:
Seems you can use == operator instead of regex matching. You know concrete name of directory to filter, just use filtering by name.
As I got your code, filesDf is something like Traversable[DataFrame]. If you want reduce it safety even this collection is empty you can use reduceLeftOption instead of reduce.
In your transformDf method you are trying to filter directory names and reading data using spark, it can be too heavy to debug with spark also. I would advise you divide your logic into 2 different methods: first - read directories and filter them, second - read data and union them into one general DataFrame.
I propose such code samples:
case without dividing logic:
def transformDf: Option[DataFrame] = {
listPath(new Path(feedPath))(fsConfig)
.map(_.getName)
.filter(name => name == "_spark_metadata")
.flatMap(path => sparkSession.parquet(Some(feedSchema))(path))
.reduceLeftOption(_ union _)
}
case with divided logic:
def getFilteredPaths: List[String] =
listPath(new Path(feedPath))(fsConfig)
.map(_.getName)
.filter(name => name == "_spark_metadata")
def transformDf: Option[DataFrame] = {
getFilteredPaths
.flatMap(path => sparkSession.parquet(Some(feedSchema))(path))
.reduceLeftOption(_ union _)
}
In second way you can write some light-weight unit-tests for debug your paths extraction and when you will have correct paths you can easily read data from directories and union them.
I want to retrieve the list of direct files (i.e. no recursive search) of a given directory and a given extension in OCaml.
I tried the following but:
It does not look OCaml-spirit
It does not work (error of import)
let list_osc2 =
let list_files = Sys.readdir "tests/osc2/expected/pp" in
List.filter (fun x -> Str.last_chars x 4 = ".osc2") (Array.to_list list_files)
I got the error (I am using OCamlPro):
Required module `Str' is unavailable
Thanks
You can use Filename.extension instead of Str.last_chars:
let list_osc2 =
let list_files = Sys.readdir "tests/osc2/expected/pp" in
List.filter (fun x -> Filename.extension x = ".osc2") (Array.to_list list_files)
and then use the pipe operator to make it a bit more readable:
let list_osc2 =
Sys.readdir "tests/osc2/expected/pp"
|> Array.to_list
|> List.filter (fun x -> Filename.extension x = "osc2")
I don't know how you expect this to work in OCamlPro though, as it doesn't have a filesystem as far as I'm aware.
To use the Str module, you need to link with the str library. For example, with ocamlc, you need to pass str.cma, and with ocamlopt, you need to pass str.cmxa. I don't know how to do that with OcamlPro.
In any case, Str.last_chars is not particularly useful here. It doesn't work if the file name is shorter than the suffix. By the way, your code would never match because ".osc2" is 5 characters, which is never equal to last_chars x 4.
The Filename module from the standard library has functions to extract and check a file's extension. You don't need to do any string manipulation.
I don't know what you consider “ugly as hell”, but apart from the mistake with string manipulation, I don't see any problem with your code. Enumerating the matches and filtering them is perfectly idiomatic.
let list_osc2 =
let list_files = Sys.readdir "tests/osc2/expected/pp" in
List.filter (fun name -> check_suffix name ".osc2") (Array.to_list list_files)
There is a case mapping two vectors into a single vector. I expected that the result of both ML should be same. Unfortunately, the result of ReasonML is different. Please help and comment how to fix it.
OCaml
List.map2 (fun x y -> x+y) [1;2;3] [10;20;30];;
[11;;22;;33]
ReasonML
Js.log(List.map2 ( (fun (x,y) => x+y), [1,2,3], [10,20,30]))
[11,[22,[33,0]]]
This is the same result. If you run:
Js.log([11,22,33]);
You'll get:
[11,[22,[33,0]]]
The result is the same, but you're using different methods of printing them. If instead of Js.log you use rtop or sketch.sh, you'll get the output you expect:
- : list(int) = [11, 22, 33]
Js.log prints it differently because it is a BuckleScript binding to console.log, which will print the JavaScript-representation of the value you give to it. And lists don't exist in JavaScript, only arrays do.
The way BuckleScript represents lists is pretty much the same way it is done natively. A list in OCaml and Reason is a "cons-cell", which is essentially a tuple or a 2-element array, where the first item is the value of that cell and the last item is a pointer to the next cell. The list type is essentially defined like this:
type list('a) =
| Node('a, list('a))
| Empty;
And with this definition could have been constructed with:
Node(11, Node(22, Node(33, Empty)))
which is represented in JavaScript like this:
[11,[22,[33,0]]]
^ ^ ^ ^
| | | The Empty list
| | Third value
| Second value
First value
Lists are defined this way because immutability makes this representation very efficient. Because we can add or remove values without copying all the items of the old list into a new one. To add an item we only need to create one new "cons-cell". Using the JavaScript representation with imagined immutability:
const old = [11,[22,[33,0]]];
const new = [99, old];
And to remove an item from the front we don't have to create anything. We can just get a reference to and re-use a sub-list, because we know it won't change.
const old = [11,[22,[33,0]]];
const new = old[1];
The downside of lists is that adding and removing items to the end is relatively expensive. But in practice, if you structure your code in a functional way, using recursion, the list will be very natural to work with. And very efficient.
#Igor Kapkov, thank you for your help. Base on your comment, I found a pipeline statement in the link, there is a summary.
let a = List.map2 ( (fun (x,y) => x+y), [1,2,3], [10,20,30] )
let logl = l => l |> Array.of_list |> Js.log;
a |> logl
[11,22,33]
Me coming from a c# and python background, feels there must be a better way to read a file and populate a classic F# list. But then I know that a f# list is immutable. There must be an alternative using a List<string> object and calling its Add method.
So far what I have at hand:
let ptr = new StreamReader("stop-words.txt")
let lst = new List<string>()
let ProcessLine line =
match line with
| null -> false
| s ->
lst.Add(s)
true
while ProcessLine (ptr.ReadLine()) do ()
If I were to write the similar stuff in python I'd do something like:
[x[:-1] for x in open('stop-words.txt')]
Simple solution
System.IO.File.ReadAllLines(filename) |> List.ofArray
Although you can write a recursive function
let processline fname =
let file = new System.IO.StreamReader("stop-words.txt")
let rec dowork() =
match file.ReadLine() with
|null -> []
|t -> t::(dowork())
dowork()
If you want to read all lines from a file, you can just use ReadAllLines. The method returns the data as an array, but you can easily turn that into F# list using List.ofArray or process it using the functions in the Seq module:
open System.IO
File.ReadAllLines("stop-words.txt")
Alternatively, if you do not want to read all the contents into memory, you can use File.ReadLines which reads the lines lazily.
I need to scan through a document and accumulate the output of different functions for each string in the file. The function run on any given line of the file depends on what is in that line.
I could do this very inefficiently by making a complete pass through the file for every list I wanted to collect. Example pseudo-code:
at :: B.ByteString -> Maybe Atom
at line
| line == ATOM record = do stuff to return Just Atom
| otherwise = Nothing
ot :: B.ByteString -> Maybe Sheet
ot line
| line == SHEET record = do other stuff to return Just Sheet
| otherwise = Nothing
Then, I would map each of these functions over the entire list of lines in the file to get a complete list of Atoms and Sheets:
mapper :: [B.ByteString] -> IO ()
mapper lines = do
let atoms = mapMaybe at lines
let sheets = mapMaybe to lines
-- Do stuff with my atoms and sheets
However, this is inefficient because I am maping through the entire list of strings for every list I am trying to create. Instead, I want to map through the list of line strings only once, identify each line as I am moving through it, and then apply the appropriate function and store these values in different lists.
My C mentality wants to do this (pseudo code):
mapper' :: [B.ByteString] -> IO ()
mapper' lines = do
let atoms = []
let sheets = []
for line in lines:
| line == ATOM record = (atoms = atoms ++ at line)
| line == SHEET record = (sheets = sheets ++ ot line)
-- Now 'atoms' is a complete list of all the ATOM records
-- and 'sheets' is a complete list of all the SHEET records
What is the Haskell way of doing this? I simply can't get my functional-programming mindset to come up with a solution.
First of all, I think that the answers others have supplied will work at least 95% of the time. It's always good practice to code for the problem at hand by using appropriate data types (or tuples in some cases). However, sometimes you really don't know in advance what you're looking for in the list, and in these cases trying to enumerate all possibilities is difficult/time-consuming/error-prone. Or, you're writing multiple variants of the same sort of thing (manually inlining multiple folds into one) and you'd like to capture the abstraction.
Fortunately, there are a few techniques that can help.
The framework solution
(somewhat self-evangelizing)
First, the various "iteratee/enumerator" packages often provide functions to deal with this sort of problem. I'm most familiar with iteratee, which would let you do the following:
import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe
-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function
-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)
otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)
-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)
-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
run resultIter
One benefit this gives you is extra composability. You can create transformers as you like, and just combine them with zip. You can even run the consumers in parallel if you like (although only if you're working in the IO monad, and probably not worth it unless the consumers do a lot of work) by changing to this:
import Data.Iteratee.Parallel
parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)
The result of doing so isn't the same as a single for-loop - this will still perform multiple traversals of the data. However, the traversal pattern has changed. This will load a certain amount of data at once (defaultBufSize bytes) and traverse that chunk multiple times, storing partial results as necessary. After a chunk has been entirely consumed, the next chunk is loaded and the old one can be garbage collected.
Hopefully this will demonstrate the difference:
Data.List.zip:
x1 x2 x3 .. x_n
x1 x2 x3 .. x_n
Data.Iteratee.zip:
x1 x2 x3 x4 x_n-1 x_n
x1 x2 x3 x4 x_n-1 x_n
If you're doing enough work that parallelism makes sense this isn't a problem at all. Due to memory locality, the performance is much better than multiple traversals over the entire input as Data.List.zip would make.
The beautiful solution
If a single-traversal solution really does make the most sense, you might be interested in Max Rabkin's Beautiful Folding post, and Conal Elliott's followup work (this too). The essential idea is that you can create data structures to represent folds and zips, and combining these lets you create a new, combined fold/zip function that only needs one traversal. It's maybe a little advanced for a Haskell beginner, but since you're thinking about the problem you may find it interesting or useful. Max's post is probably the best starting point.
I show a solution for two types of line, but it is easily extended to five types of line by using a five-tuple instead of a two-tuple.
import Data.Monoid
eachLine :: B.ByteString -> ([Atom], [Sheet])
eachLine bs | isAnAtom bs = ([ {- calculate an Atom -} ], [])
| isASheet bs = ([], [ {- calculate a Sheet -} ])
| otherwise = error "eachLine"
allLines :: [B.ByteString] -> ([Atom], [Sheet])
allLines bss = mconcat (map eachLine bss)
The magic is done by mconcat from Data.Monoid (included with GHC).
(On a point of style: personally I would define a Line type, a parseLine :: B.ByteString -> Line function and write eachLine bs = case parseLine bs of .... But this is peripheral to your question.)
It is a good idea to introduce a new ADT, e.g. "Summary" instead of tuples.
Then, since you want to accumulate the values of Summary you came make it an istance of Data.Monoid. Then you classify each of your lines with the help of classifier functions (e.g. isAtom, isSheet, etc.) and concatenate them together using Monoid's mconcat function (as suggested by #dave4420).
Here is the code (it uses String instead of ByteString, but it is quite easy to change):
module Classifier where
import Data.List
import Data.Monoid
data Summary = Summary
{ atoms :: [String]
, sheets :: [String]
, digits :: [String]
} deriving (Show)
instance Monoid Summary where
mempty = Summary [] [] []
Summary as1 ss1 ds1 `mappend` Summary as2 ss2 ds2 =
Summary (as1 `mappend` as2)
(ss1 `mappend` ss2)
(ds1 `mappend` ds2)
classify :: [String] -> Summary
classify = mconcat . map classifyLine
classifyLine :: String -> Summary
classifyLine line
| isAtom line = Summary [line] [] [] -- or "mempty { atoms = [line] }"
| isSheet line = Summary [] [line] []
| isDigit line = Summary [] [] [line]
| otherwise = mempty -- or "error" if you need this
isAtom, isSheet, isDigit :: String -> Bool
isAtom = isPrefixOf "atom"
isSheet = isPrefixOf "sheet"
isDigit = isPrefixOf "digits"
input :: [String]
input = ["atom1", "sheet1", "sheet2", "digits1"]
test :: Summary
test = classify input
If you have only 2 alternatives, using Either might be a good idea. In that case combine your functions, map the list, and use lefts and rights to get the results:
import Data.Either
-- first sample function, returning String
f1 x = show $ x `div` 2
-- second sample function, returning Int
f2 x = 3*x+1
-- combined function returning Either String Int
hotpo x = if even x then Left (f1 x) else Right (f2 x)
xs = map hotpo [1..10]
-- [Right 4,Left "1",Right 10,Left "2",Right 16,Left "3",Right 22,Left "4",Right 28,Left "5"]
lefts xs
-- ["1","2","3","4","5"]
rights xs
-- [4,10,16,22,28]