Read file character by character in SMLNJ - sml

I need to read a text file character by character in SMLNJ and store it in a list. The file consists of one line with numbers, without spaces or any form of separation. My question is how do I get a single character from the file and add it to the list of characters?
Example:
12345678
Result:
val input = [1, 2, 3, 4, 5, 6, 7, 8]

Using the following code you can get a list of characters by reading the contents of the file as a string (TextIO.vector to be accurate). The explode function is used for the conversion to a list of characters.
fun parse file =
let
fun next_String input = (TextIO.inputAll input)
val stream = TextIO.openIn file
val a = next_String stream
in
explode(a)
end

Related

Dictionary: Alphabetize the elements of a list and count its occurences

Hi so I've been trying to count the elements in the list that I have made, and when I do it
The result should be:
a 2
above 2
across 1
and etc..
here's what Ive got:
word = []
with open('Lateralus.txt', 'r') as my_file:
for line in my_file:
temporary_holder = line.split()
for i in temporary_holder:
word.append(i)
for i in range(0,len(word)): word[i] = word[i].lower()
word.sort()
for count in word:
if count in word:
word[count] = word[count] + 1
else:
word[count] = 1
for (word,many) in word.items():
print('{:20}{:1}'.format(word,many))
#Kimberly, as I understood from your code, you want to read a text file of alphabetic characters.
You want to also ignore the cases of alphabetic characters in file. Finally, you want to count the occurences of each unique letters in the text file.
I will suggest you to use dictionary for this. I have written a sample code for this task which
satisfy the following 3 conditions (please comment if you want different result by providing inputs and expected outputs, I will update my code based on that):
Reads text file and creates a single line of text by removing any spaces in between.
It converts upper case letters to lower case letters.
Finally, it creates a dictionary containing unique letters with their frequencies.
» Lateralus.txt
abcdefghijK
ABCDEfgkjHI
IhDcabEfGKJ
mkmkmkmkmoo
pkdpkdpkdAB
A B C D F Q
ab abc ab c
» Code
import json
char_occurences = {}
with open('Lateralus.txt', 'r') as file:
all_lines_combined = ''.join([line.replace(' ', '').strip().lower() for line in file.readlines()])
print all_lines_combined # abcdefghijkabcdefgkjhiihdcabefgkjmkmkmkmkmoopkdpkdpkdababcdfqababcabc
print len(all_lines_combined) # 69 (7 lines of 11 characters, 8 spaces => 77-8 = 69)
while all_lines_combined:
ch = all_lines_combined[0]
char_occurences[ch] = all_lines_combined.count(ch)
all_lines_combined = all_lines_combined.replace(ch, '')
# Pretty printing char_occurences dictionary containing occurences of
# alphabetic characters in a text file
print json.dumps(char_occurences, indent=4)
"""
{
"a": 8,
"c": 6,
"b": 8,
"e": 3,
"d": 7,
"g": 3,
"f": 4,
"i": 3,
"h": 3,
"k": 10,
"j": 3,
"m": 5,
"o": 2,
"q": 1,
"p": 3
}
"""

how to skip multiple header lines using python

I am new to python. Trying to write a script that will use numeric colomns from a file whcih also contains a header. Here is an example of a file:
#File_Version: 4
PROJECTED_COORDINATE_SYSTEM
#File_Version____________-> 4
#Master_Project_______->
#Coordinate_type_________-> 1
#Horizon_name____________->
sb+
#Horizon_attribute_______-> STRUCTURE
474457.83994 6761013.11978
474482.83750 6761012.77069
474507.83506 6761012.42160
474532.83262 6761012.07251
474557.83018 6761011.72342
474582.82774 6761011.37433
474607.82530 6761011.02524
I'd like to skip the header. here is what i tried. It works of course if i know which characters will appear in the header like "#" and "#". But how can i skip all lines containing any letter character?
in_file1 = open(input_file1_short, 'r')
out_file1 = open(output_file1_short,"w")
lines = in_file1.readlines ()
x = []
y = []
for line in lines:
if "#" not in line and "#" not in line:
strip_line = line.strip()
replace_split = re.split(r'[ ,|;"\t]+', strip_line)
x = (replace_split[0])
y = (replace_split[1])
out_file1.write("%s\t%s\n" % (str(x),str(y)))
in_file1.close ()
Thank you very much!
I think you could use some built ins like this:
import string
for line in lines:
if any([letter in line for letter in string.ascii_letters]):
print "there is an ascii letter somewhere in this line"
This is only looking for ascii letters, however.
you could also:
import unicodedata
for line in lines:
if any([unicodedata.category(unicode(letter)).startswith('L') for letter in line]):
print "there is a unicode letter somewhere in this line"
but only if I understand my unicode categories correctly....
Even cleaner (using suggestions from other answers. This works for both unicode lines and strings):
for line in lines:
if any([letter.isalpha() for letter in line]):
print "there is a letter somewhere in this line"
But, interestingly, if you do:
In [57]: u'\u2161'.isdecimal()
Out[57]: False
In [58]: u'\u2161'.isdigit()
Out[58]: False
In [59]: u'\u2161'.isalpha()
Out[59]: False
The unicode for the roman numeral "Two" is none of those,
but unicodedata.category(u'\u2161') does return 'Nl' indicating a numeric (and u'\u2161'.isnumeric() is True).
This will check the first character in each line and skip all lines that doesn't start with a digit:
for line in lines:
if line[0].isdigit():
# we've got a line starting with a digit
Use a generator pipeline to filter your input stream.
This takes the lines from your original input lines, but stops to check that there are no letters in the entire line.
input_stream = (line in lines if
reduce((lambda x, y: (not y.isalpha()) and x), line, True))
for line in input_stream:
strip_line = ...

D lang record separator is being lost after string cast

I am opening a .gz file and reading it chunk by chunk for uncompressing it.
The data in the uncompressed file is like :
aRSbRScRSd, There are record separators(ASCII code 30) between each record (records in my dummy example a,b,c).
File file = File(mylog.gz, "r");
auto uc = new UnCompress();
foreach (ubyte[] curChunk; file.byChunk(4096*1024))
{
auto uncompressed = cast(string)uc.uncompress(curChunk);
writeln(uncompressed);
auto stringRange = uncompressed.splitLines();
foreach (string line; stringRange)
{
***************** Do something with line
The result of the code above is:
abcd unfortunately record separators(ASCII 30) are missing.
I realized by examining the data record separators are missing after I cast ubyte[] to string.
Now I have two questions:
What should I change in the code to keep record separator?
How can I write the code above without for loops? I want to read line by line.
Edit
A more general and understandable code for first question :
ubyte[] temp = [ 65, 30, 66, 30, 67];
writeln(temp);
string tempStr = cast(string) temp;
writeln (tempStr);
Result is : ABC which is not desired.
The character 30 is not a printable character although some editors may show a symbol in its place. It is not being lost, but it doesn't print out.
Also note that casting a ubyte[] to string is usually incorrect because a ubyte[] array is mutable while a string is immutable. It is better to cast a ubyte[] to a char[].

Read fields from text file and store them in a structure

I am trying to read a file that looks as follows:
Data Sampling Rate: 256 Hz
*************************
Channels in EDF Files:
**********************
Channel 1: FP1-F7
Channel 2: F7-T7
Channel 3: T7-P7
Channel 4: P7-O1
File Name: chb01_02.edf
File Start Time: 12:42:57
File End Time: 13:42:57
Number of Seizures in File: 0
File Name: chb01_03.edf
File Start Time: 13:43:04
File End Time: 14:43:04
Number of Seizures in File: 1
Seizure Start Time: 2996 seconds
Seizure End Time: 3036 seconds
So far I have this code:
fid1= fopen('chb01-summary.txt')
data=struct('id',{},'stime',{},'etime',{},'seizenum',{},'sseize',{},'eseize',{});
if fid1 ==-1
error('File cannot be opened ')
end
tline= fgetl(fid1);
while ischar(tline)
i=1;
disp(tline);
end
I want to use regexp to find the expressions and so I did:
line1 = '(.*\d{2} (\.edf)'
data{1} = regexp(tline, line1);
tline=fgetl(fid1);
time = '^Time: .*\d{2]}: \d{2} :\d{2}' ;
data{2}= regexp(tline,time);
tline=getl(fid1);
seizure = '^File: .*\d';
data{4}= regexp(tline,seizure);
if data{4}>0
stime = '^Time: .*\d{5}';
tline=getl(fid1);
data{5}= regexp(tline,seizure);
tline= getl(fid1);
data{6}= regexp(tline,seizure);
end
I tried using a loop to find the line at which file name starts with:
for (firstline<1) || (firstline>1 )
firstline= strfind(tline, 'File Name')
tline=fgetl(fid1);
end
and now I'm stumped.
Suppose that I am at the line at which the information is there, how do I store the information with regexp? I got an empty array for data after running the code once...
Thanks in advance.
I find it the easiest to read the lines into a cell array first using textscan:
%// Read lines as strings
fid = fopen('input.txt', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
and then apply regexp on it to do the rest of the manipulations:
%// Parse field names and values
C = regexp(C{:}, '^\s*([^:]+)\s*:\s*(.+)\s*', 'tokens');
C = [C{:}]; %// Flatten the cell array
C = reshape([C{:}], 2, []); %// Reshape into name-value pairs
Now you have a cell array C of field names and their corresponding (string) values, and all you have to do is plug it into struct in the correct syntax (using a comma-separated list in this case). Note that the field names have spaces in them, so this needs to be taken care of before they can be used (e.g replace them with underscores):
C(1, :) = strrep(C(1, :), ' ', '_'); %// Replace spaces with underscores
data = struct(C{:});
Here's what I get for your input file:
data =
Data_Sampling_Rate: '256 Hz'
Channel_1: 'FP1-F7'
Channel_2: 'F7-T7'
Channel_3: 'T7-P7'
Channel_4: 'P7-O1'
File_Name: 'chb01_03.edf'
File_Start_Time: '13:43:04'
File_End_Time: '14:43:04'
Number_of_Seizures_in_File: '1'
Seizure_Start_Time: '2996 seconds'
Seizure_End_Time: '3036 seconds'
Of course, it is possible to prettify it even more by converting all relevant numbers to numerical values, grouping the 'channel' fields together and such, but I'll leave this to you. Good luck!

Retrieve particular parts of string from a text file and save it in a new file in MATLAB

I am trying to retrieve particular parts of a string in a text file such as below and i would like to save them in a text file in MATLAB
Original text file
D 1m8ea_ 1m8e A: d.174.1.1 74583 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74583
D 1m8eb_ 1m8e B: d.174.1.1 74584 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74584
D 3e7ia1 3e7i A:77-496 d.174.1.1 158052 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158052
D 3e7ib1 3e7i B:77-496 d.174.1.1 158053 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158053
D 2bhja1 2bhj A:77-497 d.174.1.1 128533 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=128533
So basically, I would like to retrieve the pdbcodes id which are labeled as "1m8e", chainid labeled as "A" the Start values which is "77" and stop values which is "496" and i would like all of these values to be saved inside of a fprintf statment.
Is there some kind of method is which i can use in RegExp stating which index its all starting at and retrieve those strings based on the position in the text file for each line?
In the end, all i want to have in the fprinf statement is 1m8e, A, 77, 496.
So far i have two fopen function which reads a file and one that writes to a new file and to read each line by line, also a fprintf statment:
pdbcode = '';
chainid = '';
start = '';
stop = '';
fin = fopen('dir.cla.scop.txt_1.75.txt', 'r');
fout = fopen('output_scop.txt', 'w');
% TODO: Add error check!
while true
line = fgetl(fin); % Get the next line from the file
if ~ischar(line)
% End of file
break;
end
% Print result into output_cath.txt file
fprintf(fout, 'INSERT INTO cath_domains (scop_pdbcode, scop_chainid, scopbegin, scopend) VALUES("%s", %s, %s, %s);\n', pdbcode, chainid, start, stop);
Thank you.
You should be able to strsplit on whitespace, get the third ("1m8e") and fourth elements ("A:77-496"), then repeat the process on the fourth element using ":" as the split character, and then again on the second of those two arguments using "-" as the split character. That's one approach. For example, you could do:
% split on space and tab, and ignore empty tokens
tokens = strsplit(line, ' \t', true);
pdbcode = tokens(3);
% split fourth token from previous split on colon
tokens = strsplit(tokens(4), ':');
chainid = tokens(1);
% split second token from previous split on dash
tokens = strsplit(tokens(2), '-');
start = tokens(1);
stop = tokens(2);
If you really wanted to use regular expressions, you could try the following
pattern = '\S+\s+\S+\s+(\S+)\s+([A-Za-z]+):([0-9]+)-([0-9]+)';
[mat tok] = regexp(line, pattern, 'match', 'tokens');
pdbcode = cell2mat(tok)(1);
chainid = cell2mat(tok)(2);
start = cell2mat(tok)(3);
stop = cell2mat(tok)(4);