How to read a certain number of characters (as opposed to bytes) in Crystal? - crystal-lang

In Crystal, if I have a string (or a file), how do I read a certain number of characters at a time? Using functions like IO#read, IO#gets, IO#read_string, and IO#read_utf8, one can specify a certain number of bytes to read, but not a certain number of UTF-8 characters (or ones of another encoding).
In Python, for example, one might do this:
from io import StringIO
s = StringIO("abcdefgh")
while True:
chunk = s.read(4)
if not chunk: break
Or, in the case of a file, this:
with open("example.txt", 'r') as f:
while True:
chunk = f.read(4)
if not chunk: break
Generally, I'd expect IO::Memory to be the class to use for the string case, but as far as I can tell, its methods don't allow for this. How would one do this in an efficient and idiomatic fashion (for both strings and files – perhaps the answer is different for each) in Crystal?

There currently is no short cut implementation for this available in Crystal.
You can read individual chars with IO#read_char or consecutive ones with IO#each_char.
So a basic implementation would be:
io = IO::Memory.new("€abcdefgh")
string = String.build(4) do |builder|
4.times do
builder << io.read_char
end
end
puts string
Whether you use a memory IO or a file or any other IO is irrelevant, the behaviour is all the same.

io = IO::Memory.new("€€€abc€€€") #UTF-8 string from memory
or
io = File.open("test.txt","r") #UTF-8 string from file
iter = io.each_char.each_slice(4) #read max 4 chars at once
iter.each { |slice| #into a slice
puts slice
puts slice.join #join to a string
}
output:
['€', '€', '€', 'a']
€€€a
['b', 'c', '€', '€']
bc€€
['€']
€

In addition to the answers already given, for strings in Crystal, you can read X amount of characters with a range like this:
my_string = "A foo, a bar."
my_string[0..5] => "A foo"

This workaround seems to work for me:
io = IO::Memory.new("abcdefghz")
chars_to_read = 2 # Number of chars to read
while true
chunk = io.gets(chars_to_read) # Grab the chunk of type String?
break if chunk.nil? # Break if nothing else to read aka `nil`
end

Related

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]

IO for julia reading fortran files

Noob question:
I have the output of a complex matrix done in Fortran, the contents looks like this:
(-0.594209719263636,1.463867815703586E-006)
(-0.783378034185788,-0.182301028756558) (-0.794024313844809,0.128219337674814)
(0.592814294881930,4.069892201461069E-002)
I want to read and use this data in a julia program.
No, I don't want to change the writting format, I would like to learn how to strip off
the "trash" characters like '(', or ','. This may be useful for arbitrary Input files.
2.I have tried with the following code:
file = open(pathtofilename, "r")
data_str = readall(ifile)
data_numbers_str = split(data_str)
data_numbers = split(data_numbers_str, ['('])
However, the manual is not quite self-explanatory [http://docs.julialang.org/en/release-0.2/stdlib/base/?highlight=split].
Here is what I'd do
data = "(-0.594209719263636,1.463867815703586E-006) (-0.783378034185788,-0.182301028756558) (-0.794024313844809,0.128219337674814) (0.592814294881930,4.069892201461069E-002)"
function pair_to_complex(pair)
nums = float(split(pair[2:end-1], ","))
return Complex(nums...)
end
numbers = map(pair_to_complex, split(data, " "))
To explain
The pair[2:end-1] removes the parenthesis
I then split that on the , to get an array with two numbers, still as strings
I convert them to Float64 with float(), obtaining an array of floats
I make a new complex number. The ... splats the array out so it provides the two arguments to Complex - I could have done Complex(nums[1],nums[2])
I then apply this logic using map to every term in the data.

Unpacking data in python in struct library

When I pack the data to fixed length and then while unpacking I am unable to retrieve the data with out mentioning the actual length of the data.
How do I retrieve only data without the \x00 characters without calculating the length in prior.
>>> import struct
>>> with open("forums_file.dat", "w") as file:
file.truncate(1024)
>>> country = 'india'
>>> data = struct.pack('20s', country)
>>> print data
india
>>> data
'india\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> print len(data)
20
>>> unpack_data = struct.unpack('5s', country)
>>> unpack_data
('india',)
In the above code snippet I had mentioned the length of the data(5s) while unpacking.
Short answer: You can't do it directly.
Longer answer:
The more indirect solution is actually not that bad. When unpacking the string, you use the same length as you used for packing. That returns the string including the NUL chars (0 bytes).
Then you split on the NUL char and take the first item, like so:
result_with_NUL, = struct.unpack('20s', data)
print(repr(result_with_NUL))
result_string = result_with_NUL.split('\x00', 1)[0]
print(repr(result_string))
The , 1 parameter in split() is not strictly necessary, but makes it more efficient, as it splits only on the first occurrence of NUL instead of every single one.
Also note that when packing and unpacking with the goal to read/write files or exchange data with different systems, it's important to explicitly precede your format strings with "<" or ">" (or "=" in certain very special cases), both for packing and unpacking, since otherwise it will align and pad the structures, which is heavily system dependent and might cause hard to find bugs later.

URL-Encoding a Byte String?

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Similar function peek( ); (from C++) in Ruby

Is there any similar peek(); (From C++) function in ruby? Any alternative to do this?
I've found a way to do this.
Use the StringScanner:
require 'strscan'
scanner = StringScanner.new(YourStringHere)
puts scanner.peek(1)
You can use the StringScanner to scan files as well:
file = File.open('hello.txt', 'rb')
scanner = StringScanner.new(file.read)
Maybe you can use ungetc. Try to look here.
It is not equal but you can obtain the same result.
Enumerator#peek let's you peek at the next value of an Enumerator. IO#bytes IO#chars will give you an Enumerator on a byte stream or character stream respectively. Since you opened with "rb", I'll assume you want bytes.
file = File.open('hello.txt', 'rb') # assume contains text "hello\n"
fstream = file.bytes
fstream.next # => "h"
fstream.peek # => "e"
fstream.next # => "e"
...
Of course now you're kinda stuck with byte at a time processing on the stream.