How do I un-gzip a file without saving it? - amazon-web-services

I am new to rust and I am trying to port golang code that I had written previosuly. The go code basically downloaded files from s3 and directly (without writing to disk) ungziped the files and parsed them.
Currently the only solution I found is to save the gzipped files on disk then ungzip and parse them.
Perfect pipeline would be to directly ungzip and parse them.
How can I accomplish this?
const ENV_CRED_KEY_ID: &str = "KEY_ID";
const ENV_CRED_KEY_SECRET: &str = "KEY_SECRET";
const BUCKET_NAME: &str = "bucketname";
const REGION: &str = "us-east-1";
use anyhow::{anyhow, bail, Context, Result}; // (xp) (thiserror in prod)
use aws_sdk_s3::{config, ByteStream, Client, Credentials, Region};
use std::env;
use std::io::{Write};
use tokio_stream::StreamExt;
#[tokio::main]
async fn main() -> Result<()> {
let client = get_aws_client(REGION)?;
let keys = list_keys(&client, BUCKET_NAME, "CELLDATA/year=2022/month=06/day=06/").await?;
println!("List:\n{}", keys.join("\n"));
let dir = Path::new("input/");
let key: &str = &keys[0];
download_file_bytes(&client, BUCKET_NAME, key, dir).await?;
println!("Downloaded {key} in directory {}", dir.display());
Ok(())
}
async fn download_file_bytes(client: &Client, bucket_name: &str, key: &str, dir: &Path) -> Result<()> {
// VALIDATE
if !dir.is_dir() {
bail!("Path {} is not a directory", dir.display());
}
// create file path and parent dir(s)
let mut file_path = dir.join(key);
let parent_dir = file_path
.parent()
.ok_or_else(|| anyhow!("Invalid parent dir for {:?}", file_path))?;
if !parent_dir.exists() {
create_dir_all(parent_dir)?;
}
file_path.set_extension("json");
// BUILD - aws request
let req = client.get_object().bucket(bucket_name).key(key);
// EXECUTE
let res = req.send().await?;
// STREAM result to file
let mut data: ByteStream = res.body;
let file = File::create(&file_path)?;
let Some(bytes)= data.try_next().await?;
let mut gzD = GzDecoder::new(&bytes);
let mut buf_writer = BufWriter::new( file);
while let Some(bytes) = data.try_next().await? {
buf_writer.write(&bytes)?;
}
buf_writer.flush()?;
Ok(())
}
fn get_aws_client(region: &str) -> Result<Client> {
// get the id/secret from env
let key_id = env::var(ENV_CRED_KEY_ID).context("Missing S3_KEY_ID")?;
let key_secret = env::var(ENV_CRED_KEY_SECRET).context("Missing S3_KEY_SECRET")?;
// build the aws cred
let cred = Credentials::new(key_id, key_secret, None, None, "loaded-from-custom-env");
// build the aws client
let region = Region::new(region.to_string());
let conf_builder = config::Builder::new().region(region).credentials_provider(cred);
let conf = conf_builder.build();
// build aws client
let client = Client::from_conf(conf);
Ok(client)
}

Your snippet doesn't tell where GzDecoder comes from, but I'll assume it's flate2::read::GzDecoder.
flate2::read::GzDecoder is already built in a way that it can wrap anything that implements std::io::Read:
GzDecoder::new expects an argument that implements Read => deflated data in
GzDecoder itself implements Read => inflated data out
Therefore, you can use it just like a BufReader: Wrap your reader and used the wrapped value in place:
use flate2::read::GzDecoder;
use std::fs::File;
use std::io::BufReader;
use std::io::Cursor;
fn main() {
let data = [0, 1, 2, 3];
// Something that implements `std::io::Read`
let c = Cursor::new(data);
// A dummy output
let mut out_file = File::create("/tmp/out").unwrap();
// Using the raw data would look like this:
// std::io::copy(&mut c, &mut out_file).unwrap();
// To inflate on the fly, "pipe" the data through the decoder, i.e. wrap the reader
let mut stream = GzDecoder::new(c);
// Consume the `Read`er somehow
std::io::copy(&mut stream, &mut out_file).unwrap();
}
playground
You don't mention what "and parse them" entails, but the same concept applies: If your parser can read from an impl Read (e.g. it can read from a std::fs::File), then it can also read directly from a GzDecoder.

Related

How do I get difficulty over time from Kulupu (polkadotjs)?

// Import
import { ApiPromise, WsProvider } from "#polkadot/api";
// Construct
/*
https://rpc.kulupu.network
https://rpc.kulupu.network/ws
https://rpc.kulupu.corepaper.org
https://rpc.kulupu.corepaper.org/ws
*/
(async () => {
//const wsProvider = new WsProvider('wss://rpc.polkadot.io');
const wsProvider = new WsProvider("wss://rpc.kulupu.network/ws");
const api = await ApiPromise.create({ provider: wsProvider });
// Do something
const chain = await api.rpc.system.chain();
console.log(`You are connected to ${chain} !`);
console.log(await api.query.difficulty.pastDifficultiesAndTimestamps.toJSON());
console.log(api.genesisHash.toHex());
})();
The storage item pastDifficultiesAndTimestamps only holds the last 60 blocks worth of data. For getting that information you just need to fix the following:
console.log(await api.query.difficulty.pastDifficultiesAndTimestamps());
If you want to query the difficulty of a blocks in general, a loop like this will work:
let best_block = await api.derive.chain.bestNumber()
// Could be 0, but that is a lot of queries...
let first_block = best_block - 100;
for (let block = first_block; block < best_block; block++) {
let block_hash = await api.rpc.chain.getBlockHash(block);
let difficulty = await api.query.difficulty.currentDifficulty.at(block_hash);
console.log(block, difficulty)
}
Note that this requires an archive node which has informaiton about all the blocks. Otherwise, by default, a node only stores ~256 previous blocks before state pruning cleans things up.
If you want to see how to make a query like this, but much more efficiently, look at my blog post here:
https://www.shawntabrizi.com/substrate/porting-web3-js-to-polkadot-js/

What is the idiomatic way to write Rust microservice with shared db connections and caches?

I'm writing my first Rust microservice with hyper. After years of development in C++ and Go I tend to use controller for processing requests (like here - https://github.com/raycad/go-microservices/blob/master/src/user-microservice/controllers/user.go) where the controller stores shared data like db connection pool and different kinds of cache.
I know, with hyper, I can write it this way:
use hyper::{Body, Request, Response};
pub struct Controller {
// pub cache: Cache,
// pub db: DbConnectionPool
}
impl Controller {
pub fn echo(&mut self, req: Request<Body>) -> Result<Response<Body>, hyper::Error> {
// extensively using db and cache here...
let mut response = Response::new(Body::empty());
*response.body_mut() = req.into_body();
Ok(response)
}
}
and then use it:
use hyper::{Server, Request, Response, Body, Error};
use hyper::service::{make_service_fn, service_fn};
use std::{convert::Infallible, net::SocketAddr, sync::Arc, sync::Mutex};
async fn route(controller: Arc<Mutex<Controller>>, req: Request<Body>) -> Result<Response<Body>, hyper::Error> {
let mut c = controller.lock().unwrap();
c.echo(req)
}
#[tokio::main]
async fn main() {
let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
let controller = Arc::new(Mutex::new(Controller{}));
let make_svc = make_service_fn(move |_conn| {
let controller = Arc::clone(&controller);
async move {
Ok::<_, Infallible>(service_fn(move |req| {
let c = Arc::clone(&controller);
route(c, req)
}))
}
});
let server = Server::bind(&addr).serve(make_svc);
if let Err(e) = server.await {
eprintln!("server error: {}", e);
}
}
Since the compiler doesn't let me share mutable structure between threads I got to use Arc<Mutex<T>> idiom. But I'm afraid the let mut c = controller.lock().unwrap(); part would block the entire controller while processing single request, i.e. there's no concurrency here.
What is the idiomatic way to address this problem?
&mut always acquires a (compile time or runtime) exclusive lock to the value.
Only acquire a &mut at the exact scope you want to get locked.
If a value owned by the locked value needs separate locking management,
wrap it in a Mutex.
Assuming your DbConnectionPool is structured like this:
struct DbConnectionPool {
conns: HashMap<ConnId, Conn>,
}
We need to &mut the HashMap when we add/remove items on the HashMap,
but we don't need to &mut the value in Conn.
So Arc allows us to separate the mutability boundary from its parent,
and Mutex allows us to add its own interior mutability.
Moreover, our echo method doesn't want to be &mut,
so another layer of interior mutability needs to be added on the HashMap.
So we change this to
struct DbConnectionPool {
conns: Mutex<HashMap<ConnId, Arc<Mutex<Conn>>>,
}
Then when you want to get a connection,
fn get(&self, id: ConnId) -> Arc<Mutex<Conn>> {
let mut pool = self.db.conns.lock().unwrap(); // ignore error if another thread panicked
if let Some(conn) = pool.get(id) {
Arc::clone(conn)
} else {
// here we will utilize the interior mutability of `pool`
let arc = Arc::new(Mutex::new(new_conn()));
pool.insert(id, Arc::clone(&arc));
arc
}
}
(the ConnId param and the if-exists-else logic is used to simplify the code; you can change the logic)
On the returned value you can do
self.get(id).lock().unwrap().query(...)
For convenient illustration I changed the logic to user supplying the ID.
In reality, you should be able to find a Conn that has not been acquired and return it.
Then you can return a RAII guard for Conn,
similar to how MutexGuard works,
to auto free the connection when user stops using it.
Also consider using RwLock instead of Mutex if that might result in a performance boost.

akka stream custom graph stage

I have an akka stream from a web-socket like akka stream consume web socket and would like to build a reusable graph stage (inlet: the stream, FlowShape: add an additional field to the JSON specifying origin i.e.
{
...,
"origin":"blockchain.info"
}
and an outlet to kafka.
I face the following 3 problems:
unable to wrap my head around creating a custom Inlet from the web socket flow
unable to integrate kafka directly into the stream (see the code below)
not sure if the transformer to add the additional field would be required to deserialize the json to add the origin
The sample Project (flow only) looks like:
import system.dispatcher
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val incoming: Sink[Message, Future[Done]] =
Flow[Message].mapAsync(4) {
case message: TextMessage.Strict =>
println(message.text)
Future.successful(Done)
case message: TextMessage.Streamed =>
message.textStream.runForeach(println)
case message: BinaryMessage =>
message.dataStream.runWith(Sink.ignore)
}.toMat(Sink.last)(Keep.right)
val producerSettings = ProducerSettings(system, new ByteArraySerializer, new StringSerializer)
.withBootstrapServers("localhost:9092")
val outgoing = Source.single(TextMessage("{\"op\":\"unconfirmed_sub\"}")).concatMat(Source.maybe)(Keep.right)
val webSocketFlow = Http().webSocketClientFlow(WebSocketRequest("wss://ws.blockchain.info/inv"))
val ((completionPromise, upgradeResponse), closed) =
outgoing
.viaMat(webSocketFlow)(Keep.both)
.toMat(incoming)(Keep.both)
// TODO not working integrating kafka here
// .map(_.toString)
// .map { elem =>
// println(s"PlainSinkProducer produce: ${elem}")
// new ProducerRecord[Array[Byte], String]("topic1", elem)
// }
// .runWith(Producer.plainSink(producerSettings))
.run()
val connected = upgradeResponse.flatMap { upgrade =>
if (upgrade.response.status == StatusCodes.SwitchingProtocols) {
Future.successful(Done)
} else {
throw new RuntimeException(s"Connection failed: ${upgrade.response.status}")
system.terminate
}
}
// kafka that works / writes dummy data
val done1 = Source(1 to 100)
.map(_.toString)
.map { elem =>
println(s"PlainSinkProducer produce: ${elem}")
new ProducerRecord[Array[Byte], String]("topic1", elem)
}
.runWith(Producer.plainSink(producerSettings))
One issue is around the incoming stage, which is modelled as a Sink. where it should be modelled as a Flow. to subsequently feed messages into Kafka.
Because incoming text messages can be Streamed. you can use flatMapMerge combinator as follows to avoid the need to store entire (potentially big) messages in memory:
val incoming: Flow[Message, String, NotUsed] = Flow[Message].mapAsync(4) {
case msg: BinaryMessage =>
msg.dataStream.runWith(Sink.ignore)
Future.successful(None)
case TextMessage.Streamed(src) =>
src.runFold("")(_ + _).map { msg => Some(msg) }
}.collect {
case Some(msg) => msg
}
At this point you got something that produces strings, and can be connected to Kafka:
val addOrigin: Flow[String, String, NotUsed] = ???
val ((completionPromise, upgradeResponse), closed) =
outgoing
.viaMat(webSocketFlow)(Keep.both)
.via(incoming)
.via(addOrigin)
.map { elem =>
println(s"PlainSinkProducer produce: ${elem}")
new ProducerRecord[Array[Byte], String]("topic1", elem)
}
.toMat(Producer.plainSink(producerSettings))(Keep.both)
.run()

How to build multiple concurrent servers with Rust and Tokio?

I'm looking to build multiple concurrent servers on different ports with Rust and Tokio:
let mut core = Core::new().unwrap();
let handle = core.handle();
// I want to bind to multiple port here if it's possible with simple addresses
let addr = "127.0.0.1:80".parse().unwrap();
let addr2 = "127.0.0.1:443".parse().unwrap();
// Or here if there is a special function on the TcpListener
let sock = TcpListener::bind(&addr, &handle).unwrap();
// Or here if there is a special function on the sock
let server = sock.incoming().for_each(|(client_stream, remote_addr)| {
// And then retrieve the current port in the callback
println!("Receive connection on {}!", mysterious_function_to_retrieve_the_port);
Ok(())
});
core.run(server).unwrap();
Is there an option with Tokio to listen to multiple ports or do I need to create a simple thread for each port and run Core::new() in each?
Thanks to rust-scoped-pool, I have:
let pool = Pool::new(2);
let mut listening_on = ["127.0.0.1:80", "127.0.0.1:443"];
pool.scoped(|scope| {
for address in &mut listening_on {
scope.execute(move ||{
let mut core = Core::new().unwrap();
let handle = core.handle();
let addr = address.parse().unwrap();
let sock = TcpListener::bind(&addr, &handle).unwrap();
let server = sock.incoming().for_each(|(client_stream, remote_addr)| {
println!("Receive connection on {}!", address);
Ok(())
});
core.run(server).unwrap();
});
}
});
rust-scoped-pool is the only solution I have found to execute multiple threads and wait forever after spawning them. I think it's working but I was wondering if a simpler solution existed.
You can run multiple servers from one thread. core.run(server).unwrap(); is just a convenience method and not the only/main way to do things.
Instead of running the single ForEach to completion, spawn each individually and then just keep the thread alive:
let mut core = Core::new().unwrap();
let handle = core.handle();
// I want to bind to multiple port here if it's possible with simple addresses
let addr = "127.0.0.1:80".parse().unwrap();
let addr2 = "127.0.0.1:443".parse().unwrap();
// Or here if there is a special function on the TcpListener
let sock = TcpListener::bind(&addr, &handle).unwrap();
// Or here if there is a special function on the sock
let server = sock.incoming().for_each(|(client_stream, remote_addr)| {
// And then retrieve the current port in the callback
println!("Receive connection on {}!", mysterious_function_to_retrieve_the_port);
Ok(())
});
handle.spawn(sock);
handle.spawn(server);
loop {
core.turn(None);
}
I'd just like to follow up that there seems to be a slightly less manual way to do things than 46bit's answer (at least as of 2019).
let addr1 = "127.0.0.1:80".parse().unwrap();
let addr2 = "127.0.0.1:443".parse().unwrap();
let sock1 = TcpListener::bind(&addr1, &handle).unwrap();
let sock2 = TcpListener::bind(&addr2, &handle).unwrap();
let server1 = sock1.incoming().for_each(|_| Ok(()));
let server2 = sock2.incoming().for_each(|_| Ok(()));
let mut runtime = tokio::runtime::Runtime()::new().unwrap();
runtime.spawn(server1);
runtime.spawn(server2);
runtime.shutdown_on_idle().wait().unwrap();

Communication client-server with OCaml marshalled data

I want to do a client-side js_of_ocaml application with a server in OCaml, with contraints described below, and I would like to know if the approach below is right or if there is a more efficient one. The server can sometimes send large quantities of data (> 30MB).
In order to make the communication between client and server safer and more efficient, I am sharing a type t in a .mli file like this :
type client_to_server =
| Say_Hello
| Do_something_with of int
type server_to_client =
| Ack
| Print of string * int
Then, this type is marshalled into a string and sent on the network. I am aware that on the client side, some types are missing (Int64.t).
Also, in a XMLHTTPRequest sent by the client, we want to receive more than one marshalled object from the server, and sometimes in a streaming mode (ie: process the marshal object received (if possible) during the loading state of the request, and not only during the done state).
These constraints force us to use the field responseText of the XMLHTTPRequest with the content-type application/octet-stream.
Moreover, when we get back the response from responseText, an encoding conversion is made because JavaScript's string are in UTF-16. But the marshalled object being binary data, we do what is necessary in order to retrieve our binary data (by overriding the charset with x-user-defined and by applying a mask on each character of the responseText string).
The server (HTTP server in OCaml) is doing something simple like this:
let process_request req =
let res = process_response req in
let s = Marshal.to_string res [] in
send s
However, on the client side, the actual JavaScript primitive of js_of_ocaml for caml_marshal_data_size needs an MlString. But in streaming mode, we don't want to convert the javascript's string in a MlString (which can iter on the full string), we prefer to do the size verification and unmarshalling (and the application of the mask for the encoding problem) only on the bytes read. Therefore, I have writen my own marshal primitives in javascript.
The client code for processing requests and responses is:
external marshal_total_size : Js.js_string Js.t -> int -> int = "my_marshal_total_size"
external marshal_from_string : Js.js_string Js.t -> int -> 'a = "my_marshal_from_string"
let apply (f:server_to_client -> unit) (str:Js.js_string Js.t) (ofs:int) : int =
let len = str##length in
let rec aux pos =
let tsize =
try Some (pos + My_primitives.marshal_total_size str pos)
with Failure _ -> None
in
match tsize with
| Some tsize when tsize <= len ->
let data = My_primitives.marshal_from_string str pos in
f data;
aux tsize
| _ -> pos
in
aux ofs
let reqcallback f req ofs =
match req##readyState, req##status with
| XmlHttpRequest.DONE, 200 ->
ofs := apply f req##responseText !ofs
| XmlHttpRequest.LOADING, 200 ->
ignore (apply f req##responseText !ofs)
| _, 200 -> ()
| _, i -> process_error i
let send (f:server_to_client -> unit) (order:client_to_server) =
let order = Marshal.to_string order [] in
let msg = Js.string (my_encode order) in (* Do some stuff *)
let req = XmlHttpRequest.create () in
req##_open(Js.string "POST", Js.string "/kernel", Js._true);
req##setRequestHeader(Js.string "Content-Type",
Js.string "application/octet-stream");
req##onreadystatechange <- Js.wrap_callback (reqcallback f req (ref 0));
req##overrideMimeType(Js.string "application/octet-stream; charset=x-user-defined");
req##send(Js.some msg)
And the primitives are:
//Provides: my_marshal_header_size
var my_marshal_header_size = 20;
//Provides: my_int_of_char
function my_int_of_char(s, i) {
return (s.charCodeAt(i) & 0xFF); // utf-16 char to 8 binary bit
}
//Provides: my_marshal_input_value_from_string
//Requires: my_int_of_char, caml_int64_float_of_bits, MlStringFromArray
//Requires: caml_int64_of_bytes, caml_marshal_constants, caml_failwith
var my_marshal_input_value_from_string = function () {
/* Quite the same thing but with a custom Reader which
will call my_int_of_char for each byte read */
}
//Provides: my_marshal_data_size
//Requires: caml_failwith, my_int_of_char
function my_marshal_data_size(s, ofs) {
function get32(s,i) {
return (my_int_of_char(s, i) << 24) | (my_int_of_char(s, i + 1) << 16) |
(my_int_of_char(s, i + 2) << 8) | (my_int_of_char(s, i + 3));
}
if (get32(s, ofs) != (0x8495A6BE|0))
caml_failwith("MyMarshal.data_size");
return (get32(s, ofs + 4));
}
//Provides: my_marshal_total_size
//Requires: my_marshal_data_size, my_marshal_header_size, caml_failwith
function my_marshal_total_size(s, ofs) {
if ( ofs < 0 || ofs > s.length - my_marshal_header_size )
caml_failwith("Invalid argument");
else return my_marshal_header_size + my_marshal_data_size(s, ofs);
}
Is this the most efficient way to transfer large OCaml values from server to client, or what would time- and space-efficient alternatives be?
Have you try to use EventSource https://developer.mozilla.org/en-US/docs/Web/API/EventSource
You could stream json data instead of marshaled data.
Json.unsafe_input should be faster than unmarshal.
class type eventSource =
object
method onmessage :
(eventSource Js.t, event Js.t -> unit) Js.meth_callback
Js.writeonly_prop
end
and event =
object
method data : Js.js_string Js.t Js.readonly_prop
method event : Js.js_string Js.t Js.readonly_prop
end
let eventSource : (Js.js_string Js.t -> eventSource Js.t) Js.constr =
Js.Unsafe.global##_EventSource
let send (f:server_to_client -> unit) (order:client_to_server) url_of_order =
let url = url_of_order order in
let es = jsnew eventSource(Js.string url) in
es##onmessage <- Js.wrap_callback (fun e ->
let d = Json.unsafe_input (e##data) in
f d);
()
On the server side, you then need to rely on deriving_json http://ocsigen.org/js_of_ocaml/2.3/api/Deriving_Json to serialize your data
type server_to_client =
| Ack
| Print of string * int
deriving (Json)
let process_request req =
let res = process_response req in
let data = Json_server_to_client.to_string res in
send data
note1: Deriving_json serialize ocaml value to json using the internal representation of values in js_of_ocaml. Json.unsafe_input is a fast deserializer for Deriving_json that rely on browser-native JSON support.
note2: Deriving_json and Json.unsafe_input take care of ocaml string encoding