Related
My team and I are developing an application that makes use of Flink.
The data will be processed using a computationally-heavy numerical algorithm.
In order to optimize it as much as possible, I would like to write this algorithm in C/C++ rather than in Java.
The question is: is it possible to use C/C++ code within Flink? Perhaps by wrapping it into a Java library?
I've never tested this case in particular. In general, you can always use native code from Java using JNI, the Java Native Interface.
The idea would be to have a Java facade expose your native code and use these methods in your computation graph defined with Flink in Java (or other JVM languages, like Scala). You would have to make both Java and native libraries available on all involved nodes to make this work. If you have an Hadoop cluster you can leverage YARN to ship files along with your job (docs here, see --yarn-ship CLI option).
I would suggest you test this incrementally, with a very small native function exposed. Also, don't underestimate Java's capabilities in terms of performance: with some well thought programming and leveraging JIT and other runtime optimizations, long running processes can enjoy even better performances than similar native code with unmanaged memory.
Keep in mind that this of course resorting to native code will mean restricting the portability of your code to the platforms for which you'll compile your libraries.
Is there any "tiny" VM (for any programming language) where the main data structures visible to the user (lists, arrays, maps, sets, etc.) are immutable as in Clojure or Haskell?
By "tiny" I mean a VM where implementation simplicity, brevity and portability are key points: think Lua or TinyScheme.
I'm not sure how it aligns with your "key points", but you might take a look at Pixie. Pixie implements a VM in RPython. One of its claims is a small footprint of just over 10MB for the compiled VM + standard libs. The language is a lisp based (loosely) on Clojure. It appears to maintain Clojure's policy of immutable by default, and definitely has implementations of Clojure's persistent datatypes.
Owl Lisp
A "purely functional Scheme". The VM is 1600 lines of C.
Owl Lisp is a purely functional dialect of Scheme. It is based on the
applicable subset of R7RS standard, extending it mainly with threads
and data structures necessary for purely functional operation. Owl can
be used on most UNIX-like systems, such as Linux, BSDs and OS X.
Programs are typically compiled via C to standalone binaries, so Owl
isn't needed to run programs written in it.
Owl project originally got started both as an attempt to extend R5RS
Scheme with some necessary features, such as threads and modules, and
as an experiment on how being purely functional influences the runtime
and use of an applicative order purely functional language. While
things have been added to Scheme, Owl tries to keep the core language
as simple as possible.
Implementationwise the goal was to get a small portable system which
could be used to ship programs easily. This is currently accomplished
by using a small register-based virtual machine, which can be extended
with program-specific instructions to reduce interpretive overhead.
https://haltp.org/n/owl
https://github.com/aoh/owl-lisp
ClojureC
Compiler for the Clojure programming language that targets C
as a backend. It is based on ClojureScript ... Before you can run
anything make sure you have GLib 2 and the Boehm-Demers-Weiser garbage
collector installed.
https://github.com/schani/clojurec
TinyClojure
TinyClojure is a project to build a small, easily embeddable version
of Clojure/ClojureScript in portable C++. In many ways it is my
attempt to create a Clojure equivalent of TinyScheme.
...
ClojureC is good, but the build process is complex, and
there are external library dependencies ... The focus with TinyClojure's development is to make it the easiest way
to embed Clojure within any application. Tiny Clojure consists of one
header file, one source file, no external dependencies, and the
extension and embedding interface is as simple as it is possible to
be.
https://github.com/WillDetlor/TinyClojure/
Hope this question isn't going to be too vague. Reading through the COM spec and Don Box's Essential COM book, there is plenty of talk of the "problems that COM solves" - and they all sound important, relevant and current.
So how are the problems that COM addresses dealt with on other systems (linux, unix, OSX, android)? I'm thinking of things like:
binary compatibility across compilers and compiler versions
binary component reuse
compiling an application such that it has run-time dependencies rather than load-time ones (so that it runs even when a dependency is missing)
access to library functionality from languages other than the library's own
reasonably low-overhead remote procedure calls to components loaded in the address space of a different process
etc (I'm sure the list goes on)
I'm basically just trying to understand why for instance on Linux CORBA isn't a thing like COM is a thing on Windows (if that makes any sense). Does maybe software development on Linux subscribe to a different philosophy than the component-based model proposed by COM?
And finally, is COM a C/C++ thing? Several times I've come across comments from people saying COM is made "obsolete" by .NET but without really explaining what they meant by that.
For the remainder of this post, I'm going to use Linux as an example of open-source software. Where I mention "Linux" it's mostly a short/simple way to refer to open source software in general though, not anything specific to Linux.
COM vs. .NET
COM isn't actually restricted to C and C++, and .NET doesn't actually replace COM. However, .NET does provide alternatives to COM for some situations. One common use of COM is to provide controls (ActiveX controls). .NET provides/supports its own protocol for controls that allows somebody to write a control in one .NET language, and use that control from any other .NET language--more or less the same sort of thing that COM provides outside the .NET world.
Likewise, .NET provides Windows Communication Foundation (WCF). WCF implements SOAP (Simple Object Access Protocol)--which may have started out simple, but grew into something a lot less simple at best. In any case, WCF provides many of the same kinds of capabilities as COM does. Although WCF itself is specific to .NET, it implements SOAP, and a SOAP server built using WCF can talk to one implemented without WCF (and vice versa). Since you mention overhead, it's probably worth mentioning that WCF/SOAP tend to add more overhead that COM (I've seen anywhere from nearly equal to about double the overhead, depending on the situation).
Differences in Requirements
For Linux, the first two points tend to have relatively low relevance. Most software is open source, and many users are accustomed to building from source in any case. For such users, binary compatibility/reuse is of little or no consequence (in fact, quite a few users are likely to reject all software that isn't distributed in source code form). Although binaries are commonly distributed (e.g., with apt-get, yum, etc.) they're basically just caching a binary built for a specific system. That is, on Windows you might have a single binary for use on anything from Windows XP up through Windows 10, but if you use apt-get on, say, Ubuntu 18.02, you're installing a binary built specifically for Ubuntu 18.02, not one that tries to be compatible with everything back to Ubuntu 10 (or whatever).
Being able to load and run (with reduced capabilities) when a component is missing is also most often a closed-source problem. Closed source software typically has several versions with varying capabilities to support different prices. It's convenient for the vendor to be able to build one version of the main application, and give varying levels of functionality depending on which other components are supplied/omitted.
That's primarily to support different price levels though. When the software is free, there's only one price and one version: the awesome edition.
Access to library functionality between languages again tends to be based more on source code instead of a binary interface, such as using SWIG to allow use of C or C++ source code from languages like Python and Ruby. Again, COM is basically curing a problem that arises primarily from lack of source code; when using open source software, the problem simply doesn't arise to start with.
Low-overhead RPC to code in other processes again seems to stem primarily from closed source software. When/if you want Microsoft Excel to be able to use some internal "stuff" in, say, Adobe Photoshop, you use COM to let them communicate. That adds run-time overhead and extra complexity, but when one of the pieces of code is owned by Microsoft and the other by Adobe, it's pretty much what you're stuck with.
Source Code Level Sharing
In open source software, however, if project A has some functionality that's useful in project B, what you're likely to see is (at most) a fork of project A to turn that functionality into a library, which is then linked into both the remainder of project A and into Project B, and quite possibly projects C, D, and E as well--all without imposing the overhead of COM, cross-procedure RPC, etc.
Now, don't get me wrong: I'm not trying to act as a spokesperson for open source software, nor to say that closed source is terrible and open source is always dramatically superior. What I am saying is that COM is defined primarily at a binary level, but for open source software, people tend to deal more with source code instead.
Of course SWIG is only one example among several of tools that support cross-language development at a source-code level. While SWIG is widely used, COM is different from it in one rather crucial way: with COM, you define an interface in a single, neutral language, and then generate a set of language bindings (proxies and stubs) that fit that interface. This is rather different from SWIG, where you're matching directly from one source to one target language (e.g., bindings to use a C library from Python).
Binary Communication
There are still cases where it's useful to have at least some capabilities similar to those provided by COM. These have led to open-source systems that resemble COM to a rather greater degree. For example, a number of open-source desktop environments use/implement D-bus. Where COM is mostly an RPC kind of thing, D-bus is mostly an agreed-upon way of sending messages between components.
D-bus does, however, specify things it calls objects. Its objects can have methods, to which you can send signals. Although D-bus itself defines this primarily in terms of a messaging protocol, it's fairly trivial to write proxy objects that make invoking a method on a remote object look pretty much like invoking one on a local object. The big difference is that COM has a "compiler" that can take a specification of the protocol, and automatically generate those proxies for you (and corresponding stubs in the far end to receive the message, and invoke the proper function based on the message it received). That's not part of D-bus itself, but people have written tools to take (for example) an interface specification and automatically generate proxies/stubs from that specification.
As such, although the two aren't exactly identical, there's enough similarity that D-bus can be (and often is) used for many of the same sorts of things as COM.
Systems Similar to DCOM
COM also allows you to build distributed systems using DCOM (Distributed COM). That is, a system where you invoke a method on one machine, but (at least potentially) execute that invoked method on another machine. This adds more overhead, but since (as pointed out above with respect to D-bus) RPC is basically communication with proxies/stubs attached to the ends, it's pretty easy to do the same thing in a distributed fashion. The difference in overhead, however, tends to lead to differences in how systems need to be designed to work well, though, so the practical advantage of using exactly the same system for distributed systems as local systems tends to be fairly minimal.
As such, the open source world provides tools for doing distributed RPC, but doesn't usually work hard at making them look the same as non-distributed systems. CORBA is well known, but generally viewed as large and complex, so (at least in my experience) current use is fairly minimal. Apache Thrift provides some of the same general type of capabilities, but in a rather simpler, lighter-weight fashion. In particular, where CORBA attempts to provide a complete set of tools for distributed computing (complete with everything from authentication to distributed time keeping), Thrift follows the Unix philosophy much more closely, attempting to meet exactly one need: generate proxies and stubs from an interface definition (written in a neutral language). If you want to do those CORBA-like things with Thrift you undoubtedly can, but in a more typical case of building internal infrastructure where the caller and callee trust each other, you can avoid a lot of overhead and just get on with the business at hand. Likewise, google RPC provides roughly the same sorts of capabilities as Thrift.
OS X Specific
Cocoa provides distributed objects that are fairly similar to COM. This is based on Objective-C though, and I believe it's now deprecated.
Apple also offers XPC. XPC is more about inter-process communication than RPC, so I'd consider it more directly comparable to D-bus than to COM. But, much like D-bus, it has a lot of the same basic capabilities as COM, but in different form that places more emphasis on communication, and less on making things look like local function calls (and many now prefer messaging to RPC anyway).
Summary
Open source software has enough different factors in its design that there's less demand for something providing the same mix of capabilities as Microsoft's COM provides on Windows. COM is largely a single tool that tries to meet all needs. In the open-source world, there's less drive to provide that single, all-encompassing solution, and more tendency toward a kit of tools, each doing one thing well, that can be put together into a solution for a specific need.
Being more commercially oriented, Apple OS X probably has what are (at least arguably) closer analogs to COM than most of the more purely open-source world.
A quick answer on the last question: COM is far from being obsolete. Almost everything in the Microsoft world is COM-based, including the .NET engine (the CLR), and including the new Windows 8.x's Windows Runtime.
Here is what Microsoft says about .NET in it latest C++ pages Welcome Back to C++ (Modern C++):
C++ is experiencing a renaissance because power is king again.
Languages like Java and C# are good when programmer productivity is
important, but they show their limitations when power and performance
are paramount. For high efficiency and power, especially on devices
that have limited hardware, nothing beats modern C++.
PS: which is a bit of a shock for a developer who has invested more than 10 years on .NET :-)
In the Linux world, it is more common to develop components that are statically linked, or which run in separate processes and communicate by piping text (maybe JSON or XML) back and forth.
Some of this is due to tradition. UNIX developers have been doing stuff like this long before CORBA or COM existed. It's "the UNIX way".
As Jerry Coffin says in his answer, when you have the source code for everything, binary interfaces are not as important, and in fact just make everything more difficult.
COM was invented back when personal computers were a lot slower than they are today. In those days, loading components into your app's process space and invoking native code was often necessary to achieve reasonable performance. Now, parsing text and running interpreted scripts aren't things to be afraid of.
CORBA never really caught on in the open-source world because the initial implementations were proprietary and expensive, and by the time high-quality free implementations were available, the spec was so complicated that nobody wanted to use it if they weren't required to do so.
To a large extent, the problems solved by COM are simply ignored on Linux.
It is true that binary compatibility is less important when you have the source code available. However, you still have to worry about modularisation and versioning. If two different programs depend on different versions of the same library, you need to somehow support that.
Then there is the case of the same program using different versions of the same library. This is often useful when working on large legacy programs, where upgrading everything can be prohibitively expensive but you would like to use new features anyway. With COM, the old parts of the program can just be left alone, since new library versions can more easily be made backwards compatible.
In addition, having to compile from source instead of binary compatibility is a huge hassle. Especially if you are actually developing software, since binary incompatibility means you have to recompile much more often. If one tiny part changes in a large C++ program, you may have to wait for a 30 minute recompile. If the different pieces are compatible, only the part which changed has to be recompiled.
COM and DCOM in particular have been around in windows for some considerable time now and naturally windows developers have made use of this powerful framework.
We are now in the cross platform age and when porting such applications to other platforms we are faced with challenges which in many cases can be mitigated or eliminated altogether unless the application we are porting is more than just one simple standalone app.
If your dealing with a whole suite of modules running on different machines all communicating using windows specific technologies such as DCE/RPC, DCOM or even windows named pipes then your job just became an order of magnitude harder.
DCE/RPC DCOM and windows named pipes all are very windows specific, non portable and of course subject to windows security access control.
For instance anyone familar with OPC DA (an industrial automation protocol based on DCOM still very much in use but now superceded by OPC UA (which avoids DCOM))) will know that there are no elegant solutions here if the client (or server) needs to be available for Linux!!
Sure there appear to be some technical hurdles here given that the MS code is not in the public domain but projects such as Wine have a partly ok DCE/RPC implementation and MS do publish some of the protocol docs. Try searching and you will probably find little information and few products open source or otherwise to help you.
Perhaps the lack of open source or affordable options here is more due to legal concerns - I wonder!
Some simpler solutions simply involve installing a "gateway service" on the windows machines to allow an alternative means of access to DCOM interfaces on that machine. This is fine if the windows machine does not belong to an unwilling 3rd party which unfortunately is sometimes the case!!! I know we'll just chuck another Windows machine as the gateway in the middle is the usual global warming enhancing solution to that problem.
I would conclude that Linux to Windows DCOM interoperability is certainly not impossible but it does appear to be a topic that few are interested in talking about unless you get your wallet out!
I'm a game's developer and am currently in the processing of writing a cross-platform, multi-threaded engine for our company. Arguably, one of the most powerful tools in a game engine is its scripting system, hence I'm on the hunt for a new scripting language to integrate into our engine (currently using a relatively basic in-house engine).
Key features for the desired scripting system (in order of importance) are:
Performance - MUST be fast to call & update scripts
Cross platform - Needs to be relatively easy to port to multiple platforms (don't mind a bit of work, but should only take a few days to port to each platform)
Offline compilation - Being able to pre-parse the script code offline is almost essential (helps with file sizes and load times)
Ability to integrate well with c++ - Should be able to support OO code within the language, and integrate this functionality with c++
Multi-threaded - not required, but desired. Would be best to be able to run separate instances of it on multiple threads that don't interfere with each other (i.e. no globals within the underlying code that need to be altered while running). Critical Section and Mutex based solutions need not apply.
I've so far had experience integrating/using Lua, Squirrel (OO language, based on Lua) and have written an ActionScript 2 virtual machine.
So, what scripting system do you recommend that fits the above criteria? (And if possible, could you also post or link to any comparisons to other scripting languages that you may have)
Thanks,
Grant
Lua has the advantage of being time-tested by a number of big-name video game developers and a good base of knowledgeable developers thanks to Blizzard-Activision's adoption of it as the primary platform for developing World of Warcraft add-ins.
Lua is a very good match for your needs. I'll take them in the same order.
Lua is one of the fastest scripting languages. It's fast to compile and fast to run.
Lua compiles on any platform with an ANSI C compiler, which afaik includes all gaming platforms.
Lua can be pre-compiled, but as a very dynamic languages most errors are only detectable at runtime. Also precompiled code (as bytecode) is often larger in terms of size than source code.
There are many Lua/C++ binding tools.
It doesn't support multi-threading (you cannot access a single instance of the interpreter from multiple threads), but you can have several instances of the interpreter, one per thread, or even one per game object.
Lua have been used in video-game industry for years. Lightweight and efficient.
That being said, ChaiScript and Falcon are good candidates matching your needs and with higher level language than Lua but with less history and community support.
Lua
Boost Python
SWIG
We've had good luck with Squirrel so far. Lua is so popular it's on its way to becoming a standard.
I recommend you worry more about memory than speed. Most scripting languages are "fast enough" and if they get slow you can always push some of that functionality back down into C++. Many of them burn through lots of memory, though, and on a console memory is an even more scarce resource than CPU time. Unbounded memory consumption will crash you eventually, and if you have to allocate 4MB just for the interpreter, that's like having to throw 30 textures out the window to make room.
Lua, and then LuaJIT for extra blaziness!
just don't expect too much from automatic C++ binding libraries, most are slow and restrictive. better do your own binding for your own objects.
as for concurrency, either LuaLanes, or roll your own. if your C++ program is already multithreaded, just call separate LuaStates from each thread, and use your own C++ shared structures as communications channels if needed.
as you might already know, the most often repeated answer in Lua is 'roll your own', and it's often the best advice! except when it's about bindings to common C/C++ libraries, in that case it's quite probable there's already one.
If you haven't looked at it yet I would suggest you check out Angelscript.
I have successfully used it in a cross platform environment (Windows and Linux with only a recompile) and it is designed to integrate well with C++ (both objects and code).
It is lightweight and supports multi-threading (in the sense that the question was asked), performs well and compiles to byte code which could be done in advance.
Start with Python.
If you can prove that you need more speed, then look at Stackless Python. That's what EVE Online uses for their game.
JavaScript may be a reasonable option, because of the mountains of effort that have gone into optimizing the various implementations for use in web-browsers.
These come to mind:
Lua
Python with boost::python
MzScheme or Guile
Ruby with SWIG
I have heard that C++ offers no native support for multithreading. I assume that multithreaded C++ apps depended on managed code for multithreading; that is, for example, a Visual C++ app used MFC or .NET or something along those lines to provide multithreading capability. I further assume that some or all of those managed-code capabilities are unavailable to unmanaged applications. But I have read about unmanaged multithreaded applications. How is this possible? Which of my assumptions is false?
It is wholly up to the operating system to provide support for multi-threading. On Windows, the necessary functionality is available via the Win32 API. Frameworks such as MFC provide wrappers over the low-level threading functions to simplify things, while of course .NET/CLR has its own managed interface for accessing Win32 multi-threading capabilities.
A good explanation is offerred in this article (Multithreading in C++).
Why Doesn’t C++ Contain Built-In
Support for Multithreading?
C++ does not contain any built-in
support for multithreaded
applications. Instead, it relies
entirely upon the operating system to
provide this feature. Given that both
Java and C# provide built-in support
for multithreading, it is natural to
ask why this isn’t also the case for
C++. The answers are efficiency,
control, and the range of applications
to which C++ is applied. Let’s examine
each.
By not building in support for
multithreading, C++ does not attempt
to define a “one size fits all”
solution. Instead, C++ allows you to
directly utilize the multithreading
features provided by the operating
system. This approach means that your
programs can be multithreaded in the
most efficient means supported by the
execution environment. Because many
multitasking environments offer rich
support for multithreading, being able
to access that support is crucial to
the creation of high-performance,
multithreaded programs.
Multithreading in C++ does not require managed code.
In very much the same way that C++ does not provide native support for displaying graphics or emitting sounds or reading input from a mouse, the operating system that's being used will provide a C++ API for utilizing these features.
It's not a matter of C++ not being able to do it. It simply hasn't been written into the C++ standard yet.
Some of your assumptions are not quite right. The operating system (I'm talking about win32 since you mention .NET) provides support for threading. There are lots of good threading libs. that build ontop of the OS functionality in C++ to make multithreading "easier" :) -- pthreads for example. Here is more at MSDN.
The ISO standard for the programming language C++ neither defines nor prohibits multithreading. An implementation is allowed to provide extensions if it wishes. A program is allowed to use implementation extensions if it wishes, and then the program will only run on systems that provide those extensions.
For comparison, the ISO standard for the programming language C++ neither defines nor prohibits the use of a mouse. A program is allowed to use implementation extensions and then it will only run on systems that provide those extensions. For another comparison, the ISO standard for C++ neither defines nor prohibits UTF-8, so your program can depend on Latin-1 and then your program will only run on systems that provide Latin-1.
Native C++ does not offer "built in" multithreading support simply because it was not intended to, or in fact, needed.
Your misconception is that this is a fault, while it is in fact a strength of the language. By being "oblivious" to multithreading, C++ seamlessly integrates with the MT support offered by the OS your code will compile and run on, thereby offering much more flexibility and efficiency than if it came with it's own "MT baggage" so to speak.
You mention MFC and .NET as examples - be aware that these libraries/wrappers are merely a layer over basic Win32 API's. Using C++ as intended will provide you with efficient code that will run multithrededly on ANY OS, as long as you seperate the logic from the OS-specific MT API calles (i.e thread creation etc), so that porting between OS's is greatly facilitated.
Unlike Java, which defined language constructs and JVM specs, the C++ standard is oblivious to threading (so is C). As far as these languages are concerned, anything thread-related consists of function calls to OS functionality. Libraries compiled for multithreading simply make sense of the same calls, but from a language perspectives they are plain old code.
I think you misunderstand the definition of 'managed' code. 'Managed' code is a Microsoft-specific term meaning code that is uses the .NET framework and thus is subject to the various aspects of .NET. 'Unmanaged' code means code that runs outside that and does not operate through the .NET layer. MFC code is 'unmanaged'; it's merely a spectacularly bad wrapper for the ubiquitous Win32 API (which isn't even the lowest level API available on Windows).
The .NET libraries (including multithreading) are almost all, at some level, interfaces for the more basic system APIs used by traditional, 'unmanaged' applications. There is, generally speaking, no functionality available to 'managed' code that cannot be replicated in 'unmanaged' code with sufficient effort, though the reverse is not true (this is called the abstraction penalty, if you wish to know more). While it may be easier to do in 'managed' code, that's just because somewhere, some 'unmanaged' code is doing it for you, more or less. In the case of a threading API, it is in turn an interface to the operating system kernel, which itself accesses the processor's capabilities to allow a process to run in multiple places concurrently (if using multiple cores; if not, then it's just a pseudo-concurrency).
The C++ standard currently provides no definition of threads (the upcoming C++1x standard does). There are a number different threading libraries available, including those provided by Win32 and MFC, the pthreads library found on POSIX systems, and Boost.Thread, which will use the platform's local threading library.
The next C++ standard (named c++0x) will have support for multithreading.
Will include atomic operations.