Two language problem

Mon, Sep 21, 2015

Last week I hit the two language problem hard

I am studying convergence criteria for spectral partitioning and this involves using eigensolvers. One benefit of my line of research is that it doesn’t a complete rewrite of the solvers or a new factorization method. I had previously done some experiments with the power method which is both easy to analyze on paper and easy to implement in software. Thus I had some python code for the power method lying around. I wanted to extend my results to the Implicitly Restarted Arnoldi Method and thus was using Arpack, which is a standard solver from the 90’s. ARPACK is the solver used by eigs in Matlab, Python, Julia or R(pkg: rArpack). However, I realized that in order to do my error analysis, I needed the *un*converged approximations to the eigenvectors at each step.

I thought this was possible by rewriting some of the Python wrapper to ARPACK and so I started there. At some point, I released my needs would require changes to ARPACK itself. All of a sudden I have to modify the Fortran code, use the ARPACK build system, and distribute native code in my python package.

All this set up is to get to the real point. Using a good Fortran or C library from a higher level language can get you a long way. People build many useful tools based on these libraries. The packagers can do a great job at hiding the complexity of the build process for those libraries. All of these are true for the fantastic Scientific Python Stack. Wrappers and bindings can ease the learning curve of these great tools. The downside is, once you have to change something in a library, you are left with a huge cliff to scale.

Solving the two language problem is perhaps the noblest goal in scientific computing. Multiple solid teams are trying. Numba and Cython are taking the approach of making the higher level language more performant using ideas from compilation. Julia is taking a clean slate approach and started with a simple language that was easy to compile using LLVM and powerful enough to build higher level features. It is a great honor for LLVM to be seen as the only passage from the higher level languages to the performance granted by compilation. The LLVM team has built something great which is enabling new tools to bridge the gap between two types of programming.

Will these techniques for compiling Python code succeed? Will the subset of Python they force you to write be pythonic? After hitting the two language problem, I am even more convinced the Julia is “the right way” to do it, although in the short run Julia is not as accessible as the Scientific Python Stack. Is this two language problem felt in other areas of computing? In databases there is some set of operations that the Query Language (either SQL or NoSQL) can perform easily, but once you get outside of that set of operations you have a to tackle all of the hard problems yourself. Where else does this motif repeat?