FEM Assembly and Solver Benchmarks

FEM Assembly and Solver Benchmarks

Although not very popular to undertake, and sometimes outright controversial, benchmarks and benchmarking is often very useful in order to be able to make fair and informed decisions about software codes, libraries, and choice of programming languages.

With this in mind, several Finite Element Analysis (FEA) simulation codes and their implementations are here compared using the fundamental sparse matrix vector multiplication, finite element matrix assembly, and linear solver operations. The following five different simulation codes are evaluated

  • FEATool Multiphysics - Finite Element Analysis Toolbox entirely written in vectorized and optimized MATLAB m-script code

  • FEM2D - Finite Element library and reference implementation in Fortran 77

  • FEniCS Project - computing platform for solving general partial differential equations (PDE) with Python and C++
  • Julia FEM - basic Finite Element implementation in the Julia language ( Julia is a programming language that aims to solve the one-language problem, that is high-performance while at the same time being interpreted and easy to work with)

  • SFEA (Stencil based Finite Element Analysis) - a high-performance stencil based finite element method (FEM) solver written in Fortran 90 (this solver comes very close to the memory bandwidth limit and therefore gives an indication of the upper performance limit)

All test and timings were performed on a typical workstation system with an Intel Core i7-3820 CPU running Linux in single threaded/serial mode. The Fortran codes were compiled with the Intel Fortran compiler.

Test 1 - Sparse Matrix Vector Product

1/hFEAToolFEM2DJulia FEMSFEA
MATLABFortranJuliaFortran
1280.00200.030
2560.0020.0010.0310
5120.0060.0050.0340.001
10240.0210.0240.050.002
20480.0850.1970.080.009
40960.4110.250.034

For the sparse matrix vector product benchmark test we can see that FEM2D (CSR sparse format), FEATool (MATLAB/triplet sparse format), and Julia (CSC sparse format) perform similarly especially for moderate grid sizes. The stencil based SFEA approach is about a magnitude faster which is to be expected since memory access is near optimal. FEniCS did not support matrix and vector operations on a high level at the time of testing, so no data is included for FEniCS here.

Test 2 - Finite Element Matrix Assembly

1/hFEAToolFEM2DFEniCSJulia FEM
MATLABFortranPython/C++Julia
1280.050.050.050.24
2560.140.130.120.57
5120.430.420.311.7
10241.71.51.16.5
204876526
409624105

For FEM matrix assembly, the vectorized and optimized MATLAB code used by FEATool is actually just as fast as the FEM2D Fortran code (which already is a very good end efficient reference implementation). FEniCS is a little bit faster but this is to be expected since FEM2D and FEATool both use quadrilateral shape functions which are more expensive to assemble than the linear triangular ones used by FEniCS and Julia. The performance of Julia is unfortunately not very good here but this could be due to a non-optimized implementation. The SFEA code is not included since being stencil based FEM assembly costs virtually nothing.

Test 3 - Linear Solver for the Poisson Equation

1/hFEAToolFEM2DFEniCSJulia FEMSFEA
MATLAB
Umfpack
Fortran
GMG
Python/C++
PETSc
Julia
CHOLMOD?
Fortran
GMG
1280.3680.0250.190.180.049
2561.7180.050.790.3220.064
5127.4060.214.650.90.11
102447.6461.134.13.40.16
2048338.146-14.60.49
409663832.9
819220

In the final test the Poisson equation is solved on the unit square with unit source term and zero homogeneous Dirichlet boundary conditions everywhere. The default linear solver of each code is used throughout the tests. FEM2D and SFEA employ a geometric multigrid solver, FEATool (MATLAB) uses Umfpack/SuiteSparse, and FEniCS uses PETSc. From the timings we can see that the Umfpack and PETSc direct sparse solvers performed about the same, with a slight advantage for PETSc (although failed for the 1/h=2048 grid). As the problem sizes increases we can see that the GMG solvers scale significantly better than the direct solvers, with the stencil based SFEA approach being about a magnitude faster (about 700 times faster than Umfpack on the 1/h=2048 grid with 4.2 million unknowns/degrees of freedom). It is not quite clear which solver Julia uses but due to the performance figures which are on par with the FEM2D Geometric MultiGrid (GMG) solver we suspect it detects that the problem can be solved faster with Cholesky factorization and uses a corresponding solver.

Summary

We have looked at how five different finite element implementations perform on three fundamental test cases. For sparse matrix vector operations the codes performed very similar with exception of the stencil based approach which was a magnitude faster.

For matrix assembly it was quite surprising just how good performance one can get from a properly vectorized and optimized MATLAB implementation, showing the same performance as a reference implementation in compiled Fortran.

Regarding the solvers Geometric MultiGrid (GMG) unsurprisingly beats direct solvers when grid sizes increased. Here also the stencil based approach was faster by a magnitude or more.

These few tests have shown that it seems entirely possible to write high-performance FEM simulation codes in almost any programming language. What seems more important for performance is the choice of data structures (sparse versus stencil) and algorithm (direct or iterative solvers). The outlier might be Julia for which it currently is not fully clear how it will perform, but due to being a new language it clearly does show some potential.

A follow up benchmark which tests the whole solution process can be found in the corresponding FEM benchmark comparison post.

var layout={xaxis:{title:'Grid density, 1/h',tickvals:[128,256,512,1024,2048,4096]},yaxis:{title:'CPU Time [s]'}}; var tr1={x:[128,256,512,1024,2048],y:[0.002,0.002,0.006,0.021,0.085],type:'scatter',name:'FEATool (MATLAB)',line:{color:'rgb(255,0,0)',width:2}}; var tr2={x:[128,256,512,1024,2048,4096],y:[0,0.001,0.005,0.024,0.197,0.411],type:'scatter',name:'FEM2D (Fortran)',line:{color:'rgb(0,255,0)',width:2}}; var tr3={x:[128,256,512,1024,2048,4096],y:[0.03,0.031,0.034,0.05,0.08,0.25],type:'scatter',name:'Julia FEM (Julia)',line:{color:'rgb(0,0,255)',width:2}}; var tr4={x:[128,256,512,1024,2048,4096],y:[0,0,0.001,0.002,0.009,0.034],type:'scatter',name:'SFEA (Fortran)',line:{color:'rgb(255,255,0)',width:2}}; Plotly.newPlot('sparse-matrix-vector-product-timings-div', [tr1,tr2,tr3,tr4], layout); var tr1={x:[128,256,512,1024,2048],y:[0.05,0.14,0.43,1.7,7],type:'scatter',name:'FEATool (MATLAB)',line:{color:'rgb(255,0,0)',width:2}}; var tr2={x:[128,256,512,1024,2048,4096],y:[0.05,0.13,0.42,1.5,6,24],type:'scatter',name:'FEM2D (Fortran)',line:{color:'rgb(0,255,0)',width:2}}; var tr3={x:[128,256,512,1024,2048,],y:[0.05,0.12,0.31,1.1,5],type:'scatter',name:'FEniCS (Python/C++)',line:{color:'rgb(255,0,255)',width:2}}; var tr4={x:[128,256,512,1024,2048,4096],y:[0.24,0.57,1.7,6.5,26,105],type:'scatter',name:'Julia FEM (Julia)',line:{color:'rgb(0,0,255)',width:2}}; Plotly.newPlot('finite-element-matrix-assembly-timings-div', [tr1,tr2,tr3,tr4], layout); var tr1={x:[128,256,512,1024,2048],y:[0.368,1.718,7.406,47.646,338.14],type:'scatter',name:'FEATool (Umfpack)',line:{color:'rgb(255,0,0)',width:2}}; var tr2={x:[128,256,512,1024,2048,4096],y:[0.025,0.05,0.21,1.1,6,63],type:'scatter',name:'FEM2D (GMG)',line:{color:'rgb(0,255,0)',width:2}}; var tr3={x:[128,256,512,1024,2048],y:[0.19,0.79,4.65,34.1],type:'scatter',name:'FEniCS (PETSc)',line:{color:'rgb(255,0,255)',width:2}}; var tr4={x:[128,256,512,1024,2048,4096],y:[0.18,0.322,0.9,3.4,14.6,83],type:'scatter',name:'Julia FEM (CHOLMOD?)',line:{color:'rgb(0,0,255)',width:2}}; var tr5={x:[128,256,512,1024,2048,4096,8192],y:[0.049,0.064,0.11,0.16,0.49,2.9,20],type:'scatter',name:'SFEA (GMG)',line:{color:'rgb(255,255,0)',width:2}}; var layout={xaxis:{title:'Grid density, 1/h',tickvals:[128,256,512,1024,2048,4096,8192]},yaxis:{title:'CPU Time [s]'}}; Plotly.newPlot('linear-solver-timings-div', [tr1,tr2,tr3,tr4,tr5], layout);