Modern data processing workloads demand low-latency, memory-efficient computation that traditional interpreted-language libraries struggle to deliver at scale. This paper presents Veloxx, an ultra-high-performance data processing and analytics library implemented in Rust with production-ready bindings for Python (via PyO3) and JavaScript (via WebAssembly). Veloxx introduces a columnar data model built around typed Series enums with validity bitmaps, stored in deterministically-ordered IndexMap structures within DataFrames. The library achieves substantial performance gains through a layered optimization strategy: SIMD-accelerated kernels using AVX2 intrinsics with portable fallbacks, Rayon-based work-stealing parallelism with adaptive threshold switching, custom SIMD-aligned memory pools (64-byte alignment for AVX-512 compatibility), and memory-mapped streaming I/O for CSV and JSON parsing. Our experimental evaluation on synthetic microbenchmarks demonstrates throughput of 1,466.3 million rows/second for group-by operations (25.9× improvement), 538.3 million elements/second for filtering (172× improvement), and 2,489.4 million rows/second for the query engine with SIMD optimization. In direct comparison with Polars, Veloxx achieves 66% faster vector addition and 61% faster filtering operations. Memory consumption is reduced by 38–45% through advanced pooling techniques. These results validate the architectural choices of combining Rust’s zero-cost abstractions with hardware-aware SIMD vectorization for building competitive data processing infrastructure.
Introduction
The text introduces Veloxx, a high-performance data processing library built in Rust to address the limitations of Python-based tools like Pandas and NumPy, which struggle with large-scale data due to issues like the Global Interpreter Lock (GIL), inefficient memory usage, and lack of parallelism.
Veloxx is designed for ultra-fast analytical workloads and achieves major performance improvements through several key techniques:
SIMD acceleration (AVX2) for vectorized operations
Parallel processing using Rayon with adaptive workload distribution
Custom memory management with aligned memory pools and zero-copy access
Streaming I/O with memory-mapped CSV/JSON parsing
Multi-language support via Python bindings and WebAssembly
The system uses a columnar data model (Series and DataFrame) for efficient memory access and supports a wide range of operations like filtering, aggregation, joins, and statistical analysis. Its architecture is layered into I/O, processing engine, core API, and language bindings.
Compared to existing frameworks (Pandas, Polars, Spark, DuckDB), Veloxx stands out by combining SIMD optimization, memory safety, parallelism, and cross-platform support in a single system.
Up to 7.8× faster than scalar operations and 2–5× faster than Pandas
Extremely high throughput (e.g., 1.4 billion rows/sec for group-by)
Faster I/O processing (up to 3.9× CSV speedup)
Reduced memory usage (~46%) and more stable latency
Conclusion
This paper presented Veloxx, an ultra-high-performance data processing library implemented in Rust with SIMD-accelerated columnar operations. Our experimental evaluation demonstrates substantial improvements across all measured dimensions: 25.9× faster group-by operations (1,466.3 M rows/sec), 172× faster filtering (538.3 M elements/sec), 66% faster vector addition compared to Polars, and 38–45% memory reduction compared to Pandas.
The key architectural contributions include: (1) a typed Series enum with validity bitmaps enabling type-safe, null-aware columnar operations; (2) a three-tier SIMD acceleration strategy with AVX2 intrinsics, portable wide-crate SIMD, and scalar fallbacks; (3) SIMD-aligned memory pools with RAII management achieving 13.8 M allocations/second; and (4) simultaneous Python and WebAssembly bindings enabling cross-platform deployment.
Future directions include:
1) Distributed Computing: Extending the parallel processing framework to multi-node clusters using message-passing or shared-memory approaches.
2) GPU Acceleration: Offloading compute-intensive operations (matrix operations for ML, large-scale sorting) to GPU hardware via CUDA or Vulkan compute shaders.
3) Streaming Engine: Adding continuous query processing capabilities for real-time data pipelines.
4) Advanced SQL Compatibility: Implementing a more complete SQL dialect including window functions over grouped partitions, common table expressions, and subquery optimization.
5) Packed Bitsets: Replacing Vec validity bitmaps with Vec packed bitsets to eliminate the bitmap overhead identified in this work.
References
[1] W. McKinney, “Data structures for statistical computing in Python,” Proc. 9th Python in Science Conf., pp. 51–56, 2010.
[2] N. D. Matsakis and F. S. Klock, “The Rust language,” ACM SIGAda Ada Letters, vol. 34, no. 3, pp. 103–104, 2014.
[3] M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, 2016.
[4] T. Lam, N. Dutt, and A. Nicolau, “A survey of SIMD extensions for multimedia applications,” IEEE Micro, vol. 20, no. 2, pp. 62–73, 2000.
[5] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” 12th USENIX Symp. OSDI, pp. 265–283, 2016.
[6] J. D. Hunter, “Matplotlib: A 2D graphics environment,” Comput. Sci. Eng., vol. 9, no. 3, pp. 90–95, 2007.
[7] R. Vink, “Polars: Blazingly fast DataFrames in Rust and Python,” GitHub, 2023.
[8] Apache Software Foundation, “Apache Arrow: A cross-language development platform for in-memory analytics,” 2019.
[9] Rayon Contributors, “Rayon: A data parallelism library for Rust,” GitHub, 2021.
[10] S. Behnel et al., “Cython: The best of both worlds,” Comput. Sci. Eng., vol. 13, no. 2, pp. 31–39, 2011.
[11] Veloxx Project, “Veloxx: Ultra-high performance data processing,” GitHub, 2025.
[12] Veloxx Project, “Veloxx documentation,” 2025.
[13] PyO3 Project, “PyO3: Rust bindings for Python,” 2023.
[14] wasm-bindgen Project, “wasm-bindgen: High-level interactions between Wasm modules and JavaScript,” 2023.
[15] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded computations by work stealing,” J. ACM, vol. 46, no. 5, pp. 720–748, 1999.
[16] Transaction Processing Performance Council, “TPC-H benchmark specification,” Revision 3.0.1, 2023.
[17] W. Bugden and A. Alahmar, “Rust: The programming language for safety and performance,” arXiv:2206.05503, 2022.
[18] A. Mozzillo et al., “Evaluation of dataframe libraries for data preparation on a single machine,” Information Systems, 2023.
[19] Maturin Project, “Maturin: Build and publish Rust-based Python packages,” 2023.