Veloxx: An Ultra-High Performance Data Processing Library with SIMD-Accelerated Columnar Operations

Authors: Kadri Wali Mohammad , Irfan Khan, Javed Choudhary, Rishabh Singh

DOI Link: https://doi.org/10.22214/ijraset.2026.78854

Abstract

Modern data processing workloads demand low-latency, memory-efficient computation that traditional interpreted-language libraries struggle to deliver at scale. This paper presents Veloxx, an ultra-high-performance data processing and analytics library implemented in Rust with production-ready bindings for Python (via PyO3) and JavaScript (via WebAssembly). Veloxx introduces a columnar data model built around typed Series enums with validity bitmaps, stored in deterministically-ordered IndexMap structures within DataFrames. The library achieves substantial performance gains through a layered optimization strategy: SIMD-accelerated kernels using AVX2 intrinsics with portable fallbacks, Rayon-based work-stealing parallelism with adaptive threshold switching, custom SIMD-aligned memory pools (64-byte alignment for AVX-512 compatibility), and memory-mapped streaming I/O for CSV and JSON parsing. Our experimental evaluation on synthetic microbenchmarks demonstrates throughput of 1,466.3 million rows/second for group-by operations (25.9× improvement), 538.3 million elements/second for filtering (172× improvement), and 2,489.4 million rows/second for the query engine with SIMD optimization. In direct comparison with Polars, Veloxx achieves 66% faster vector addition and 61% faster filtering operations. Memory consumption is reduced by 38–45% through advanced pooling techniques. These results validate the architectural choices of combining Rust’s zero-cost abstractions with hardware-aware SIMD vectorization for building competitive data processing infrastructure.

Introduction

The text introduces Veloxx, a high-performance data processing library built in Rust to address the limitations of Python-based tools like Pandas and NumPy, which struggle with large-scale data due to issues like the Global Interpreter Lock (GIL), inefficient memory usage, and lack of parallelism.

Veloxx is designed for ultra-fast analytical workloads and achieves major performance improvements through several key techniques:

SIMD acceleration (AVX2) for vectorized operations
Parallel processing using Rayon with adaptive workload distribution
Custom memory management with aligned memory pools and zero-copy access
Streaming I/O with memory-mapped CSV/JSON parsing
Multi-language support via Python bindings and WebAssembly

The system uses a columnar data model (Series and DataFrame) for efficient memory access and supports a wide range of operations like filtering, aggregation, joins, and statistical analysis. Its architecture is layered into I/O, processing engine, core API, and language bindings.

Compared to existing frameworks (Pandas, Polars, Spark, DuckDB), Veloxx stands out by combining SIMD optimization, memory safety, parallelism, and cross-platform support in a single system.

Experimental results demonstrate significant performance gains:

Up to 7.8× faster than scalar operations and 2–5× faster than Pandas
Extremely high throughput (e.g., 1.4 billion rows/sec for group-by)
Faster I/O processing (up to 3.9× CSV speedup)
Reduced memory usage (~46%) and more stable latency

Conclusion

This paper presented Veloxx, an ultra-high-performance data processing library implemented in Rust with SIMD-accelerated columnar operations. Our experimental evaluation demonstrates substantial improvements across all measured dimensions: 25.9× faster group-by operations (1,466.3 M rows/sec), 172× faster filtering (538.3 M elements/sec), 66% faster vector addition compared to Polars, and 38–45% memory reduction compared to Pandas. The key architectural contributions include: (1) a typed Series enum with validity bitmaps enabling type-safe, null-aware columnar operations; (2) a three-tier SIMD acceleration strategy with AVX2 intrinsics, portable wide-crate SIMD, and scalar fallbacks; (3) SIMD-aligned memory pools with RAII management achieving 13.8 M allocations/second; and (4) simultaneous Python and WebAssembly bindings enabling cross-platform deployment. Future directions include: 1) Distributed Computing: Extending the parallel processing framework to multi-node clusters using message-passing or shared-memory approaches. 2) GPU Acceleration: Offloading compute-intensive operations (matrix operations for ML, large-scale sorting) to GPU hardware via CUDA or Vulkan compute shaders. 3) Streaming Engine: Adding continuous query processing capabilities for real-time data pipelines. 4) Advanced SQL Compatibility: Implementing a more complete SQL dialect including window functions over grouped partitions, common table expressions, and subquery optimization. 5) Packed Bitsets: Replacing Vec validity bitmaps with Vec packed bitsets to eliminate the bitmap overhead identified in this work.

References

[1] W. McKinney, “Data structures for statistical computing in Python,” Proc. 9th Python in Science Conf., pp. 51–56, 2010. [2] N. D. Matsakis and F. S. Klock, “The Rust language,” ACM SIGAda Ada Letters, vol. 34, no. 3, pp. 103–104, 2014. [3] M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, 2016. [4] T. Lam, N. Dutt, and A. Nicolau, “A survey of SIMD extensions for multimedia applications,” IEEE Micro, vol. 20, no. 2, pp. 62–73, 2000. [5] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” 12th USENIX Symp. OSDI, pp. 265–283, 2016. [6] J. D. Hunter, “Matplotlib: A 2D graphics environment,” Comput. Sci. Eng., vol. 9, no. 3, pp. 90–95, 2007. [7] R. Vink, “Polars: Blazingly fast DataFrames in Rust and Python,” GitHub, 2023. [8] Apache Software Foundation, “Apache Arrow: A cross-language development platform for in-memory analytics,” 2019. [9] Rayon Contributors, “Rayon: A data parallelism library for Rust,” GitHub, 2021. [10] S. Behnel et al., “Cython: The best of both worlds,” Comput. Sci. Eng., vol. 13, no. 2, pp. 31–39, 2011. [11] Veloxx Project, “Veloxx: Ultra-high performance data processing,” GitHub, 2025. [12] Veloxx Project, “Veloxx documentation,” 2025. [13] PyO3 Project, “PyO3: Rust bindings for Python,” 2023. [14] wasm-bindgen Project, “wasm-bindgen: High-level interactions between Wasm modules and JavaScript,” 2023. [15] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded computations by work stealing,” J. ACM, vol. 46, no. 5, pp. 720–748, 1999. [16] Transaction Processing Performance Council, “TPC-H benchmark specification,” Revision 3.0.1, 2023. [17] W. Bugden and A. Alahmar, “Rust: The programming language for safety and performance,” arXiv:2206.05503, 2022. [18] A. Mozzillo et al., “Evaluation of dataframe libraries for data preparation on a single machine,” Information Systems, 2023. [19] Maturin Project, “Maturin: Build and publish Rust-based Python packages,” 2023.

Copyright

Copyright © 2026 Kadri Wali Mohammad , Irfan Khan, Javed Choudhary, Rishabh Singh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET78854

Publish Date : 2026-03-26

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here