50
RustSight is an open-source command-line tool designed for fast CSV dataset profiling and validation, making it an essential asset for data engineers and AI/ML practitioners. Built with Rust, this tool excels in speed and efficiency, outperforming traditional tools like Pandas by 6.1x on large datasets. RustSight is capable of detecting missing values, outliers, type mismatches, and no-variance columns, ensuring data quality before AI/ML model training.
Key features include:
CSV dataset profiling with column type detection
Missing value analysis per column
ML readiness validation with outlier detection
Streaming processing for large files
Automatic report generation
RustSight supports CSV and TXT/Binary formats, with plans to include Parquet, JSON, and Arrow. The tool's performance is benchmarked against industry standards, demonstrating its capability to handle large datasets efficiently.
Installation is straightforward via Cargo, requiring the Rust toolchain. Users can quickly profile datasets and validate them for ML readiness, making RustSight a powerful tool for data validation and profiling tasks.
Built with