Full Report
One of the things that has often confused me is how little good advice there is for reading large files efficiently when writing code. Typically most people use whatever the canonical file read suggestion for their language is, until they need to read large files and it’s too slow. Then they google “efficiently reading large files in ” and are pointed to a buffered reader of some sort, and that’s that.
Analysis Summary
# Main Topic
Analysis and empirical testing of strategies for efficiently reading large files in code, focusing specifically on how modern SSD architectures (compared to older spinning disk assumptions) affect optimal file reading performance in Rust.
## Key Points
- Traditional, canonical methods of reading files (e.g., reading the entire file into a single String) are significantly inefficient, shown to be over 12x slower than the best performing method in testing.
- Modern SSDs handle concurrent seeks and parallel reads much better than older spinning disks, challenging historic performance advice.
- Reading files in large blocks (e.g., 8MB) offers a significant speed improvement over reading the entire file at once or line-by-line if the goal is raw data throughput.
- Concurrent reading using multiple threads accessing separate file handles yields the best measured performance, nearly 3x faster than the standard Buffered Reader approach.
## Threat Actors
- Not applicable. This report focuses on software development performance optimization, not malicious threat actors or campaigns.
## TTPs
- Not applicable. The focus is on optimizing software development TTPs (file reading strategies) rather than adversary TTPs.
- Strategies Tested:
1. Vanilla Read (entire file to String).
2. IO Read (entire file into raw byte buffer).
3. Block Read (8MB chunks).
4. Buffered Reader (modified for block reads).
5. Thread Reader (concurrent reading of blocks via 10 threads).
## Affected Systems
- The findings primarily relate to file I/O performance in software implementations (tested using Rust).
- The context implicitly affects any system dealing with large data files where performance bottlenecks related to I/O are observed, especially those utilizing modern SSD storage.
- Testing environment: 2021 MBP with M1 chip.
## Mitigations
- Developers should move away from reading entire files into memory if they are large.
- For maximum throughput, implement multi-threaded file reading where each thread reads a predetermined block offset concurrently.
- Utilize blocking reads (e.g., 8MB chunks) rather than leveraging standard line-reading patterns if raw speed is the goal.
- When reading blocks, a standard `BufReader` approach implemented for block reads offers substantial gains over naive methods.
## Conclusion
The conventional wisdom for reading large files is outdated, likely based on limitations of spinning disks. For modern hardware, concurrency and block-based I/O are crucial for maximizing read performance. Developers experiencing slow file input should investigate multi-threaded chunked reading strategies.