Mar 192004

Are memory mapped files (MMFs) always faster than normal file I/O techniques?

Not necessarily.

Memory mapped files are seductive because they offer the lure of reading and writing data on disk using only a memory pointer. Advance the pointer to a new address, and presto! the data magically appears there. The system takes care of reading the data from disk on demand, using the memory page protection architecture of the x386 virtual memory controller. If you refer to an address that has not yet been loaded into RAM, a page fault occurs behind the scenes and reads the data into RAM for you. Your program doesn’t notice this activity because your thread is suspended while the page fault is processed.

MMFs give you simple access to data on disk without all the source code overhead of file I/O and buffering. It’s simple and it’s fast. So it must be better than the old way of doing things, right? Not necessarily.

Memory mapped files are not always faster than custom data loading algorithms. You have no control over how much of the MMF is kept in memory or for how long. This means that using an MMF may push other things out of RAM, such as code or data pages that you will need back “soon”.

Also, page faults are not free. A page fault can take a lot longer for the system to process than a simple file I/O call. The additional system overhead of using page faults is hidden by the fact that fault processing is performed in the system kernel on a different thread, not in your process.

A custom data management routine can be made to take into account data access patterns specific to your data, locality of reference, and cache longevity. A carefully crafted data manager for a specific data set can provide comparable performance to raw MMF but use significantly less physical RAM to do it, thus reducing page swapping and improving overall system performance.

We’ve studied whether MMFs would provide any benefit to the Delphi compiler, for example, above and beyond the simple file buffering scheme that has served us well for reading source files for the past decade. What we found was that reading source code from an MMF was very fast, yes, but the MMF sucked up lots and lots of RAM and pushed more important stuff (like compiler symbols) out to disk.

We used the sequential access hint in hopes of convincing the MMF to retire old pages sooner. The sequential access hint made the MMF load pages before our source pointer touched them, which improved scanning speed (because the program didn’t have to wait for the page to be loaded on demand). However, that did not reduce the memory footprint of the MMF overall, and the large memory footprint decreased overall compile performance.

The difference between using an MMF for source scanning and using a simple file buffer technique boils down to this: The MMF consumed a minimum of 64k of RAM, and kept approximately 25% of the file in memory after it had been scanned.

The simple file buffer technique never used more than 4k of RAM regardless of the source file size, and kept 0% of the file in memory after it had been scanned. For a 600k source file like Windows.pas, the MMF would chew up about 100K of RAM, whereas the simple file buffer would use no more than 4-8k of RAM.

When compiling a large project, the 96k difference between MMF and file buffer meant that potentially 96k of compiler symbols would be pushed out to disk by the MMF, and have to be loaded again from disk later.

I think part of the reason MMF doesn’t pay off in this example is that the source file is not the most important data in memory for a compiler. The symbol table is the most important data, and tends to be quite a bit larger than the source. (To compile and link a simple VCL app, the compiler must sift through more than 10MB of symbol data.)

In cases where the data in the MMF is the most important data to the application, MMF may provide greater advantages.

Memory Mapped File Advantages:

  • Simple and easy access to data as if it were already in memory
  • Excellent performance in most situations, compared to generic file I/O.
  • Implicitly asynchronous file I/O, without threading headaches.

Memory Mapped File Disadvantages:

  • Larger memory footprint than traditional file I/O
  • No control over how much memory is used or how long it stays in RAM
  • MMFs require a fixed file size. Expanding an MMF is a pain in the neck.
  • Byte for byte mapping from disk to memory makes compression of file data or file format versioning more difficult.
  • MMFs do not support file sharing.
  • Contiguous block of address space required. It’s possible to fragment your process address space to the point that you can’t map a 500MB file into memory in one chunk.
  • Not readily available to typesafe managed .NET code.

There are still opportunities to tweak the compiler’s I/O performance. I hope to experiment with I/O completion ports and asynchronous file I/O to preload the next source buffer independently of the scanner, so that the compiler doesn’t have to wait for file reads to complete. But that’s not critical to the product, so it will have to wait for the next rainy day weekend. (or trans-oceanic flight)