Thomas Tempelmann | Mac OS X: FSRef versus POSIX path performance

Some newcomers to OS X programming seem to believe that everything what was adding to OS X from Unix is better than what the old Mac OS APIs offered. And looking at the new Cocoa APIs, even at Apple some seem to think so, regrettably.

Update (Dec 2008): I've since learned that I was wrong on a few things. E.g, there's another FS API at a deeper level which is used by both the POSIX and the Carbon APIs. That one is where one can get the best performance out of. Furthermore, unless running under root, even the FSRef API functions need to check on the path for parent permissions, although that is apparently still more effective, due to better caching etc, probably. Lastly, Apple plans to introduce a new API in OS X 10.6 which effectively creates an opaque object per file or dir reference, making it the best performing way to identify file system items eventually.

Update (Feb 2012): The new API is based on URL references, and the FSRef based API has been declared deprecated in OS 10.8. While URLs appear to be no different from POSIX paths, the key difference is that URLs are objects, meaning that the OS can tell how long the URL is in use, and keep it cached based on this info. FSRefs do not allow this, making it especially inefficient for file servers.

I'll try to clarify one misunderstanding about file system performance on OS X in this regard.

Mac OS X has two APIs to access directories on a volume:

Carbon File Manager (FSRef based)
POSIX / BSD (path based)

Furtunately, nothing indicates that Apple plans to remove the FSRef API (Update Feb 2012: Well, its use is now discouraged, see above).

As a general rule: If you need speed, avoid the path based API

Why paths are slow

The path based API uses a string to describe the location of a file or folder, listing all its parent folders up to the root. So, to locate an item's data on the disk, each path component has to be parsed and looked up. The deeper the item (i.e. the more components it has), the more lookups are needed. Each lookup requires a (possibly cached) read from the directory file on the volume, and a search therein. This can amount to a lot of disk accesses, which cost time. (Note: This is a somewhat unproven statement I made based on how they work. Apparently, there is a lot of caching going on

The FSRef access, on the other other, is something like a cache in itself. It tells the OS where to look for a directory entry, without the need to search through all its parent folders.

(Update (Dec 2008): The fixed size (80 bytes) of a FSRef record causes problems with some volume types, e.g. network volumes, as the OS may need more memory to store a file reference then - it thus needs to maintain another cache for those references, and since there's no notification to the OS once a FSRef is not needed any more by an app, the OS has a hard time optimizing for both memory and speed needs. The new API in 10.6 will deal with this, the references being CF-based objects.)

But there's more

The FSRef based API also provides some highly optimized functions to browse directories or even entire volumes.

The FSGetCatalogInfoBulk function allows you to read all entries of a folder in one go, retrieving practically all information that's available in the directory. On the other hand, the POSIX functions only supply the scandir() method, which retrieves the names and inode numbers of entries, but for most other properties, more calls have to be made, which again takes much more time.

If you care to learn about all items on a volume, you can even use FSCatalogSearch, which is incredibly fast on HFS (and HFS+) volumes. A test on my Mac Pro (2.8 GHz) allowed me to get all properties (name, time stamps, permissions, etc.) of a volume with 4.3 million items in 3.5 minutes. The only thing that it failed to give me was the inode of hard links, for which I had to revert to using path based methods, which turned the overall time to scan the entire volume into about 3 hours on a Time Machine backp volume where about 70% of all items were hard links. (Admittedly, the speed difference would be a lot smaller if I'd iterate the volume recursively through its folders, where the path-based access then would use the previously cached parent folders, but the overall speed would also still be much worse than with using FSCatalogSearch. I just wanted to make a point about the advantages of the Carbon based API.)

So, you see, if you have application that does a lot of directory scanning, the "old" Carbon File Manager API can beat the Unix / POSIX path based APIs by magnitudes.

Other advantages of the Carbon File Manager API

There is no path length limit (paths are limited to 1024 bytes on OS X)
Renaming non-root volumes and folders causes no problems with items inside such volumes or folders because the FSRef keeps the reference to the item even if its path changes.
Similarly, Aliases, which are soft links using references similar to FSRefs, keep their reference intact if an item's parent folder or volume name changes.

Programming