TIP: 228 Title: Tcl Filesystem Reflection API Version: $Revision: 1.10 $ Author: Andreas Kupries Author: Andreas Kupries Author: Vince Darley State: Draft Type: Project Vote: Pending Created: 02-Nov-2004 Post-History: Tcl-Version: 8.7 ~ Abstract This document describes an API which reflects the Filesystem Driver API of the core Virtual Filesystem Layer up into the Tcl level, for the implementation of filesystems in Tcl. It is an independent companion to [219] ('Tcl Channel Reflection API') and [230] ('Tcl Channel Transformation Reflection API'). As the latter TIPs bring the ability of writing channel drivers and transformations in Tcl itself into the core so this TIP provides the facilities for the implementation of filesystems in Tcl. This document specifies version ''1'' of the filesystem reflection API. ~ Motivation / Rationale The purpose of this and the other reflection TIPs is to provide all the facilities required for the creation and usage of wrapped files (= virtual filesystems attached to executables and binary libraries) within the core. While it is possible to implement and place all the proposed reflectivity in separate and external packages, this however means that the core itself cannot make use of wrapping technology and virtual filesystems to encapsulate and attach its own data and library files to itself. Something which is desirable as it can make the deployment and embedding of the core easier, due to having less files to deal with, and a higher degree of self-containment. One possible application of a completely self-contained core library would be, for example, the Tcl browser plugin. While it is also possible to create a special purpose filesystem and channel driver in the core for this type of thing, it is however my belief that the general purpose framework specified here is a better solution as it will also give users of the core the freedom to experiment with their own ideas, instead of constraining them to what we managed to envision. Another use for reflected filesystems is as a helper for testing the generic filesystem layer of Tcl, by creating filesystems which forcibly return errors, bogus data, and the like. An implementation of this TIP exists already as a package, '''TclVfs'''. This TIP asks to make that mechanism publicly available to script and package authors, with a bit of cleanup regarding the Tcl level API. ~ Specification of Tcl-Level API The Tcl level API consists of a single new command, '''filesystem''', and one change to the existing command '''file'''. The new command is an ensemble command providing five subcommands. These subcommands are '''mount''', '''unmount''', '''info''', '''posixerror''', and '''internalerror'''. (Note that this TIP does not introduce a new C API, but rather exposes an existing C API to Tcl scripts.) ~~ The mount Subcommand > '''filesystem mount''' ?''-volume''? ''path cmdprefix'' This subcommand creates a new filesystem using the command prefix ''cmdprefix'' as its handler. The API this handler has to provide is specified below, in the section "Command Handler API". The new filesystem is immediately mounted at ''path''. After completion of the call any access to a subdirectory of ''path'' will be handled by that filesystem, through its handler. The filesystem is represented here by the command prefix which will be executed whenever an operation on a file or directory within path has to be performed. If the option '''-volume''' is specified then the new mount point is also registered with Tcl as a new volume and will therefore from then on appear in the output of the command '''file volumes'''. This is useful (and actually required for reasonable operation) when mounting paths like '''ftp://'''. It should not be used for paths mounted inside the native filesystem. The new filesystem will be immediately accessible in ''all'' interpreters executed by the current process. The command returns the empty string as its result. Returning a handle or token is not required despite the fact that the handler command can be used in more than one mount operation. The different instances can be clearly distinguished through the ''root'' argument given to each called method. This ''root'' is identical to the ''path'' specified here. In other words, the chosen ''path'' (= mount point) is the handle as well. We have chosen to use ''early binding'' of the handler command. See the section "Early versus late binding of the handler command" for more detailed explanations. '''Important note''': The handler command for the filesystem resides in the interpreter performing the mount operation. This interpreter is the '''filesystem interpreter''' mentioned in the section "Interaction with threads and other interpreters". ~~ The unmount Subcommand > '''filesystem unmount''' ''path'' This methods unmounts the reflected filesystem which was mounted at ''path''. An error is thrown if no reflected filesystem was mounted at that location. After the completion of the operation the filesystem which was mounted at that location is not visible anymore, and any previous filesystem accessible through this path becomes accessible again. The command returns the empty string as its result. ~~ The info Subcommand > '''filesystem info''' ?''path''? This method will return a list of all filesystems mounted in all interpreters, if it was called without arguments. When called with a ''path'' the reflected filesystem responsible for that path is examined and the command prefix used to handle all filesystem operations is returned. An error is thrown if no reflected filesystem is mounted for that path. There is currently no facility to determine the '''filesystem interpreter''' (nor its thread). ~~ The posixerror Subcommand > '''filesystem posixerror''' ''error'' This command can be called by a handler command during the execution of a filesystem operation to signal the POSIX error code of a failure. This also aborts execution immediately, behaving like '''return -code -1'''. The argument ''error'' is either the integer number of the POSIX error to signal, or its symbolic name, like "EEXIST", "ENOENT", etc. ~~ The internalerror Subcommand > '''filesystem internalerror''' ''cmdprefix'' This method registers the provided command prefix as the command to call when the core has to report internal errors thrown by a handler command for a reflected filesystem. If no such command is registered, then internal errors will stay invisible, as the core currently does not provide a way for reporting them through the regular VFS layer. We have chosen to use ''early binding'' of the handler command. See the section "Early versus late binding of the handler command" for more detailed explanations. ~~ Modifications to the file Command The existing command '''file'' is modified. Its method '''normalize''' is extended to recognize a new switch, ''-full''. When this switch is specified the method performs a normal expansion of ''path'' first , followed by an expansion of any links in the last element of ''path''. It returns the result of the expansion as its own result. The new signature of the method is * '''file normalize''' ?''-full''? ''path'' ~ Command Handler API The Tcl-level handler command for a reflected filesystem has to support the following subcommands, as listed below. Note that the term ''ensemble'' is used to generically describe all command (prefixes) which are able to process subcommands. This TIP is ''not'' tied to the recently introduced 'namespace ensemble's. There are three arguments whose meaning does not change across the methods. They are explained now, and left out of the specifications of the various methods. root: This is always the path the filesystem is mounted at, i.e. the handle of the filesystem. In other words, it is the part of the absolute path we are operating upon which is 'outside' of the control of this filesystem. relative: This is always the full path to the file or directory the operation has to work on, relative to ''root'' (s.a.). In other words, it is the part of the absolute path we are operating upon which is 'inside' of the control of the reflected filesystem. actualpath: This is the exact path which was given to the file command which caused the invocation of the handler command. This path can be absolute or relative. If it is absolute then ''actualpath'' is identical to "root/relative". Otherwise it can be a sub- or super-path of ''relative'', depending on the current working directory. And finally the list of methods and their detailed specification. ~~ The initialize Method > ''handler'' '''initalize''' ''root'' This method is called first, and then never again (for the given ''root''). Its responsibility is to initialize all parts of the filesystem at the Tcl level. The return value of the method has to be a list containing two elements, the version of the reflection API, and a list containing the names of all methods which are supported by this handler. Any error thrown by the method will prevent the creation of the filesystem and aborts the mount operation which caused the call. The thrown error will appear as error thrown by '''filesystem mount'''. The current version is ''1''. ~~ The finalize Method > ''handler'' '''finalize''' ''root'' The method is called when the filesystem was '''unmount'''ed, and is the last call a handler can receive for a specific ''root''. This happens just before the destruction of the C level data structures. Still, the command handler must not access the filesystem anymore in no way. It is now his responsibility to clean up any internal resources it allocated to this filesystem. The return value of the method is ignored. Any error thrown by the method is returned as the error of the '''unmount''' command. ~~ The access Method * ''handler'' '''access''' ''root relative actualpath mode'' This method is called to determine the "access" permissions for the file (''relative''). It has to either return successfully, or signal a POSIX error (See '''filesystem posixerror'''. The latter means that the permissions asked for via ''mode'' are not compatible with the file. Any result returned by the method is ignored. Regular errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. The argument ''mode'' is a list containing any of the strings '''read''', '''write''', and '''exe''', the permissions the file has to have for the request to succeed. * '''write''' contained in ''mode'' implies "writable". * '''read''' contained in ''mode'' implies "readable". * '''exe''' contained in ''mode'' implies "executable". ~~ The createdirectory Method > ''handler'' '''createdirectory''' ''root relative actualpath'' This method has to create a directory with the given name (''relative''). The command can assume that ''relative'' does not exist yet, but the directory ''relative'' is in does. The C level of the reflection takes care of this. Any result returned by the method is ignored. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. ~~ The deletefile Method > ''handler'' '''deletefile''' ''root relative actualpath'' This method has to delete the file ''relative''. Any result returned by the method is ignored. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. ~~ The fileattributes Method > ''handler'' '''fileattributes''' ''root relative actualpath'' ?''index''? ?''value''? The command has to return a list containing the names of all acceptable attributes, if neither ''index'' nor ''value'' were specified. The command has to return the value of the ''index'''th attribute if the ''index'' is specified, but not the ''value''. The attributes are counted in the same order as their names appear in the list returned by a call where neither ''index'' nor ''value'' were specified. The first attribute is has the index 0. The command has to set the value of the ''index'''th attribute to ''value'' if both ''index'' and ''value'' were specified for the call. Any result returned by the method is ignored for this case. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. ~~ The matchindirectory Method > ''handler'' '''matchindirectory''' ''root relative actualpath pattern types perm mac'' This method has to return the list of files or directories in the path ''relative'' which match the glob ''pattern'', are compatible with the specified list of ''types'', have the given ''perm''issions and ''mac'' creator/type data. The specified path is always the name of an existing directory. '''Note''': As the core VFS layer generates requests for directory-only matches from the filesystems involved when performing any type of recursive globbing this subcommand absolutely has to handle such (and file-only) requests correctly or bad things (TM) will happen. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. ''types'' is a list of strings, interpreted as set. The strings are the names of the types of files the caller is looking for. Allowed strings are: '''files''', and '''dirs'''. The command has to return all files which match '''at least one''' of the types. If ''types'' is empty then all types are valid. ''perm'' is a list of permission strings (i.e. a set), i.e. '''read''', '''write''', and '''exe'''. The command has to return all files which have '''at least all''' the given permissions. If ''perm'' is empty then no permissions are required. ''mac'' is a list containing 2 strings, for Macintosh creator and type. If ''mac'' is empty then the data is irrelevant. ~~ The open Method > ''handler'' '''open''' ''root relative actualpath mode permissions'' This command has to return a list describing the successfully opened file ''relative'', or throw an error describing how the operation failed. The thrown error will appear as error thrown by the ''open'' command which caused the invocation of the handler. The list returned upon success contains at least one and at most two elements. The first element is obligatory and is always the handle of the channel which was created to allow access to the contents of the file. If the second element is present it will be interpreted as a callback, i.e. a command prefix. This prefix will always be executed as is, i.e. without additional arguments. Any required arguments have to be returned as part of the result of the call to '''open'''. This callback is fully specified in section "The channel close callback". The argument ''mode'' specifies if the file is opened for read, write, both, appending, etc. Its value is a string in the set '''r''', '''w''', '''a''', '''w+''', or '''a+'''. The argument ''permissions'' determines the native mode the opened file is created with. This is relevant only if the ''mode'' actually requests the creation of a non-existing file, i.e. is not '''r'''. '''Note''': it is possible to return a channel implemented through reflection here. See also section "The channel close callback" for more. ~~ The removedirectory Method > ''handler'' '''removedirectory''' ''root relative actualpath recursive'' This method has to delete the given directory. The argument ''recursive'' is a boolean value. The method has to signal the POSIX error "EEXIST" if ''recursive'' is '''false''' and the directory is not empty. Otherwise it has to attempt to recursively delete the directory and its contents. Any result returned by the method is ignored. Regular errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. ~~ The stat Method > ''handler'' '''stat''' ''root relative actualpath'' This method has to return a dictionary containing the stat structure for the file ''relative''. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. The following keys and their values have to be provided by the filesystem: dev: A long integer number, the device number of the path stat was called for. This number is optional and always overwritten by the C level of the filesystem reflection. ino: A long integer number, the inode number of the path stat was called for. mode: An integer number, the encoded access mode of the path. It is this mode which is checked by the method '''access'''. nlink: A long integer number, the number of hard links to the specified path. uid: A long integer number, the id of the user owning the virtual path. gid: A long integer number, the id of the user group the virtual path belongs to. size: A long integer number, the true size of the virtual path, in bytes. atime: A long integer number, the time of the latest access to the path, in seconds since the epoch. Convertible into a readable date/time by the command '''clock format'''. mtime: A long integer number, the time of the latest modification of the path, in seconds since the epoch. Convertible into a readable date/time by the command '''clock format'''. ctime: A long integer number, the time of the path was created, in seconds since the epoch. Convertible into a readable date/time by the command '''clock format'''. type: A string, either '''directory''', or '''file''', describing the type of the given path. Notes: The stat data is highly Unix-centric, especially device node, inode, and the various ids for file ownership. While the latter are not that important both device and inode number can be crucial to higher-level algorithms. An example would be a directory walker using the device/inode information to keep itself out of infinite loops generated by symbolic links referring to each other. Returning non-unique device/inode information will most likely cause such a walker to skip over paths under the wrong assumption of having them seen already. To prevent the various reflected filesystem from stomping over each other with regard to device ids this information will be generated by the common C level of the filesystem reflection. The inode numbers however have to be assigned by the filesystem itself. It is possible to make a higher-level algorithm depending on device/inode data aware of the problem with virtual filesystems (and has actually been done, see the Tcllib directory walker), this however is a kludgey solution and should be avoided. ~~ The utime Method > ''handler'' '''utime''' ''root relative actualpath atime ctime mtime'' This method has to set the access and modification times of the file ''relative''. The access time is set to ''atime'', creation time to ''ctime'', and the modification time is set to ''mtime''. The arguments are positive integer numbers, the number of seconds since the epoch. Any result returned by the method is ignored. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. ~~ The copyfile Method > ''handler'' '''copyfile''' ''root relative_src actualpath_src relative_dst actualpath_dst'' This method is optional. It has to create a copy of a file in the filesystem under a different name, in the ''same'' filesystem. This method is not for copying of files between different filesystems and won't be called for such. Any result returned by the method is ignored. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. If this method is not supported the core filesystem layer will fall back to a Tcl & channel based method of copying the file. The same fallback will happen if the method is available, but signals the POSIX error "EXDEV". ~~ The copydir Method > ''handler'' '''copydir''' ''root relative_src actualpath_src relative_dst actualpath_dst'' This method is optional. It has to create a recursive copy of a directory in the filesystem under a different name, in the '''same''' filesystem. This method is not for copying of directories between different filesystems and won't be called for such. Any result returned by the method is ignored. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. If this method is not supported the core filesystem layer will fall back to a Tcl based method of copying the directory file by file.. The same fallback will happen if the method is available, but signals the POSIX error "EXDEV". ~~ The rename Method > ''handler'' '''rename''' ''root relative_src actualpath_src relative_dst actualpath_dst'' This method is optional. It has to rename a file in the filesystem, giving it a different name in the '''same''' filesystem. This method is not for the renaming of files between different filesystems and won't be called for such. Any result returned by the method is ignored. Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. If this method is not supported the core filesystem layer will fall back to a Tcl & channel based method of renaming the file. The same fallback will happen if the method is available, but signals the POSIX error "EXDEV". ~ Interaction with Threads and Other Interpreters. Virtual filesystems in Tcl are process global structures. In other words, they are seen and accessible by all interpreters, and all threads in the current process. For filesystems implemented completely at the C-level this is not that big a problem. However a filesystem implemented based on the reflection here will always be associated with a Tcl interpreter, the interpreter executing the requested filesystem operations. This cannot be avoided as only the interpreter containing the handler command also has all the state required by it. The filesystem/interpreter association also implies that any such filesystem is associated with a particular thread, the thread containing that interpreter. Filesystem requests coming from a different interpreter are handled by executing the driver functionality in the filesystem interpreter instead. In the case of requests coming from a different thread the C level part of the reflection will post specialized events to the filesystem thread, essentially forwarding the invocations of the driver. When a thread or interpreter is deleted all filesystems mounted with the '''filesystem mount''' command using this thread/interpreter as their computing base will be automatically unmounted and deleted as well. This pulls the rug out under the other thread(s) and/or interpreter(s), this however cannot be avoided. Future accesses will either fail because the virtual files are now missing, or will access different files provided by a different filesystem now owning the path. ~ Interaction with Safe Interpreters The command '''filesystem''' is unsafe and safe interpreters are not allowed to use it. The reason behind this restriction: The ability of mounting filesystems gives a safe interpreter the ability to inject code into a trusted interpreter. The mechanism is as follows: * An application using a trusted master interpreter and safe slaves for plugins reads and evaluates a file '''foo''' directly in the trusted interpreter. * A malicious plugin loaded into one of the safe slaves knows about this file '''foo''', and its actual location. It mounts a virtual filesystem using a driver which is part of its own code, over the directory '''foo''' is in. * When the trusted interpreter reads '''foo''', it does not go to the native filesystem anymore, but the mounted filesystem. In other words the driver in the slave provides the contents, the code which is executed in the trusted environment. From here on the slave can do anything it wishes in the trusted environment. * Access to any other file in the directory can be passed through unchanged to the filesystem originally owning the path. ~ The Channel Close Callback The channel close callback is an optional callback which can be set up by the Tcl layer when a file is opened. This is done in the '''open''' method, by returning a 2-element list. The first element is the channel handle as usual and the second element the command prefix of the callback. The command prefix is early-bound, i.e. the command will be resolved when the callback is set up. The resolution happens in the current context, and thus can be anywhere in the application. Because of this it is strongly recommended to use a fully-qualified command name in the callback. The callback is executed in the current context of the operation which caused the channel to close. It is executed just before the channel is closed '''by the generic filesystem layer'''. The callback itself '''must not''' call '''close'''. It will always be executed as is, i.e. without additional arguments. Any required arguments have to be made part of the prefix when it is set up. The channel is still live enough at the time of the call to allow '''seek''' and '''read''' operations. In addition all available data will have been flushed into it already. This means, for example, that the callback can seek to the beginning of the said channel, read its contents and then store the gathered data elsewhere. In other words, this callback is not only crucial to the cleanup of any resources associated with an opened file, but also for the ability to implement a filesystem which can be written to. This does assume that the filesystem does not use a reflected channel to access the contents of the virtual file. If a reflected channel is used however, the close callback is not required, as the ''finalize'' method of the channel can be used for the same purpose. Under normal circumstances return code and any errors thrown by the callback itself are ignored. In that case errors have to be signaled asynchronously, for example by calling ''bgerror''. Any result returned by the callback is ignored. Errors thrown by the callback are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present. '''Note''' that it is possible that the channel we are working with here is implemented through reflection. The order in which the various callbacks are called during closing is this: * The channel for the file is closed via ''close'' by the VFS. * The channel close callback has been set up as a regular close handler, and is called now. * The close function of the channel driver is called, reflected into the Tcl level and cleans it up. * The close operation completes. The important point here is that the channel close callback set up by the filesystem is definitely called before the reflected channel cleans up its Tcl layer, so the assertion above about the channel being live enough to be read and saved from the filesystem Tcl layer holds even if both filesystem and channel are reflected. It also holds if reflected transformations are involved. ~ Early versus Late Binding of the Handler Command We have two principal methods for using the handler command. These are called early and late binding. Early binding means that the command implementation to use is determined at the time of the creation of the channel, i.e. when ''chan create'' is executed, before any methods are called. Afterward it cannot change. The result of the command resolution is stored internally and used until the channel is destroyed. Renaming the handler command has no effect. In other words, the system will automatically call the command under the new name. The destruction of the handler command is intercepted and causes the channel to close as well. Late binding means that the handler command is stored internally essentially as a string, and this string is mapped to the implementation to use for each and every call to a method of the handler. Renaming the command, or destroying it means that the next call of a handler method will fail, causing the higher level channel command to fail as well. Depending on the method the error message may not be able to explain the reason of that failure. Another problem with this approach is that the context for the resolution of the command name has to be specified explicitly to avoid problems with relative names. Early binding resolves once, in the context of the ''chan create''. Late binding performs resolution anywhere where channel commands like '''puts''', '''gets''', etc. are called, i.e. in a random context. To prevent problems with different commands of the same name in several namespaces it becomes necessary to force the usage of a specific fixed context for the resolution. Note that moving a different command into place after renaming the original handler allows the Tcl level to change the implementation dynamically at runtime. This however is not really an advantage over early binding as the early bound command can be written such that it delegates to the actual implementation, and that can then be changed dynamically as well. ~ Limitations For now this section documents the existing limitations of the reflection. The code of the package '''TclVfs''' has only a few limitations. * One subtlety one has to be aware of is that mixing case-(in)sensitive filesystems and application code may yield unexpected results. > For example mounting a case-sensitive virtual filesystem into a case-insensitive system (like the standard Windows or MacOS filesystems) and then using this with code relying on case-insensitivity problems will appear when accessing the virtual filesystem. > Note that application code relying on case-insensitivity will not under Unix either, i.e. is inherently non-portable, and should be fixed. * The C-API's for the methods '''link''' and '''lstat''' are currently not exposed to the Tcl level. This may be done in the future to allow virtual filesystems implemented in Tcl to support the reading and writing of links. > '''Note''' - Exposure of links may require path normalization and native path generation, something the TclVfs implementation does not support. This limitation regarding any type of link, hard or or soft, is quite deeply entrenched in the TclVfs code. * The public C-API filesystem function '''Tcl_FSUtime''' is Unix-centric, its main data argument is a ''struct utimbuf *''. This structure contains only a single value for both ''atime'' and ''ctime''. The method '''utime''' of the handler command was nevertheless defined to take separate values for access and creation times, in case that this changes in the future. * The Tcl core VFS layer was written very near to regular filesystems and has no way to transport regular Tcl error messages through it. This is the reason for the introduction of the internal error callback. This problem cannot be fixed within the 8.5 line as it requires more extensive changes to the public API. Note that when such changes are done the reflection API has to change as well, as it then allows the direct passing of errors. At that point the C layer of the reflection will have to support both this and the new version of the API. ~ Examples of Filesystems The filesystems provided by '''TclVfs''' are all examples. * webdav * ftp sites * http sites * zip archive * tar archive * metakit database * namespace/procedures as filesystem * widget fs Some examples can be found on the Tcler's Wiki, see pages referring to http://wiki.tcl.tk/11851 * Encryption * Compression * Jails * Quotas ~ Reference Implementation The package '''TclVfs'''[http://sourceforge.net/projects/tclvfs/] can serve as the basis for a reference implementation. The final reference implementation will be provided at SourceForge, as an entry in the Tcl Patch Tracker. The exact url will be added here when it becomes available. ~ Comments Comments on [http://mini.net/tcl/12328] suggest it might be a good idea to modify the 'file attributes' callback to make it more efficient for vfs writers, especially across a network and when vfs's are stacked. Currently one needs to make multiple calls to accomplish anything. [[ Add comments on the document here ]] ~ Copyright This document has been placed in the public domain.