Non-uniform file encodings in the Eclipse Platform

Last modified: June 12, 2003

Plan item description: Eclipse 2.1 uses a single global file encoding setting for reading and writing files in the workspace. This is problematic; for example, when Java source files in the workspace use OS default file encoding while XML files in the workspace use UTF-8 file encoding. The Platform should support non-uniform file encodings. [Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: User experience] (bug 37933, 5399)

The current situation is as follows:

Requirements

Proposed solution

The encoding for a resource (as returned by IResource.getCharset - see API changes) will be:

  1. the encoding explictly set by a client/user (with IResource.setCharset - see API changes), if any, or
  2. for a file resource, the encoding discovered by an encoding interpreter associated to the file extension, if one exists and can determine the encoding, or
  3. for a file resource, the file encoding determined by its Byte Order Mark, if it exists, or
  4. the resource parent's encoding (except for the workspace root, whose encoding is equivalent to ResourcesPlugin.getEncoding()).

Regarding #2, an extension-point would allow file format-aware encoding interpreters to register to the encoding discovery mechanism for specific file types (extensions) or to associate existing encoding interpreters to their own file extensions. Users would be able to associate more file extensions for the known interpreters (preference).

All clients, when creating character-based streams when reading/writing the contents of a file resource, should pass along the charset string obtained from IFile.getCharset instead of the one provided by ResourcesPlugin.getEncoding. Examples are: text editors, compiler, search, compare.

Also, setting the encoding for a resource would generate a resource change event, but only for the directly affected resource (if clients are interested on what effects the change in a directory had on files inside it, they will have to find it out by themselves).

API changes

Added:

public void IResource.setCharset(String charsetName) throws CoreException

Sets the charset name for this resource. May be null, which sets it to default. For the workspace root, it sets the workspace's default encoding preference to the charset's canonical name (or to the default encoding, if null was provided).

public String IResource.getCharset() throws CoreException

Returns the name of the charset for this resource. For files, if none has been defined (with setCharset), returns the default charset. To determine the default charset, it tries to guess it by a) inspecting the file contents (BOM), b) calling the corresponding encoding interpreter (if any). Otherwise, the parent's charset is returned. For the workspace root, a charset corresponding to the workspace's default encoding preference is returned.

public boolean IResource.isDefaultCharset() throws CoreException

Returns true if the currently configured charset was not explicitly set by the user - (has a default value either guessed by file contents, or inherited from parent).

public static final int IResourceDelta.ENCODING = 0x100000;

public String IResourceDelta.getNewCharset();

public String IResourceDelta.getOldCharset();

For notifying changes in file encodings. Both methods should only be called only valid when getKind()==CHANGE, and (getFlags()&ENCODING)!=0.

public interface IEncodingInterpreter {
	/** returns null if the charset cannot be determined. */
	public String interpretCharset(java.io.InputStream input);
}

Encoding interpreters will be associated to file types through a new core resources extension point. Users can associate additional file extensions ia preferences.

The platform would provide itself implementations for xml and other popular (?) file formats.

Deprecated:

public int IFile.getEncoding()
public int IFile.ENCODING_* constants

Encoding settings metadata

The encoding settings metadata will be stored inside the project's content area so it can be easily shared.

Scenarios