Non-uniform file encodings in the Eclipse Platform

Last modified: February 23, 2004

Plan item description: Eclipse 2.1 uses a single global file encoding setting for reading and writing files in the workspace. This is problematic; for example, when Java source files in the workspace use OS default file encoding while XML files in the workspace use UTF-8 file encoding. The Platform should support non-uniform file encodings. [Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: User experience] (bug 37933, 5399)

The pre-M7 situation is as follows:

Requirements 

Proposed solution

In addition to the existing approach of having a single global encoding for a workbench we propose
  1. an extensible mechanism to determine the encoding of a stream by analyzing its contents or, if available its file name,
  2. to add a default encoding property to projects. This default encoding is used if no encoding could be determined in the first step.
We do not (yet) propose a settable encoding attribute per file because The encoding for a stream or an IStorage (as returned by two getCharset methods - see API changes) will be:
  1. the encoding discovered by a content interpreter associated to the file extension (or file type), if one exists and can determine the encoding, or
  2. the default encoding define for the enclosing project, if any, or
  3. the global workspace encoding (equivalent to ResourcesPlugin.getEncoding()).

Regarding #1, an extension-point would allow file format-aware encoding interpreters to register to the encoding discovery mechanism for specific file types (extensions) or to associate existing encoding interpreters to their own file extensions. Users would be able to associate more file extensions for the known interpreters (preference).

All clients, when creating character-based streams when reading/writing the contents of a file resource, should pass along the charset string obtained from one of the getCharset methods instead of the one provided by ResourcesPlugin.getEncoding. Examples are: text editors, compiler, search, compare.

API changes

Added:

To make the encoding support available for non-workspace based resources we propose to add the following method to org.eclipse.core.runtime.IPlatform:
public interface IPlatform {
// ...
public String getCharset(InputStream stream, String fileExtension) throws CoreException;
// ...
}

The InputStream seems to be the most widely used and scalable mechanism to get access to any kind of byte content. InputStreams can be easily created for a java.io.File, an IStorage (which subsumes IFile and IFileState; see below), as well as for bytes in memory (ByteArrayInputStream).
The optional file extension argument can be used to quickly reject more expensive ways for infering the encoding from the contents.

A corresponding implementation (based on IContentInterpreters; see below) lives in org.eclipse.core.runtime.Platform.

For the resource plugin we propose to add a new interface IEncodedStorage that adds the single method getCharset to the existing IStorage interface:
interface IEncodedStorage extends IStorage {
public String getCharset() throws CoreException;
}

Its method getCharset returns the name of the encoding for an IStorage. It would make sense to add this method directly to the IStorage interface, since any InputStream can only be interpreted correctly if the used encoding is known. But because clients are allowed to implement IStorage this would be a breaking API change, so we decided to introduce a separate extension to IStorage.

Two existing interfaces will extend IEncodedStorage: IFile and IFileState, two concrete class will provide an implementation: File and FileState.

For both, files and file states, the implementation of getCharset first uses IPlatform.getCharset(...) from above to find an encoding based on any registered IContentInterpreters. If no encoding can be determined, File.getCharset() locates the enclosing project of the file and queries its IProjectDescription for a default encoding. For this we need the following two new methods on IProjectDescription:

interface IProjectDescription {
// ...
public String getDefaultCharset();
public void setDefaultCharset(String charset);
// ...
}
If no default encoding has been defined fo the project, the workspace's default encoding preference is returned (via the existing API).

Other implementers of IStorage will have to decide whether they should base their implementation on IEncodedStorage.

The implementation of Platform.getCharset will make use of content interpreters implementing the IContentInterpreter interface and that can be associated to file types through a new Core Runtime extension point "org.eclipse.core.runtime.contentInterpreter". Users can associate additional file extensions via preferences.

The method interpretContent does not return the detected encoding but stores it into a result object of type IContentInfo that is passed in as an argument. This approach makes it possible to allow for collecting additional information (like 'type'/'subtype') instead of just the encoding.

interface IContentInterpreter {    
	public void interpretContent(IContentInfo result, InputStream contents);
}
The IContentInfo is:
public interface IContentInfo {
	public void setCharset(String charset);
	public String getCharset();
}
Since we would not allow clients to implement (or extend) IContentInfo, we will be able to extend the API with new setters and getters in the future without breaking API.

The platform would provide itself implementations of IContentInterpreters for xml and other popular file formats.

Deprecated:

public int IFile.getEncoding()
public int IFile.ENCODING_* constants

public String ResourcesPlugin.getEncoding(): Since all clients of this method will most likely have to adapt their code, I suggest to deprecate getEncoding() and introduce a new method getDefaultCharset() that better reflects the real purpose (and brings it more in line with IProjectDescription.getDefaultCharset())

UI Changes

We need to add new UI for changing the default encoding for a project. A good place for this would be the Property dialog since encoding can be considered a property of the project, similar to the read-only property etc. The property dialog for files would only show the current value for the encoding but would not allow to change it.

We should provide a "Convert Encoding" action that converts the contents of a file (or all files in a hierarchy) to a different encoding. This action would ask the user for two encodings: the first is used when reading all selected files and the second when writing these files back to the workspace.
The action would not change the encoding value returned by getCharset() but it would provide a means to make the encoding of multiple files consistent with the default encoding of the enclosing project.
(An alternative to this UI would be to provide something like a "Save with encoding" action for editors. But this UI seems to be less convenient if the encoding of multiple files needs to be changed).

In order to make sharing of files with heterogenous encodings easier, we'll have to enhance the compare/merge tools to be able to work with heterogenous encodings:

To facilitate that, we try to automatically determine the encoding for the remote resource
With these means it becomes possible to compare and merge files independent from the fact whether we use the same encodings on both sides or not.

However, if we want to use the same encoding (that is if we catchup with the remote .project file), we will have to convert the encoding of our local files to adapt them to the new encoding. For this we will provide the "Convert Encoding" action in the Compare/Merge tools where required.

Scenarios