This paper introduces a new encoder-decoder architecture that is trained to
reconstruct images by disentangling the salient information of the image and
the values of attributes directly in the latent space. As a result, after
training, our model can generate different realistic versions of an input image
by varying the attribute values. By using continuous attribute values, we can
choose how much a specific attribute is perceivable in the generated image.
This property could allow for applications where users can modify an image
using sliding knobs, like faders on a mixing console, to change the facial
expression of a portrait, or to update the color of some objects. Compared to
the state-of-the-art which mostly relies on training adversarial networks in
pixel space by altering attribute values at train time, our approach results in
much simpler training schemes and nicely scales to multiple attributes. We
present evidence that our model can significantly change the perceived value of
the attributes while preserving the naturalness of images.