A major challenge for object recognition is to correctly perceive the image objects despite the extraneous variations in the data such as shifting, rotation, deformation, etc. It would be much easier for the vision tasks if such task-irrelevant transformation variabilities were removed from the data. The recent success of deep learning approaches has its roots in the ability to build feature representations that are invariant to variations caused by nuisance factors. The expressiveness of deep networks allows the models to disentangle the underlying factors of variations in the data and the training signals guide the models to learn feature representations that are robust to task-irrelevant variations. However, such variations need to be observed from the training data. Otherwise, a traditional deep neural network without special architectural design would not generalize to these variations. To address this concern, we study the problem of achieving transformation invariance and equivariance in deep learning.,We show how some existing approaches, such as the stacked statistical model with rotat- able features and the spatial transformer network, are imperfect at learning feature repre- sentations that are invariant or equivariant to transformations. To search for an alternative, we develop a training mechanism to learn transformation-invariant feature representations, where feature maps of canonical images are used as soft targets to guide a deep neural network to produce the same feature representations even when the input images are transformed. As a result, our framework can obtain transformation-invariant feature representations and makes it possible to take advantage of unlabeled data that contains an enormous amount of variations. Additionally, we seek architectural changes to the existing deep learning models and propose a framework for training deep neural networks with optimal instantiations. By introducing latent variables to parametrize the transformations of the data for each class, our approach is able to obtain the optimal instantiations while training for the downstream tasks. Another direction we explore is the use of 3D CAD models to render 2D images as a data augmentation approach. Rendering 2D images from 3D models allows for a more compact way of representing an object class and the models are able to observe more data variations during training. Competitive experimental results are demonstrated by these methods, and our analysis shows that they can be promising directions to achieve transformation invariance and equivariance in deep learning.