Kernel Transformer Networks for Compact Spherical Convolution


Ideally, 360◦ imagery could inherit the deep convolutional neural networks (CNNs) already trained with great success on perspective projection images. However, existing methods to transfer CNNs from perspective to spherical images introduce significant computational costs and/or degradations in accuracy. In this work, we present the Kernel Transformer Network (KTN). KTNs efficiently transfer convolution kernels from perspective images to the equirectangular projection of 360◦ images. Given a source CNN for perspective images as input, the KTN produces a function parameterized by a polar angle and kernel as output. Given a novel 360◦ image, that function in turn can compute convolutions for arbitrary layers and kernels as would the source CNN on the corresponding tangent plane projections. Distinct from all existing methods, KTNs allow model transfer: the same model can be applied to different source CNNs with the same base architecture. This enables application to multiple recognition tasks without rtraining the KTN. Validating our approach with multiple source CNNs and datasets, we show that KTNs improve the state of the art for spherical convolution. KTNs successfully preserve the source CNN’s accuracy, while offering transferability, scalability to typical image resolutions, and, in many cases, a substantially lower memory footprint.

This is a heading 1

This is a heading 2

This is a heading 3

This is a blockquote.

  1. This is the first item
  2. This is the second item
  3. This is the third item
    1. This is a nested list
    2. More nested list
  4. This is the fourth item.

I want to wrap this text in a strong tag and this text in an emphasis tag.

Existing Methods

Take an off-the-shelf model trained on perspective images

  1. Apply it repeatedly to multiple perspective projections of the 360 images
    • Self-view grounding given a narrated 360 video (AAAI 18)
    • Making 360 video watchable in 2D: Learning videography for click free viewing (CVPR 17)
    • Pano2Vid: Automatic cinematography for watching 360 videos (ACCV 16)
    • A deep ranking model for spatio-temporal highlight detection from 360 video (AAAI 18)
  2. Apply it once to a single equirectangular projection
    • Deep 360 Pilot: A deep ranking model for spatio-temporal highlight detection from 360 video (CVPR 17)
    • Semantic-driven generation of hyperlapse from 360 video (TVCG 17)

Recent work that challenges spherical data specifically

  1. Adapts the network architecture for equirectangular projection and trains kernels of variable size to account for distortions
    • Learning sperical convolution for fast features from 360 imagery (NIPS 17)

      Accurate but suffers from significant model bloat

  2. Adapts the kernels on the sphere, resampling the kernels or projecting their tangent plane features
    • Spherenet: Learning spherical representations for detection and classfication in omnidirectional images (ECCV 18)
    • Saliency detection in 360 videos (ECCV 18)

      Allows kernel sharing and hence smaller models it degrades accuracy–especially for deeper networks–due to an implicit interpolation assumption

  3. Defines convolution in the spectral domain
    • Convolutional networks for spherical signals (arXiv 17)
    • Learning so(3) equivariant representations with spherical cnns (ECCV 18)

      Has significant memory overhead and thus far limited applicability to real world data.

ALL above require retraining to handle a new recognition task.