Based on popular request, this post can be found in a journal article form over here.
1. Introduction
One of the popular directions in Deep Learning (DL) research has been to build larger and more complex deep networks that can perform well on several different learning tasks, commonly known as multitask learning. This work is usually done within specific domains, e.g. multitask models that perform captioning, translation, and text classification tasks. Some work has been done in building multimodal/crossmodal networks that use deep networks with a combination of different neural network primitives (Convolutional Layers, Recurrent Layers, Mixture of Expert layers, etc).
An ultimate ambition of this research direction and many others in DL is to build extremely large heterogeneous networks that can solve hundreds or thousands of tasks, and deploy them in realistic use-cases. Computation, memory, and network bandwidth become extremely costly at this scale. To remedy some of these concerns, there has been prior research introducing sparsity to very large, deep networks through conditional activation.
The other key problem posed by very large, sparse, multitask networks (LSMNs) is managing the complexity they present at scale. Modern state of the art deep networks, especially those that tackle complex multitask settings, are built using tensor level processing abstractions provided by programming libraries and frameworks. However, implementing, scaling, and ultimately deploying LSMNs described above at this level of abstraction presents a workload that is almost intractable through a manual engineering process. Using higher level abstractions provided by other frameworks designed for rapid implementation of simpler networks is also an issue, as it limits the capacity for building highly custom modules, which are key to building high parameter, sparse networks that tackle tasks across various modalities.
This set of research notes explores various topics and ideas that may prove relevant to large, sparse, multitask networks and explores the potential for a general approach to building and managing these networks. A framework to automatically build, update, and interpret modular LSMNs is presented in the context of current tooling and theory. These notes are by no means comprehensive, and are meant to serve as the preliminary foray of a much deeper analysis of LSMNs and the ideas presented here.
These notes are structured so as to present a foundation of several ideas that may prove useful to building LSMNs, followed by their synthesis into an overall framework. Each preliminary section is meant to serve as a high level overview of the key ideas of the topic as relevant to LSMNs, rather than a comprehensive survey.
The document is organized into roughly 2 sections. The first covers some high level foundations to several of the key ideas mentioned in the second section, which discusses the motivation and approach to building LSMNs.
2. Multitask and Multimodal Learning
Much of the work done in Machine Learning has been focused on optimizing a single objective or learning task. Multitask learning instead focuses on building models that solve usually related tasks and build a shared representation. The motivation behind Multitask learning is several fold, including obvious parallels to the biologically inspired learning, but the main motivation from a machine learning perspective is that introducing more than one learning task to a single model introduces a sort of useful inductive transfer, leading to models that may generalize better. There has been significant work in multitask learning including a variety of architectures, parameter sharing schemes, auxiliary task2introduction, and other methods. NLP tasks have greatly benefited from multitask learning, and it has achieved success in machine translation and speech recognition models as well. For a thorough treatment on multitask learning several works are useful, including [1] and [2]. Of key interest in the research direction being explored here is the notion ofMultimodal learning. The objective for this line of thinking can be surmised as building a unified deep architecture that can solve tasks across these various do-mains, be it vision, text, or otherwise. In this case, and in the general literature domains are referred to as modalities. A key work on this topic is [3], where a single architecture, referred to as the Multimodel, was constructed to perform well on 8 different data corpora, including Imagenet, WSJ speech, COCO image captioning, English-German translation, and English to French translation.This work introduces several interesting ideas that are commonly seen in some other areas of DL research, including the composition of sub networks, the use of different primitives for different functions, and the introduction of a mixture of experts layer for conditional computation. Accordingly, it is useful to examine the details and architecture presented in this paper in some more detail for the sake of following discussions. There are other multimodal architectures including [4].There are 3 key ideas introduced in the architecture outlined in [3] that are critical to the general idea behind LSMNs. The first of these is the idea of small modality specific subnetworks that convert input data into a unified representation within the model and vice versa. The second is the idea that different ”computational blocks” are useful for different problems. The third is an Encoder-Decoder architecture similar to other work [5] which is outlined in later sections.
2.1 Modality Nets
The authors introduce modality nets, which are essentially subnetworks that transform the various data inputs a joint latent representation. These networks are named like this because a specific subnetwork is designed for a specific modality (speech, text, images, etc). The authors specifically design modality nets to be computationally minimal. There is also an autoregressive design constraint enforced on these subnetworks, though that is outside the general purview of this set of notes. Modality nets are very similar to the data type networks introduced in [5], and the neural module networks introduced in [6]. They can be viewed as a specific case of the general notion of neural module blocks introduced in this set of notes.
2.2 Computational Blocks
The authors also make note that the architecture makes use of various kinds of "computational blocks” or primitives and that this is critical to achieve good performance on various problems. This makes intuitive sense. Examples of computational blocks are separable convolutions [7], attention mechanisms [8],sparsely gated mixture of experts [9] (to be explored in the next section onSparsity), and others. The idea of fundamental building blocks that correspond to specific functionality is also critical to the notion of neural module blocks introduced in this work.
2.3 Encoder-Decoder Architecture
The other key idea this work provides that will prove useful to us is the introduction of a key architectural paradigm used in DL literature, and in multitask learning in particular. This is the idea of an Encoder-Decoder pair. In [3],the Multimodel uses an Encoder, Mixer, Decoder architecture, similar to those found in sequence to sequence models [10]. This architecture is important, as it is a specific case of the general ECD architectural pattern introduced by [5],something analyzed in significant detail in later sections.
Key Takeaway: Key ideas are introduced in building neural networks for multimodal tasks: 1. Modality nets are small subnetworks that are composed together in the general architecture with the goal of mapping different input modalities to a unified latent representation. 2. Different computational blocks(CNN layers, mixture of experts, etc) are key to good performance on various tasks. 3. An Encoder Decoder architecture is introduced, a specific case of a powerful architectural pattern demonstrated by the approach outlined in theLudwig paper outlined in later sections
3. Sparsity through Conditional Computation
Key to building extremely large, multimodal networks is the idea of sparsity.Here, sparse activation is defined by conditional computation paradigms that have been introduced in various works. Extremely large neural networks are limited in their capacity because of the sheer number of parameters. Conditional Computation promises a reduction in computation despite increases in capacity.Significant research into conditional computation through gating decision processes has been done [11] [12] [13]. These decisions could be discrete or continuous, and may also be stochastic. The Mixture of Experts layer used in the MultiModel was originally used in a very large, sparse recurrent language model introduced in [9]. The authors introduce a new layer/module that consists of a variety of small feed forward networks that act as experts, and a trainable gating network serves to select a sparse combination of the experts to process inputs.
Gating functions can range from soft max gating to a noisy selection of the top k candidates, as de-scribed in [9]. Training of the gating network is done jointly with the rest of the architecture through simple back-propagation.This sort of gating is key to building large, sparse networks that handle many multimodal tasks, and can be applied at both the specific layer level of abstraction as outlined above, or at the whole architecture level. Some form of sparsity is key to increasing model capacity using a reasonable amount of compute.
4. Neural Network Composition and Modular Architectures
A large interest in representation learning is understanding how learned representations are structured. Manually designed models often have come degree of compositionality. In [14], compositionality is broadly defined as combining simple parts to represent complex concepts. Various approaches to composing submodules, like the modality nets mentioned in [3]. In this section, we’ll cover neural module networks [6], a useful line of research that demonstrates compositionality, and relate the Encoder-Decoder architectural paradigm mentioned earlier to the general concept in the context of Ludwig [5], a popular declarative deep learning tool.
5. Automating Neural Network Design
AutoML [15] is a popular topic in the field over the past few years, and broadly refers to any class of methods that automatically find and fit a machine learning method to a particular problem. In this section, we’re specifically going to examine Neural Architecture Search [16], a group of strategies to automatically design deep learning architectures for given training objective. NAS methods have been applied to multitask and continual learning settings. [17]
6. Pruning Neural Networks
Pruning neural networks refers to selectively removing non contributing parameters while maintaining overall performance in a neural network. Pruning can be useful because of computational limitations. The Lottery Ticket pruning method [18] outlines an approach that finds a lottery ticket subnetwork within a larger deep neural network. This subnetwork may have as little as a tenth of the parameters, but it reaches the same performance the original network does, and sometimes generalizes to out of set examples better. For our purposes, pruning provides a useful scheme to treat deep networks as primitives to larger more complex networks by reducing their structure. This idea is explored further in the approach to building LSMNs outlined later.
7. An approach to building LSMNs
Building dynamically updated learning systems that can perform across thou-sands of tasks has been one of the great goals of the field, and will likely have enormous impact across many different application verticals. Such systems pro-vide an approach towards complex problems that have historically been out of reach for learning systems. Approaches using end to end multitask learning systems have already been applied to notoriously difficult problem domains like autonomous navigation and medical applications.
7.1 Issues and Requirements
However, building and managing LSMNs provides a host of problems, including managing the complexity these networks at scale and in production. LSMNs by design are heterogeneous, large, and have many parts that will require constant change and management to learn new tasks and improve current ones. They involve the design and use of a variety of neural network primitives, and human interpretable organization to intervene in the design and training of LSMNs.Most popular frameworks and libraries that are used to build custom deep neural networks involve abstractions at the tensor level, providing useful algebra primitives and automatic differentiation. While this has saved researchers and engineers considerable time and been partially responsible for the speed wit which neural network research has progressed, it does not provide a useful abstraction over the complexity required to implement LSMNs.In light of these requirements, an approach to building large scale LSMNs would involve 2 key design decisions:
1. A general abstraction paradigm that encapsulates subnetworks of the pat-tern described in [3] [5] [14] [6] and others.
2. An automated method for both micro level primitive design and macro level primitive composition, preferably while maintaining intervenability.
7.2 Unifying Key Ideas
We can unify the notion of modality networks presented in [3], the MOE layer in [9], and neural module networks [6] along with the higher type based abstractions and encoder-decoder blocks presented in [5] to a general idea of modular neural network primitives associated with an information processing function.These primitives can range in function from simple tensor algebra primitives for functions like convolutions to complex architectures like Residual Networks.This level of abstraction is intentionally loose, as it leaves the internals of modules flexible, and allows for organization through functional definitions as op-posed to mechanistic ones. Practically, this allows researchers and engineers to recycle blocks associated with large scale processing with necessary, but also include custom blocks for new information transforms in novel use cases. Key to this idea is the ability to generate new primitives on the fly, something to be discussed in detail. In organizing through function, the resulting modules maintain a high level of interpretability, key to productive engineering practice even in exceedingly complex networks that involve heterogenous information processing and conditional computation across a sizeable number of parameters. Organizing through function provides the data type flexibility provided in [5] while allowing primitive definition to be more specific and with smaller scope if necessary.
While this general descriptor of neural network modules that can be ex-tended from tensor level primitives to full blown SAO deep learning architectures is useful, in and of itself it does not increase the efficiency of buildingLSMNs, only of understanding and interpreting them. To rapidly prototype and build LSMNs as a collection of neural network modules with conditional computation, an automated method is necessary. While Neural ArchitectureSearch, explored earlier, has mostly been used in the design of large, fully activated deep learning architectures, the framework provides a useful paradigm for the dynamic design of both individual layers or cell types, as well as higher level network topologies [19] [15]. We can differentiate between the two types of search based on their domains: Micro level searches generate new primitive types, while Macro level searches generate topologies composed of conditional routing functions and neural network modules. While not explicitly clarified before, the benefit of micro level search is that it aids in the discovery of new functional primitives for neural network when combined with a pruning step like the method outlined in [18]. This is significant, as it allows researchers to dynamically create new primitives, or use predefined ones. Rule based approaches, like those outlined in the Combiner approach from the Ludwig architecture [5] also provides a useful tool for the composition of neural network modules into a larger network architecture.
7.3 A Conceptual Approach to building LSMNs
Now, through the two synthesized ideas presented above, a general approach to the design and management of LSMNs begins to emerge. LSMNs can be composed of neural modules, a generic abstraction that organizes encapsulated functionality at defineable granularity.
Building NN Modules: NN modules can be predefined, using well known architectures or tensor level operations, ranging from a convolutional or recur-rent primitive to a full blown ResNet. To define the best primitive for a new task, a micro level neural architecture search can be used, followed by a pruning operation to reduce the network to minimal parameter size. In practice, this makes is significantly easier to reuse, extend, and maintain good engineering practices around building very large networks.
Topology Generation: To generate the LSMN topology, a macro neural architecture search process can be defined to operate on neural modules and routing functions. Since neural architecture searches are essentially independent processes of the actual task, search spaces can be user defined and block architectures can be put together. Pretrained best weights can be swapped in from neural network modules as necessary, reducing overall training cost. Module interfaces can be modified by a separate process, similar to the Combiner introduced in [5]. Since modules are reusable, topology generation can be user defined instead of automatic, yielding networks similar to most cutting edge modular architectures. Whole network training could be done, or pretrained modules could be used out of the box.
Limited Architectural Opinions and Human in the Loop Learning: The strengths of this approach is that it doesn’t force architectural notions on the user other than compositionality. A core concern raised when discussing automated neural network methods is that they search combinatorial spaces of the primitive types fed to them, and that human defined search spaces are providing more innovation than the search method entirely. The approach outlined here sidesteps this issue, by allowing module definition to be generic, sourced through a search or through manual design processes. This means that as new computational primitives are discovered and formalized, they can be immediately used in this approach. This could be layer types, whole networks, and routing function. While a tool will be proposed for deep learning, this generic composition + search method could conceivably used on other network types, including spiking neural networks. If you can define a computational primitive8and it can be composed, it can be a part of the network. This could most certainly include non neural network computation as well, as any process would be entirely encapsulated and learning could be done in a one shot or independent fashion if necessary. The level of interpretability and extensibility, and the fact that automation is not a requirement demonstrates the ability of human in the loop learning with this approach.
The approach outlined above, while being framed as being useful to the build-ing of extremely large, sparse, multi objective networks, also provides a generic framework in which to design deep architectures for one task. Encapsulation and automation are useful features for any network generation process, and are already used to some degree in the construction of many networks, includingResNets and seq to seq models.
There are several areas still being developed: scoped out research ideas, ex-tensions and also a direction to implement a set of tools that would allow re-searchers and engineers to build heterogeneous networks using this approach, and eventually build LSMNs.
References
[1] S. Ruder, “An overview of multi-task learning in deep neural networks,”arXiv preprint arXiv:1706.05098, 2017.
[2] Y. Zhang and Q. Yang, “A survey on multi-task learning,” arXiv preprint arXiv:1707.08114, 2017.
[3] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones,and J. Uszkoreit, “One model to learn them all,” arXiv preprint arXiv:1706.05137, 2017.
[4] S. Pramanik, P. Agrawal, and A. Hussain, “Omninet: A unified architecturefor multi-modal multi-task learning,”arXiv preprint arXiv:1907.07804,2019.
[5] P. Molino, Y. Dudin, and S. S. Miryala, “Ludwig: a type-based declarativedeep learning toolbox,”arXiv preprint arXiv:1909.07930, 2019.
[6] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module net-works,” inProceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2016, pp. 39–48.
[7] F. Chollet, “Xception: Deep learning with depthwise separable convolu-tions,” inProceedings of the IEEE conference on computer vision and pat-tern recognition, 2017, pp. 1251–1258.
[8] W. Wang and J. Shen, “Deep visual attention prediction,”IEEE Transac-tions on Image Processing, vol. 27, no. 5, pp. 2368–2378, 2017.9
[9] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, andJ. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017.
[10] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generativemodel for raw audio,”arXiv preprint arXiv:1609.03499, 2016.
[11] Y. Bengio, N. L ́eonard, and A. Courville, “Estimating or propagatinggradients through stochastic neurons for conditional computation,”arXivpreprint arXiv:1308.3432, 2013.
[12] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup, “Conditionalcomputation in neural networks for faster models,”arXiv preprintarXiv:1511.06297, 2015.
[13] K. Cho and Y. Bengio, “Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning,”arXivpreprint arXiv:1406.7362, 2014.
[14] J. Andreas, “Measuring compositionality in representation learning,”arXivpreprint arXiv:1902.07181, 2019.[15] M.-A. Z ̈oller and M. F. Huber, “Survey on automated machine learning,”arXiv preprint arXiv:1904.12054, 2019.
[16] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: Asurvey,”arXiv preprint arXiv:1808.05377, 2018.
[17] R. Pasunuru and M. Bansal, “Continual and multi-task architecturesearch,”arXiv preprint arXiv:1906.05226, 2019.
[18] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse,trainable neural networks,”arXiv preprint arXiv:1803.03635, 2018.
[19] B. Zoph and Q. V. Le, “Neural architecture search with reinforcementlearning,”arXiv preprint arXiv:1611.01578, 2016.