# 9 ICML Papers You Should Read

Did you miss ICML? Here are some cool papers from this year's conference.

Here are some papers that we found interesting from ICML 2017.

### 1. Failures of Gradient-Based Deep Learning

*Why it’s cool:* We’re used to seeing demonstrations of the success of deep learning on tasks ranging from image captioning to transcription factor binding prediction. But are there certain tasks that deep learning (and more generally, gradient-based methods) are inherently unsuited for? Shalev-Shwartz et. al show that there are certain problems (such as predicting the parity of a random vectors based on observing its dot product with known vectors), in which the loss function has similar values for the gradient regardless of which hypothesis is chosen from the set of hypotheses. Theoretically and empirically, this means that gradient-based methods will not be able to learn the true hypothesis, even with lots of time!

### 2. Parseval Networks: Improving Robustness to Adversarial Examples

*Why it’s cool*: Part of the reason deep learning is able to model so many complex patterns between the input and output spaces is because of the non-linear transformations that are learned by the network. Each nonlinear transformation can be thought of “warping” the input space so that in the end, different classes of objects are easily separable in the warped space. However, this makes neural networks vulnerable to adversarial examples, whereby input images (or documents, etc.) are perturbed very slightly in the *unwarped* space, but result very different labels predicted by the network. In this paper from Facebook AI Research, researchers make neural networks more robust by limiting the Lipschitz constant (a mathematical property that quantifies warpedness) of the network. Surprisingly, they find that this produces not only more robust networks, but neural networks that perform better even on the original images!

### 3. Learning Sleep Stages from Radio Signals: A Conditional Adversarial Architecture

*Why it’s cool*: This work is an awesome synthesis of machine learning theory and a cutting edge wireless/medical application. Previous researchers from Dina Katabi’s lab at MIT have developed a wireless device that can use reflected (or “backsacttered”) Wifi signals to measure things like the heartbeat of people in a room. Here, they try to extend that to predict sleep stages of someone based *only* on reflected wireless signals. The challenge is to do this in a way which adapts to different environments and subjects. They do this by building a encoder-predictor-discriminator network. The *encoder* turns raw wireless signals into features, which the *predictor* must use to predict sleep stages, but which the *discriminator* is unable to use to discriminate between different settings and individuals. This way, they ensure that they learn features that are only relevant to predicting sleep stages, instead of domain-specific features.

### 4. When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications

*Why it’s cool*: In practical studies, small sample sizes are always encountered because of data nature or finantial considerations, which makes analysis hard and inaccurate. One common solution is to pool data from diverse sources directly together. However, is such pooling guaranteed to help? In this paper, researchers propose a hypothesis test to answer this question. Theoretical study is carried on a simple setting of linear regression on data with no model mismatch. The “bias-variance” trade-off is found to be the key point. When the data size is not large enough, multi-site pooling can decrease the vairance signigicantly, while increase the bias not much. The hypothesis test is then validated by emperical experiments on Alzheimer’s disease studies.

### 5. The Shattered Gradients Problem: If resnets are the answer, then what is the question?

*Why it’s cool*: Vanishing and exploding gradients were once viewed as a critical problem of neural network. It was later well solved by initialization and batch normalization techniques. However, in recent researches, people found that resnets and other architectures with skip-connections perform much better than the standard feedforward ones, without good explanations. This work points out a problem for the first time which may be the hidden reason - the shattered gradients of deep neural network. With the growth of network’s depth, gradients in feedforward networks become shattered as they resemble white noise. In contrast, resnets dramatically reduce this shattered gradient problem.

### 6. Asymmetric Tri-training for Unsupervised Domain Adaptation

*Why it’s cool*: This work achieves impressive performance for domain adaptation task, while not explicitly decreases the feature distributions of source and target domains. More specifically, it utilizes an asymmetric tri-training networks where two of them are used to label target data, and the last one is trainined only on the generated pseudo-labeled samples. Comprehensive experiments are carried out and demonstrate better performance over existing methods such as MMD and DANN. One interesting result is that this model only silghtly reduced the divergence of domain compared to other methods.

### 7. Modular Multitask Reinforcement Learning with Policy Sketches

*Why it’s cool*: In reinforcement learning, can we take advantage of the structure underlying different tasks for generalization? In prevous work, the structure is grounded in the environment. This paper proposes using “policy sketches,” which are just symbols representing subtasks within a task but not grounded in the environment. With this kind of weak supervision, the paper proposed a de-coupled actor-critic approach for policy optimization within a curriculum learning framework, and shows promising empirical results.

### 8. Post-Inference Prior Swapping

*Why it’s cool*: In probabilistic modeling, given an arbitrary prior, posterior inference can be computationally intractable. What if we can have arbitrary simple priors that make posterior inference more tractable, and have a way to correct the posterior from the simple priors? The naive approach of using importance sampling fails when the simple prior and the true prior are very different. This paper proposes prior swapping, a method that works for arbitrary simple priors.

### 9. Evaluating Bayesian Models with Posterior Dispersion Indices

*Why it’s cool*: From designing a model, to inferring the hidden structures of the model, to evaluating the model, this cycle is what underlies probabilistic modeling. This paper shines new light on how we should evaluate a model. The caonical approach is to use the predictive accuracy, which is the expectation of the likelihood under the posterior. This paper argues that the predictive accuracy does not tell the full story. The paper proposes using the posterior dispersion indices, which is the quotient of the variance and the expectation of the likelihood under the posterior, and shows empirical case studies in which the posterior dispersion indices can tell a fuller story of how well a data point is modeled.

Abubakar Abid ARTICLE