Hebrew University logoHebrew University logoCSE School logoCSE Building info icon info icon
Ross building, The Edmond J. Safra Campus. Picture: Dror Bar-Natan, Fall 2001

The Learning Club   rss icon

Past talks

Fall 2015/2016 Semester
Extreme nonparametrics in time series
Thu, 19 Nov 2015, 10:15
Speaker: Daniil Ryabko , INRIA
I will talk about some learning problems concerning time series, including clustering, change-point and hypothesis testing. The framework is very general: the time series are not assumed to come from some parametric family, and there are no assumptions of independence or finite memory. The main results come in the form of asymptotically consistent algorithms. I will also touch upon the question of what is possible and what is not possible to do in this general framework. Finally, I will present some research directions which I would like to work during my stay at HUJI.
Optimization, Regularization and Generalization in Multilayer Networks
Sun, 15 Nov 2015, 10:15
Speaker: Nati Srebro , TTI Chicago
What is it that enables learning with multi-layer networks? What causes the network to generalize well? What makes it possible to optimize the error, despite the problem being hard in the worst case? In this talk I will attempt to address these questions and relate between them, highlighting the important role of optimization in deep learning. I will then use the insight to suggest studying novel optimization methods, and will present Path-SGD, a novel optimization approach for multi-layer RELU networks that yields better optimization and better generalization. Joint work with Behnam Neyshabur, Ryota Tomioka and Russ Salakhutdinov.
Understanding Word Embeddings
Thu, 12 Nov 2015, 10:15
Speaker: Omer Levy , Bar Ilan
Neural word embeddings, such as word2vec (Mikolov et al., 2013), have become increasingly popular in both academic and industrial NLP. These methods attempt to capture the semantic meanings of words by processing huge unlabeled corpora with methods inspired by neural networks and the recent onset of Deep Learning. The result is a vectorial representation of every word in a low-dimensional continuous space. These word vectors exhibit interesting arithmetic properties (e.g. king - man + woman = queen) (Mikolov et al., 2013), and seemingly outperform traditional vector-space models of meaning inspired by Harris's Distributional Hypothesis (Baroni et al., 2014). Our work attempts to demystify word embeddings, and understand what makes them so much better than traditional methods at capturing semantic properties. Our main result shows that state-of-the-art word embeddings are actually ""more of the same"". In particular, we show that skip-grams with negative sampling, the latest algorithm in word2vec, is implicitly factorizing a word-context PMI matrix, which has been thoroughly used and studied in the NLP community for the past 20 years. We also identify that the root of word2vec's perceived superiority can be attributed to a collection of hyperparameter settings. While these hyperparameters were thought to be unique to neural-network-inspired embedding methods, we show that they can, in fact, be ported to traditional distributional methods, significantly improving their performance. Among our qualitative results is a method for interpreting these seemingly-opaque word-vectors, and the answer to why king - man + woman = queen. (Based on joint work with Yoav Goldberg and Ido Dagan.) Bio: Omer is a Computer Science PhD student at Bar-Ilan University’s Natural Language Processing lab, working with Prof. Ido Dagan and Dr. Yoav Goldberg. Omer is interested in realizing high-level semantic applications such as question answering and summarization to help people cope with information overload. At the heart of these applications are challenges in textual entailment and semantic similarity, which form the core of Omer's current research. He is also interested in the current advances in representation learning (aka “deep learning”), particularly in the scope of word embeddings, and how they can support semantic applications.
Learning to cluster - a statistical framework for incorporating domain knowledge in clustering
Thu, 05 Nov 2015, 10:15
Speaker: Shai Ben-David , University of Waterloo
Clustering is an area of huge practical relevance but rather meager theoretical foundations. The multitude of clustering algorithms (and their possible parameter settings) and the diversity of the results they may yield, call for incorporation of domain expertise in the process of selecting a clustering algorithm and setting up its parameters. I will outline recent progress made along this direction. In particular,I will describe a novel statistical/machine-learning approach to that challenge; a model selection algorithm that is based on interactions with the clustering user. I will analyze the statistical complexity of the proposed approach. I will also allude to some common misconceptions and potential pitfalls, aiming to stimulate discussions and highlight open questions. The main part of the talk is based on recent work with my student Hassan Zokaei Ashtiani presented at UAI 2015.
Physics-Based Computational Imaging in the Era of Big Data: An X-Ray Perspective
Thu, 29 Oct 2015, 10:15
Speaker: Yan Kaganovsky , Duke
Physics-Based Computational Imaging is the process of reconstructing a physical object represented by a digital image, which is often high-dimensional (e.g., 3D, hyperspectral, movie of 3D images), by computational inversion of measured data that is not directly related to the imaged object. This inverse problem is inherently ill-posed due to the non-direct and incomplete nature of the measurements, model uncertainties, and noise. In my talk I will discuss in detail two examples of computational imaging: (1) X-ray computed tomography (CT), which is used for 3D medical imaging of humans/animals, quality control in industrial manufacturing, and security inspections, to name a few; (2) Coded-aperture x-ray coherent scatter imaging, which is a novel imaging modality proposed by Prof. Brady’s group at Duke University that allows one to identify the molecular structure of materials in security and medical applications. Traditional methods used for data inversion in most actual imaging systems employ one-shot methods, e.g., Fourier-based inversion or filtered back-projection. In recent years, there has been an increased interest in statistical iterative inversion methods based on minimizing some cost function that incorporates additional physical information about the system, the measured signals, and prior knowledge about the imaged object, which makes the inverse problem less ill-posed. Despite their many advantages, iterative methods are computationally much more expensive (require more CPU time and memory) than one-shot methods. This is becoming a big challenge as sensors and detectors are becoming faster and more accurate, leading to more measurements and to the accumulation of big data, which increases storage requirements and processing time. There is also an increasing demand for high resolution images, resulting in very high dimensional solution spaces (another aspect of big data), which also increases reconstruction time, e.g., in 3D x-ray CT there are billions of image voxels to be reconstred; this becomes an even bigger problem for hyperspectral or 4D images. In medical and security applications, time may be critical, so faster inversion methods need to be developed. Another challenge in iterative methods is the determination of model parameters, which are object dependent and therefore unknown a priori. I will present recent developments in iterative inversion algorithms and computational approaches that address the above challenges. I will highlight a non-conventional method called ``Automatic Relevance Determination (ARD)’’, which originated in machine learning, for principled automatic determination of model parameters. I will present an extension of ARD to physically-based models used in CT, called ``Variational ARD’’ [1], which can accurately account for photon shot noise and Beer's law. This method is over an order of magnitude faster than previous ARD methods for big data. [1] Y. Kaganovsky et al., “Alternating Minimization Algorithm with Automatic Relevance Determination for Transmission Tomography under Poisson Noise”, accepted to the SIAM Journal on Imaging Sciences.
Soft-Optimal Decisions under Uncertainty
Thu, 22 Oct 2015, 10:15
Speaker: Roy Fox , HUJI
Under uncertainty, seemingly optimal decisions can actually be suboptimal. This makes the empirical optimum a biased estimator of the true optimum, which is a major obstacle in optimization, model selection, and sequential decision making. Information theory offers methods for reducing this bias, by considering a relative-entropy cost of diverging from a prior stochastic decision. In this work we show that the bias is a monotonic function of the scale of this relative-entropy cost, and find an explicit asymptotic expression for the zero-bias scale. We illustrate this framework in two reinforcement-learning settings, and derive a novel off-policy TD-learning algorithm. We demonstrate in several domains some of the benefits of our algorithm over existing ones, such as its superior convergence rate in noisy environments, and its risk aversion during exploration.

Spring 2013/2014 Semester
On statistical model selection
Thu, 28 May 2015, 10:15
Speaker: Yosi Rinott , HUJI
As background I will discuss model selection notions such as AIC and BIC. Then I will describe an attempt based on AIC to quantify the predictive value of models (joint work with David Azriel).
Learning orthogonal and positive matrices using block coordinate updates
Thu, 21 May 2015, 10:15
Speaker: Gal Chechik , Bar Ilan University
I will cover two of our recent works which share a common theme: using block coordinate updates when optimizing over matrix sets. I will first describe an approach for learning orthogonal matrices, a central component in problems like sparse-PCA or tensor decomposition. Such optimization is hard since simple operations on orthogonal matrices easily break orthogonality, and correcting orthogonality usually costs a large amount of computation. I'll describe a framework for optimizing orthogonal matrices, that is the parallel of coordinate-descent in Euclidean spaces. It is based on Givens-rotations, a fast-to-compute operation that affects a small number of entries in the learned matrix, and preserves orthogonality. Second, I'll describe a block-coordinate approach for optimizing over positive semi definite matrices, a key problem in metric learning. This optimization problem is again hard, and I'll describe a procedure that uses row-column coordinate updates, which keeps the optimization within the positive definite cone. I'll demonstrate the applicability of these two approaches in several applications, including retrieval of similar images and analysis of brain-omics data. Bio: Gal Chechik is an Assoc. Prof. at the Gonda brain research center at Bar-Ilan University. His research focuses on learning in brains and machines, spanning topics from brain development and plasticity to machine perception. Before joining the faculty, he was a senior research scientist at Google research, Mountain View, and a research associate at Stanford University with Daphne Koller. Gal obtained my PhD from the Hebrew University working with Naftali Tishby and Israel Nelken and his M.Sc. with Eytan Ruppin and Issac Meilijson from Tel-Aviv University.
Distributed Machine Learning
Thu, 14 May 2015, 10:15
Speaker: Ohad Shamir , Weizmann Institute
Handling increasingly large datasets is one of the major problems faced by machine learning today. One approach is to distribute the learning task, and split the data among several machines which can run in parallel. Ideally, a distributed learning algorithm on k machines should provably (1) Run k times faster than an algorithm designed for a single machine; (2) Reach the same statistical learning performance with the same amount of training data; And (3) Use minimal communication between the machines, since it is usually much slower than internal processing. In other words, such an algorithm should combine computational efficiency, statistical efficiency, and communication efficiency. In this talk, I'll survey the challenges of designing such algorithms for convex learning problems, and describe some recent advances as well as fundamental limitations. Includes joint works with Yossi Arjevani, Andrew Cotter, Ofer Dekel, Ran Gilad-Bachrach, Nathan Srebro, Karthik Sridharan, Lin Xiao and Tong Zhang.
Common Manifold Learning Using Alternating Diffusion for Multimodal Signal Processing
Thu, 07 May 2015, 10:15
Speaker: Ronen Talmon , Technion
One of the true challenges in signal processing is to distinguish between different sources of variability. In this work, we consider the case of multiple multimodal sensors measuring the same physical phenomenon, such that the properties of the physical phenomenon are manifested as a hidden common source of variability (which we would like to extract), while each sensor has its own sensor-specific effects. We will address the problem from a manifold learning standpoint and show a method based on alternating products of diffusion operators and local kernels, which extracts the common source of variability from multimodal recordings. The generality of the addressed problem sets the stage for the application of the developed method to many real signal processing problems, where different types of devices are typically used to measure the same activity. In particular, we will show application to sleep stage assessment. We demonstrate that through alternating-diffusion, the sleep information hidden inside multimodal respiratory signals can be better captured compared to single-modal methods.
Turning Big Data Signals into Small Core-sets
Thu, 16 Apr 2015, 10:15
Speaker: Dan Feldman , University of Haifa
Life-logging video streams, financial time series, and Twitter tweets are a few examples of high-dimensional signals over practically unbounded time. I will consider the problem of computing optimal segmentation of such signals by k-piecewise linear function, using only one pass over the data by maintaining a coreset for the signal. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (core-set) is a compact representation of the data seen so far, which approximates the data well for a specific task -- in our case, segmentation of the stream. I will show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments k, independently of both the dimension d of the signal, and its number n of points. More precisely, we will construct a representation of size $O(klog n /epsilon^2)$ that provides a $(1+\epsilon)$-approximation for the sum of squared distances to any given k-piecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. The results rely on a novel eduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud. Based on papers from NIPS'14 and ICRA'15 (to appear).
Learning High Dimensional Copula Bayesian Networks
Thu, 19 Mar 2015, 10:15
Speaker: Yaniv Tenzer , HUJI
Learning expressive real-valued multivariate distributions is of central interest in numerous fields ranging from computational biology to economics to climatology. When the joint distribution of interest is far from multivariate normal, this modeling task can be a great challenge. In statistics, copulas [Joe, 1997, Nelsen,2007] are the central tool for capturing flexible multivariate real-valued distributions. However, copulas are typically only effective in low dimensions and much of the research in the field in fact focuses on the bivariate case. In machine learning, probabilistic graphical models and in particular directed Bayesian networks (BNs) [Pearl, 1988], have become increasingly popular as a flexible and intuitive framework for modeling multivariate densities based on a qualitative graph structure G that encodes the independencies in the domain. Graphical models are geared toward the high dimensional case and numerous algorithms for estimation, model selection and prediction using these models have been developed in recent decades [Koller and Friedman, 2009]. Unfortunately, due to computational considerations, real-valued high-dimensional modeling using this framework is often limited to a structured multivariate Gaussian model. In recent years, several works suggested a fusion between copulas and graphical models [Kirshner, 2007,Elidan, 2010], with the goal of allowing for flexible real-valued modeling that is practical in the high dimensional setting. In this talk I will present the Copula Bayesian Network construction (CBN) and a novel structure learning approach that is based on some new theoretical results regarding copulas. * No background knowledge in graphical models or copulas is assumed
Consistent distribution-free K-sample and independence tests for univariate random variables
Thu, 12 Mar 2015, 10:15
Speaker: Ruth Heller , TAU
A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distribution-free tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomial-time algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example. Joint work with Yair Heller, Shachar Kaufman, Barak Brill, and Malka Gorfine.
Strongly Adaptive Online Learning
Thu, 05 Mar 2015, 10:15
Speaker: Alon Gonen , HUJI
Strongly adaptive algorithms are algorithms whose performance on every time interval is close to optimal. We present a reduction that can transform standard low-regret algorithms to strongly adaptive. As a consequence, we derive simple, yet efficient, strongly adaptive algorithms for a handful of problems. joint work with: Amit Daniely and Shai Shalev-Shwartz
Efficient Training of Structured SVMs via Soft Constraints
Thu, 25 Dec 2014, 10:15
Speaker: Ofer Meshi , TTI Chicago
Structured output prediction is a powerful framework for jointly predicting interdependent output labels. Learning the parameters of structured predictors is a central task in machine learning applications. However, training the model from data often becomes computationally expensive. Several methods have been proposed to exploit the model structure, or decomposition, in order to obtain efficient training algorithms. In particular, methods based on linear programming relaxation, or dual decomposition, decompose the prediction task into multiple simpler prediction tasks and enforce agreement between overlapping predictions. In this work we observe that relaxing these agreement constraints and replacing them with soft constraints yields a much easier optimization problem. Based on this insight we propose an alternative training objective, analyze its theoretical properties, and derive an algorithm for its optimization. Our method, based on the Frank-Wolfe algorithm, achieves significant speedups over existing state-of-the-art methods without hurting prediction accuracy.
Acquiring Semantic Knowledge using Patterns
Thu, 04 Dec 2014, 10:15
Speaker: Roy Schwartz , HUJI
Building word representations is one of the core tasks in Natural Language Processing. In recent years there have been several successful attempts to tackle this problem, which are based on deep neural networks (aka word embeddings). However, the resulting embeddings are limited in their ability to express general semantic relations, as well as the inability to interpret their elements. In this talk, I will present a model for constructing interpretable word embeddings. Unlike the great majority of previous approaches, our model is based on patterns learned from plain text (e.g., ``X and Y''). I will show that the representation learned using our model (a) outperforms state-of-the-art embeddings (word2vec) on a word similarity task, (b) is able to capture semantic relations that word2vec embeddings are unable to capture and (c) can be combined with word2vec embeddings to create a representation that is better than both representation separately. Based on a joint work with Roi Reichart and Ari Rappoport.
Probabilistic inference by randomly perturbing max-solvers
Thu, 27 Nov 2014, 10:15
Speaker: Tamir Hazan , Haifa University
Predictions in modern inference problems can be increasingly understood in terms of discrete structures such as arrangements of objects in computer vision, parses in natural language processing or molecular structures in computational biology. In a fully probabilistic treatment, all possible alternative assignments are considered thus sampling from traditional structured probabilistic models may be computationally expensive for many machine learning applications. These computational difficulties are circumvented with a variety of optimization techniques that provide max-solvers to predict the most likely structure. In this talk I will present a new approach to relax the exponential complexity of probabilistic reasoning in structured models while relying on efficient predictions under random perturbations. This approach leads to a new statistical inference framework that is based on probability models that measure the stability of the prediction to random changes of the structures scores.
Machine Learning, art and design
Thu, 20 Nov 2014, 10:15
Speaker: Michael Fink ,
Sequential Monte Carlo methods and their use in graphical models
Thu, 13 Nov 2014, 10:15
Speaker: Thomas Schon , Uppsala University
In this talk I will introduce our new methods (to be published at NIPS in December) for inference in general probabilistic graphical models (PGMs). The key is a sequential decomposition of the PGM which provides a sequence of auxiliary distributions defined on a monotonically increasing sequence of probability spaces. By targeting these auxiliary distributions using Sequential Monte Carlo (SMC) methods we are able to approximate the full joint distribution defined by the PGM. I will also introduce the underlying Sequential Monte Carlo methods (e.g. particle filters/smoothers), which are computational methods primarily used to deal with the state inference problem in nonlinear state space models. We are (since a few years back) seeing these methods finding new applications in more and more general model classes, for example probabilistic graphical models. Furthermore, the systematic combination of SMC and MCMC, referred to as particle MCMC provides another powerful family of algorithms. The first algorithms of this type were published in 2010 and since then we have (for very good reasons) witnessed a rapidly growing interest in these algorithms.
* Interpolation Between Online Learning Settings
Thu, 23 Oct 2014, 10:15
Speaker: Yevgeni Seldin , University of Copenhagen, Denmark
Online learning problems can be characterized by several key features, including the amount of feedback (full information, bandit feedback, partial monitoring, etc.); adversariality of the environment ("stochastic" regime, adversarial regime, adaptive adversaries, etc); and structural richness (stateless problems, problems with side information, MDPs, etc.). There is a significant amount of research on isolated settings in this huge space, for example, "prediction with expert advice", "stochastic multiarmed bandits", "adversarial multiarmed bandits", and so on. In most situations algorithms developed for one problem, say, stochastic multiarmed bandits, are either unsuitable or suboptimal for any other problem. We present a number of algorithms that achieve the optimal performance (up to logarithmic factors) in a number of different regimes as well as in intermediate regimes. This includes one algorithm that achieves the optimal performance in both stochastic and adversarial multiarmed bandits. The same algorithm achieves the optimal performance in two intermediate regimes, moderately contaminated stochastic regime and adversarial regime with a gap. We also present two other algorithms that achieve the optimal performance on a continuous range of problems extending from prediction with expert advice to adversarial multiarmed bandits and a bit further into partial monitoring games. The results naturally lead to a number of challenging open questions: (1) how far can we push the "general algorithms" line - is there a limit on the number of different settings that can be solved with one algorithm? (2) is there a price to be paid for "generality"? (the optimal performance is achieved up to logarithmic factors, but are these factors inevitable and if yes, how to derive lower bounds establishing that no single algorithm can achieve the optimal performance in several different regimes without paying a price?) Based on joint work with Aleksandrs Slivkins, Peter Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori.
Exact Matrix Completion.
Thu, 19 Jun 2014, 10:00
Speaker: Naomi Vinokurov , HUJI
Deep Neural Networks for Acoustic Modeling in Speech Recognition
Thu, 19 Jun 2014, 10:40
Speaker: Yoel Sher , HUJI
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a deep feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. I will present the shared views of four research groups (From University of Torronto, Google, Microsoft and IBM) that had recent success in using these neural networks for acoustic modeling in speech recognition.
Generalized Principal Component Analysis
Thu, 12 Jun 2014, 10:00
Speaker: Moran Mizrahi , HUJI
I will present the paper "Generalized Principal Component Analysis" by Rene Vidal, Yi Ma and Shankar Sastry from 2003. The paper develops an extension of PCA to the case where data is drawn from a mixture of linear subspaces. I will describe an algebraic geometric algorithm proposed by the authors for solving this problem, for the case where all the subspaces are of the same dimension.
Tighter Linear Program Relaxations for High Order Graphical Models
Thu, 29 May 2014, 10:00
Speaker: Elad Mezuman , HUJI
Graphical models with High Order Potentials (HOPs) have received considerable interest in recent years. While there are a variety of approaches to inference in these models, nearly all of them amount to solving a linear program (LP) relaxation with unary consistency constraints between the HOP and the individual variables. In many cases, the resulting relaxations are loose, and in these cases the results of inference can be poor. It is thus desirable to look for more accurate ways of performing inference. In this work, we study the LP relaxations that result from enforcing additional consistency constraints between the HOP and the rest of the model. We address theoretical questions about the strength of the resulting relaxations compared to the relaxations that arise in standard approaches, and we develop practical and efficient message passing algorithms for optimizing the LPs. Empirically, we show that the LPs with additional consistency constraints lead to more accurate inference on some challenging problems that include a combination of low order and high order terms. joint work with Daniel Tarlow, Amir Globerson and Yair Weiss
Learning a topic hierarchy using the Chinese restaurant process
Thu, 22 May 2014, 10:00
Speaker: Oren Bernstein , HUJI
We address the problem of learning topic hierarchies from data. The problem is very rich: we need to learn the character of each topic, their number and their structure relative to one another. Starting with a fixed number of clusters, we build up a model which solves this problem in a general way, using the non-parametric distribution known as the Chinese restaurant process.
Thu, 22 May 2014, 10:45
Speaker: Alon Cohen , HUJI
I will present a paper by Agarwal et al. The paper discusses generalized linear models for binary prediction and their extension to multiclass prediction. We will show simple algorithms for learning such models, for when the specific model is known and when it is unknown. These algorithms are essentially iterative least-squares updates and are provided with formal convergence guarantees. On the empirical side, I will show a scalable stagewise variant of their approach, which achieves dramatic computational speedups on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.
James Stein estimator
Thu, 15 May 2014, 10:00
Speaker: Rotem Hayut , HUJI
In 1961, James and Stein (JS) introduced a new parameter estimation scheme. Their JS estimator has a surprising and counterintuitive property: when three or more unrelated parameters are measured, the estimation error can be reduced by using a combined estimator (even though they are unrelated!). Imagine three measurements, the first is a speed of light test, the second is the euro-league final-four results and the third will be kebab consumption before Independence Day. By combining the measurements of the three above we can reach a better MSE than estimating them separately. In the lecture, I will briefly talk about basic definitions in estimation theory, and compare maximum likelihood and James-Stein estimators in terms of MSE risk. Two intuitions for the estimator will be presented, as well as an example of predicting success in baseball batting average.
Tensor Decomposition Algorithms for Latent Variable Model Estimation
Thu, 08 May 2014, 10:00
Speaker: Uri Shalit , HUJI
Winning The Large Scale Visual Recognition Challenge with Dropout And Deep CNN
Thu, 01 May 2014, 10:00
Speaker: Lior Bar , HUJI
In 2010 a team from the University of Toronto won the LSVRC-2010 (Large Scale Visual Recognition Challenge) with more then 8% better labelling accuracy then the other competitors. They classified 1000 image classes using Deep Convolutional Neural Networks (CNN) and also applied the newly developed 'Dropout' training approach, which controls overfitting by hiding part of the features between the network's layers during the training. In this talk we will describe the CNN architecture used in the challenge (and implementation details) , as well the dropout approach. We will also discuss some recent analysis of dropout and its relation to regularization.
Discrete Chebyshev Classifiers
Thu, 03 Apr 2014, 10:00
Speaker: Elad Eban , HUJI
In large scale learning problems it is often easy to collect simple statistics of the data, but hard or impractical to store all the original data. A key question in this setting is how to construct classifiers based on such partial information. In the talk I will present a framework for discriminative learning given a set of statistics. Specifically, we address the case where all variables are discrete and we have access to various marginals. We show that for certain sets of statistics the problem is tractable, and in the general case can be approximated using MAP LP relaxations. Empirical results show that the method is competitive with other approaches that use the same input.
On Huber’s robust regression and its connection to compressed sensing
Thu, 27 Mar 2014, 10:00
Speaker: Jakob Vovnoboy , HUJI
inear regression is a common practice in the fields of signal processing and machine learning. In some cases the samples obtained for the regression are corrupted by gross outliers. Hence, classic methods may perform poorly or even break down. To address this issue, robust regression methods were developed throughout the years. Huber's robust regression is probably the most known of them and was introduced as early as 1964. In this talk we discuss two approaches to the problem of regression in the presents of outliers. The first one is looking at the outliers as random variables taken from an unknown distribution. The second is looking at them sparse deterministic unknowns. Thus, compressed sensing methods can be used to identify them . We show that both approaches are tightly connected to the Huber's robust regression method and present some conclusions from this connection. based on chapters from "Robust statistics" by Huber and Ronchetti and on "USPACOR: Universal sparsity-controlling outlier rejection" Giannakis et al.
Blind Source Separation by Sparse Decomposition
Thu, 20 Mar 2014, 10:00
Speaker: Michael Zibulevsky , Technion
The blind source separation problem is to extract the underlying source signals from a set of their linear mixtures, where the mixing matrix is unknown. This situation is common, e.g. in acoustics, radio, and medical signal processing. We exploit the property of the sources to have a sparse representation in a corresponding (possibly overcomplete) signal dictionary. Such a dictionary may consist of wavelets, wavelet packets, etc., or be obtained by learning from a given family of signals. Starting from the maximum a posteriori framework, which is applicable to the case of more sources than mixtures, we derive a few other categories of objective functions, which provide faster and more robust computations, when there are an equal number of sources and mixtures. Our experiments with artificial signals and with musical sounds demonstrate significantly better separation than other known techniques.
On the intersection of mathematics and machine learning
Thu, 13 Mar 2014, 10:00
Speaker: Shai Dekel , GE, TAU
In the first part of the talk we will provide an overview of the industrial internet (also known as the “Internet Of Things (IOT)”) and present some of the challenges in this domain such as big data analytics, predictive analytics, large scale asset management and machine vision problems in advanced manufacturing. In the second part of the talk we will show how mathematics (Approximation Theory and Harmonic Analysis) can be combined with Machine Learning to solve some of these challenges. Tools such as the tree-based Random Forest and the Gradient Boosting Machine are popular and powerful machine learning algorithms that are also employed as part of 'Deep Learning' systems. Constructing the right form of wavelet decomposition of these tools allows establishing ordering of their forest decision nodes: from `significant' features to `less significant' to `insignificant' noise. Consequently, simple wavelet techniques can be used to overcome the presence of noise and misclassifications in the training sets and compress large scale neural networks.
Information Trade-offs in Machine Learning
Thu, 06 Mar 2014, 10:00
Speaker: Ohad Shamir , Weizmann Institute of Science
Many machine learning approaches are characterized by information constraints on how they interact with the training data. These include memory and sequential access constraints (e.g. fast first-order methods to solve stochastic optimization problems); communication constraints (e.g. distributed learning); partial access to the underlying data (e.g. missing features and multi-armed bandits) and more. Currently, we have little understanding how such information constraints fundamentally affect our performance, independent of the learning problem semantics. For example, are there learning problems where *any* algorithm which has small memory footprint (or can use any bounded number of bits from each example, or has certain communication constraints) will perform worse than what is possible without such constraints? In this talk, I'll describe how a single set of results implies positive answers to the above, for a variety of settings.
Minimize Your Regret and Have It
Thu, 27 Feb 2014, 10:00
Speaker: Eyal Gofer , HUJI
Online linear optimization (OLO) models a wide range of sequential decision making problems in a possibly adversarial environment. For this setting, Regularized Follow the Leader algorithms, and in particular Hedge and Online Gradient Descent (OGD), are known to uniformly upper bound regret. This talk will discuss a surprising flip side of these algorithms: They uniformly \emph{guarantee} regret. As will be shown, this property makes them useful, of all things, in proving \emph{lower} bounds on regret. In fact, the message of this talk is that when attempting to prove high regret in the OLO setting, regret minimization algorithms could feature in the solution. I will begin with results by G. and Mansour (2012). That work lower bounded the individual sequence anytime regret of a large family of algorithms in terms of the quadratic variation of the loss sequence, Q_T, and the learning rate. It further showed that any learning rate that guarantees a regret upper bound of O(\sqrt{Q_T}) necessarily implies an \Omega(\sqrt{Q_T}) anytime regret on any sequence with quadratic variation Q_T. These bounds apply to Hedge and OGD as special cases. One application of these bounds is an asymptotically optimal lower bound on the arbitrage-free price of call options. Time permitting, I will discuss new results for OLO with switching costs. For this model, a novel construction based on regret minimization algorithms implies that for normed switching costs, under general conditions, regret cannot be upper bounded by any learner given only an upper bound on $Q_T$, necessitating additional constraints.
Volumetric Spanners: an Efficient Exploration Basis for Learning
Thu, 20 Feb 2014, 10:00
Speaker: Zohar Karnin , Yahoo
Numerous machine learning problems require an {\it exploration basis} - a mechanism to explore the action space. We define a novel geometric notion of exploration basis with low variance called volumetric spanners, and give efficient algorithms to construct such bases. We show how efficient volumetric spanners give rise to the first efficient and optimal-regret algorithm for bandit linear optimization over general convex sets. Previously such results were known only for specific convex sets, or under special conditions such as the existence of an efficient self-concordant barrier for the underlying set.
The sample complexity of agnostic learning when the labeling is deterministic
Wed, 19 Feb 2014, 15:00
Speaker: Shai Ben-David , Waterloo
The case of learning a deterministic classifier with a class to which it does not belong has so far received relatively little research attention. Its relevance grows with the emergence of tools that allow handling data with a huge number of features, so big that it is reasonable to assume that, over the full set of features, the labeling is (almost) fully determined. We investigate this setting and show that it displays a very different behavior than that of the fundamental results of learning for the common PAC setups. First, the sample complexity of learning a binary hypothesis class is not fully determined by the VC-dimension of the class. For any d, we present classes of VC-dimensiond that are learnable from O(d/ε)-many samples and classes that require samples of size Ω(d/ε^2). Furthermore, in this setup is that there are classes for which any proper learner is has suboptimal sample complexity. While the class can be learned with sample complexity O(d/ε), any proper (and therefore, any ERM) algorithm requires Ω(d/ε^2) samples. We also shaw that in such cases, access to unlabeled data suffices to close that gap. A semi-supervised version of ERM has optimal label complexity once it has access to sufficiently many unlabeled examples. We provide combinatorial characterizations of both phenomena. This is joint work with Ruth Urner.

Fall 2013/2014 Semester
Pay attention to the leading constants
Thu, 13 Feb 2014, 10:00
Speaker: Yoram Singer , Google
Abstract: In the talk I review one of the three largest machine learning platforms at Google called Sibyl. Sibyl can handle over 100B examples in 100B dimensions so long as each example is very sparse. The recent version of Sibyl fuses the views of Nesterov's accelerated gradient method and parallel coordinate descent. The result is an algorithm that retains the momentum and convergence properties of the accelerated gradient method while taking into account the curvature of the objective function. The algorithm is fast to convergence and is easy to parallelize in large scale. I conclude with a few examples of problems at Google that the system handles.
What is learnable and how to learn
Thu, 16 Jan 2014, 10:00
Speaker: Amit Daniely , HUJI
The fundamental theorem of statistical learning states that if a problem is at all learnable, then the empirical risk minimization learning rule is an optimal learner, and the amount of data required for learning is characterized by the VC dimension of the class. This theorem fully answers the questions of "what is learnable" and "how to learn". We start by showing that the theorem breaks in multiclass classification problems, and rather surprisingly, an optimal multiclass learner must be improper, namely, it must have the ability to output hypotheses which does not come from the hypothesis class. As a result, the fundamental questions of "what is learnable" and "how to learn" are again widely open. We describe a complete answer to the question of "how to learn" and propose a new notion of dimension, that we conjecture to characterize the answer to the "what is learnable" question. The talk is based on: "Multiclass learnability and the ERM principle" (D., Sabato, Shalev-Shwartz, Ben-David, COLT'11) "Multiclass Learnng Approaches: A theoretical comparison with implication" (D., Sabato, Shalev-Shwartz, NIPS'12) "Multiclass classification: An optimal algorithm, linear classifiers, and the statistical role(!) of improper learning" (D., Shalev-Shwartz, Soon on arXiv)
Unsupervised Ranking and Ensemble Learning
Thu, 09 Jan 2014, 10:00
Speaker: Boaz Nadler , Weizmann Institute of Science
In various decision making problems, one is given the advice or predictions of several classifiers of unknown reliability, over multiple questions or queries. This scenario is different from the standard supervised setting where classifier accuracy can be assessed using available labeled training or validation data, and raises several questions: given only the predictions of several classifiers of unknown accuracies, over a large set of unlabeled test data, is it possible to a) reliably rank them, and b) construct a meta-classifier more accurate than any individual classifier in the ensemble? In this talk we'll show that under standard independence assumptions between classifier errors, this high dimensional data hides a simple low dimensional structure. In particular, we'll present a spectral approach to address the above questions, and derive a novel unsupervised spectral meta-learner (SML). On both simulated and real data, SML typically achieves a higher accuracy than most classifiers in the ensemble and can provide a better starting point for iterative estimation of the maximum likelihood estimator than classical majority voting. Furthermore, SML is robust to the presence of small malicious groups of classifiers designed to veer the ensemble prediction away from the (unknown) ground truth. Joint work with Fabio Parisi, Francesco Strino and Yuval Kluger (Yale).
Robust dynamic decision making
Thu, 02 Jan 2014, 10:00
Speaker: Shie Mannor , Technion
Planning when the model parameters are not fully known is a common problem encountered in operations research, control, and artificial intelligence. I will start with demonstrating why planning with parameter uncertainty is an important issue. I will then describe several approaches: Bayesian uncertainty model over the unknown parameters, a robust approach that takes a worst case view, and a frequentist approach. I will outline the advantages and disadvantages of each approach and discuss its potential to scale-up to large problems. I will finally discuss the challenges that are posed by a higher level of uncertainty, where the model itself rather than its parameters may not be fully known.
On learning parametric-output HMMs
Thu, 26 Dec 2013, 10:00
Speaker: Aryeh Kontorovich , BGU
We present a novel approach for learning an HMM whose outputs are distributed according to a parametric family. This is done by {\em decoupling} the learning task into two steps: first estimating the output parameters, and then estimating the hidden states transition probabilities. The first step is accomplished by fitting a mixture model to the output stationary distribution. Given the parameters of this mixture model, the second step is formulated as the solution of an easily solvable convex quadratic program. We provide an error analysis for the estimated transition probabilities and show they are robust to small perturbations in the estimates of the mixture parameters. Finally, we support our analysis with some encouraging empirical results. Joint work with Boaz Nadler and Roi Weiss, ICML 2013.
Online Learning with k Actions
Thu, 19 Dec 2013, 10:00
Speaker: Ofer Dekel , Microsoft Resarch
“Online learning with k actions” is the simplest and most fundamental online learning setting. Two of its best-known special cases are the “adversarial multi-armed bandit” problem and the “predicting with expert advice” setting. In this talk, we present a rich and complete theoretical characterization of this setting, as we vary the power of the adversary and the amount of feedback given to the learning algorithm. Specifically, we show that the different variants of this setting can be arranged in three tiers of difficulty: easy problems (with sqrt(T) minimax regret), hard problems (with T^(2/3) minimax regret), and hopeless problems (with linear minimax regret). The class of hard online learning problems was previously unknown, but turns out to include quite a few interesting settings. As a corollary of our characterization, we provide the first formal confirmation that learning with bandit feedback can be strictly harder than learning with full information. Joint work with Raman Arora (TTI-C), Nicolo Cesa-Bianchi (Milano), Jian Ding (UChicago), Tomer Koren (Technion), Yuval Peres (MSR), Ohad Shamir (Weizmann) , and Ambuj Tewari (UMICH)
postponed due to the snow
Thu, 12 Dec 2013, 10:15
Speaker: Yair Wiener , Technion
K-means recovers ICA filters when independent components are sparse
Thu, 28 Nov 2013, 10:15
Speaker: Alon Vinnikov , HUJI
From Bandits to Experts: A Tale of Domination and Independence
Thu, 21 Nov 2013, 10:15
Speaker: Yishay Mansour , TAU
We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [NIPS 2011]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. Joint work with Noga Alon, Nicolo Cesa-Bianchi and Claudio Gentile.
Coordinate-descent for learning orthogonal matrices through Givens rotations
Thu, 14 Nov 2013, 10:15
Speaker: Uri Shalit , HUJI
Optimizing over the set of orthogonal matrices is a central component in problems like sparse-PCA or tensor decomposition. Unfortunately, such optimization is hard since simple operations on orthogonal matrices easily break orthogonality, and correcting orthogonality usually costs a large amount of computation. We propose a framework for optimizing orthogonal matrices, that is the parallel of coordinate-descent in Euclidean spaces. It is based on Givens-rotations, a fast-to-compute operation that affects a small number of entries in the learned matrix, and preserves orthogonality. We show two applications of this approach: an algorithm for tensor decomposition that is used in learning mixture models, and an algorithm for sparse-PCA. We study the parameter regime where a Givens rotation approach converges faster and achieves a superior model on a genome-wide brain-wide mRNA expression dataset.
Vanishing Component Analysis
Thu, 07 Nov 2013, 10:00
Speaker: Roi Livni , HUJI
The vanishing ideal of a set of points, S ⊂ Rn, is the set of all polynomials that attain the value of zero on all the points in S. Such ideals can be compactly represented using a small set of polynomials known as generators of the ideal. Here we describe and analyze an efficient procedure that constructs a set of generators of a vanishing ideal. Our procedure is numerically stable, and can be used to find approximately vanishing polynomials. The resulting polynomials capture nonlinear structure in data, and can for example be used within supervised learning. Empirical comparison with kernel methods show that our method constructs more compact classifiers with comparable accuracy.
A Bayesian Probability Calculus for Density Matrices
Thu, 31 Oct 2013, 10:15
Speaker: Manfred Warmuth , UC Santa Cruz
One of the main concepts in quantum physics is a density matrix, which is a symmetric positive definite matrix of trace one. Finite probability distributions can be seen as a special case when the density matrix is restricted to be diagonal. We develop a probability calculus based on these more general distributions that includes definitions of joints, conditionals and formulas that relate these, including analogs of the Theorem of Total Probability and various Bayes rules for the calculation of posterior density matrices. The resulting calculus parallels the familiar ``conventional'' probability calculus and always retains the latter as a special case when all matrices are diagonal. Whereas the conventional Bayesian methods maintain uncertainty about which model has the highest data likelihood, the generalization maintains uncertainty about which unit direction has the largest variance. Surprisingly the bounds also generalize: as in the conventional setting we upper bound the negative log likelihood of the data by the negative log likelihood of the MAP estimator. This is joint work with Dima Kuzmin. No background in Quantum Physics is required for this talk and we will give elaborate visualizations of all needed concepts. Some background in Bayesian analyses will be helpful for motivating the approach. A version of the talk can be found at http://www.cse.ucsc.edu/~manfred/pubs/C76talk.pdf
Crowd-sourcing epidemic detection
Thu, 24 Oct 2013, 10:15
Speaker: Constantine Caramanis , The University of Texas at Austin
The history of infections and epidemics holds famous examples where understanding, containing and ultimately treating an outbreak began with understanding its mode of spread. The key question then, is: which network of interactions is the main cause of the spread? Our current approaches to understand and predict epidemics rely on the scarce, but exact/reliable, expert diagnoses. In this talk we explore a different way forward: use more readily available but also more noisy, incomplete and unreliable data to determine the causative network of an epidemic, thus making an accurate global diagnosis from highly unreliable local data.

Spring 2012/2013 Semester
Method-of-Moments for Learning Diagnosis Networks
Thu, 10 Oct 2013, 10:15
Speaker: Yoni Halpern , NYU
ethod-of-Moments for Learning Diagnosis Networks I will present recent work on learning the parameters and structure of bipartite "noisy-or" bayesian networks used to express the relationships between diseases and symptoms in expert systems for medical diagnosis. Method-of-moments approaches provide efficient alternatives to EM for learning in these networks where exact inference is intractable. Yoni Halpern is a PhD student at New York University focusing on probabilistic graphical models for improving health care.
Method-of-Moments for Learning Diagnosis Networks
Thu, 10 Oct 2013, 10:00
Speaker: Yoni Halpern , NYU
I will present recent work on learning the parameters and structure of bipartite "noisy-or" bayesian networks used to express the relationships between diseases and symptoms in expert systems for medical diagnosis. Method-of-moments approaches provide efficient alternatives to EM for learning in these networks where exact inference is intractable.
On MAP inference by MWSS on perfect graphs
Wed, 31 Jul 2013, 12:00
Speaker: Adrian Weller , Columbia
Finding the most likely (MAP) configuration of a Markov random field (MRF) is NP-hard in general. A promising, recent technique is to reduce the problem to finding a maximum weight stable set (MWSS) on a derived weighted graph, which if perfect, allows inference in polynomial time. In this talk, we’ll review the approach and discuss new results, including a general decomposition theorem for MRFs of any order and number of labels, and a characterization of which binary pairwise MRFs can be efficiently solved with this method. This defines the power of the approach on this class of models and improves our toolbox. Joint work with Tony Jebara If time, I’ll also review the AISTATs work on using discrete methods to approximate the Bethe partition function, including a PTAS for attractive binary pairwise models with max degree O(log |V|).
Toward understanding the urban information
Thu, 04 Jul 2013, 10:00
Speaker: Qiao Wang , Nanjing Institute of Communications Technologies
There are over one hundred big cities in China, each with population more than one million. It brings us both challenge and opportunity to develop information processing technologies for establishing the urban service system, based on understanding the behavior of people in these cities. In this talk, we will illustrate the evolution patterns concerning the business hot-pots, traffic, and/or flu propagation, according to analysis from Chinese social networks and other sources. The research is based on a typical big city, Nanjing, with 8 million people.
Computable Performance Analysis of Sparse Recovery with Applications
Thu, 27 Jun 2013, 10:00
Speaker: Arye Nehorai , Washington University in St. Louis
The last decade has witnessed burgeoning developments in the reconstruction of signals based on exploiting their low-dimensional structures, particularly their sparsity, block-sparsity, and low-rankness. The reconstruction performance of these signals is heavily dependent on the structure of the operating matrix used in sensing. The quality of these matrices in the context of signal recovery is usually quantified by the restricted isometry constant and its variants. However, the restricted isometry constant and its variants are extremely difficult to compute. We present a framework for analytically computing the performance of the recovery of signals with sparsity structures. We define a family of incoherence measures to quantify the goodness of arbitrary sensing matrices. Our primary contribution is the design of efficient algorithms, based on linear programming and second order cone programming, to compute these incoherence measures. As a by-product, we implement efficient algorithms to verify sufficient conditions for exact signal recovery in the noise-free case. The utility of the proposed incoherence measures lies in their relationship to the performance of reconstruction methods. We derive closed-form expressions of bounds on the recovery errors of convex relaxation algorithms in terms of these measures. The second part of the talk applies the developed theory and algorithms to the optimal design of an OFDM radar system with multi-path components.
Building High-level features using large scale unsupervised learning
Thu, 20 Jun 2013, 10:00
Speaker: Yoel Sher , The Hebrew University
Consider the problem of building high level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? In this talk I will present Quoc V Le's and Andrew Ng's paper on building high-level features using large scale unsupervised learning (ICML 2012). In their paper they trained a 9-layer neural network with 1 billion connections on 10 million images from YouTube database. They trained this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, their experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. In this talk I will go over the concepts of unsupervised feature learning, sparse deep autoencoders and other building blocks in this model.
Prediction with Limited Advice and Multiarmed Bandits with Paid Observations
Thu, 06 Jun 2013, 10:00
Speaker: Yevgeny Seldin , Queensland University of Technology and UC Berkeley
We study two basic questions in online learning. The first question is what happens between full-information and limited-feedback games and the second question is the cost of information acquisition in online learning. The questions are addressed by defining two variations of standard online learning games. In the first variation, \emph{prediction with limited advice}, we consider a game of prediction with expert advice, where on each round of the game we query the advice of a subset of $M$ out of $N$ experts. We present an algorithm that achieves $O(\sqrt{(N/M) T \ln N})$ regret on $T$ rounds of this game. The second variation, the \emph{multiarmed bandit with paid observations}, is a variant of the adversarial $N$-armed bandit game, where on round $t$ of the game, we can observe the reward of any number of arms, but each observation has a cost $c$. We present an algorithm that achieves $O((c N \ln N)^{1/3} T^{2/3})$ regret on $T$ rounds of this game. We present lower bounds that show that, apart from the logarithmic factors, these regret bounds cannot be improved.
Information, Complexity and Learning
Thu, 30 May 2013, 09:00
Speaker: - ,
The workshop will be held in honor of Prof. Naftali Tishby, marking his 60th birthday, and celebrating his influential research career
Metro Maps of Information
Thu, 23 May 2013, 10:00
Speaker: Dafna Shahaf , Carnegie Mellon University
When information is abundant, it becomes increasingly difficult to fit nuggets of knowledge into a single coherent picture. Complex stories spaghetti into branches, side stories, and intertwining narratives; search engines, our most popular navigational tools, are limited in their capacity to explore such complex stories. We propose a methodology for creating structured summaries of information, which we call metro maps. Just as cartographic maps have been relied upon for centuries to help us understand our surroundings, metro maps can help us understand the relationships between many pieces of information. We formalize characteristics of good maps and formulate their construction as an optimization problem. We provide efficient, scalable methods with theoretical guarantees for generating maps. User studies over real-world datasets demonstrate that our method is able to produce maps which help users acquire knowledge efficiently.
projection-free online learning
Thu, 09 May 2013, 10:00
Speaker: Elad Hazan , The Technion
The computational bottleneck in applying online learning to massive data sets is usually the projection step. We present efficient on- line learning algorithms that eschew projections in favor of much more efficient linear optimization steps. Our algorithms are based on a new algorithm for general convex optimization based on the Frank-Wolfe technique, which is of independent interest. joint work with Dan Garber, Technion
Adaptive Coding of Actions and Observations
Thu, 02 May 2013, 10:00
Speaker: Pedro Ortega , The Hebrew University
The application of expected utility theory to construct adaptive agents is both computationally intractable and statistically questionable. To overcome these difficulties, agents need the ability to delay the choice of the optimal policy to a later stage when they have learned more about the environment. How should agents do this optimally? An information-theoretic answer to this question is given by the Bayesian control rule - the solution to the adaptive coding problem when there are not only observations but also actions. I will review the central ideas behind the Bayesian control rule.
Winning The Large Scale Visual Recognition Challenge with Dropout And Deep CNN
Wed, 01 May 2013, 10:00
Speaker: Lior Bar , HUJI
In 2010 the team from Toronto won the LSVRC-2010 (Large Scale Visual Recognition Challenge) with more then 8% better labeling then all others. They classified 1000 Image classes using Deep Convolutional Neural Networks (CNN) and also applied the newly developed 'Dropout' that controls overfit by corrupting the training data. We will see the close connection between dropout and regularization, and why (in some cases) it's better. We will also talk about their network architecture, and the Technical optimizations that led them to the first place.
Optimizing the measure of performance in structured learning
Thu, 25 Apr 2013, 10:00
Speaker: Joseph Keshet , TTI
The goal of discriminative learning is to train a system to optimize a certain desired measure of performance. In simple classification we seek a function that assigns a binary label to a single object, and tries to minimize the error rate (correct or incorrect) on unseen data. In structured prediction we are interested in the prediction a structured label, where the input is a complex object. Typically, each structured prediction task has its own measure of performance, or cost function, such as word error rate in speech recognition, the BLEU score in machine translation or the intersection-over-union score in object segmentation. Not only that those cost functions are much more involved than the binary error rate, the structured prediction itself spans an exponentially large label space. In the talk, I will present two algorithms each designed to the minimize a given cost. First, I will present a new theorem stating that a general learning update rule for linear models directly corresponds to the gradient of the desired measure of performance, and will describe its proof technique. I will present empirical results on the task of phoneme-to-speech alignment, where the goal is to minimize the special alignment cost function. Then, I will show a generalization of the theorem to training non-linear models such as HMMs, and will present empirical results on phoneme recognition task which surpass results from HMMs trained with all other training techniques. In the second part of the talk, I will describe a new algorithm which aims to minimize a regularized cost function. The algorithm is derived by directly minimizing a generalization bound for structured prediction, which gives an upper-bound on the expected cost (risk) in terms of the empirical cost. The resulting algorithm is iterative and easy to implement, and as far as we know, the only algorithm that can handle a non-separable cost functions. We will present experimental results on the task of phoneme recognition, and will show that the algorithm achieves the lowest phoneme error rate (normalized edit distance) compared to other discriminative and generative models with the same expressive power.
A Hilbert Space Embedding for Distributions. Paper by Alex Smola, Arthur Gretton, Le Song, Bernhard Schölkopf.
Thu, 18 Apr 2013, 10:00
Speaker: Yoav Wald , The Hebrew University
The authors describe a technique for comparing distributions without the need for density estimation as an intermediate step. Their approach relies on mapping distributions into a reproducing kernel Hilbert space, the paper is an overview of methods proposed over several past works, all based on this approach. In this talk I'll describe the technique and its application to a few tasks including two-sample tests and measures of independence.
Adaptive Metric Dimensionality Reduction
Sun, 07 Apr 2013, 10:00
Speaker: Aryeh Kontorovich , Ben Gurion University
We initiate the study of dimensionality reduction in general metric spaces in the context of supervised learning. Our statistical contribution consists of tight Rademacher bounds for Lipschitz functions in metric spaces that are doubling, or nearly doubling. As a by-product, we obtain a new theoretical explanation for the empirically reported improvements gained by pre-processing Euclidean data by PCA (Principal Components Analysis) prior to constructing a linear classifier. On the algorithmic front, we describe an analogue of PCA for metric spaces, namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Thus, our approach can exploit the dual benefits of low dimensionality: (1) more efficient proximity search algorithms, and (2) more optimistic generalization bounds. Joint work with Lee-Ad Gottlieb and Robert Krauthgamer.
Multiclass Learning Approaches: A Theoretical Comparison with Implications.
Thu, 14 Mar 2013, 10:00
Speaker: Amit Daniely , The Hebrew University
We outline a theoretical analysis and comparison of five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. The analysis is distribution free and "hypothesis class based" (similar to the PAC/VC theory). Every method naturally defines a hypothesis class of classifiers. For each such class, we study both the approximation error (the error of the best hypothesis in the class) and the estimation error (the difference between the error of the returned classifier and the approximation error). Similar to binary classification, the estimation error is evaluated using the Graph Dimension, a generalization of theVC dimension. For the approximation error, we give two kind of result - the first shows that certain classes will always have better approximation error than other classes. The second shows that certain classes are very likely to have very large approximation error (close to 1/2). The analysis yields several conclusions of practical relevance, and reveals some phenomenons that do not happen in binary classification. Joint work with Sivan Sabato and Shai Shalev-Shwartz
D.C. Programming and DCA
Thu, 07 Mar 2013, 10:00
Speaker: Nadav Cohen , The Hebrew University
D.C. programming addresses the problem of minimizing a function f=g-h, g and h being convex. We will discuss the properties of D.C. programs and D.C. functions, including global and local optimality conditions and D.C. duality. The DCA (D.C. algorithm) is based on local optimality conditions and duality, and tackles a D.C. program with a convex analysis approach. It is a generalization of CCCP (concave-convex procedure), although historically it preceded the latter. We will discuss the merits and drawbacks of DCA, and in particular its convergence properties. Finally, we will concentrate on the class of polyhedral D.C. programs, which arise naturally in many cases, and on which DCA exhibits finite convergence.
Exact Lifted Probabilistic Inference
Thu, 28 Feb 2013, 10:00
Speaker: Udi Aspel , Ben Gurion University
Relational graphical models extend the descriptive scope of graphical models by using a first-order language. Compared with "regular" models that use propositional language, relational models are able to compactly represent large scale problems, by capturing repeated patterns and relations between domain entities. E.g., a friend of a smoker is likely to be a smoker as well. A specialized class of inference algorithms called "lifted inference" algorithms, allows inference to be applied directly on probabilistic relational models, and has proven to be vastly superior to the alternative, where a relational model is extracted into a large (and exponentially less efficient) equivalent model of propositional language. The talk will include: i. A brief introduction to Variable Elimination - an exact inference algorithm for propositional models. ii. Introduction to Relational Probabilistic Models. iii. An overview of First Order Variable Elimination - an exact inference algorithm for Relational Probabilistic Models. iv. A brief overview of our contribution. Joint work with Prof. Ronen Brafman.

Fall 2012/2013 Semester
Thu, 07 Feb 2013, 10:00
Speaker: Edo Liberty , Yahoo!
Convergence rate of coordinate descent for MAP
Thu, 24 Jan 2013, 10:00
Speaker: Ofer Meshi , The Hebrew University
Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little was known about the convergence rate of this procedure. We perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. This is a joint work with Tommi Jaakkola and Amir Globerson. Presented at NIPS 2012.
On Measure Transformed Canonical Correlation Analysis
Thu, 17 Jan 2013, 10:00
Speaker: Koby Todros , University of Michigan
In this work linear canonical correlation analysis (LCCA) is generalized by applying a structured transform to the joint probability distribution of the considered pair of random vectors, i.e., a transformation of the joint probability measure defined on their joint observation space. This framework, called measure transformed canonical correlation analysis (MTCCA), applies LCCA to the data after transformation of the joint probability measure. We show that judicious choice of the transform leads to a modified canonical correlation analysis, which, in contrast to LCCA, is capable of detecting non-linear relationships between the considered pair of random vectors. Unlike kernel canonical correlation analysis, where the transformation is applied to the random vectors, in MTCCA the transformation is applied to their joint probability distribution. This results in performance advantages and reduced implementation complexity. The proposed approach is illustrated for graphical model selection in simulated data having non-linear dependencies, and for measuring long-term associations between companies traded in the NASDAQ and NYSE stock markets.
Nonlinear Signal Processing Based on Empirical Intrinsic Geometry
Thu, 10 Jan 2013, 10:00
Speaker: Ronen Talmon , Yale University
In this talk, I will present a method for nonlinear signal processing based on empirical intrinsic geometry (EIG). This method provides a convenient framework for combining geometric and statistical analysis and incorporates concepts from information geometry. Unlike classic information geometry that assumes known probabilistic models, we empirically infer an intrinsic model of distribution estimates, while maintaining similar theoretical guarantees. The key observation is that the probability distributions of signals, rather than specific realizations, uncover relevant geometric information. The proposed modeling exhibits two important properties which demonstrate its advantage compared to common geometric algorithms. We show that our model is noise resilient and invariant under different observation and instrumental modalities. In addition, we show that it can be extended efficiently to newly acquired measurements in a sequential manner. These two properties make the proposed model especially suitable for signal processing. We revisit the Bayesian approach and incorporate statistical dynamics and empirical intrinsic geometric models into a unified nonlinear filtering framework. We then apply the proposed method to nonlinear and non-Gaussian filtering problems. In addition, we show applications to biomedical signal analysis and acoustic signal processing.
Multi-Target Radar Detection with Almost Linear Complexity
Sun, 06 Jan 2013, 14:00
Speaker: Alexander Fish , University of Wisconsin
We would like to know the distances to moving objects and their velocities. The radar system is built to fulfill this task. The radar transmits a waveform S which bounds back from the objects and the echo R is received. In practice we can work in the digital model, namely S and R are sequences of N complex numbers (e.g., N=1023). THE RADAR PROBLEM IS: Design S, and an effective method of extracting, using S and R, the distances and velocities of all targets. In many applications the current sequences S which are used are pseudo-random and the algorithm they support takes O(N²logN) arithmetic operations. In the lecture we will introduce the Heisenberg sequences, and a much faster detection algorithm called the Cross Method. It solves the Radar Problem in O(NlogN+m²) operations for m objects. This is a joint work with S. Gurevich (Math, Madison), A. Sayeed (EE, Madison), K. Scheim (General Motors, Herzeliya), O. Schwartz (EECS, Berkeley).
Causal Reasoning and Learning Systems
Thu, 03 Jan 2013, 10:00
Speaker: Elon Portugaly , Microsoft Research Cambridge / Bing Ads
In complex real world systems, machine learning is used to influence actions, rather than just provide predictions. Those actions in turn influence the environment of the system. The goal of machine learning in these systems is therefore causal rather than correlational. E.g. what would be the survival chance of patient A if we gave them drug B (causal question); what is the survival chance of the patient A knowing that they were given drug B (correlational question). By injecting noise into actions taken by the system, we can collect data that allows us to infer causality, and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work provides a framework (which can be viewed as a generalization of the A/B testing framework) for counterfactual causal inference in complex systems. Parts of this framework were implemented in Microsotf’s bing search advertising system, and I will show data from this implementation.
Distributed Learning and Communication Complexity
Thu, 27 Dec 2012, 10:00
Speaker: Yonatan Kibarski , The Hebrew University
Consider a dataset distributed over several computers around the world each containing several terabytes each. We describe a learning problem that learns a hypothesis over the unified dataset and show general bounds on the amount of data transfer and specific efficient algorithms. Paper by Maria-Florina Balcan, Maria-Florina Balcan, Avrim Blum, Shai Fine, Yishay Mansour.
Robust Subspace Modeling
Thu, 20 Dec 2012, 10:00
Speaker: Gilad Lerman , University Of Minnesota
Consider a dataset of vector-valued observations that consists of a modest number of noisy inliers, which are explained well by a low-dimensional subspace, along with a large number of outliers, which have no linear structure. We describe a convex optimization problem that can reliably fit a low-dimensional model to this type of data. When the inliers are contained in a low-dimensional subspace we provide a rigorous theory that describes when this optimization can recover the subspace exactly. We present an efficient algorithm for solving this optimization problem, whose computational cost is comparable to that of the non-truncated SVD. We also show that the sample complexity of the proposed subspace recovery is of the same order as PCA subspace recovery and we consequently obtain some nontrivial robustness to noise. This presentation is based on three joint works: 1) with Teng Zhang, 2) with Michael McCoy, Joel Tropp and Teng Zhang, and 3) with Matthew Coudron.
Learning patterns in Big data from small data using core-sets
Thu, 13 Dec 2012, 10:00
Speaker: Dan Feldman , MIT
When we need to solve an optimization problem we usually use the best available algorithm/software or try to improve it. In recent years we have started exploring a different approach: instead of improving the algorithm, reduce the input data and run the existing algorithm on the reduced data to obtain the desired output much faster on a streaming input, using a manageable amount of memory, and in parallel (say, using Hadoop, cloud service, or GPUs). A core-set for a given problem is a semantic compression of its input, in the sense that a solution for the problem with the (small) coreset as input yields a provable approximate solution to the problem with the original (Big) data. In this talk I will describe how we applied this magical paradigm to obtain algorithmic achievements with performance guarantees in iDiary: a system that combines sensor networks, computer vision, differential privacy, and text mining. It turns large signals collected from smart-phones or robots sensors into textual descriptions of the trajectories. The system features a user interface similar to Google Search that allows users to type text queries on their activities (e.g., "Where did I have dinner last time I visited Paris?"") and receive textual answers based on their signals.
no talk this wee due to NIPS
Thu, 06 Dec 2012, 10:00
Speaker: -------- ,
What Cannot be Learned with Bethe Approximations
Thu, 22 Nov 2012, 10:00
Speaker: Uri Heinemann , The Hebrew University
Learning the experts for online sequence prediction
Thu, 08 Nov 2012, 10:00
Speaker: Elad Eban , The Hebrew University
Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
Thu, 01 Nov 2012, 10:00
Speaker: Aharon Birnbaum , The Hebrew University
Learnability beyond uniform convergence
Thu, 25 Oct 2012, 10:00
Speaker: Shai Shalev-Swartz , The Hebrew University

Spring 2011/2012 Semester
Thu, 14 Jun 2012, 10:15
Speaker: Aviv Peretz , The Hebrew University
Student Talk
Thu, 07 Jun 2012, 10:15
Speaker: Hillel Taub-Tabib , The Hebrew University
Student Talk
Thu, 31 May 2012, 10:15
Speaker: Michael Fink , Google
Thu, 10 May 2012, 10:15
Speaker: Noam Horev , The Hebrew University
Student Talk
Thu, 03 May 2012, 10:15
Speaker: Miri Cohen , The Hebrew University
Student Talk
Thu, 29 Mar 2012, 10:15
Speaker: Hila Gonen , The Hebrew University
Student Talk
Thu, 15 Mar 2012, 10:15
Speaker: Roy Fox , The Hebrew University

Fall 2011/2012 Semester
Wed, 18 Jan 2012, 15:00
Speaker: Greg Shakhnarovich , Toyota Technological Institute at Chicago
Note the special day
Sun, 15 Jan 2012, 13:00
Speaker: Sivan Sabato , The Hebrew University
Note the special day
Wed, 11 Jan 2012, 15:00
Speaker: Ilya Stuskever , Univ. of Toronto -> Stanford
Thu, 08 Dec 2011, 10:15
Speaker: Rina Dechter , University of California Irvine (UCI)
Joint work with Vibhav Gogate. Rina is visiting HUJI for the year.
Architecture 101 for Machine Learning
Thu, 24 Nov 2011, 10:15
Speaker: Uri Weiser , The Technion
Thu, 17 Nov 2011, 10:15
Speaker: Shalev Ben David , MIT
Second of the week
Thu, 10 Nov 2011, 10:15
Speaker: Amit Daniely , The Hebrew University
Thu, 03 Nov 2011, 10:15
Speaker: Ruth Urner , The University of Waterloo
Spring 2010/2011 Semester
Simplified PAC-Bayesian Margin Bounds
Thu, 26 May 2011, 10:15
Speaker: Alon Gonen , The Hebrew University
Student Talk
Thu, 12 May 2011, 10:15
Speaker: Tali Tishby , The Hebrew University
Thu, 31 Mar 2011, 10:15
Speaker: Amit Beka , The Hebrew University
Fall 2010/2011 Semester
Thu, 16 Dec 2010, 10:15
Speaker: Nir Rosenfeld , HUJI
Student seminar talk
Thu, 02 Dec 2010, 10:15
Speaker: Nathan Srebro , TTI, Chicago
Thu, 11 Nov 2010, 10:15
Speaker: Koby Todros , Ben Gurion University
Spring 2009/2010 Semester
Thu, 17 Jun 2010, 10:15
Speaker: Tal El-Hay , The Hebrew University
Thu, 10 Jun 2010, 10:15
Speaker: Gal Elidan , The Hebrew University
Thu, 27 May 2010, 10:15
Speaker: Adam Nitzan , The Hebrew University
Thu, 06 May 2010, 10:15
Speaker: Inbal Marhaim , The Hebrew University
Thu, 18 Mar 2010, 10:15
Speaker: Roie Kliper , The Hebrew University
Fall 2009/2010 Semester
Tue, 02 Feb 2010, 10:15
Speaker: Eitan Menahem , Ben-Gurion University of the Negev
Thu, 14 Jan 2010, 10:15
Speaker: Tomer Ezra , The Hebrew University
Thu, 07 Jan 2010, 10:15
Speaker: Cobi Cario , The Hebrew University
Thu, 03 Dec 2009, 10:15
Speaker: Daniel Zoran , The Hebrew University
Spring 2008/2009 Semester
Thu, 18 Jun 2009, 10:15
Speaker: Gil Ben-Zvi , The Hebrew University
Thu, 04 Jun 2009, 10:15
Speaker: Ido Ginodi , The Hebrew University
Thu, 26 Mar 2009, 10:15
Speaker: Danny Rosenberg , The Hebrew University
Thu, 12 Mar 2009, 10:15
Speaker: Yevgeny Seldin , The Hebrew University

Fall 2008/2009 Semester
Thu, 05 Feb 2009, 10:15
Speaker: Asaf Pe'er , The Hebrew University
Thu, 15 Jan 2009, 10:15
Speaker: Shie Mannor , The Technion
Spring 2007/2008 Semester
Tue, 01 Jul 2008, 10:30
Speaker: Amit Gruber, Ohad Shamir, Tamir Hazan , The Hebrew University
Note special day and time!
Sun, 29 Jun 2008, 10:00
Speaker: Tal El-Hay , The Hebrew University
Note special day and time!
Sun, 29 Jun 2008, 16:15
Speaker: Yevgeny Seldin, Ohad Shamir , The Hebrew University
Note special day and time!
Thu, 26 Jun 2008, 10:15
Speaker: Oded Margalit , IBM
Thu, 15 May 2008, 10:15
Speaker: Inna Stainvas , Siemens

Fall 2007/2008 Semester
Thu, 01 May 2008, 10:15
Speaker: Liat Ein-Dor , Intel
Thu, 17 Apr 2008, 10:15
Speaker: Yair Wiener , Technion
Sun, 24 Feb 2008, 10:15
Speaker: Nathan Srebro , Toyota Technological Institute at Chicago
Joint learning-vision seminar (in Ross 201).
Wed, 23 Jan 2008, 10:30
Speaker: Shahar Mendelson , ANU, Technion
Joint learning-theory seminar (in Ross 201).
Thu, 03 Jan 2008, 10:15
Speaker: Omer Berkman , Tel-Aviv University
Thu, 22 Nov 2007, 10:15
Speaker: Tal El-Hay , The Hebrew University
Spring 2006/2007 Semester
Learnability beyond uniform convergence
Sun, 12 Oct 0025, 10:00
Speaker: Shai Shalev-Shwartz , HUJI
* indicates a special talk, not on the regular time-slot or place