Exponential Smoothing for Off-Policy Learning
ICML 2023, Oral


Ever grappled with off-policy evaluation (OPE) and learning (OPL)? If you're in the loop, you'd know the challenges of high variance associated with the popular inverse propensity scoring (IPS) estimator. Here's some exciting news: we've designed a smooth regularization for IPS, introducing some bias to reduce that notorious variance. But wait, there's more!  We back this up with a scalable PAC-Bayesian bound, breaking free from the widely used bounded importance weights assumption. The icing on the cake? Our bound also holds for standard IPS without assuming a uniform coverage of the logging policy.