1 Introduction
Explaining model predictions is important for transparency, accountability, and for motivating users to act on data. Good methods for generating explanations are particularly useful in domains like healthcare and finance, where explanations are an ethical and legal requirement (Amann et al., 2020)
. However, the field of time series explainability for deep neural networks has only recently seen attention, with the discovery that traditional explainability methods underperform on deep learning models applied in the time series domain
(Ismail et al., 2020). Recent methods such as Feature Importance in Time (FIT) (Tonekaboni et al., 2020) and Temporal Saliency Rescaling (TSR) (Ismail et al., 2020) have improved performance and defined initial benchmarks but face challenges in the breadth of their application in realworld scenarios.In this work we explore time series explainability in the domain where there may be a delay between important feature shifts and a change in the predictive distribution. This type of temporal dependency can be important in realworld settings, where changes in input features may not instantaneously change model predictions. We demonstrate experimentally that existing stateoftheart method FIT fails to extend to the delayed label setting via experiments on a new synthetic dataset. We propose a new approach, WinIT, to address this challenge by quantifying the impact of features on the predictive distribution over multiple instances in a windowed setting. WinIT utilizes a modification of the instancewise importance score introduced in FIT, which we refer to as Inverse FIT, that performs better in the windowed setting. We evaluate WinIT on realworld clinical data and find that it outperforms FIT by a significant margin. In summary, our main contributions are:

[nosep]

Extending FIT to work with lookbackwindows that improve performance on datasets where there is some time delay between the observation of important features and a corresponding shift in label. We show how to evaluate performance on the label delay problem with a new synthetic dataset.

Reformulating the counterfactual explanation method of FIT in a more efficient manner, suitable for use in a windowed setting.

Our results show that combining these methods leads to a improvement in explanation performance on the realworld clinical MIMICmortality task.
2 Background
Traditional perturbationbased and modelbased methods have shown limited success in the time series domain. Gradients, Integrated Gradients, GradientSHAP, DeepLIFT (Shrikumar et al., 2017), and DeepSHAP (Lundberg and Lee, 2017) all leverage model gradients to generate feature importance, but do not directly consider the temporal nature of the problem. Perturbationbased methods like feature occlusion (Zeiler and Fergus, 2014) and feature ablation (Suresh et al., 2017) are modelagnostic methods which measure how changes to the input features relate to changes in model prediction. RETAIN learns attention scores over the input features (Choi et al., 2016). LIME learns explainable models locally around a prediction, applied at every time step in the time series domain (Ribeiro et al., 2016).
Recent benchmarks (Ismail et al., 2020; Tonekaboni et al., 2020) evaluate these traditional explainability methods on time series problems in both simulated and realworld experiments. By separating the importance calculation in both the time and feature input dimensions (Ismail et al., 2020) finds that the performance of the existing methods can be improved. In contrast (Tonekaboni et al., 2020) proposes a new method, FIT, that measures each observation’s contribution to the predictive distribution shift of the model over time in order to provide better explanations in certain settings. However, FIT is limited to measuring the importance of instantaneous shifts in the predictive distribution. We define an instantaneous shift as one where the important observations from the input change the model prediction immediately. As in, the important data and the prediction change occur on the same time step. The assumption that feature shift and prediction shift occur simultaneously does not always hold in practice. In real world applications there can be a delay between an important feature shift and a change in outcome. It is important for explanation methods for time series predictions to be able to perform well given such temporal dependencies. For this reason we present a new method, WinIT, which reformulates the FIT algorithm to make it more efficient, while also attributing correct feature importance for noninstantaneous changes in the predictive distribution.
3 Notation
Let be a sample of a multivariate time series with features and time steps. We denote to be the set . We also let be the set of all observations at a particular time and . Let be the label at each time step for a classification task with classes. Let be a subset of features of interest and be the observations of that subset at time . We also define as the set complement of the features of interest. For a model,
, that estimates the conditional distribution
at each time step, we aim to provide a feature importance score for each set of observations using the observations up to that time step, .For feature importance methods that calculate scores over a set of time steps, we let be the lookback window up to a maximum window size of . Then represents the set of observations of the subset of features of interest over a set of time steps of length and represents the historical observations for all features before that window. We also refer to the absmax function which in our implementation finds the maximum absolute value, but then returns the actual value, not the absolute value.
4 Methods
In this section we introduce our approach WinIT. We first review the FIT importance score in Section 4.1. We then present Inverse FIT, a modified version of the importance score in Section 4.2. In Section 4.3 we present WinIT, which extends Inverse FIT using a windowed approach to computing feature importance for noninstantaneous changes in the predictive distribution.
4.1 Fit
Proposed by (Tonekaboni et al., 2020), FIT defines an importance score for a subset of features at time , given by a set of observations . It measures how well the partial conditional distribution, where only a subset of features are observed at time , , approximates the full predicted distribution . This is characterized by the KL divergence between these two distributions and is referred to as the “unexplained” predictive distribution shift. It is measured with respect to the total “temporal” shift from time to time , given by the KL divergence between the model prediction at time and the model prediction at time , . The FIT importance score for a set of observations at time is then given by the difference between the “temporal” distribution shift and the “unexplained” distribution shift:
(1) 
To compute the partial predictive distribution , FIT marginalizes over the complement feature set at time , , by sampling from the counterfactual distribution approximated by a generative model .
4.2 Inverse FIT
The FIT algorithm quantifies the predictive distribution shift explained by an observation by calculating the difference between the unexplained distribution shift and the total temporal distribution shift. An alternative approach is to directly compute the explained distribution shift. We can measure the importance of features at time by quantifying how well the partial conditional distribution, , where only the complement set of features, , are observed at time , approximates the true predictive distribution .
We call this modification Inverse FIT (IFIT). The new formulation of an instancewise feature importance score is:
(2) 
Similar to FIT, we compute the partial predictive distribution by using MonteCarlo integration to marginalize over by sampling from a generator that approximates the distribution . This approach is outlined in Algorithm 1.
It is important to note that the Inverse Fit importance score, Equation 2 is not equivalent to the FIT score. In particular Inverse FIT does not consider the overall shift in the predictive distribution from to . This approach performs well when extended to calculating feature importance over a window of time steps, as shown in Section 5. Furthermore, we find that Inverse FIT achieves similar performance to FIT, but is faster as seen in Table 1. This is due to the different generator that can be used (perfeature, rather than joint) and is relevant in the case where as evaluated in our experiments. For larger set sizes the runtime may vary.
FIT is limited to measuring instantaneous changes in the predictive distribution, because only the most recent time step of input is considered when computing importance for a given prediction. The importance score in Equation 1 is equal to the predictive distribution shift from time step to explained by the observation . However, for sequential models, the observation could also influence any of the predictive distributions from time onwards. This importance is not captured in the FIT algorithm. In the next section we extend the IFIT method to address this limitation.
4.3 WinIT
We formulate an extension of IFIT with a window of past observations when attributing importance for a given prediction and call this WinIT. For a prediction at time , with a window size of , we compute importance scores for the observations . For a set of observations the sum of the importance scores for all remaining time steps to is subtracted from the total importance score for time steps to , to get the observation score at time . When the importance score for remaining time steps is zero. Because the KL divergences in a sequence of windows cancel out in subsequent scores this can be rewritten as the difference between the current time step and the following time step as seen in in Equation 3. Here represents the importance of the feature subset at time step that affects the prediction at time step .
Since WinIT generates scores for each time step, but the explainability methods were evaluated based on single importance score for each observation on individual features (no subsets), the final scores must be aggregated. To generate a single observation score, the absolute maximum value across all of the windows is computed as shown in Equation 4. This is used in order to capture the most important contribution of each observation in the final importance score. In the benchmark experiments, important contributions tend to be sparse; a different aggregation metric like an average would not properly capture infrequent but important contributions.
(3) 
(4) 
This leads to a new formulation of a instancewise feature importance score now taken over multiple overlapping windows described in Algorithm 2.
5 Experiments
For the following experiments, an RNNbased predictor is trained on the training dataset, and the explainability methods are evaluated on the test dataset. A recurrent latent variable generator (Chung et al., 2015) is trained on the training dataset for the FIT and WinIT models.
To evaluate the explainability methods on experiments with simulated data ground truth importance scores are defined. An observation is given a ground truth importance score of 1 if it causes the label to change. All other observations have a ground truth importance score of . Explainability methods are evaluated against the ground truth using AUROC and AUPRC, where the ranking score is calculated persample by ranking all the feature instances by importance and then averaged over the entire dataset. For the realworld clinical data, no ground truth feature importance is available. Instead, the explainability methods are evaluated based on AUROC drop after the Top K=50, or Top 5%, of observations with the highest importance score is removed from the test dataset by carrying forward the previous values.
5.1 Simulated Data
Spike is a benchmark experiment presented in (Tonekaboni et al., 2020) which uses a multivariate dataset composed of 3 random NARMA time series with random ‘spikes’, immediate large increases, added to the samples. The label is 0 until a spike occurs in the first feature, at which point it changes to 1 for the rest of the sample. As shown in Table 1
, WinIT shows similar performance to FIT on the Spike benchmark, with FIT having the highest AUROC, and the AUPRC of the two methods being the same within one standard deviation.
Spike  

Method  AUROC  AUPRC  Time (s) 
FIT  0.994 0.002  0.852 0.098  394.78 
IFIT  0.954 0.006  0.844 0.081  70.91 
WinIT  0.965 0.002  0.905 0.048  449.41 
Delayed Spike (d=2)  
Method  AUROC  AUPRC  Time (s) 
FIT  0.516 0.035  0.002 0.001  340.00 
WinIT  0.970 0.006  0.909 0.029  455.03 
To demonstrate the temporal dependency effect, we present a simple experiment using simulated data as a modification of the Spike data. Three independent NARMA sequences are generated and two of the features add linear trends. Spikes are then added following the same procedure as the Spike data. However, the time step at which the label changes is different. In the Spike data the label changes immediately to 1 after encountering the first spike. In the Delayed Spike data the label changes after time steps. To measure the accuracy of the explainability methods, we define the ground truth importance score as 1 for the first spike and 0 for all other observations. As shown in Table 1, since FIT only considers the observations from time steps to as they relate to the prediction at , it is unable to assign importance to the correct observation, which occurs at time step . However, since the spike falls in the window , WinIT is able to assign importance to the correct observation.
We also show performance of the IFIT and WinIT models as an ablation study in Table 1. This reveals some of the tradeoffs between runtime and performance for the different methods, with IFITbased methods taking less time than FIT. Methods that do not use a lookback window achieve poor results on the Delayed Spike data, reflected in the low AUPRC of both IFIT and FIT.
5.2 Clinical Data
MIMIC III is a multivariate time series clinical dataset with a number of vital and lab measurements taken over time for around 40000 patients at the Beth Israel Deaconess Medical Center in Boston, MA (Johnson et al., 2016). MIMIC III is used in the FIT paper to construct the MIMICmortality experiment, which uses 8 vital and 20 lab measurements hourly over a 48 hour period to predict patient mortality. As shown in Table 2, the WinIT method is a significant improvement over FIT and other explainability methods on the MIMICmortality experiment. In fact, we see a improvement over FIT when calculating AUC Drop in the top 5% of features and a improvement when calculating AUC Drop in the top 50 features.
Adjusting the window size can lead to different performance in all settings. WinIT with different lookback windows of size 1, 5, 10, and 15 shows improving performance in AUC Drop (Top 5%) in the realworld setting of the MIMICmortality task as seen in Figure 2
. AUC Drop (Top 50), while outperforming all other methods, does exhibit more variance and does not improve with window size. This may be because only a few features benefit from the additional information related to delays between feature changes and label changes. It may also be that globally important features from the Top 5% display more temporal dependence than other features.
Method  AUC Drop (95pc)  AUC Drop (k=50) 

AFO  0.023 0.003  0.068 0.003 
FO  0.028 0.006  0.095 0.042 
DeepLift  0.045 0.004  0.067 0.038 
IG  0.036 0.003  0.056 0.014 
RETAIN  0.020 0.014  0.032 0.019 
LIME  0.028 0.000  0.032 0.019 
GradSHAP  0.036 0.000  0.065 0.062 
FIT  0.038 0.005  0.138 0.037 
WinIT  0.094 0.003  0.188 0.015 
(vs. FIT) 
5.3 Saliency Maps
As a sanity check on the explanations provided by WinIT we show saliency maps from instances in the Spike and Delayed Spike datasets in Figure 1 and Figure 3. In the Delayed Spike example it is clear how FIT fails to identify the important observations, instead providing a higher average score to all observations. FIT also suffers from a common problem in time series explanations, where it overweights features that occur in the same time step as an important observation. This can be seen in both figures. WinIT, on the other hand, sees failure cases for the Spike dataset when multiple spikes appear close together.
We also show comparisons of FIT and WinIT saliency maps for an instance from the MIMICmortality task in Figure 4. In this case the overweighting of time steps is even more apparent with the FIT explanation due to the larger number of features.
6 Conclusion
In this work we propose WinIT, a method for time series explainability that allows for attributing correct importance to observations for noninstantaneous changes in a time series model’s predictive distribution. WinIT uses a windowed approach to computing the feature importance and is based on a modification of the FIT importance score that performs well in the windowed setting. WinIT is comparable to FIT on the Spike benchmark, and significantly outperforms FIT on the proposed Delayed Spike data,where changes in the model’s predictive distribution are not instantaneous, as well as on the realworld MIMICmortality task.
In the future, we hope to evaluate WinIT on the other benchmark experiments, as well as new simulated and realworld experiments to help better understand where temporal dependencies make the greatest impact. The methods we present can be further optimized through selection and tuning of the generative methods used and their application to different kinds of realworld data.
References

Explainability for artificial intelligence in healthcare: a multidisciplinary perspective
. BMC Medical Informatics and Decision Making 20 (310). Cited by: §1.  RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, pp. . External Links: Link Cited by: §2.
 A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §5.
 Benchmarking deep learning interpretability in time series predictions. In Advances in Neural Information Processing Systems, Vol. 33, pp. 6441–6452. External Links: Link Cited by: §1, §2.
 MIMICiii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §5.2.
 A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 4768–4777. Cited by: §2.

”Why should i trust you?”: explaining the predictions of any classifier
. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Cited by: §2.  Learning important features through propagating activation differences. CoRR abs/1704.02685. External Links: Link, 1704.02685 Cited by: §2.
 Clinical intervention prediction and understanding using deep networks. CoRR abs/1705.08498. External Links: Link, 1705.08498 Cited by: §2.
 What went wrong and when? instancewise feature importance for timeseries blackbox models. In Advances in Neural Information Processing Systems, Vol. 33, pp. 799–809. External Links: Link Cited by: §1, §2, §4.1, §5.1, Table 2.

Visualizing and understanding convolutional networks.
European conference on computer vision
. Cited by: §2.
Comments
There are no comments yet.