Modeling Lengthy Behavioral Log Data for Customer Churn Management: A Representation Learning Approach
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Marketing Science Institute Working Paper Series 2022 Report No. 22-101 Modeling Lengthy Behavioral Log Data for Customer Churn Management: A Representation Learning Approach Daehwan Ahn, Dokyun Lee, and Kartik Hosanagar “Modeling Lengthy Behavioral Log Data for Customer Churn Management: A Representation Learning Approach ” Daehwan Ahn, Dokyun Lee, and Kartik Hosanagar MSI Working Papers are Distributed for the benefit of MSI corporate and academic members and the general public. Reports are not to be reproduced or published in any form or by any means, electronic or mechanical, without written permission.
Modeling Lengthy Behavioral Log Data for Customer Churn Management: A Representation Learning Approach Daehwan Ahn1, Dokyun Lee2, Kartik Hosanagar3 {ahndh1, kartikh3}@wharton.upenn.edu, dokyun@bu.edu2 University of Pennsylvania13, Boston University2 Abstract Despite the success of recent deep learning models in churn prediction, they can only address short sequences of lengths ranging from hundreds to thousands of events. In practice, however, customer behavioral log data has very long sequences that can extend up to millions of events in length, which can only be utilized through manual and onerous feature engineering. This approach requires domain expertise and is exposed to human error. We introduce an automated log processing approach that combines several representational learning (i.e., transforming data algorithmically to maximize signals) frameworks to extract valuable signals from the entirety of lengthy log data. Our model combines a graph-based embedding method with flexible neural networks to focus on sequence length and long-term dependencies, given relatively lower dimensions of sequence-event type, and efficiently extracts useful representations beneficial for churn prediction. The model improved prediction performance up to 55% when compared to existing manual feature engineering approaches developed by a global game company and recent deep learning models. Such improvement could nontrivially increase our collaborating company’s value by increasing the customer lifetime value of loyal customers. Additionally, our approach can reduce up to 99% of human labor and computational costs. The performance lift increases as sequence length increases. Managerial implications and applications to other sequence data are discussed. Keywords: Churn, Customer Log Data, Lengthy Log, Deep Learning, Sequence Embedding, CLV. Marketing Science Institute Working Paper Series
1 Introduction Because the benefits of customer retention are well documented and studied, there have been calls for proactive churn management across a variety of industries (Ascarza et al., 2017). Gallo (2014) has documented that acquiring a new customer is 5–25 times more costly than retaining an existing one. A case study among financial services showed that a 5% increase in retention could raise a company's profits by more than 25% (Reichheld & Detrick, 2003). Considering that CAC (Customer Acquisition Cost) has grown by nearly 50% over the past 5 years (Campbell, 2019), churn prediction and management have become central to the application of data science in business (Ahn et al., 2020). This stream of research has suggested relevant proactive churn management strategies by utilizing various methodological backgrounds, including modeling (Ascarza & Hardie, 2013; Braun & Schweidel, 2011; Lemmens & Gupta, 2020), randomized experiments (Godinho de Matos et al., 2018), and predictive analytics (Lemmens & Croux, 2006). Breakthroughs in machine learning (ML) technology have enabled predictive analytics to deliver superior performance by incorporating a wealth of new, structured and unstructured data (LeCun et al., 2015). Representation learning―a set of methods that permit algorithms to automatically discover various levels of transformed features from large-scale raw data―flexibly allows the use of various input data types (Bengio et al., 2013), such as short logs (Arnaldo et al., 2017), images (Krizhevsky et al., 2012), text (Collobert et al., 2011), and graphs (Hamilton et al., 2017). Furthermore, new techniques for automated and nonlinear feature learning have helped to extract more subtle signals from data in contrast to human-engineered features (Bengio et al., 2013). These advantages have recently produced fruitful research on the use of churn prediction to directly extract useful information from rich yet raw modern datasets. Consequently, there has been significant interest in incorporating large-scale customer log data into churn detection and prediction in business domains (Ahn et al., 2020). A customer log is a sequential record of the communication between a service and the users of that service (Peters, 1993). Since a log is the most common data collection form and contains a wealth of information, such as microsecond-level transactions, it is considered a key source for big data analytics (Chen et al., 2014; Dumais et al., 2014; Oliner et al., 2012). However, in its original form, a log is unusable for analysis; managers need additional data and modeling work to obtain value for the business (Bhadani & Jothimani, 2016). Log processing is tricky due to its complexity―event sequences consist of various numeric, categorical, and unstructured features and have arbitrary lengths for the same time interval. Information loss is also Marketing Science Institute Working Paper Series
inevitable during a traditional statistics-based (i.e., averaging, counting) aggregation process (Marvasti, 2010; McKee & Miljkovic, 2007). Thus, recent studies have begun to utilize powerful representation learning models for log analysis. These studies show that sequential deep learning (DL) models can capture time-varying dynamics from customers’ log data (Ahn et al., 2020; Sarkar & De Bruyn, 2021) in domains such as marketing (Hu et al., 2018), financial services (Vo et al., 2021), healthcare (Kwon et al., 2021), online games (Guitart et al., 2018), etc. Within the marketing context, Sarkar and De Bruyn (2021) showed that DL-based feature learning defeated 267 of 271 hand-crafted models that applied a wide variety of variables and modeling approaches, often by a wide margin. Similarly, Ahn et al. (2020) compared more than 100 papers on churn prediction in various business domains and concluded that DL models offer the best performance due to their powerful feature-learning ability, which captures subtle patterns in a vast amount of log data. However, these studies only worked with shorter sequences of log data, because current approaches were originally designed for addressing high-dimensional but short-length inputs (e.g., natural languages like speech and text) ranging from hundreds to thousands of events in sequence length (e.g., canonical RNNs [P. J. Liu et al., 2018], Transformer [Vaswani et al., 2017]).1 Thus, customer log data, which are commonly available and can extend up to millions of events in sequence length, cannot be directly used; this raises critical managerial challenges. Currently, managers have two options to address lengthy logs: the first is utilizing only short-period sequences through data truncation (Chollet, 2017; Géron, 2018), and the second is the aggregation of long sequences (Jurgovsky et al., 2018; Whitrow et al., 2009). Truncation keeps a designated number of recent behaviors and excludes the rest while aggregation summarizes long sequences through statistical measures, such as average and count. Although the truncation approach can capture time-dynamics in dense logs, it often misses long-term dependencies between customer behaviors and management activities, which is often a key interest in churn literature (Ataman et al., 2010; Jedidi et al., 1999; Mitra & Golder, 2006; Sloot et al., 2006). When using the aggregation approach, managers suffer from the following issues: aggregation requires onerous and lengthy manual feature/data work (Bengio et al., 2013; Heaton, 2016; Z. Liu et al., 2020; Munk et al., 2010; Ng, 2013; Press, 2016); human-engineered features often miss important signals that powerful representation learning can capture (Bengio et al., 2013; Marvasti, 2010; McKee & Miljkovic, 2007); and the performance heavily relies on the analyst’s expertise 1 We discuss in detail why prevalent language models, such as Transformers, are not appropriate for lengthy log data in section 4.1. While there is no theoretical limits to sequential length that these algorithms can take, for computational reasons, most are capped at 512 events (words or tokens). It is infeasible, if not impossible, to process sequences significantly longer than thousands in length while maintaining performance with current computational power. Marketing Science Institute Working Paper Series
and domain knowledge to craft relevant predictors (Sarkar & De Bruyn, 2021). The more the data complexity (i.e., size and type) increases, the more severe these limitations become. To fill this gap, we propose an automated log processing framework that combines an unsupervised sequence embedding approach (i.e., finds lower-dimensional representation while maintaining desirable properties of the data) with flexible neural networks to efficiently extract nonlinear signals from lengthy logs pertinent to customers’ churning patterns. Our approach uses a graph-structured embedding method to map a user’s long journey of service consumption into a highly informative yet low-dimensional space. This sequence embedding can summarize multiple sequences that are millions of events long into shorter- length representations as vector sequences readily feedable to any ML/DL model. To evaluate our framework, we used a large-scale data set from a leading global game company. Our model improved prediction performance (as measured by ROC AUC, PR AUC, and F1-score, defined in section 5.1) by a minimum of 5% and a maximum of 55% when compared to benchmark ML models using manual feature engineering approaches developed by a global game company (representing years of industry know-how) and to recent DL models. Back-of-the-envelope calculations within an online gaming context suggest that these improvements could nontrivially increase the value of our collaborating company by increasing the customer lifetime value (CLV) of their loyal customers (Appendix C). We estimate that the company needs to spend only 1% of their current cost in human labor (for onerous feature engineering modeling) and in computation (section 5.3.3). Further analyses show that improved performance results from using the entire log sequence and the performance gain increases as the sequence length increases. We conclude with managerial implications, generalizability, and other use cases. 2 Related Work Three streams of literature are closely related to our work: (1) customer churn models in management science, (2) log analysis for churn prediction, and (3) sequence feature embedding techniques in machine learning. The majority of research in management science has focused on modeling customer churn behavior based on aggregate-level data (Bachmann et al., 2021) that is rooted in theory (Ascarza et al., 2017; Fader & Hardie, 2009). This stream of work has evaluated the impact of various human-engineered features on churn, including customer heterogeneity and cross-cohort effects in marketing-mix activities (Schweidel et al., 2008), the frequency and amount of direct marketing activities across individuals and over time (Schweidel & Knox, 2013), customers’ complaints and recoveries (Knox & Van Oest, 2014), and Marketing Science Institute Working Paper Series
customers’ service experiences (e.g., the frequency and recency of past purchases [Braun et al., 2015]). Though this stream of research is theoretically elegant, it often lacks precision with individual-level predictions, especially when capturing time-varying contexts of sequential data (Bachmann et al., 2021). Furthermore, as digitization delivers new kinds of data about customers, capturing complex patterns from a vast amount of log data is increasingly essential to business analytics. Our work contributes to this stream by proposing a model that easily incorporates not only time-varying dynamics but also the flexible nonlinearities of rich modern datasets. Recent advances of incorporating log data through the applications of machine learning have seen great success in churn prediction problems (Ahn et al., 2020). Theoretically, using log data itself, rather than the aggregation process typically used in existing churn models, can help avoid critical information losses (Marvasti, 2010; McKee & Miljkovic, 2007). Methodologically, models that use log data operate by directly extracting time-varying churn signals through sequential DL models. However, this body of work has only addressed sequences of shorter length, typically ranging in the hundreds, due to the recurrent architecture of canonical RNNs (e.g., LSTM, GRU), which causes gradient decay over layers (Li et al., 2018). The approach of Transformer models is also infeasible because the computational burdens increase quadratically with the sequence length (Beltagy et al., 2020). Considering that logs can be millions of events long, the application of modern churn approaches is limited, and managers are still relying on inefficient manual processes that have not kept pace with the quantitative and qualitative evolution of big data (Chen et al., 2014). Our work combines two representation learning frameworks to utilize the lengthy log data commonly available in various business fields in their entirety and automatically. Finally, sequence feature embedding provides a distilled representation for the abundantly available sequential business data, such as clickstreams, content consumption histories, weblogs, and social network services. However, the challenge of sequence processing―regarding both sequential DL models and sequence feature embedding models―is effectively capturing long-term dependencies (i.e., the relationship between distant elements in a sequence) while managing computational scalability for both the sequence length and the vocabulary size (i.e., the unique number of elements in sequences) (Ranjan et al., 2016). In other words, the trade-offs between performance, sequence length, and vocabulary size are unavoidable. For example, though Transformer has recently exhibited state-of-the-art performance in various sequential tasks regarding large vocabulary size (e.g., number of words in NLP), its computational and memory requirements scale quadratically with the sequence length. Additionally, long- term Transformers that can address longer sequences with less computational burden (e.g., Longformer [Beltagy et al., 2020], Sparse Transformer [Child et al., 2019]) sacrifice the novel performance of the Marketing Science Institute Working Paper Series
original Transformer due to the approximation process. Similarly, traditional sequence embedding approaches have ignored length and vocabulary size (Needleman & Wunsch, 1970; Smith & Waterman, 1981) and/or long-term dependencies (Farhan et al., 2017; Kuksa et al., 2008). Methodologically, our approach faces the same limitation, but we address this problem by leveraging domain knowledge about customer logs that have limited types of user behaviors (events), and thus low vocabulary size (i.e., small action choices). By doing so, the model can capture long-term dependencies in lengthy logs; it provides computational efficiency by focusing on the length of sequences rather than the dimensionality of sequences (event type). Our work is tailored toward business problems and extends the burgeoning literature in long-sequence feature embedding to management fields. Figure 1 positions our paper within the DL sequence processing literature with respect to customer churn prediction. The emergence of DL approaches has delivered great returns especially when dealing with high-dimensional and unstructured sequential data – namely, natural languages. Prior works explore and utilize the potentials of these abundantly available data (e.g., text and speech) and is the state-of-the- art for a wide range of sequential tasks (LeCun et al., 2015). However, recent advances in sequential DL models have focused on improving models’ performance and efficiency while retaining the ability to handle the high dimensionality of the emerging textual data, albeit at the cost of sacrificing long-term sequence dependence. While the existing managerial research on churn prediction has benefited much from directly adapting such models into analyzing shorter-length sequential log data (Ahn et al., 2020), achieving excellent performance in presence of both the length and high dimensionality remains elusive. To address lengthy behavioral logs in a real-world context, there is a need for a different design philosophy that can handle long-sequence events. Therefore, we suggest a new direction of sequence modeling―which focuses on length and extracting long-length dependencies rather than on handling high dimensionality―to handle lengthy logs in managerial problems like churn analyses. 3 Empirical Setting We modeled the customer churn prediction in the context of online games. Online games are an ideal testing ground to study the capture of minute-changing patterns in lengthy customer log data. The global game market is estimated to be worth $180 billion, which is bigger than North American sports and global movie industries combined (Witkowski, 2020). As competition intensifies, game companies have been increasingly required to perform effective customer churn management by utilizing massive amounts of play log data. Thus, they have actively tested various log analysis methods and built extensive know-how. We have therefore grounded the empirical analysis of our approach in this setting, given the Marketing Science Institute Working Paper Series
importance of churn management in gaming and the existence of alternative approaches against which we can benchmark our framework. Our analysis focuses on loyal customers’ churn behavior. In online games, typical payment distribution follows extreme Pareto, and loyal users (0.19% of users) supply half of the revenue (Cifuentes, 2016). This means a small improvement in churn prevention can greatly impact total revenue, so managing and retaining loyal customers is one of the most important goals of game operations (Lee et al., 2020). Thus, it is common for game companies to conduct sophisticated analyses focused on a small group of profitable, loyal users. Similarly, we excluded non-loyal users for consistency with industry practice and to save computational resources. We acknowledge that non-loyal users may have different behaviors than loyal users, so the insights from the analysis cannot be extrapolated to that segment of users. 3.1 Data We used a global game company’s proprietary data from April 1, 2016 to May 11, 2016 to conduct our analysis. The company is one of the largest game developers/publishers globally with an annual revenue of over $2 billion. The data contains 175 million event logs of 4,000 loyal customers over the period of six weeks. The loyal users were chosen by the company based on cumulative purchases and in-game activities. The logs have a multivariate time series format that consists of second-level timestamps and 16 relevant columns containing different types of events and user behaviors, which are represented by numeric and categorical features. That is, each user generates 16 different kinds of in-game behavioral log sequences. For example, if a user enters a certain area of the game world, a column named “AREA” records the unique ID of that area and a “TIME” column records the timestamp for it. Simultaneously, other columns also record the event status (e.g, the user’s level, how much time they spent in the current session, their in-game social activities and battle status) for the same timestamp. Furthermore, we had access to aggregate cross-sectional data cleaned and engineered by the company (based on the same raw logs). This manually engineered feature set―containing 78 variables―was used to establish baselines by combining it with various ML predictive models. Since the company accumulated considerable know-how and domain knowledge for over 20 years as a front-runner in online games, these baselines effectively demonstrate our model’s improvements and efficacy from a real-world business standpoint. Figure 2 shows the frequency distribution of sequence length for each user. An average user has 43,784 events in six weeks, with a maximum of 582,941 and a minimum of eight. Previous approaches cannot handle these lengthy sequences and directly extract useful signals. Marketing Science Institute Working Paper Series
3.2 Definition of Churn Similarly to prior work (Lee et al., 2018; Tamassia et al., 2016), we define a churner as a user who does not play the game for more than five weeks. We predict user churn in a binary classification manner, and 1,200 users out of 4,000 (30%) are churners in the dataset. To allow for effective churn prevention strategies, we set a three-week time interval between observation and churning windows. By doing so, the company had an opportunity to implement various retention interventions for potential churners, such as a new promotion campaign or content updates. 4 Model Our model assumes that each log has limited event types (i.e., vocabulary size, the unique number of elements in sequences), typically less than a few hundred. In real-world log management, it is essential to separate events into various subfields and columns, depending on their categories and types, in the way typical, relational database systems operate. Since logs are notably long, each column includes only a small number of customer actions for efficient data processing by operations such as groupby and join. Additionally, even if logs have large vocabulary sizes, due to the existence of timestamps, it is easy to separate them into sub-logs that have small vocabulary sizes through a simple filtering process with low computation. Therefore, the business logs are bounded on the size of the vocabulary, unlike natural languages processed by recent sequential DL models. In other words, to address long sequence logs in business, the computational scalability should focus on addressing the sequence length, rather than the vocabulary size. Leveraging this domain knowledge, we propose a sequence embedding approach to handle customers’ lengthy traces of online content consumption that can extend to millions of events in length. To do so, we incorporate a graph-based sequence embedding approach (Ranjan et al., 2016) that is methodologically fitting for capturing long-term dependencies, while linearly and quadratically scaling with sequence length and vocabulary size, respectively. Then, we hierarchically stacked the embedding model on a novel multivariate sequential DL model, named Temporal Convolutional Networks (TCN) (Bai et al., 2018; Lea et al., 2017). This model is built to concurrently capture both local-level features and their temporal patterns of embedded representations from the sequence embedder. Below, we first discuss the theoretical background of our model selection criteria. Next, we overview our log processing framework. We then elaborate on the graph-based sequence feature embedding process. Following that, the neural network architecture and its components are described. Lastly, we provide details of model estimation and optimization. Marketing Science Institute Working Paper Series
4.1 Theoretical Background in Model Selection This section elaborates on why existing state-of-the-art sequential DL models are not suitable for real- world lengthy log analysis, in contrast to our graph-embedding approach.2 We first discuss five model selection criteria to satisfy the unique characteristics of big data business analytics. We then highlight the limitations of existing sequential DL models from the perspectives of methodological and computational costs. Table 1 summarizes the discussion. 4.1.1 Selection criteria to address lengthy logs in large scale business analytics Our approach meets the following five criteria for model selection in lengthy business log analysis. (1) The model has to linearly scale with sequence length without using techniques that cause information loss, such as approximation and compression. Both vanilla Transformer and Transformers for long-term contexts do not meet this standard. (2) The model needs to perform comparably to popular DL models (e.g., LSTM, GRU) in addressing low-dimensional data. (3) The model should handle multivariate time series or logs. Both vanilla Transformer and Transformers for long-term contexts do not support multivariate sequences (Zerveas et al., 2021). (4) The model should be free from the fixed-length input problem―which causes huge waste in computations when setting input sequences with the same length (i.e., max sequence length) through padding techniques (e.g., filling out zeros). Existing sequential DL models (e.g., RNNs and Transformers) can only consume fixed-length inputs, which is a key barrier limiting their application to lengthy sequences (Dai et al., 2019). For example, in our dataset, existing methods require 13 times more computing resources to address irrelevant zeros, which account for 92.5% of the data after the padding process. (5) Fifth, the model should be computable in distributed environments. RNNs and Transformers for long-term contexts also have shortcomings regarding this criterion (Bai et al., 2018; Fang et al., 2021). 4.1.2 Limitations of Existing Methods Existing DL sequential methods for lengthy log processing have limitations due to their focus on the high dimensionality of data rather than the long-term dependency of input sequences. 2 We discuss this in detail in Appendix A “Model selection criteria and limitations of existing models” and Appendix B “Estimated cost of each model and its feasibility for addressing lengthy logs.” Marketing Science Institute Working Paper Series
● RNNs―not only vanilla RNN but also LSTM and GRU―are inherently ill-fitted for capturing long-term dependencies due to the limitation of their recurrent structure. These models can generally handle mid-range sequences up to a thousand events in length due to gradient decay over layers (Li et al., 2018). We also empirically show the same result (see the results of the truncation settings in section 5.2). ● Truncated backpropagation through time (TBPTT) sacrifices long-term dependencies to help RNNs handle longer sequences. Specifically, TBPTT loses the ability to capture long-term dependencies due to its truncation mechanism―which saves computation and memory costs by truncating the backpropagation process after a fixed number of lags (Tallec & Ollivier, 2017). ● Vanilla Transformer cannot handle long-term dependencies due to its quadratic computational scalability with input length (Child et al., 2019), despite its superior performance. For example, the typical input length of Transformer models is set to 512. If we analyze the full-length sequences in our dataset (i.e., 582,941 events in length) through Transformer, it requires 1.3 million (= [582,941 / 512]2) times more computational cost relative to the standard Transformer’s setting. Thus, processing lengthy log data with typical Transformers is infeasible . ● Transformers for long-term contexts―which are modified Transformers that address longer sequences with less computational burden―rely on the approximation of long-term dependencies. It is unavoidable for Transformer models to sacrifice their superior performance to capture longer contexts, if not entirely infeasible to compute due to insurmountable memory and time requirements. When used, the performance loss increases as the computational efficiency increases (Tay et al., 2020). 4.1.3 Computational Costs We estimate the computational requirements of our approach and other traditional, state-of-the-art sequential DL models to address our dataset. This quantitatively shows why the existing methods are inadequate or infeasible for handling extremely long sequences effectively. From a perspective of computational cost, our model is 11 to 6,720 times cheaper than state-of-the-art alternatives (i.e., long- term Transformers) while losing no information from the approximation process. We discuss the cost savings of human labor later in the paper. Whereas the alternatives sacrifice their performance to achieve computational efficiencies when addressing longer sequences (Tay et al., 2020), they still have huge computational requirements―which is a significant hurdle in real-world, large-scale, lengthy log analysis. Some models’ computational requirements linearly increase with input length, but their cost was 11 times more expensive than ours due to the fixed-input length problem. These cost gaps rapidly increase as the size and length of data increase. Marketing Science Institute Working Paper Series
Table 1 summarizes the key features and computing costs of each model regarding the lengthy log processing discussed in this section. The detailed explanations and related implications are described in Appendices A and B. 4.1 Framework Overview Figure 4 plots an overview of our novel approach. We designed our model as a hierarchical structure that stacks a graph-based sequence embedder on TCN, rather than letting a single embedder handle the whole sequence process. Once a sequence embedder compresses lengthy logs into short vector sequences in an unsupervised manner, TCN squeezes out useful signals from them. This hybrid design allows higher performance and flexibility by maximizing the utilization of novel DL techniques in the process. Though the graph-based sequence embedder exhibited better accuracy than LSTM in some benchmark tests (Ranjan et al., 2016), the state-of-the-art sequential models (e.g., transformers, TCN) had superior performance and methodological elegance in many different contexts. We summarize the approach here: (1) Split and Stack: In the first step, we split a full-length input sequence into subsequences. The input sequence is an ordered series of label encoded events (i.e., numbers). For example, suppose a certain column includes five events {login, battle, chat, purchase, logout}, a sequence {login, battle, battle, purchase, chat, logout} can be converted to {0,1,1,3,2,4}. We split each six-week sequence into 1,008 subsequences of one-hour increments. We chose a one-hour increment to set the sequence length after the embedding process at 1,008, so that TCN―its computational burden quadratically scales with sequence length (Krebs et al., 2021)―can handle these data with ease. Finally, we obtained the total of 4,032,000 distinctly occurring subsequences and stacked them for the next embedding process. (2) Sequence Embedding: This step converted the stacked subsequences into meaningful representation (or feature) vectors, allowing similar sequences to be embedded near each other in the lower dimensional space. We did so by integrating a graph-based embedding approach (Ranjan et al., 2016) that quantifies the effects (i.e., associations) of events on each other based on their relative positions in a sequence. Specifically, we set events (i.e., customer behaviors) as nodes and their relationships as links (i.e., associations) in a graph structure. We embedded the sequences into a vector space based on the unique characteristics of their graph structure. (3) Principal Component Analysis (PCA): Since the embedded representation vector contains information about all possible bidirectional event pairs, it is usually sparse and high-dimensional. For efficiency, we compressed the embedded vectors through PCA (the amount of explained variance was set to 95%). Marketing Science Institute Working Paper Series
(4) Iterate Steps One to Three for all 16 columns: Our dataset has a total of 16 columns. Each user has 16 different log sequences about different types of events and behaviors. Thus, we repeated the sequence embedding process for all columns and got a total of 16 embedded representation vectors. These sixteen vectors were then concatenated to form a unified meta-feature vector. (5) Unstack: The meta-feature vector has a vertically stacked structure. For example, the first row is an embedded feature vector of user-1 at time-1 (i.e., the first hour of the six-week period). For individual-level analysis, we reshaped the data dimension into 4,000 users × 1,008 (6 weeks × 7 days × 24 hours) time periods. Consequently, each user has a sequence of 1,008 representation vectors, and each vector contains distilled information of each hour-long subsequence. (6) Neural Network: The TCN layer abstracts sequential feature representations while considering their long-term contexts. The attention layer helps the TCN layer efficiently handle the data's sparsity and long-term dependencies. Abstracted outputs are provided to a fully connected layer, which distills the information one more time. Since the goal was to predict imbalanced binary labels (churn: 30%, stay: 70%), the learning process was conducted to minimize weighted binary cross-entropy loss. 4.2 Graph-based Sequence Feature Embedding Denote an input sequence in the data set of sequences , which is composed of event factors in a set (i.e., vocabulary), by . denotes the length of a sequence and denotes the event at position in sequence , where and =1,..., . The graph-based sequence embedder characterizes a sequence by quantifying the forward direction effects (i.e., associations)―the effect of the preceding event on the later event―of all paired events. Here, the effect of event on event , which are at positions and respectively, is defined as , where is a tuning parameter and . The forward direction effects can be applied to various lengths of sequences. Therefore, the graph-embedding approach saves a huge amount of computational resources for addressing unnecessary dummy inputs (i.e., zeros) that account for 92.5% of our dataset with zero-padding for the utilization of the maximum sequence length. We can store associations of all paired events in an asymmetric matrix size of × . For example, see Figure 5, where there are five events (i.e., A, B, C, D, E), we can get a total of 25 associations (i.e., A- A, A-B, A-C,., E-D, E-E). Here, has all associations of event pairs ( , ) and is the preceding event than . Marketing Science Institute Working Paper Series
The association feature vector , which is used for the sequence embedding process, is a normalized aggregation of all associations, as follows . , the feature representation of sequence s, denotes the embedded position of sequence in a -dimensional feature space. Also, since contains the association between paired events (i.e., the effect of the preceding event on the later event ), we can interpret it as a directed graph with × edges. Here, the edge weights are normalized associations of paired events. 4.3 Neural Network Architecture The neural network module consists of a TCN, an attention layer, fully connected layers, and a weighted cross-entropy loss layer. We describe the specifics as follows. 4.3.1 Temporal Convolutional Network (TCN) for Sequence Processing TCN retains the time series processing ability of RNN, but adds the computational efficiency of convolutional networks (Lea et al., 2017). TCN outperforms canonical RNNs such as LSTM or GRU across various sequential tasks while handling longer input sequences. Compared to RNNs, TCN is faster while requiring less memory and computational power and is more suitable for parallel processing (Bai et al., 2018), which delivers critical advantages in addressing a vast amount of customer logs. TCN consists of stacked residual blocks, which in turn consist of convolutional layers, activation layers, normalization layers, and regularization layers (see Figure 6 and Lea et al. [2016] for more details). For the layers, a temporal block is constructed by stacking several convolutional layers. In detail, for an input sequence , output sequence , and a convolution filter with size k , the level dilated convolution operation at the time is defined as Marketing Science Institute Working Paper Series
where is the dilation factor, which can be written as to cover an exponentially wide receptive field. The Rectified Linear Units (ReLU) is used as an activation function to provide nonlinearity to the output of the convolutional layers (Glorot et al., 2011) and is defined as One of the obstacles of DL is that the gradients for the weights in one layer are highly correlated to the outputs of the previous layer, resulting in increased training time. Layer normalization is designed to alleviate this “covariate shift” problem by adjusting the mean and variance of the summated inputs within each layer (Ba et al., 2016). Though the theoretical motivation for decreasing covariate shift is controversial in technical ML literature, the practical advantage of normalization methods, which allow for faster and more efficient training, has proven indispensable to a wide range of DL applications (A. Zhang et al., 2019). The statistics of layer normalization over each hidden unit in the same layer is written as where is the normalized summated inputs to the hidden unit in the layer, and denotes the number of hidden units in a layer. Dropout is an essential regularization technique to prevent the overfitting of the neural network. The idea is to randomly drop (hidden and visible) units from the network during training, which prevents units from co-adapting too often. By doing so, dropout improves the generalization of neural networks by allowing the training process to be an efficient stochastic approximation of an exponential ensemble of “thinned” networks (Srivastava et al., 2014). 4.3.2 Attention with Context The attention mechanism allows the sequential model (i.e., TCN) to focus more on the relevant parts of the input data by acting like random access memory across time and input data (Bahdanau et al., 2014). Thus, it improves the training efficiency and performance of the model by giving more direct pathways to Marketing Science Institute Working Paper Series
the model structure (Raffel & Ellis, 2015). We follow and implement the work of Z. Yang et al. (2016). Specifically, where is an annotation obtained by concatenating (the forward hidden state) and (the backward hidden state) in the sequence of bidirectional TCN. is a hidden representation of obtained through a one-layer multilayer perceptron (MLP). is an embedded representation-level context vector, randomly initialized and learned during the training process. contains the normalized importance of each embedded representation vector and can be calculated through a softmax function with and . is the output vector that summarizes all the information of embedded representations in a sequence. The attention mechanism not only allows for faster training and better performance but also increases the stability of training by providing more direct pathways to the model structure. 4.3.3 Fully Connected Layers We built the fully connected layer module by stacking multiple component blocks. Each component block consists of dense and dropout layers, layer normalization, and Exponential Linear Units (ELU) activation. The dense layer is a linear operation on the input vector. Dropout prevents the model from overfitting, and layer normalization boosts the efficiency of training. ELU (Clevert et al., 2015) adds nonlinearity to the model while lowering susceptibility to the vanishing gradient problem and is defined as where is a hyperparameter. 4.3.4 Minimizing the Loss Function We optimized a weighted binary cross-entropy loss function to train imbalanced binary labels (churn: 30%, stay: 70%). This function gives additional weight to the minority class (i.e., churn users, Y=1) relative to the typical binary cross-entropy loss function. The additional weight can be calculated as the ratio between “the number of negative classes” (i.e., stay users, Y=0) and “the number of positive classes” (i.e., churn users, Y=1). The loss is calculated as follows: Marketing Science Institute Working Paper Series
where number of training examples weight target label for training example m input for training example m model with neural network weights . When managers apply our framework to other business problems, the loss function can be flexibly changed to suit the given tasks (e.g., regression, multi-labeled classification). 4.4 Model Estimation and Optimization 4.4.1 Automated Model Generation through Bayesian Optimization Our model utilizes an AutoML framework to help managers easily extract insightful business-oriented features without manual effort. The use of AutoML not only cuts manual effort but also makes benchmark comparisons more structured and unbiased. Through Bayesian optimization techniques (Snoek et al., 2012), we automate key processes, such as finding the best model structure (e.g., the width, depth, and capacity of TCN and fully connected layers; the choice of activation functions) and fine-tuning hyperparameters (e.g., dropout rate, learning rate). As a result, our approach can operate end-to-end (raw data to result) without human intervention. For the implementation, we use Tensorflow (Abadi et al., 2016) and Keras Tuner (O’Malley et al., 2019). 4.4.2 Optimization Methods Complex and noisy real-world data make the training process highly unstable and cause it to converge to poor local minima, especially when handling sequential tasks (Pascanu et al., 2013). Thus, scholars and engineers have implemented advanced optimization techniques. In this regard, we applied three different state-of-the-art optimization methods to improve the effectiveness and efficiency of the training process: 1) Rectified Adam (RAdam): Despite faster and more stable training, popular stochastic optimizers (e.g., Adam and RMSProp) experience a variance issue in which problematically large variance in the early stage of training risks converging into undesirable local optima (L. Liu et al., 2019). RAdam addresses this problem by incorporating a warmup (i.e., an initial training with a much smaller learning rate) to reduce the variance. Marketing Science Institute Working Paper Series
2) Lookahead optimizer: This iteratively updates two sets of weights, “fast weights” and “slow weights,” and then interpolates them. By doing so, Lookahead improves training stability and reduces the variance of optimization algorithms such as Adam, SGD, and RAdam (M. Zhang et al., 2019). 3) Gradient centralization: This performs Z-score standardization on the gradient vectors, similarly to batch normalization. Thus, it boosts not only the generalization performance of networks but also the stability and efficiency of the training process (Yong et al., 2020). 4.4.3 Model Estimation and Optimization The sequence embedding was performed on an Amazon SageMaker ml.m5.12xlarge instance with 48 multi-thread CPUs and 192GB of RAM. The embedding process took around six hours including the dimensionality reduction with PCA. The TensorFlow deep learning library was used to estimate our neural network model (Abadi et al., 2016). The model estimation took around two hours with 5-fold cross-validation on an Nvidia GTX 1080 Ti GPU server with 64 GB of RAM. Additionally, the Bayesian optimization was conducted on the same server and took about 20 hours to find the best model structure and hyperparameters. Since our model is based on TCN, parallel processing can contribute to faster model training, which is hard for recurrent computations of canonical RNNs. 5 Results In this section, the predictive performance of our automated log processing is benchmarked against existing manual approaches. Additional analyses revealed when and how our model excels in predicting churn in real-world data. We discuss the economic impact of our model on improving game value while saving time and labor costs in comparison to existing methods. 5.1 Experimental Setting We benchmark the quality of representations extracted by our automated log process by comparing its predictive performance against several alternatives based on various ML/DL models―Logistic Regression, Decision Tree, K-Nearest Neighbor, Naive Bayes, Support Vector Machine, Extreme Gradient Boosting (XGBoost), and Deep Neural Network (Multilayer Perceptron)―and the manually engineered feature set developed by the company. This manually engineered feature set represents years of domain knowledge as one of the leading firms in the game industry. Our approach natively utilizes lengthy logs due to its sequence embedding ability, while other options use the aggregate data commonly available in real-world business analytics settings. Marketing Science Institute Working Paper Series
We additionally tested two different manual approaches that are commonly used in the existing churn prediction setting―aggregation (i.e., aggregating long period sequences through statistical measures) and truncation (i.e., utilizing only short period sequences by keeping a designated number of recent behaviors and dropping the rest). In the aggregation approach, we converted lengthy logs into shorter sequences through descriptive statistical measures (e.g., sum, count)―that are widely used in big data analytics and churn prediction settings (Jeon et al., 2017; Rothmeier et al., 2021; Zdravevski et al., 2020). To do so, we aggregated daily sequences through six statistical functions (i.e., max, min, count, average, sum, and mode) that are most commonly used for modeling user behaviors in game contexts (Guitart et al., 2018; Jeon et al., 2017; Periáñez et al., 2016; Rothmeier et al., 2021; Sifa et al., 2015). In the truncation approach, we tested various input lengths from 500 to 10,000. Then, we examined the efficacy of the existing truncation method (as an alternative to our approach) in dealing with lengthy logs. For both settings, LSTM and GRU models with Attention Mechanism were utilized. For robust model evaluation, we conducted 5-fold cross-validation based on three splits―training, validation, and testing that were constructed with 60%, 20%, and 20% of users, respectively. The final score is the average performance of each fold on the testing set. We used three evaluation metrics together―1) Area under the ROC Curve (ROC AUC)3 (Fawcett, 2006), 2) Area under the Precision- Recall Curve (PR AUC)4 (Davis & Goadrich, 2006), and 3) F1-score5 (Powers, 2020)―to robustly address imbalanced output labels (churn: 30%, stay: 70%). Additionally, to prevent biases from manual hyperparameter tuning, we applied the same automatic process to DL models. The structure and hyperparameters of DL models were determined by the Bayesian optimization process (Snoek et al., 2012) without any manual intervention. The same early stopping rule was also applied: if the performance did not increase during five epochs, the model stopped the training process and kept the best weights. 5.2 Evaluating the Predictive Quality Table 2 benchmarks the predictive performance of various models. Our approach shows superior performance over other competitive baselines for all ROC AUC, PR AUC, and F1-score evaluation metrics. Specifically, our automated method improved performance by at least 5% compared to baselines. It is worth emphasizing that we did not use any domain knowledge in our analysis (i.e., manually 3 ROC AUC, PR AUC, and F1-score robustly measure the performance of imbalanced classification models. Higher scores (maximum 1) in each metric indicate a better model performance. To measure ROC AUC, we first drew an ROC curve that plotted True Positive Rate (y-axis) vs. False Positive Rate (x-axis) at different classification thresholds from 0 to 1. ROC AUC measured the two-dimensional area underneath the ROC curve. 4 To measure PR AUC, we first drew a Precision-Recall curve that plotted Precision (y-axis) vs. Recall (x-axis) at different classification thresholds from 0 to 1. PR AUC measured the two-dimensional area underneath the Precision-Recall curve. 5 The F1-score can be calculated as (2 × Precision × Recall) ÷ (Precision + Recall). Marketing Science Institute Working Paper Series
engineering features based on knowledge of the data and logs), but the baseline feature set was developed over many years by a global company with leading expertise and know-how of its own data. Since our analysis focused on high-spending customers, the impact of the improvement is substantial (see details in Appendix C). With regard to the aggregation approach, the combination of automated aggregation features and LSTM/GRU showed better performance than traditional ML models (i.e., Logistic Regression, Decision Tree, K-Nearest Neighbor, Naive Bayes, Support Vector Machine) based on the manually engineered feature set but worse than well-performed ML/DL models (i.e., XGBoost and NN) based on the manually engineered feature set. This implies that, despite its labor-intensive characteristics, manual feature engineering can be a worthwhile approach in comparison to simple rule-based automated aggregation methods. With the truncation approach, regardless of the input length, we obtained a ROC AUC of 0.5, equivalent to the expectation of a random guess (Fawcett, 2006). Because LSTM/GRU can only handle a limited length of sequences, using logs without aggregation fails to provide enough information relevant to churn prediction, even if we increase the input length. If we assume the models can address sequences around a thousand events in length as shown in benchmark tests (Li et al., 2018), and considering that average users have 43,784 events in our dataset, the truncation approach only covers 2.3% of user behaviors (1,000 ÷ 43,784 ≈ 0.023) on average. This implies that aggregation processes, such as our log processing, are essentially needed for the lengthy log data of modern high-transaction environments. 5.3 Experiments on Model Efficacy & Managerial Impact This section first explores what conditions are needed for our approach to excel over existing approaches. Specifically, we investigate the advantage of our model over baselines, depending on the sequence length and the time period length of the input data. We then discuss the potential economic impact of cost savings in model estimation and human labor. 5.3.1 Performance Improvement According to Sequence Length Figure 7 visualizes the performance gap between our model and other baseline models (y-axis) according to sequence length (x-axis). On the x-axis, we grouped a total of 4,000 users in our dataset into quintiles (5 bins) based on their sequence length so that the higher ranking groups have longer sequences. For example, the fifth group has the longest sequences. The y-axis indicates the performance difference between our model and the average of other baseline models (i.e., the averaged performance of the baseline ML/DL models―Logistic Regression, Decision Tree, K-Nearest Neighbor, Naive Bayes, Marketing Science Institute Working Paper Series
Support Vector Machine, XGBoost, and Deep Neural Network―when based on the manually engineered feature set). As the y-value increases, the gap between our model and other baselines also increases, showing that our model works better than the baselines. Figure 7 shows that the relative performance of our model increases more in longer sequence groups in both ROC AUC and PR AUC. The trend lines have positive slopes (i.e., coefficient of linear regression) even when we increased the number of sequence group bins to 10, 20, 50 and 100 (see Table 3). The slopes are also statistically significant except in one case (the p-value is slightly larger than 0.05). These results suggest that our sequence embedding approach delivers more benefits when dealing with longer sequences. Somewhat surprisingly, we observed a slight performance drop with the longest sequence group (i.e., the fifth group in quintiles). One potential explanation that emerged from discussions with domain experts is the existence of anomalous users who have very different behavioral patterns. Successful online games usually have two anomalous groups. The first one consists of cheating users who use prohibited tools (e.g., automatic playing bots) to achieve higher efficiency in the gameplay. The second one consists of gold farmers who hire workers (often from developing countries) and let them play the game to earn cyber money, which can then be exchanged for real money. These anomaly groups generally have much longer behavioral sequences than other users because they use automation (e.g., bots or workers) to play the game longer. Therefore, most anomalous users may belong to the longest sequence group. In real- world settings, accurate anomaly detection is technically challenging, so churn datasets usually include anomalous users. Since ML/DL tends to minimize the loss of the entire population, small numbers of abnormal patterns are hard to capture with global prediction models. Additional studies may be needed to better incorporate anomaly detection into churn prediction for improved predictive quality. 5.3.2 Performance Improvement According to Period Length Longer sequence processing requires more computing costs. In a big data environment, it is important to find a sweet spot (i.e., the optimal length of input sequences) that satisfies both performance and computational efficiency. Thus, we investigated the optimal sequence length based on the period (e.g., day, week) of data. We examined: 1) How much information does the most recent data have? 2) How much additional benefit does the data yield as it goes further back in time? 3) Is there a sweet spot where the marginal performance improvement becomes almost zero? In Figure 8, the input sequence length is based on periods from one day to six weeks (x-axis). For example, the one-day period only used the most recent one-day sequence for the churn prediction. Six Marketing Science Institute Working Paper Series
weeks is the maximum length of the sequence in our dataset. The y-axis indicates the relative performance compared to the best result (i.e., the result of the six-week sequence). Results show that recent data carries a significant amount of information for churn prediction even within a short span of time. For example, with just one day's data, the predictive performance achieved is approximately 80–85% that of using the entire period, depending on the performance metrics. However, we did not observe a sweet spot in our setting. This suggests that, while the marginal improvement decreased, incorporating a longer sequence―by extending the current six-week period to a longer period―could lead to better performance. Managerially speaking, our results suggest that it is worthwhile to incorporate full-length sequences despite their high computational costs. Additionally, it is recommended that the company test sequences of longer periods to improve predictive performance. Since our analysis focused on loyal game customers, even a small improvement could lead to an enormous sales improvement (see Appendix C). 5.3.3 Savings in Human Labor and Time from Automated Feature Learning This section demonstrates how our automated approach can help reduce labor costs in feature learning processes. A few years prior to our study, the game company sponsored a global data science competition via a well-known platform and opened their dataset for feature engineering and churn prediction. Due to the large prize, more than 500 teams (ranging from professional data scientists to graduate students) participated in the contest. During four months of competition, 93 teams submitted their results. The majority of teams consisted of three to five team members working closely together. Notably, top-ranked teams participated in the competition full-time, providing high quality outputs that were representative of real-world efforts. For example, one of the top teams (consisting of five members) reported in an interview that they spent long hours every day for four months to manually make and test more than 2,000 features. This immense effort reflects a common reality for data scientists, so we used the characteristics of the competition task, the same period of four months and an equal number of data scientists, to estimate real-world labor costs in business. We also assumed that the labor cost of workers is equivalent to that of intermediate-level data scientists in the US. (Gil Press, 2019). Top ranked models typically need additional modifications before being applied to real-world systems, otherwise they are often rejected due to quality or computational issues (e.g., Netflix Grand Prize). Often firms continuously iterate with a cycle of hypothesis building, feature engineering, modeling, and testing to gradually improve their analytics systems at a steady pace. Our approach dramatically shortens this lengthy process. Our model took 30 hours to train and achieved better performance than the manual feature sets developed by the global company. However, this does not mean that our model can solve all Marketing Science Institute Working Paper Series
You can also read