Modeling Lengthy Behavioral Log Data for Customer Churn Management: A Representation Learning Approach

Page created by Clarence Garner
 
CONTINUE READING
Marketing Science Institute Working Paper Series 2022
Report No. 22-101

Modeling Lengthy Behavioral Log Data for Customer Churn Management: A
Representation Learning Approach

Daehwan Ahn, Dokyun Lee, and Kartik Hosanagar

“Modeling Lengthy Behavioral Log Data for Customer Churn Management: A Representation Learning Approach ”

Daehwan Ahn, Dokyun Lee, and Kartik Hosanagar

MSI Working Papers are Distributed for the benefit of MSI corporate and academic members and the general
public. Reports are not to be reproduced or published in any form or by any means, electronic or mechanical,
without written permission.
Modeling Lengthy Behavioral Log Data for
 Customer Churn Management:
 A Representation Learning Approach

 Daehwan Ahn1, Dokyun Lee2, Kartik Hosanagar3
 {ahndh1, kartikh3}@wharton.upenn.edu, dokyun@bu.edu2
 University of Pennsylvania13, Boston University2

 Abstract

Despite the success of recent deep learning models in churn prediction, they can only address short
sequences of lengths ranging from hundreds to thousands of events. In practice, however, customer
behavioral log data has very long sequences that can extend up to millions of events in length, which can
only be utilized through manual and onerous feature engineering. This approach requires domain
expertise and is exposed to human error. We introduce an automated log processing approach that
combines several representational learning (i.e., transforming data algorithmically to maximize signals)
frameworks to extract valuable signals from the entirety of lengthy log data. Our model combines a
graph-based embedding method with flexible neural networks to focus on sequence length and long-term
dependencies, given relatively lower dimensions of sequence-event type, and efficiently extracts useful
representations beneficial for churn prediction. The model improved prediction performance up to 55%
when compared to existing manual feature engineering approaches developed by a global game company
and recent deep learning models. Such improvement could nontrivially increase our collaborating
company’s value by increasing the customer lifetime value of loyal customers. Additionally, our approach
can reduce up to 99% of human labor and computational costs. The performance lift increases as
sequence length increases. Managerial implications and applications to other sequence data are discussed.

Keywords: Churn, Customer Log Data, Lengthy Log, Deep Learning, Sequence Embedding, CLV.

Marketing Science Institute Working Paper Series
1 Introduction

Because the benefits of customer retention are well documented and studied, there have been calls for
proactive churn management across a variety of industries (Ascarza et al., 2017). Gallo (2014) has
documented that acquiring a new customer is 5–25 times more costly than retaining an existing one. A
case study among financial services showed that a 5% increase in retention could raise a company's
profits by more than 25% (Reichheld & Detrick, 2003). Considering that CAC (Customer Acquisition
Cost) has grown by nearly 50% over the past 5 years (Campbell, 2019), churn prediction and management
have become central to the application of data science in business (Ahn et al., 2020). This stream of
research has suggested relevant proactive churn management strategies by utilizing various
methodological backgrounds, including modeling (Ascarza & Hardie, 2013; Braun & Schweidel, 2011;
Lemmens & Gupta, 2020), randomized experiments (Godinho de Matos et al., 2018), and predictive
analytics (Lemmens & Croux, 2006).

 Breakthroughs in machine learning (ML) technology have enabled predictive analytics to deliver
superior performance by incorporating a wealth of new, structured and unstructured data (LeCun et al.,
2015). Representation learning―a set of methods that permit algorithms to automatically discover
various levels of transformed features from large-scale raw data―flexibly allows the use of various input
data types (Bengio et al., 2013), such as short logs (Arnaldo et al., 2017), images (Krizhevsky et al.,
2012), text (Collobert et al., 2011), and graphs (Hamilton et al., 2017). Furthermore, new techniques for
automated and nonlinear feature learning have helped to extract more subtle signals from data in contrast
to human-engineered features (Bengio et al., 2013). These advantages have recently produced fruitful
research on the use of churn prediction to directly extract useful information from rich yet raw modern
datasets.

 Consequently, there has been significant interest in incorporating large-scale customer log data into
churn detection and prediction in business domains (Ahn et al., 2020). A customer log is a sequential
record of the communication between a service and the users of that service (Peters, 1993). Since a log is
the most common data collection form and contains a wealth of information, such as microsecond-level
transactions, it is considered a key source for big data analytics (Chen et al., 2014; Dumais et al., 2014;
Oliner et al., 2012). However, in its original form, a log is unusable for analysis; managers need
additional data and modeling work to obtain value for the business (Bhadani & Jothimani, 2016). Log
processing is tricky due to its complexity―event sequences consist of various numeric, categorical, and
unstructured features and have arbitrary lengths for the same time interval. Information loss is also

Marketing Science Institute Working Paper Series
inevitable during a traditional statistics-based (i.e., averaging, counting) aggregation process (Marvasti,
2010; McKee & Miljkovic, 2007).

 Thus, recent studies have begun to utilize powerful representation learning models for log analysis.
These studies show that sequential deep learning (DL) models can capture time-varying dynamics from
customers’ log data (Ahn et al., 2020; Sarkar & De Bruyn, 2021) in domains such as marketing (Hu et al.,
2018), financial services (Vo et al., 2021), healthcare (Kwon et al., 2021), online games (Guitart et al.,
2018), etc. Within the marketing context, Sarkar and De Bruyn (2021) showed that DL-based feature
learning defeated 267 of 271 hand-crafted models that applied a wide variety of variables and modeling
approaches, often by a wide margin. Similarly, Ahn et al. (2020) compared more than 100 papers on
churn prediction in various business domains and concluded that DL models offer the best performance
due to their powerful feature-learning ability, which captures subtle patterns in a vast amount of log data.

 However, these studies only worked with shorter sequences of log data, because current approaches
were originally designed for addressing high-dimensional but short-length inputs (e.g., natural languages
like speech and text) ranging from hundreds to thousands of events in sequence length (e.g., canonical
RNNs [P. J. Liu et al., 2018], Transformer [Vaswani et al., 2017]).1 Thus, customer log data, which are
commonly available and can extend up to millions of events in sequence length, cannot be directly used;
this raises critical managerial challenges.

 Currently, managers have two options to address lengthy logs: the first is utilizing only short-period
sequences through data truncation (Chollet, 2017; Géron, 2018), and the second is the aggregation of
long sequences (Jurgovsky et al., 2018; Whitrow et al., 2009). Truncation keeps a designated number of
recent behaviors and excludes the rest while aggregation summarizes long sequences through statistical
measures, such as average and count. Although the truncation approach can capture time-dynamics in
dense logs, it often misses long-term dependencies between customer behaviors and management
activities, which is often a key interest in churn literature (Ataman et al., 2010; Jedidi et al., 1999; Mitra
& Golder, 2006; Sloot et al., 2006). When using the aggregation approach, managers suffer from the
following issues: aggregation requires onerous and lengthy manual feature/data work (Bengio et al., 2013;
Heaton, 2016; Z. Liu et al., 2020; Munk et al., 2010; Ng, 2013; Press, 2016); human-engineered features
often miss important signals that powerful representation learning can capture (Bengio et al., 2013;
Marvasti, 2010; McKee & Miljkovic, 2007); and the performance heavily relies on the analyst’s expertise

1
 We discuss in detail why prevalent language models, such as Transformers, are not appropriate for lengthy log data in section
4.1. While there is no theoretical limits to sequential length that these algorithms can take, for computational reasons, most are
capped at 512 events (words or tokens). It is infeasible, if not impossible, to process sequences significantly longer than
thousands in length while maintaining performance with current computational power.

Marketing Science Institute Working Paper Series
and domain knowledge to craft relevant predictors (Sarkar & De Bruyn, 2021). The more the data
complexity (i.e., size and type) increases, the more severe these limitations become.

 To fill this gap, we propose an automated log processing framework that combines an unsupervised
sequence embedding approach (i.e., finds lower-dimensional representation while maintaining desirable
properties of the data) with flexible neural networks to efficiently extract nonlinear signals from lengthy
logs pertinent to customers’ churning patterns. Our approach uses a graph-structured embedding method
to map a user’s long journey of service consumption into a highly informative yet low-dimensional space.
This sequence embedding can summarize multiple sequences that are millions of events long into shorter-
length representations as vector sequences readily feedable to any ML/DL model. To evaluate our
framework, we used a large-scale data set from a leading global game company.

 Our model improved prediction performance (as measured by ROC AUC, PR AUC, and F1-score,
defined in section 5.1) by a minimum of 5% and a maximum of 55% when compared to benchmark ML
models using manual feature engineering approaches developed by a global game company (representing
years of industry know-how) and to recent DL models. Back-of-the-envelope calculations within an
online gaming context suggest that these improvements could nontrivially increase the value of our
collaborating company by increasing the customer lifetime value (CLV) of their loyal customers
(Appendix C). We estimate that the company needs to spend only 1% of their current cost in human labor
(for onerous feature engineering modeling) and in computation (section 5.3.3). Further analyses show that
improved performance results from using the entire log sequence and the performance gain increases as
the sequence length increases. We conclude with managerial implications, generalizability, and other use
cases.

2 Related Work

Three streams of literature are closely related to our work: (1) customer churn models in management
science, (2) log analysis for churn prediction, and (3) sequence feature embedding techniques in machine
learning.

 The majority of research in management science has focused on modeling customer churn behavior
based on aggregate-level data (Bachmann et al., 2021) that is rooted in theory (Ascarza et al., 2017; Fader
& Hardie, 2009). This stream of work has evaluated the impact of various human-engineered features on
churn, including customer heterogeneity and cross-cohort effects in marketing-mix activities (Schweidel
et al., 2008), the frequency and amount of direct marketing activities across individuals and over time
(Schweidel & Knox, 2013), customers’ complaints and recoveries (Knox & Van Oest, 2014), and

Marketing Science Institute Working Paper Series
customers’ service experiences (e.g., the frequency and recency of past purchases [Braun et al., 2015]).
Though this stream of research is theoretically elegant, it often lacks precision with individual-level
predictions, especially when capturing time-varying contexts of sequential data (Bachmann et al., 2021).
Furthermore, as digitization delivers new kinds of data about customers, capturing complex patterns from
a vast amount of log data is increasingly essential to business analytics. Our work contributes to this
stream by proposing a model that easily incorporates not only time-varying dynamics but also the flexible
nonlinearities of rich modern datasets.

 Recent advances of incorporating log data through the applications of machine learning have seen
great success in churn prediction problems (Ahn et al., 2020). Theoretically, using log data itself, rather
than the aggregation process typically used in existing churn models, can help avoid critical information
losses (Marvasti, 2010; McKee & Miljkovic, 2007). Methodologically, models that use log data operate
by directly extracting time-varying churn signals through sequential DL models. However, this body of
work has only addressed sequences of shorter length, typically ranging in the hundreds, due to the
recurrent architecture of canonical RNNs (e.g., LSTM, GRU), which causes gradient decay over layers
(Li et al., 2018). The approach of Transformer models is also infeasible because the computational
burdens increase quadratically with the sequence length (Beltagy et al., 2020). Considering that logs can
be millions of events long, the application of modern churn approaches is limited, and managers are still
relying on inefficient manual processes that have not kept pace with the quantitative and qualitative
evolution of big data (Chen et al., 2014). Our work combines two representation learning frameworks to
utilize the lengthy log data commonly available in various business fields in their entirety and
automatically.

 Finally, sequence feature embedding provides a distilled representation for the abundantly available
sequential business data, such as clickstreams, content consumption histories, weblogs, and social
network services. However, the challenge of sequence processing―regarding both sequential DL models
and sequence feature embedding models―is effectively capturing long-term dependencies (i.e., the
relationship between distant elements in a sequence) while managing computational scalability for both
the sequence length and the vocabulary size (i.e., the unique number of elements in sequences) (Ranjan et
al., 2016). In other words, the trade-offs between performance, sequence length, and vocabulary size are
unavoidable. For example, though Transformer has recently exhibited state-of-the-art performance in
various sequential tasks regarding large vocabulary size (e.g., number of words in NLP), its
computational and memory requirements scale quadratically with the sequence length. Additionally, long-
term Transformers that can address longer sequences with less computational burden (e.g., Longformer
[Beltagy et al., 2020], Sparse Transformer [Child et al., 2019]) sacrifice the novel performance of the

Marketing Science Institute Working Paper Series
original Transformer due to the approximation process. Similarly, traditional sequence embedding
approaches have ignored length and vocabulary size (Needleman & Wunsch, 1970; Smith & Waterman,
1981) and/or long-term dependencies (Farhan et al., 2017; Kuksa et al., 2008). Methodologically, our
approach faces the same limitation, but we address this problem by leveraging domain knowledge about
customer logs that have limited types of user behaviors (events), and thus low vocabulary size (i.e., small
action choices). By doing so, the model can capture long-term dependencies in lengthy logs; it provides
computational efficiency by focusing on the length of sequences rather than the dimensionality of
sequences (event type). Our work is tailored toward business problems and extends the burgeoning
literature in long-sequence feature embedding to management fields.

 Figure 1 positions our paper within the DL sequence processing literature with respect to customer
churn prediction. The emergence of DL approaches has delivered great returns especially when dealing
with high-dimensional and unstructured sequential data – namely, natural languages. Prior works explore
and utilize the potentials of these abundantly available data (e.g., text and speech) and is the state-of-the-
art for a wide range of sequential tasks (LeCun et al., 2015). However, recent advances in sequential DL
models have focused on improving models’ performance and efficiency while retaining the ability to
handle the high dimensionality of the emerging textual data, albeit at the cost of sacrificing long-term
sequence dependence. While the existing managerial research on churn prediction has benefited much
from directly adapting such models into analyzing shorter-length sequential log data (Ahn et al., 2020),
achieving excellent performance in presence of both the length and high dimensionality remains elusive.
To address lengthy behavioral logs in a real-world context, there is a need for a different design
philosophy that can handle long-sequence events. Therefore, we suggest a new direction of sequence
modeling―which focuses on length and extracting long-length dependencies rather than on handling high
dimensionality―to handle lengthy logs in managerial problems like churn analyses.

3 Empirical Setting

We modeled the customer churn prediction in the context of online games. Online games are an ideal
testing ground to study the capture of minute-changing patterns in lengthy customer log data. The global
game market is estimated to be worth $180 billion, which is bigger than North American sports and
global movie industries combined (Witkowski, 2020). As competition intensifies, game companies have
been increasingly required to perform effective customer churn management by utilizing massive
amounts of play log data. Thus, they have actively tested various log analysis methods and built extensive
know-how. We have therefore grounded the empirical analysis of our approach in this setting, given the

Marketing Science Institute Working Paper Series
importance of churn management in gaming and the existence of alternative approaches against which we
can benchmark our framework.

 Our analysis focuses on loyal customers’ churn behavior. In online games, typical payment
distribution follows extreme Pareto, and loyal users (0.19% of users) supply half of the revenue
(Cifuentes, 2016). This means a small improvement in churn prevention can greatly impact total revenue,
so managing and retaining loyal customers is one of the most important goals of game operations (Lee et
al., 2020). Thus, it is common for game companies to conduct sophisticated analyses focused on a small
group of profitable, loyal users. Similarly, we excluded non-loyal users for consistency with industry
practice and to save computational resources. We acknowledge that non-loyal users may have different
behaviors than loyal users, so the insights from the analysis cannot be extrapolated to that segment
of users.

3.1 Data
We used a global game company’s proprietary data from April 1, 2016 to May 11, 2016 to conduct our
analysis. The company is one of the largest game developers/publishers globally with an annual revenue
of over $2 billion. The data contains 175 million event logs of 4,000 loyal customers over the period of
six weeks. The loyal users were chosen by the company based on cumulative purchases and in-game
activities. The logs have a multivariate time series format that consists of second-level timestamps and 16
relevant columns containing different types of events and user behaviors, which are represented by
numeric and categorical features. That is, each user generates 16 different kinds of in-game behavioral log
sequences. For example, if a user enters a certain area of the game world, a column named “AREA”
records the unique ID of that area and a “TIME” column records the timestamp for it. Simultaneously,
other columns also record the event status (e.g, the user’s level, how much time they spent in the current
session, their in-game social activities and battle status) for the same timestamp.

 Furthermore, we had access to aggregate cross-sectional data cleaned and engineered by the company
(based on the same raw logs). This manually engineered feature set―containing 78 variables―was used
to establish baselines by combining it with various ML predictive models. Since the company
accumulated considerable know-how and domain knowledge for over 20 years as a front-runner in online
games, these baselines effectively demonstrate our model’s improvements and efficacy from a real-world
business standpoint.

 Figure 2 shows the frequency distribution of sequence length for each user. An average user has
43,784 events in six weeks, with a maximum of 582,941 and a minimum of eight. Previous approaches
cannot handle these lengthy sequences and directly extract useful signals.

Marketing Science Institute Working Paper Series
3.2 Definition of Churn
Similarly to prior work (Lee et al., 2018; Tamassia et al., 2016), we define a churner as a user who does
not play the game for more than five weeks. We predict user churn in a binary classification manner, and
1,200 users out of 4,000 (30%) are churners in the dataset. To allow for effective churn prevention
strategies, we set a three-week time interval between observation and churning windows. By doing so, the
company had an opportunity to implement various retention interventions for potential churners, such as a
new promotion campaign or content updates.

4 Model

Our model assumes that each log has limited event types (i.e., vocabulary size, the unique number of
elements in sequences), typically less than a few hundred. In real-world log management, it is essential to
separate events into various subfields and columns, depending on their categories and types, in the way
typical, relational database systems operate. Since logs are notably long, each column includes only a
small number of customer actions for efficient data processing by operations such as groupby and join.
Additionally, even if logs have large vocabulary sizes, due to the existence of timestamps, it is easy to
separate them into sub-logs that have small vocabulary sizes through a simple filtering process with low
computation. Therefore, the business logs are bounded on the size of the vocabulary, unlike natural
languages processed by recent sequential DL models. In other words, to address long sequence logs in
business, the computational scalability should focus on addressing the sequence length, rather than the
vocabulary size.

 Leveraging this domain knowledge, we propose a sequence embedding approach to handle customers’
lengthy traces of online content consumption that can extend to millions of events in length. To do so, we
incorporate a graph-based sequence embedding approach (Ranjan et al., 2016) that is methodologically
fitting for capturing long-term dependencies, while linearly and quadratically scaling with sequence
length and vocabulary size, respectively. Then, we hierarchically stacked the embedding model on a
novel multivariate sequential DL model, named Temporal Convolutional Networks (TCN) (Bai et al.,
2018; Lea et al., 2017). This model is built to concurrently capture both local-level features and their
temporal patterns of embedded representations from the sequence embedder.

 Below, we first discuss the theoretical background of our model selection criteria. Next, we overview
our log processing framework. We then elaborate on the graph-based sequence feature embedding
process. Following that, the neural network architecture and its components are described. Lastly, we
provide details of model estimation and optimization.

Marketing Science Institute Working Paper Series
4.1 Theoretical Background in Model Selection
This section elaborates on why existing state-of-the-art sequential DL models are not suitable for real-
world lengthy log analysis, in contrast to our graph-embedding approach.2 We first discuss five model
selection criteria to satisfy the unique characteristics of big data business analytics. We then highlight the
limitations of existing sequential DL models from the perspectives of methodological and computational
costs. Table 1 summarizes the discussion.

4.1.1 Selection criteria to address lengthy logs in large scale business analytics

Our approach meets the following five criteria for model selection in lengthy business log analysis.

 (1) The model has to linearly scale with sequence length without using techniques that cause
 information loss, such as approximation and compression. Both vanilla Transformer and
 Transformers for long-term contexts do not meet this standard.
 (2) The model needs to perform comparably to popular DL models (e.g., LSTM, GRU) in addressing
 low-dimensional data.
 (3) The model should handle multivariate time series or logs. Both vanilla Transformer and
 Transformers for long-term contexts do not support multivariate sequences (Zerveas et al., 2021).
 (4) The model should be free from the fixed-length input problem―which causes huge waste in
 computations when setting input sequences with the same length (i.e., max sequence length)
 through padding techniques (e.g., filling out zeros). Existing sequential DL models (e.g., RNNs
 and Transformers) can only consume fixed-length inputs, which is a key barrier limiting their
 application to lengthy sequences (Dai et al., 2019). For example, in our dataset, existing methods
 require 13 times more computing resources to address irrelevant zeros, which account for 92.5%
 of the data after the padding process.
 (5) Fifth, the model should be computable in distributed environments. RNNs and Transformers for
 long-term contexts also have shortcomings regarding this criterion (Bai et al., 2018; Fang et al.,
 2021).

4.1.2 Limitations of Existing Methods

Existing DL sequential methods for lengthy log processing have limitations due to their focus on the high
dimensionality of data rather than the long-term dependency of input sequences.

2
 We discuss this in detail in Appendix A “Model selection criteria and limitations of existing models” and Appendix B
“Estimated cost of each model and its feasibility for addressing lengthy logs.”

Marketing Science Institute Working Paper Series
● RNNs―not only vanilla RNN but also LSTM and GRU―are inherently ill-fitted for capturing
 long-term dependencies due to the limitation of their recurrent structure. These models can
 generally handle mid-range sequences up to a thousand events in length due to gradient decay
 over layers (Li et al., 2018). We also empirically show the same result (see the results of the
 truncation settings in section 5.2).
 ● Truncated backpropagation through time (TBPTT) sacrifices long-term dependencies to help
 RNNs handle longer sequences. Specifically, TBPTT loses the ability to capture long-term
 dependencies due to its truncation mechanism―which saves computation and memory costs by
 truncating the backpropagation process after a fixed number of lags (Tallec & Ollivier, 2017).
 ● Vanilla Transformer cannot handle long-term dependencies due to its quadratic computational
 scalability with input length (Child et al., 2019), despite its superior performance. For example,
 the typical input length of Transformer models is set to 512. If we analyze the full-length
 sequences in our dataset (i.e., 582,941 events in length) through Transformer, it requires 1.3
 million (= [582,941 / 512]2) times more computational cost relative to the standard Transformer’s
 setting. Thus, processing lengthy log data with typical Transformers is infeasible .
 ● Transformers for long-term contexts―which are modified Transformers that address longer
 sequences with less computational burden―rely on the approximation of long-term
 dependencies. It is unavoidable for Transformer models to sacrifice their superior performance to
 capture longer contexts, if not entirely infeasible to compute due to insurmountable memory and
 time requirements. When used, the performance loss increases as the computational efficiency
 increases (Tay et al., 2020).

4.1.3 Computational Costs

We estimate the computational requirements of our approach and other traditional, state-of-the-art
sequential DL models to address our dataset. This quantitatively shows why the existing methods are
inadequate or infeasible for handling extremely long sequences effectively. From a perspective of
computational cost, our model is 11 to 6,720 times cheaper than state-of-the-art alternatives (i.e., long-
term Transformers) while losing no information from the approximation process. We discuss the cost
savings of human labor later in the paper. Whereas the alternatives sacrifice their performance to achieve
computational efficiencies when addressing longer sequences (Tay et al., 2020), they still have huge
computational requirements―which is a significant hurdle in real-world, large-scale, lengthy log analysis.
Some models’ computational requirements linearly increase with input length, but their cost was 11 times
more expensive than ours due to the fixed-input length problem. These cost gaps rapidly increase as the
size and length of data increase.

Marketing Science Institute Working Paper Series
Table 1 summarizes the key features and computing costs of each model regarding the lengthy log
processing discussed in this section. The detailed explanations and related implications are described in
Appendices A and B.

4.1 Framework Overview
Figure 4 plots an overview of our novel approach. We designed our model as a hierarchical structure that
stacks a graph-based sequence embedder on TCN, rather than letting a single embedder handle the whole
sequence process. Once a sequence embedder compresses lengthy logs into short vector sequences in an
unsupervised manner, TCN squeezes out useful signals from them. This hybrid design allows higher
performance and flexibility by maximizing the utilization of novel DL techniques in the process. Though
the graph-based sequence embedder exhibited better accuracy than LSTM in some benchmark tests
(Ranjan et al., 2016), the state-of-the-art sequential models (e.g., transformers, TCN) had superior
performance and methodological elegance in many different contexts. We summarize the approach here:

 (1) Split and Stack: In the first step, we split a full-length input sequence into subsequences. The
 input sequence is an ordered series of label encoded events (i.e., numbers). For example, suppose
 a certain column includes five events {login, battle, chat, purchase, logout}, a sequence {login,
 battle, battle, purchase, chat, logout} can be converted to {0,1,1,3,2,4}. We split each six-week
 sequence into 1,008 subsequences of one-hour increments. We chose a one-hour increment to set
 the sequence length after the embedding process at 1,008, so that TCN―its computational burden
 quadratically scales with sequence length (Krebs et al., 2021)―can handle these data with ease.
 Finally, we obtained the total of 4,032,000 distinctly occurring subsequences and stacked them
 for the next embedding process.

 (2) Sequence Embedding: This step converted the stacked subsequences into meaningful
 representation (or feature) vectors, allowing similar sequences to be embedded near each other in
 the lower dimensional space. We did so by integrating a graph-based embedding approach
 (Ranjan et al., 2016) that quantifies the effects (i.e., associations) of events on each other based
 on their relative positions in a sequence. Specifically, we set events (i.e., customer behaviors) as
 nodes and their relationships as links (i.e., associations) in a graph structure. We embedded the
 sequences into a vector space based on the unique characteristics of their graph structure.

 (3) Principal Component Analysis (PCA): Since the embedded representation vector contains
 information about all possible bidirectional event pairs, it is usually sparse and high-dimensional.
 For efficiency, we compressed the embedded vectors through PCA (the amount of explained
 variance was set to 95%).

Marketing Science Institute Working Paper Series
(4) Iterate Steps One to Three for all 16 columns: Our dataset has a total of 16 columns. Each user
 has 16 different log sequences about different types of events and behaviors. Thus, we repeated
 the sequence embedding process for all columns and got a total of 16 embedded representation
 vectors. These sixteen vectors were then concatenated to form a unified meta-feature vector.

 (5) Unstack: The meta-feature vector has a vertically stacked structure. For example, the first row is
 an embedded feature vector of user-1 at time-1 (i.e., the first hour of the six-week period). For
 individual-level analysis, we reshaped the data dimension into 4,000 users × 1,008 (6 weeks × 7
 days × 24 hours) time periods. Consequently, each user has a sequence of 1,008 representation
 vectors, and each vector contains distilled information of each hour-long subsequence.

 (6) Neural Network: The TCN layer abstracts sequential feature representations while considering
 their long-term contexts. The attention layer helps the TCN layer efficiently handle the data's
 sparsity and long-term dependencies. Abstracted outputs are provided to a fully connected layer,
 which distills the information one more time. Since the goal was to predict imbalanced binary
 labels (churn: 30%, stay: 70%), the learning process was conducted to minimize weighted binary
 cross-entropy loss.

4.2 Graph-based Sequence Feature Embedding

Denote an input sequence in the data set of sequences , which is composed of event factors in a set
(i.e., vocabulary), by . denotes the length of a sequence and denotes the event at position in
sequence , where and =1,..., . The graph-based sequence embedder characterizes a sequence
by quantifying the forward direction effects (i.e., associations)―the effect of the preceding event on the
later event―of all paired events. Here, the effect of event on event , which are at positions and
respectively, is defined as

 , where is a tuning parameter and .

 The forward direction effects can be applied to various lengths of sequences. Therefore, the
graph-embedding approach saves a huge amount of computational resources for addressing unnecessary
dummy inputs (i.e., zeros) that account for 92.5% of our dataset with zero-padding for the utilization of
the maximum sequence length.

 We can store associations of all paired events in an asymmetric matrix size of × . For example,
see Figure 5, where there are five events (i.e., A, B, C, D, E), we can get a total of 25 associations (i.e., A-
A, A-B, A-C,., E-D, E-E). Here, has all associations of event pairs ( , ) and is the preceding event
than .

Marketing Science Institute Working Paper Series
The association feature vector , which is used for the sequence embedding process, is a normalized
aggregation of all associations, as follows

 .

 , the feature representation of sequence s, denotes the embedded position
of sequence in a -dimensional feature space. Also, since contains the association between
paired events (i.e., the effect of the preceding event on the later event ), we can interpret it as a directed
graph with × edges. Here, the edge weights are normalized associations of paired events.

4.3 Neural Network Architecture

The neural network module consists of a TCN, an attention layer, fully connected layers, and a weighted
cross-entropy loss layer. We describe the specifics as follows.

4.3.1 Temporal Convolutional Network (TCN) for Sequence Processing

TCN retains the time series processing ability of RNN, but adds the computational efficiency of
convolutional networks (Lea et al., 2017). TCN outperforms canonical RNNs such as LSTM or GRU
across various sequential tasks while handling longer input sequences. Compared to RNNs, TCN is faster
while requiring less memory and computational power and is more suitable for parallel processing (Bai et
al., 2018), which delivers critical advantages in addressing a vast amount of customer logs.

 TCN consists of stacked residual blocks, which in turn consist of convolutional layers, activation
layers, normalization layers, and regularization layers (see Figure 6 and Lea et al. [2016] for more
details).

 For the layers, a temporal block is constructed by stacking several convolutional layers. In detail, for
an input sequence , output sequence , and a convolution filter with size k
 , the level dilated convolution operation at the time is defined as

Marketing Science Institute Working Paper Series
where is the dilation factor, which can be written as to cover an exponentially wide
receptive field.

 The Rectified Linear Units (ReLU) is used as an activation function to provide nonlinearity to the
output of the convolutional layers (Glorot et al., 2011) and is defined as

 One of the obstacles of DL is that the gradients for the weights in one layer are highly correlated to the
outputs of the previous layer, resulting in increased training time. Layer normalization is designed to
alleviate this “covariate shift” problem by adjusting the mean and variance of the summated inputs within
each layer (Ba et al., 2016). Though the theoretical motivation for decreasing covariate shift is
controversial in technical ML literature, the practical advantage of normalization methods, which allow
for faster and more efficient training, has proven indispensable to a wide range of DL applications (A.
Zhang et al., 2019). The statistics of layer normalization over each hidden unit in the same layer is written
as

 where is the normalized summated inputs to the hidden unit in the layer, and denotes the
number of hidden units in a layer.

 Dropout is an essential regularization technique to prevent the overfitting of the neural network. The
idea is to randomly drop (hidden and visible) units from the network during training, which prevents units
from co-adapting too often. By doing so, dropout improves the generalization of neural networks by
allowing the training process to be an efficient stochastic approximation of an exponential ensemble of
“thinned” networks (Srivastava et al., 2014).

4.3.2 Attention with Context

The attention mechanism allows the sequential model (i.e., TCN) to focus more on the relevant parts of
the input data by acting like random access memory across time and input data (Bahdanau et al., 2014).
Thus, it improves the training efficiency and performance of the model by giving more direct pathways to

Marketing Science Institute Working Paper Series
the model structure (Raffel & Ellis, 2015). We follow and implement the work of Z. Yang et al. (2016).
Specifically,

 where is an annotation obtained by concatenating (the forward hidden state) and (the
backward hidden state) in the sequence of bidirectional TCN. is a hidden representation of
obtained through a one-layer multilayer perceptron (MLP). is an embedded representation-level
context vector, randomly initialized and learned during the training process. contains the normalized
importance of each embedded representation vector and can be calculated through a softmax function
with and . is the output vector that summarizes all the information of embedded representations in
a sequence. The attention mechanism not only allows for faster training and better performance but also
increases the stability of training by providing more direct pathways to the model structure.

4.3.3 Fully Connected Layers

We built the fully connected layer module by stacking multiple component blocks. Each component block
consists of dense and dropout layers, layer normalization, and Exponential Linear Units (ELU) activation.
The dense layer is a linear operation on the input vector. Dropout prevents the model from overfitting,
and layer normalization boosts the efficiency of training. ELU (Clevert et al., 2015) adds nonlinearity to
the model while lowering susceptibility to the vanishing gradient problem and is defined as

 where is a hyperparameter.

4.3.4 Minimizing the Loss Function

We optimized a weighted binary cross-entropy loss function to train imbalanced binary labels (churn:
30%, stay: 70%). This function gives additional weight to the minority class (i.e., churn users, Y=1)
relative to the typical binary cross-entropy loss function. The additional weight can be calculated as the
ratio between “the number of negative classes” (i.e., stay users, Y=0) and “the number of positive
classes” (i.e., churn users, Y=1). The loss is calculated as follows:

Marketing Science Institute Working Paper Series
where
 number of training examples
 weight
 target label for training example m
 input for training example m
 model with neural network weights .

 When managers apply our framework to other business problems, the loss function can be flexibly
changed to suit the given tasks (e.g., regression, multi-labeled classification).

4.4 Model Estimation and Optimization

4.4.1 Automated Model Generation through Bayesian Optimization

Our model utilizes an AutoML framework to help managers easily extract insightful business-oriented
features without manual effort. The use of AutoML not only cuts manual effort but also makes
benchmark comparisons more structured and unbiased. Through Bayesian optimization techniques
(Snoek et al., 2012), we automate key processes, such as finding the best model structure (e.g., the width,
depth, and capacity of TCN and fully connected layers; the choice of activation functions) and fine-tuning
hyperparameters (e.g., dropout rate, learning rate). As a result, our approach can operate end-to-end (raw
data to result) without human intervention. For the implementation, we use Tensorflow (Abadi et al.,
2016) and Keras Tuner (O’Malley et al., 2019).

4.4.2 Optimization Methods

Complex and noisy real-world data make the training process highly unstable and cause it to converge to
poor local minima, especially when handling sequential tasks (Pascanu et al., 2013). Thus, scholars and
engineers have implemented advanced optimization techniques. In this regard, we applied three different
state-of-the-art optimization methods to improve the effectiveness and efficiency of the training process:

 1) Rectified Adam (RAdam): Despite faster and more stable training, popular stochastic optimizers
 (e.g., Adam and RMSProp) experience a variance issue in which problematically large variance
 in the early stage of training risks converging into undesirable local optima (L. Liu et al., 2019).
 RAdam addresses this problem by incorporating a warmup (i.e., an initial training with a much
 smaller learning rate) to reduce the variance.

Marketing Science Institute Working Paper Series
2) Lookahead optimizer: This iteratively updates two sets of weights, “fast weights” and “slow
 weights,” and then interpolates them. By doing so, Lookahead improves training stability and
 reduces the variance of optimization algorithms such as Adam, SGD, and RAdam (M. Zhang et
 al., 2019).

 3) Gradient centralization: This performs Z-score standardization on the gradient vectors, similarly
 to batch normalization. Thus, it boosts not only the generalization performance of networks but
 also the stability and efficiency of the training process (Yong et al., 2020).

4.4.3 Model Estimation and Optimization

The sequence embedding was performed on an Amazon SageMaker ml.m5.12xlarge instance with 48
multi-thread CPUs and 192GB of RAM. The embedding process took around six hours including the
dimensionality reduction with PCA. The TensorFlow deep learning library was used to estimate our
neural network model (Abadi et al., 2016). The model estimation took around two hours with 5-fold
cross-validation on an Nvidia GTX 1080 Ti GPU server with 64 GB of RAM. Additionally, the Bayesian
optimization was conducted on the same server and took about 20 hours to find the best model structure
and hyperparameters. Since our model is based on TCN, parallel processing can contribute to faster
model training, which is hard for recurrent computations of canonical RNNs.

5 Results

In this section, the predictive performance of our automated log processing is benchmarked against
existing manual approaches. Additional analyses revealed when and how our model excels in predicting
churn in real-world data. We discuss the economic impact of our model on improving game value while
saving time and labor costs in comparison to existing methods.

5.1 Experimental Setting

We benchmark the quality of representations extracted by our automated log process by comparing its
predictive performance against several alternatives based on various ML/DL models―Logistic
Regression, Decision Tree, K-Nearest Neighbor, Naive Bayes, Support Vector Machine, Extreme
Gradient Boosting (XGBoost), and Deep Neural Network (Multilayer Perceptron)―and the manually
engineered feature set developed by the company. This manually engineered feature set represents years
of domain knowledge as one of the leading firms in the game industry. Our approach natively utilizes
lengthy logs due to its sequence embedding ability, while other options use the aggregate data commonly
available in real-world business analytics settings.

Marketing Science Institute Working Paper Series
We additionally tested two different manual approaches that are commonly used in the existing churn
prediction setting―aggregation (i.e., aggregating long period sequences through statistical measures) and
truncation (i.e., utilizing only short period sequences by keeping a designated number of recent behaviors
and dropping the rest). In the aggregation approach, we converted lengthy logs into shorter sequences
through descriptive statistical measures (e.g., sum, count)―that are widely used in big data analytics and
churn prediction settings (Jeon et al., 2017; Rothmeier et al., 2021; Zdravevski et al., 2020). To do so, we
aggregated daily sequences through six statistical functions (i.e., max, min, count, average, sum, and
mode) that are most commonly used for modeling user behaviors in game contexts (Guitart et al., 2018;
Jeon et al., 2017; Periáñez et al., 2016; Rothmeier et al., 2021; Sifa et al., 2015). In the truncation
approach, we tested various input lengths from 500 to 10,000. Then, we examined the efficacy of the
existing truncation method (as an alternative to our approach) in dealing with lengthy logs. For both
settings, LSTM and GRU models with Attention Mechanism were utilized.

 For robust model evaluation, we conducted 5-fold cross-validation based on three splits―training,
validation, and testing that were constructed with 60%, 20%, and 20% of users, respectively. The final
score is the average performance of each fold on the testing set. We used three evaluation metrics
together―1) Area under the ROC Curve (ROC AUC)3 (Fawcett, 2006), 2) Area under the Precision-
Recall Curve (PR AUC)4 (Davis & Goadrich, 2006), and 3) F1-score5 (Powers, 2020)―to robustly
address imbalanced output labels (churn: 30%, stay: 70%). Additionally, to prevent biases from manual
hyperparameter tuning, we applied the same automatic process to DL models. The structure and
hyperparameters of DL models were determined by the Bayesian optimization process (Snoek et al.,
2012) without any manual intervention. The same early stopping rule was also applied: if the performance
did not increase during five epochs, the model stopped the training process and kept the best weights.

5.2 Evaluating the Predictive Quality

Table 2 benchmarks the predictive performance of various models. Our approach shows superior
performance over other competitive baselines for all ROC AUC, PR AUC, and F1-score evaluation
metrics. Specifically, our automated method improved performance by at least 5% compared to baselines.
It is worth emphasizing that we did not use any domain knowledge in our analysis (i.e., manually

3
 ROC AUC, PR AUC, and F1-score robustly measure the performance of imbalanced classification models. Higher scores
(maximum 1) in each metric indicate a better model performance. To measure ROC AUC, we first drew an ROC curve that
plotted True Positive Rate (y-axis) vs. False Positive Rate (x-axis) at different classification thresholds from 0 to 1. ROC AUC
measured the two-dimensional area underneath the ROC curve.
4
 To measure PR AUC, we first drew a Precision-Recall curve that plotted Precision (y-axis) vs. Recall (x-axis) at different
classification thresholds from 0 to 1. PR AUC measured the two-dimensional area underneath the Precision-Recall curve.
5
 The F1-score can be calculated as (2 × Precision × Recall) ÷ (Precision + Recall).

Marketing Science Institute Working Paper Series
engineering features based on knowledge of the data and logs), but the baseline feature set was developed
over many years by a global company with leading expertise and know-how of its own data. Since our
analysis focused on high-spending customers, the impact of the improvement is substantial (see details in
Appendix C).

 With regard to the aggregation approach, the combination of automated aggregation features and
LSTM/GRU showed better performance than traditional ML models (i.e., Logistic Regression, Decision
Tree, K-Nearest Neighbor, Naive Bayes, Support Vector Machine) based on the manually engineered
feature set but worse than well-performed ML/DL models (i.e., XGBoost and NN) based on the manually
engineered feature set. This implies that, despite its labor-intensive characteristics, manual feature
engineering can be a worthwhile approach in comparison to simple rule-based automated aggregation
methods.

 With the truncation approach, regardless of the input length, we obtained a ROC AUC of 0.5,
equivalent to the expectation of a random guess (Fawcett, 2006). Because LSTM/GRU can only handle a
limited length of sequences, using logs without aggregation fails to provide enough information relevant
to churn prediction, even if we increase the input length. If we assume the models can address sequences
around a thousand events in length as shown in benchmark tests (Li et al., 2018), and considering that
average users have 43,784 events in our dataset, the truncation approach only covers 2.3% of user
behaviors (1,000 ÷ 43,784 ≈ 0.023) on average. This implies that aggregation processes, such as our log
processing, are essentially needed for the lengthy log data of modern high-transaction environments.

5.3 Experiments on Model Efficacy & Managerial Impact

This section first explores what conditions are needed for our approach to excel over existing approaches.
Specifically, we investigate the advantage of our model over baselines, depending on the sequence length
and the time period length of the input data. We then discuss the potential economic impact of cost
savings in model estimation and human labor.

5.3.1 Performance Improvement According to Sequence Length

Figure 7 visualizes the performance gap between our model and other baseline models (y-axis) according
to sequence length (x-axis). On the x-axis, we grouped a total of 4,000 users in our dataset into quintiles
(5 bins) based on their sequence length so that the higher ranking groups have longer sequences. For
example, the fifth group has the longest sequences. The y-axis indicates the performance difference
between our model and the average of other baseline models (i.e., the averaged performance of the
baseline ML/DL models―Logistic Regression, Decision Tree, K-Nearest Neighbor, Naive Bayes,

Marketing Science Institute Working Paper Series
Support Vector Machine, XGBoost, and Deep Neural Network―when based on the manually engineered
feature set). As the y-value increases, the gap between our model and other baselines also increases,
showing that our model works better than the baselines.

 Figure 7 shows that the relative performance of our model increases more in longer sequence groups
in both ROC AUC and PR AUC. The trend lines have positive slopes (i.e., coefficient of linear
regression) even when we increased the number of sequence group bins to 10, 20, 50 and 100 (see
Table 3). The slopes are also statistically significant except in one case (the p-value is slightly larger than
0.05). These results suggest that our sequence embedding approach delivers more benefits when dealing
with longer sequences.

 Somewhat surprisingly, we observed a slight performance drop with the longest sequence group
(i.e., the fifth group in quintiles). One potential explanation that emerged from discussions with domain
experts is the existence of anomalous users who have very different behavioral patterns. Successful online
games usually have two anomalous groups. The first one consists of cheating users who use prohibited
tools (e.g., automatic playing bots) to achieve higher efficiency in the gameplay. The second one consists
of gold farmers who hire workers (often from developing countries) and let them play the game to earn
cyber money, which can then be exchanged for real money. These anomaly groups generally have much
longer behavioral sequences than other users because they use automation (e.g., bots or workers) to play
the game longer. Therefore, most anomalous users may belong to the longest sequence group. In real-
world settings, accurate anomaly detection is technically challenging, so churn datasets usually include
anomalous users. Since ML/DL tends to minimize the loss of the entire population, small numbers of
abnormal patterns are hard to capture with global prediction models. Additional studies may be needed to
better incorporate anomaly detection into churn prediction for improved predictive quality.

5.3.2 Performance Improvement According to Period Length

Longer sequence processing requires more computing costs. In a big data environment, it is important to
find a sweet spot (i.e., the optimal length of input sequences) that satisfies both performance and
computational efficiency. Thus, we investigated the optimal sequence length based on the period (e.g.,
day, week) of data. We examined: 1) How much information does the most recent data have? 2) How
much additional benefit does the data yield as it goes further back in time? 3) Is there a sweet spot where
the marginal performance improvement becomes almost zero?

 In Figure 8, the input sequence length is based on periods from one day to six weeks (x-axis). For
example, the one-day period only used the most recent one-day sequence for the churn prediction. Six

Marketing Science Institute Working Paper Series
weeks is the maximum length of the sequence in our dataset. The y-axis indicates the relative
performance compared to the best result (i.e., the result of the six-week sequence).

 Results show that recent data carries a significant amount of information for churn prediction even
within a short span of time. For example, with just one day's data, the predictive performance achieved is
approximately 80–85% that of using the entire period, depending on the performance metrics. However,
we did not observe a sweet spot in our setting. This suggests that, while the marginal improvement
decreased, incorporating a longer sequence―by extending the current six-week period to a longer
period―could lead to better performance. Managerially speaking, our results suggest that it is worthwhile
to incorporate full-length sequences despite their high computational costs. Additionally, it is
recommended that the company test sequences of longer periods to improve predictive performance.
Since our analysis focused on loyal game customers, even a small improvement could lead to an
enormous sales improvement (see Appendix C).

5.3.3 Savings in Human Labor and Time from Automated Feature Learning

This section demonstrates how our automated approach can help reduce labor costs in feature learning
processes. A few years prior to our study, the game company sponsored a global data science competition
via a well-known platform and opened their dataset for feature engineering and churn prediction. Due to
the large prize, more than 500 teams (ranging from professional data scientists to graduate students)
participated in the contest. During four months of competition, 93 teams submitted their results. The
majority of teams consisted of three to five team members working closely together. Notably, top-ranked
teams participated in the competition full-time, providing high quality outputs that were representative of
real-world efforts. For example, one of the top teams (consisting of five members) reported in an
interview that they spent long hours every day for four months to manually make and test more than
2,000 features. This immense effort reflects a common reality for data scientists, so we used the
characteristics of the competition task, the same period of four months and an equal number of data
scientists, to estimate real-world labor costs in business. We also assumed that the labor cost of workers is
equivalent to that of intermediate-level data scientists in the US. (Gil Press, 2019).

 Top ranked models typically need additional modifications before being applied to real-world systems,
otherwise they are often rejected due to quality or computational issues (e.g., Netflix Grand Prize). Often
firms continuously iterate with a cycle of hypothesis building, feature engineering, modeling, and testing
to gradually improve their analytics systems at a steady pace. Our approach dramatically shortens this
lengthy process. Our model took 30 hours to train and achieved better performance than the manual
feature sets developed by the global company. However, this does not mean that our model can solve all

Marketing Science Institute Working Paper Series
You can also read