Prediction of the future trend
of e-commerce



  Prediction of the future trend of e-commerce in
 Prognostisering av trender inom e-handel i Sverige
                                             Engström, Freja & Nilsson Rojas, Disa

   Abstract—In recent years more companies have invested in            it provides. For companies to understand the customer and
electronic commerce as a result of more customers using the            their behaviour while being able to specialize and personalize
internet as a tool for shopping. However, the basics of marketing      marketing, companies need an understanding of the future
still apply to online stores, and thus companies need to conduct
market analyses of customers and the online market to be able to       market of electronic commerce.
successfully target customers online. In this report, we propose          Index Terms—ARIMA, demographics, electronic commerce, e-
the use of machine learning, a tool that has received a lot of         commerce, machine learning, market analysis, polynomial regres-
attention and positive affirmation for the ability to tackle a range   sion, segmentation, support vector regression.
of problems, to predict future trends of electronic commerce in
Sweden. More precise, to predict the future share of users of                                I. I NTRODUCTION
electronic commerce in general and for certain demographics.
We will build three different models, polynomial regression, SVR
and ARIMA. The findings from the constructed forecasts were
that there are differences between different demographics of
                                                                       T     ODAY’S society is driven towards technological solu-
                                                                             tions for efficiency, effectiveness, simplicity etc. and the
                                                                       internet provides more and more services for everyday life.
customers and between groups within a certain demographic.             Online shopping has therefore grown steadily over the last
Furthermore, the result showed that the forecast was more
accurate when modelling a certain demographic than the entire          couple of years as more consumers discover the benefits of
population. Companies can thereby possibly use the models to           online transactions. The consumer is no longer bound by store
predict the behaviour of certain smaller segments of the market        opening hours, can access the store from wherever, given
and use that in their marketing to attract these customers.            more choices and information, and the possibility of easily
                                                                       comparing products to mention some advantages [1].
   Abstract—På senare år har många företag investerat i
elektronisk handel, även kallat e-handel, vilket är ett re-
                                                                          Companies are not far behind in discovering the benefits
sultat av att individer i samhället i större utsträckning           of electronic commerce to reach customers, also called e-
använder internet som ett redskap. Grunderna för mark-               commerce. The trend among companies within the EU is that
nadsföring gäller fortfarande för webb-baserade butiker, och        e-commerce is representing a larger part of the annual turnover
därmed behöver företag genomföra marknadsanalyser över po-        [2]. While this may be the case for many companies, there are
tentiella kunder och internet-marknaden för att kunna lansera
starka marknadsföringskampanjer. I denna rapport föreslår vi
                                                                       also companies with little to no revenue from e-commerce.
användning av maskininlärning, ett verktyg som har fått mycket      However, because of the emerging technologies and changing
uppmärksamhet på senaste tiden för dess förmåga att hantera       consumer behaviour, the market is changing. Companies need
olika problem kring data och för att prognostisera framtida           to learn how to take advantage of the opportunities the internet
trender för e-handel i Sverige. Mer exakt kommer andelen              creates and the new ways of interacting with consumers,
användare av e-handel i framtiden prognostiseras, både generellt
och för enskilda demografier. Vi kommer att implementera
                                                                       especially those companies not yet familiar with e-commerce.
tre olika modeller, polynomisk regression, SVR och ARIMA.                 In general, companies need to know the consumer and
Resultaten från de konstruerade prognoserna visar att det finns       where to find them. Different groups in society will respond to
tydliga skillnader mellan olika demografier av kunder och mellan       a change from physical stores to online shopping in different
grupper inom en viss demografi. Dessutom visade resultaten             ways. Furthermore, marketing strategies differ between phys-
att prognoserna var mer exakta vid modellering av en viss
demografi än över hela befolkningen. Företag kan därmed
                                                                       ical and online store [3], and as the online market becomes
möjligtvis använda modellerna för att förutsäga beteendet hos     increasingly competitive when more companies enter the mar-
vissa mindre segment av marknaden.                                     ket, knowledge of the consumer becomes critical for success
                                                                       [1]. Thus, companies benefit from predicting the shopping
   The internet has grown increasingly popular during the last         behaviour of the consumer. Information about whether the
decades, and technology is developing to be more convenient            consumer is using the internet for shopping can be used as the
to use. With smartphones, computers, and constant connection           basis for selecting communication channels and help improve
to the internet we can search for information, talk to people          marketing campaigns and make them successful.
across the world and buy goods and services whenever and                  Last year’s pandemic affected the world in many ways, and
wherever. As the number of internet users increases, so does           among other things, one consequence was a shift in consumers
the number of online shoppers and companies interest in elec-          attitude towards online shopping. One effect of lock-downs
tronic commerce. However, all people do not feel comfortable           and restrictions was that groups in society, which previous to
on the internet and some are more reluctant to use the services        the pandemic were quite unfamiliar with online shopping and

the internet in general, had to get familiar with the internet and        •  What effects on online shopping can be seen from the
start to explore the possible benefits of online stores and other            perspective of the 2020 pandemic?
internet services. One example is the senior citizens, wherein             • What differences can be identified between different
Sweden one out of ten tried online shopping for the first time               forecasting models when applied to a limited time series?
during the year 2020 [4]. Furthermore, Sweden experienced                  2) Hypothesis: The hypothesis is that based on the given
an increase in e-commerce, as the annual sales revenue from             historic data regarding e-commerce it is possible to make
e-commerce grew by 40% during 2020 [5].                                 well-based predictions that can be used as a basis for market
A. Problem
   One problem closely connected to commerce and e-                     C. Stakeholders
commerce is for companies to find suitable communication                   The expected outcome of the report is multiple forecasts of
channels to reach the consumer. The solution for this problem           the future trend of online shopping in Sweden in general and
is often market analysis and segmentation of the market.                over different demographic groups.
   When creating and designing a product, companies have                   This is of interest to all companies who conduct business
one or more specific groups of consumers in mind. Thus, it              where the consumer is an end-customer, regardless of whether
becomes essential to find those consumers and likewise, it is           the company is operating from physical or online stores
essential to know how to reach them. A market analysis can be           today. Companies need to be where the customers are, and
conducted in various ways, and one approach is to attempt to            the general trend among companies should be to meet the
predict the future shopping behaviour of consumers. By doing            customers on the internet. Few companies are completely
so, companies can better understand how and where to launch             unaffected by the transition towards everyday use of internet
new products for them to be successful with the consumer. In            in the daily life. Lack of execution or strategies for using the
the case of this report, the behaviour subject to prediction            internet to reach customers will likely result in the company
is online shopping. That is, companies need to be able to               losing significant market share. Thus, companies can benefit
predict the rate at which different groups of consumers shift           from the findings of this report to increase the understanding
to online shopping, what groups already have transitioned to            of the consumers, as well as by using the information about
e-commerce and what groups will take more time to transition.           possible trends when targeting customers.
   To make this prediction of consumer behaviour, the pro-
posed approach consists of machine learning algorithms and
regular regression. The reason behind implementing multiple             D. Sustainability
models is the fact that prediction models are difficult to                 The report and its findings have the potential to contribute
evaluate due to the lack of future data to compare to. By               to several of the UN:s goals of sustainable development.
implementing multiple models it is possible to compare the              No. 9 ”Build resilient infrastructure, promote inclusive and
predictions and thereby evaluate the performance of each                sustainable industrialization and foster innovation”, no. 11
model.                                                                  ”Make cities and human settlements inclusive, safe, resilient
   One of the main challenges with building an accurate model           and sustainable” and no. 12 ”Ensure sustainable consumption
is the limitations of the available data. In the case of this report,   and production patterns” to mention some. [6]
the data consist of a shorter time series, which makes the data            All the above-mentioned goals can be related to sustainable
sparse.                                                                 industry and sustainable consumption, which can be facilitated
                                                                        by e-commerce. Consumers no longer have to partake in
B. Purpose                                                              unsustainable ways of travel to reach stores, instead, they can
                                                                        go online. Therefore, even if the distribution of bought goods
   The report will investigate mainly two aspects of consumer
                                                                        is done via less sustainable ways, the net effect of greenhouse
and e-commerce in Sweden: what general trends can be
                                                                        gas emissions is reduced. Also, previous shopping areas, malls,
identified and what differences between different demographic
                                                                        large parking lots and other areas built for stores can be used
groups can be seen in their approach towards online shopping.
                                                                        for better purposes and the need for deforestation to build new
   The result of the investigation can then be used as a basis
                                                                        such areas disappears.
for market analysis. Companies involved in e-commerce or
attempting to enter the online market can use the findings
to increase their knowledge about what demographic groups                                     II. BACKGROUND
are potential consumers. The results could also be used to              A. Market Analysis
improve the understanding of the consumer and how to target                For a company to create successful marketing strategies,
them.                                                                   understanding the market in terms of the consumer,
                                                                        competitors, distributors and suppliers etc. is essential. To
  1) Scientific questions:                                              gain this understanding, there are many tools available
  • What differences in consumer behaviour are seen when                (for example SWOT, PESTLE, Ansoff-matrix) that help
    comparing different demographic groups (geographic lo-              the company analyse the internal and external environment
    cation, education, age and work sector)?                            as well as the consumer. When the market analysis has
  • What future trends in e-commerce can be identified?                 been conducted, the company can build on the discovered

background knowledge to adjust the business idea and                With b being the coefficients and " the random error which
formulate a business strategy, where the goal is to appeal          assumed to be independent between the input values, x, and
to the consumers to make them paying customers. The                 normal distributed according to " ⇡ N (0, 2 ) [9].
consumer is often identified by consumer analysis and market           By increasing the order of k, the equation can fit the training
segmentation. When the market segment is known, the                 data better. Finding the optimal degree of the polynomial can
company can create and launch efficient marketing campaigns         be done by Forward Selection, where k is increased until the
specific to the segment. [7]                                        optimal value is found. The coefficients can also be optimized,
                                                                    which makes it possible to get the equation that fits well
   1) Segmentation: The goal for companies is to offer a prod-      to the data. The optimization of the degree of polynomial
uct to meet customer demands. However, the problem is that          and coefficients can be done in multiple ways, but the most
there are too many customers for one company to cater to each       straightforward approach is to minimize the error measured as
of their individual needs. The solution is therefore to cluster     Mean Square Error (MSE). [10] See equation (9) in Appendix
customers with similar demands and process the demands of           B.
the cluster instead, i.e. segmentation. After a conducted market
segmentation, the company can use strategies to analyse the
segments in terms of growth, competitors and size to determine      C. Support Vector Regression
what or which segments to focus on. The above-mentioned                Support Vector Regression (SVR) uses the idea behind
factors all influence the profitability of the segment and how      Support Vector Machine (SVM) and applies it to regression
to best target consumers.                                           problems (see Appendix VIII-B for SVM). The model can also
   Furthermore, when the target groups of customers                 be used for the prediction of unseen data. To solve non-linear
have been identified via segmentation, the company can              regression problems, the technique for mapping the non-linear
conduct further consumer analysis to gain an even better            data with the kernel trick is used, just as for a regular SVM.
understanding of the customer and segment. The choice of            Given data points (x1 , y1 )...(xn , yn ), the goal of the model is
communication and platform to target the consumer are key           to find the optimal function to map the input values, x, to the
factors for successful marketing. Therefore, the company            corresponding target, y.
need to explore what is the best-suited platform and how to            The main difference between SVR and SVM is the loss
communicate the message of the product to the consumers. [7]        function used, which has to be changed to solve regression
                                                                    problems. There are a few different loss-functions such as
   2) Forecasts: There are also certain difficulties connected      Laplace-, Huber’s Gaussian- and "-sensitive loss-function.
to forecasting in general which needs to be considered when         Among these, the "-sensitive loss function is the most com-
companies attempt to predict the market or consumer be-             monly adopted. [11] See equation (10) in Appendix B.
haviour. First, a suitable period should be chosen, the longer         The loss function is equal to 0 only if the data point is
the time span of the forecast the more uncertain it gets. Second,   within the pre-determined radius of the predicted line. Since
limitations of data, time or resources also add constraints to      the loss function is subject to minimization, the model will
the forecast. If any of the mentioned variables are scarce, one     seek to avoid any data points outside the radius when finding
should preferably use time series analysis, which models time       a solution. Thus, the optimal solution will have the majority
as the independent variable. Thus, the forecast is limited to       of the data points within the radius. [11] See Figure 4 in
show correlations and trends. However, for the purpose of the       Appendix A for an illustration of a simple SVR.
report, correlations and trends are just what is needed. Third,        The generic function of SVR with the kernel function can
complex methods are not necessarily preferred over simpler          be written as f (x) = (w ⇤ (x) + b). The most common
ones. Simple models are less sensitive to inaccuracies in the       kernels are Gaussian RBF: (x, xi ) = exp( ||x2 2xi || ) and

data or the model compare to more complex ones. However,            the polynomial kernel: (x, xi ) = (xTi ⇤ x c)p . With
a model can also become too simple and overlook important           respectively p the parameter to tune. The minimization of the
factors or miss correlations. Thus, there is a trade-off between    regression risk can thus be expressed as:
simple and complex models to be considered. [8]
                                                                                                            X n
                                                                                  Rreg (f ) =     |W |2 + C     (⇣i     ⇣i⇤ )      (2)
B. Polynomial Regression                                                                                    i=1

   Regression is one of the most commonly used tools for            Subject to:
forecasts and identifying correlations between variables. The
                                                                    (                  qi    (w ⇤ (xi ))      b  " + ⇣i
most fundamental approach is simple linear regression, which
can model linear predictions and identify the relation between                         (w ⇤ (xi )) + b       qi  " + ⇣i⇤             (3)
input values, x, and the target values, y, as a straight line.                              ⇣i⇤ , ⇣i   0, for i = 1, ..., n
The more general model is nonlinear or polynomial regression
which can model more complex relations. The model allows            Where Rreg is the function to minimize, with the constraints
one to fit an equation of the k:th order to the data:               defined above. The variables to tune, which is done by the user,
                                                                    are " (the radius or allowed deviation), C (slack or training
           Y = b0 + b1 x1 + b2 x22 + ... + bk xkk + "        (1)    error) and (kernel function). [12], [13]

D. ARIMA                                                                        is preferable to set the parameters conservatively to avoid over-
   Autoregressive Integrated Moving Average (ARIMA) is a                        differentiation. To fix potential over- or under-differentiation
statistical analysis model which uses time-series data to predict               it is possible to add either an additional MA or AR term. [15]
future trends. An ARIMA model is, by standard, characterized                       When the ARIMA model is built it can be favourable to
by three parameters p, d and, q.                                                optimize it using out-of-time cross-validation, which is done
   p represents the order of the Auto Regressive term or the                    by looking back at previous data points and using them to
lag order, also known as the number of lag observations in the                  forecast as many steps back that were taken. These forecasted
model. q is the size of the Moving Average window. The last                     points are then compared to the actual ones for that time. See
parameter is d, which is the number of times the observations                   Figure 5.
are differenced. When d = 0, the series is already stationary.                     As the different parameters are selected for the model the
[14]                                                                            prediction will look similar to Figure 6. Additional fine-tuning
   In the Auto Regressive model (AR), the function is, as                       can then be done to match the observed values to predicted
previously mentioned, only dependent on its lags and the                        values more precisely.
equation looks as follows:
                                                                                E. Data set
               Yt = ↵ +       1   ⇤ Yt      1   +    2   ⇤ Yt     2
                                                                          (4)      The data consists of the annual survey done by Internetstif-
                              +... +            ⇤ Yt         + "1
                                            p            p
                                                                                telsen, Svenskarna och Internet, which is a national survey
Where ↵ is the intercept term and is the coefficient of lag                     about the use of the internet in Sweden. The data from
that the model estimates.                                                       the reports are aggregated by the possible responses to each
   Similarly, in the Moving Average model (MA), the function                    question. One of the questions in the survey concerns online
is dependent only on the lagged forecast errors.                                shopping behaviour. That is, the data shows the share of users
                                                                                of e-commerce in Sweden in a specific year.
             Y t = ↵ + "t +        1   ⇤ "t     1    +   2    ⇤ "t    2            The survey also provides multiple demographic variables,
                                            +... +       q ⇤ "t       q         where the ones chosen for this report are geographic location,
                                                                                education, age and work sector. The data spans from the year
The error terms are the errors of the auto-regressive models
                                                                                2000 to 2020 and from 2015 to 2020.
(see above) of the respective tags. For example, the error "1
is the error of the following equation:
                                                                                F. Prediction of very short Time series
                 Yt =     1   ⇤ Yt     1   +     2   ⇤ Yt     2
                                                                          (6)      The dilemma of insufficient data is one of the most common
                              +... +        0   ⇤ Y 0 + "t                      problems encountered when attempting to model a forecast
So, the ARIMA model combines these two terms and differ-                        or build any model based on historic data. In theory, the
entiate the series to make it stationary, i.e. the equation can                 number of data points needs to be greater than the number of
be written as:                                                                  parameters of the model. However, the number of data points
                                                                                needed is often a lot more than what the theory suggests. [16]
            Yt = ↵ +          ⇤ Yt         +         ⇤ Yt         + ...
                          1            1         2            2
                                                                                   In the best scenario, there is enough data to both train
                      +   p   ⇤ Yt      p   ⇤ "t +        1   ⇤ "t    1   (7)   and test the model on unseen data. This is not possible with
                     +    2 ⇤ "t       2 + ... +          q ⇤ "t      q         limited data. One way to increase the available training data
                                                                                is by n-fold cross validation. Because the model’s purpose
To use the model, the hyperparameters need to be determined.                    is forecasting, the training data should always be before the
The purpose of d is to make the time series stationary.                         test data when put on a timeline, which adds a restriction.
However, there is also a risk of over-differentiating, which                    One method of cross-validation which can be used and takes
will affect both the parameters and the outcome. To find the                    the restriction into concern is rolling-origin-recalibration
correct value of the parameter it is important to identify the                  evaluation.
minimum difference to get a relatively stationary series, where
the autocorrelation function (ACF) plot reaches zero quickly                       1) Rolling-origin-recalibration evaluation: By taking a
and the mean roams around a set value. [14]                                     small subsample of the available data as training data, the next
   To determine the order of the parameter p (AR term) it is                    sample or subset can be the training subset’s test data. At the
important to identify if the model needs it, and then determine                 next step, the test data is added to the training subset and the
it by using the partial autocorrelation function (PACF) plot.                   following data point or subset is the new test data. This way
PACF finds the correlation of the residuals with the next lag,                  of dividing the data into training and test data can be done
unlike ACF which finds the correlation of the present with                      throughout the available data. The accuracy of the model will
past lag. The value of p is then determined to be the lag value                 be the average computed accuracy at each step [17].
where the PACF plot crosses the upper confidence level the
first time. [15]
   The order of q (MA term) is obtained from the ACT plot.                      G. Related Work
The information is found where the plot crosses the upper                         Multiple studies have been done to examine SVR and
confidence level and its corresponding lag value. Generally, it                 the models’ capabilities. The majority of studies have shown

positive results regarding forecasting models based on the al-        B. Implementation
gorithm, thus the general conclusion is that the model performs          The models are built in Python using standard libraries such
satisfactorily. SVR has been used to forecast different time          as NumPy, Scikit-learn, and Matplotlib.
series such as the stock market [11], [18], energy consumption           Each model was optimized by iteratively changing the
[19] and other time series [20], [21].                                values of the hyperparameters. For each model, MSE was
   As mentioned by U. Thissen et al. SVR has multiple                 computed and the optimal model was defined as one with
advantages making the model attractive to use. Especially the         the lowest average MSE score. When the optimal model was
model’s generalization capabilities combined with the ability         found, the values of the hyperparameters were noted and the
to always find a unique and sparse solution which also is the         model predicted the future e-commerce. This was repeated for
global solution [20]. The model can also be used for data with        the different data sets.
multiple features which is shown in the study regarding energy           The forecasts and computed MSE was compared to each
consumption [19]. Furthermore, the same article uses n-fold           other as a method of evaluation since there is no way to eval-
validation when training the model, which indicates that the          uate a forecast. To identify differences within demographics,
approach to handle limited data and hyperparameter selection          multiple predictions were made. One for each demographic
with n-fold validation is suitable for SVR.                           group.
   The 1998 study “The use of ARIMA models for reliability               Additionally, the accuracy of the models was computed
forecasting and analysis” comes to the conclusion where               to further evaluate how well the different approaches to the
Ho and Xie claim that the model is very flexible and give             problem fit the historic values and forecasted the future. The
statistically accurate predictions, something which has been          accuracy was calculated as:
accepted since the model is widely used for forecasting today
[22]. This is supported in the Babai et al. 2013 article where                                    Actual F orecast
they forecasted the demand in a two-stage supply chain. The                 P ercentageError =                         ⇥ 100 (8)
publication reaches the same conclusion that the model is both
                                                                        Some of the demographic groups had little data or missing
efficient and accurate. This study is also more relevant seen
                                                                      data points. This was dealt with by excluding the groups the
from the perspective of this report, as it also investigates future
                                                                      model did not have sufficient data for to make a forecast.
trends [23].
   In the 2016 article “Gold Price Forecasting Using ARIMA               1) Polynomial Regression: By implementing forward
Model” Guha and Bandyopadhyay use ARIMA for predicting                selection with the start at an equation of order p = 1
the price of gold. This article is, as the one previously             (assuming the data to be non-linear), the degree of the
mentioned, more in line with what this report will investigate        polynomial was increased iteratively. The maximal possible
which gives insight into both opportunities and limitations of        order of the polynomial is n 1 where n is the number of
the model. The report explains how the model is applied and           data points. At each step, MSE was computed.
the limitations, one of which is that it might be more suitable
for predicting values in the near future, something that was             2) Support Vector Regression: The model used the "-
not mentioned in earlier reports but is important to take into        sensitive loss function to compute the cost function and the
consideration. [24]                                                   Gaussian RBF kernel function for modelling data to a higher
                                                                      dimension. The values of hyperparameters C, respectively the
                                                                      allowed deviation ", was optimized by systematically testing
                         III. M ETHOD                                 different values and computing MSE.
   The study consisted of data pre-processing, implementation
                                                                         3) ARIMA: Initially, the PACF and ACF plots were used
and parameter optimization for each model, and lastly predic-
                                                                      to delimit the possible parameters. This guideline was then
tion and evaluation of the models. The models implemented
                                                                      used to optimize the implementation of the model by testing
were polynomial regression, SVR and ARIMA.
                                                                      different values of the three hyperparameters. The model was
   The findings, that is the forecasts, of the different models,      then optimized by using out-of-time cross-validation.
were evaluated in terms of being suitable as a part of a market          The accuracy was computed as MSE over the last step.
or consumer analysis.
                                                                      C. Limitations
A. Data Pre-processing                                                   The data set could be considered small since it consisted
                                                                      of 20 data points at the most. To handle the problem with
   The data from Internetstiftelsen was transformed into a            limited data, rolling-origin-recalibration evaluation was used.
format compatible with the models and modelled as a time              The data was divided into smaller subsets with only a few data
series. x(t), t = 1, 2, ..., n with each value of x, year, having     points in each subset. The general forecast had 4 subsets of
a corresponding target value, y, percentage of users.                 5 points, while the different demographics had a test set that
   Further processing of the data was made for the different          consisted of the 2 last data points and the prior data points
representations of the data, where each representation repre-         made up the training set. MSE was computed as the average
sents a certain demographic.                                          MSE.

   The forecast of the future was set to the time horizon         forecasts also gave unrealistic results with values over 100%
of two years ahead, 2021 and 2022. The choice of such a           during the training phase, which can be seen as a limitation
short forecast was based on the high degree of uncertainty        that comes with the simplicity of the model.
concerning the future. For a field like the internet, which is
characterized by rapid change in technology, one can assume
the future of e-commerce to be more uncertain than other,
historically more stable fields. Thereby a short time horizon
was chosen, to limit the amount of uncertainty.
   Furthermore, the models did not consider the fact that
the population percentage cannot exceed or equal 100%, and
thereby gave predictions of values over 100%. These values
could be achieved because of the simplicity of the models, but
cannot occur in reality. For cases where the predicted value
was over 100, we disregarded the actual value and instead
                                                                  Fig. 1. Forecast over all data points for the regression model. The 2:nd degree
acknowledge the trend as strongly increasing and being close      curves are clearly visible.
to 100%. All values were however included in the graphic
representations.                                                     Removing the data point of 2020 from the training data
                                                                  made the slope of the prediction steeper, that is the trend
                         IV. R ESULT                              declined faster. See Figure 7 and 8. However, the value for
  Below we will highlight the most interesting findings. As the   2022 only differed by 1% between the two forecasts (see Table
data concerns the share of users of e-commerce, all predicted     III in IX-A).
values will thus represent the share of users for a specific         The majority of the forecasts over the demographics showed
year. For values in tables and graphs over the forecasts, see     a stable or slightly increasing trend. Only one group, Finans
Appendix IX.                                                      from the work demographic showed a strong declining trend
  General findings were that the general forecast for all three   with a drop of about 10 percentage points.
models showed a declining trend, see Table I below. The              MSE for the model differed highly between the data sets.
models also forecast the elderly and people living in the         The predictions over demographic groups had a lower MSE
countryside to increase their use of e-commerce over the next     compared to the general model. This is probably since the gen-
few years.                                                        eral model had more data points to measure. The percentage
                                                                  error was also quite varied, see Table II in IX-A.
                   General Prediction 2021-2022
           Year      Polynomial    SVR            ARIMA
                     Regression                                   B. Support Vector Regression
           2021      81            83             84                 The general forecast showed a downward trend which is
           2022      77            77             79
                                                                  quite steep. Over the years 2020-2022 the number of users
                          TABLE I
                                                                  decrease by 10 percentage points, measured as the difference
                 MODELS FOR YEAR 2021-2022                        between the actual value of 2020 and the forecast of 2022.
                                                                     The result of the forecasts over the different demographics
                                                                  is quite different from the general trend since they all have
   The removal of the data point of 2020 showed little impact     a more or less straight trend for the years 2021-2022. Some
on the forecasts. For the years 2021 and 2022, the change in      of the forecasts showed a slight increase or decrease over the
training data mostly affected the slope of the forecast. The      period 2020-2021 but then a stable trend (see Figure 2).
predicted values for the coming years are quite similar with
or without the last data point for all forecasts.

A. Polynomial Regression
   The general forecast showed a declining trend over the next
years, where fewer consumers will use online shopping. The
forecast decreases by 10 percentage points over the years
2020-2022, measured as the difference between the actual
value of 2020 and the forecast of 2022.
   Furthermore, it is clear how the model is influenced by
historical data and the degree of the polynomial. The general     Fig. 2. Forecast over the segment based on age. The future trends for all
model is of the second degree, and when visualizing the           subgroups are straight or slightly increasing.
forecast the curves of a second-degree polynomial are visible
(see Figure 1). Likewise, the majority of the demographic           In general, the model does not fit very well to training data,
forecasts are of the first degree, something that also can        see Figure 13 and 14. The forecasts for the general model
be seen by the straight lines which are the forecast. Some        done during the training phase are not close to the actual

values, which also can be seen by the high MSE of the model                   was quite high relative to other values. For example, some
or by the large difference between predicted and true values                  subdivisions in the sector demographic had a much higher
(see Table V IX-B). However, MSE-score for the demographic                    MSE than others in the same group (see the MSE for Annan
groups are much lower and some values can be considered                       vs Skola/Utbildning). Additionally, the percentage error of the
satisfactory. Similarly, the percentage error of the model differs            general forecast, as well as the different demographics, showed
quite a lot when comparing the general forecast to the average                small errors from the historic data points. See table IX-C
demographic forecast.
    Moreover, the removal of the data point of 2020 had                                               V. D ISCUSSION
little influence on the forecast and gave results in line with                   The different models resulted in quite different forecasts,
the results from the regression model. The forecasted values                  in some cases even contradictory of each other. The varied
before 2020 were the same even after removing the data point.                 results illustrate the uncertainty of prediction models and how
However, the predicted values after 2020 showed almost no                     the forecast is dependent on the chosen model.
difference at all compared to the forecast with all data points.                 The investigation of the report and its results did offer
The value predicted for 2022 was the same and for 2021 there                  predictions for the future of e-commerce which can be used
was only a 1% difference between the predictions.                             as a basis for market analysis. However, as the three different
                                                                              models also produced deviating results, it is quite difficult
C. ARIMA                                                                      to do a market analysis based only on the produced results.
   In general, the ARIMA model predicted a negative trend                     Instead, it might be possible for companies to draw some
as seen in Figure 18. With the removal of the data point of                   conclusions regarding a possible behaviour but the results
2020, the forecast behaved similarly to the forecast with the                 are not strong enough to base the market analysis on alone.
data point. However, the declining trend was not as steep, see                Furthermore, the models all show a higher accuracy when
Figure 20.                                                                    investigating a smaller group, which indicates that the models
   In contrast, the results from the different demographic                    are better suited for forecasting specific groups’ behaviour.
groups differed, as some predictions increased while others
decreased. As seen in Figure 3, the predicted values for people               A. Effects from the pandemic
living in the countryside were lower in the year 2021 compared                   Looking at the data set on which the models are built, it
to 2020. This value is expected to rise the following year. By                is clear that during the height of the pandemic in 2020 e-
observing the same graph, the prediction for those located in a               commerce dropped by several percentage points, something
city shows the opposite, as the percentage is expected to first               that was quite unexpected given the circumstances. As local
rise in 2021 and then fall the following year.                                restrictions meant that many people refrained from shopping
                                                                              in physical stores and many malls and stores decreased their
                                                                              opening hours, it was reasonable to think that consumers might
                                                                              have turned towards e-commerce. However, as previously
                                                                              mentioned, this was not the case, which leads us to other
                                                                              possible explanations as to why e-commerce decreased in
                                                                                 One natural explanation is the fact that many people lost
                                                                              their jobs or were laid off short-term, which meant less income
                                                                              to dispose of. Another reason is that the purchases of services
                                                                              (tickets, travels etc.) online has drastically decreased as events
                                                                              have been cancelled and travels were largely restricted both
                                                                              nationally and globally, a product group that previously made
                                                                              up a large portion of the total e-commerce. Furthermore,
Fig. 3. Forecast over the segment based on location of living. The forecast   the data do not include the occurrence of online shopping
of those living on the countryside is seen increasing in 2022.
                                                                              or monetary aspects which can result in the forecasts being
   Another demographic group where the prediction is a per-                   misleading.
centage increase is the group under education which have only                    Thereby, the general prediction, which showed a declining
finished Grundskola. The model predicted higher percentages                   trend, can be deceiving in the sense that e-commerce for
for both 2021 and 2022, which can be seen in Figure 23.                       certain customer groups and certain product groups increased.
   Most age groups showed a relatively stable forecast with                   Individuals already familiar with online shopping increased
values close to the ones from previous years. However, 76+                    the number of purchases done via the internet, while other
showed a decrease in percentage for the year 2021, which then                 groups of individuals decreased their overall purchases and
rose again in 2022, similarly to the forecast for those living                accordingly also decreased their online shopping.
in the countryside. See Figure 21.
   In general, the model produced forecasts which fit well                    B. Demographic groups
for the existing data points. Although, as seen in Table V                       The different models sometimes resulted in contradicting
in IX-C, some demographics were less accurate as the MSE                      forecasts for a certain demographic, which makes it hard

to draw solid conclusions about the future trend for that             First, the complexity of the behaviour subject for prediction
demographic. However, the forecasts also seemed to agree on        decides what model is suitable. A more complex model will
the trend of other demographics, for example, age.                 optimize itself to historical data and past trends have a large
   We can see that the general finding for the elderly in Sweden   influence on the shape of predicted trends. While a simple
is an increase in e-commerce since the three groups 56-65, 66-     model is more general, but faces the risk of being too general
75 and 76+ all showed increasing trends. Thus, these groups        and thus miss to model important relations.
will likely be more present on the internet in the future. This       Second, the data available has to be considered. The re-
is probably the effect of the pandemic, where individuals          gression model is bound by the constraint of the degree of
have to get more comfortable with the different tools offered      the polynomial, that is, there is a limit of the degree which
by the internet due to offices encouraging working from            limits the possible relations the model can portrait. ARIMA
home and restrictions for senior citizens regarding socializing    has a minimum number of data points needed for training
with others. Another explanation might be that the younger         data which adds a constraint to the data set being used as
generation, consisting of people who already have embraced         training data. SVR, on the other hand, has not the same
technology and internet solutions, in the near future will enter   constraints concerning training data, but many possibilities of
the group of elderly in Sweden. Thereby, the groups’ general       hyperparameters to fine-tune which can be time-consuming.
habits concerning the internet will change accordingly.               Despite the model chosen as a tool for creating marketing
   When dividing data after the demographic of the degree          strategies, the results have to be used with caution and
of education, the majority of the models show an upwards-          rationality as all models come with risks of not being truly
sloping trend for the group of Grundskola. Assuming the            representative and are flawed in some way.
majority of people with a low level of education belong to
the part of the workforce with lower income, the jobs they
occupy are the ones employers cut first when money becomes         D. Further Improvements and Research
tight for the company. Due to the previous slow growth in             The report investigated the share of users of e-commerce
Sweden of 2019 and the pandemic in 2020 many low-income            and one interesting aspect to further research would be the
jobs disappeared resulting in the declining trend.                 annual turnover of e-commerce. The turnover can be a good
   However, the predictions show this trend being turned           complement to this report as the number of users alone cannot
around. Based on the estimates of these trends and history,        show the whole picture of a market or segment. By having in-
when the economy recovers the jobs previously dismissed will       formation about both the share of potential consumers and the
be reinstated. As a result of the decrease in unemployment,        expected turnover, companies can form a well-based market
people have more income to dispose of and thereby money to         analysis.
spend on e-commerce.                                                  For this report, the data used to build the models was
   Additionally, when investigating the groups of individuals      aggregated. One interesting aspect for further investigation
living in the countryside the models agree on an upwards trend     would be to have non-aggregated data, to create models for
for the coming years, something that can be interesting to         more specific groups instead of the more general ones used
investigate further. The different models based on the work        in this report. As the study was performed on the market as
sector contradicted each other and thereby gave inconclusive       a whole, with some larger demographic groups, the result can
forecasts.                                                         mostly be used to get a general picture of the market, and
   One general conclusion which can be drawn from the gath-        the under-laying trends, for the next few years. For a specific
ered demographic forecasts is that the MSE and percentage          company, it could be more rewarding to look into their specific
error is lower compared to the general forecasts. This indicates   target groups, if they are more delimited.
that the models perform better and thus are more accurate             Generally, the lack of data is often a problem in these
when forecasting the behaviour of a smaller group.                 reports, as the results might be lacking when there are limited
                                                                   observations to take into consideration. As this report did
C. The Models                                                      not handle a large data set, it could be interesting to further
   The majority of forecasts modelled by ARIMA resulted in         research the future trend of e-commerce, using more data
oscillating predictions and a low MSE. This can be seen as         points and thus being able to predict a, possibly, more accurate
a result of using a complex model to predict a future with         forecast.
a lot of uncertainty in the form of unpredictable variables.          Moreover, while it is possible to conclude that the models,
On the other hand, the regression model was often built on a       in general, perform better on a small group of individuals
low degree of the polynomial and therefore resulted in models      the measurements are not weighted after the number of data
with good generalization capabilities. However, the nature of      points. Since the general models are based on more data,
the low polynomial also made it possible for the forecasts to      the lower accuracy can just be the result of having more
predict unrealistic values.                                        data points to measure. Or that the fewer data points of
   The optimal model of the three implemented to forecast          the demographics make it easier for the models to learn the
consumer behaviour in terms of marketing purposes is hard          pattern of the data and not get truly generalized. Thus, one
to determine. However, the findings from the report can offer      improvement to be made and a research subject would be
some guidelines and features of the models to be considered        to gather more data from the demographics and measure the
when choosing a model.                                             differences in accuracy compared to the general models.

   Concerning the models, specifically the regression model,          We would also like to thank our supervisors from KTH for
the forecasts could be unrealistic (predicting a value over         the support they have given us. As well as our peers, thank
100%). One improvement to be made to get more realistic             you for the reviews.
values would be to put a constraint on the model, to approach
the value of 100 asymptotically.                                                         AUTHOR C ONTRIBUTIONS
                                                                      Disa Nilsson Rojas currently a BSc student of Industrial
                      VI. C ONCLUSION                               Engineering and Management with specialization in computer
   Based on the results of the models, the general predictions      engineering at KTH. The author contributed mainly to the
show a declining future trend of e-commerce with fewer              polynomial regression and SVR models, as well as part I,
individuals turning to online stores. While the results show        IV-VI.
one thing, it is also important to take into account the increase     Freja Engström currently a BSc student in Industrial En-
in revenue e-commerce experienced during 2020. It would             gineering and Management with a specialization in computer
thereby not be wise to only look at the results of this study,      engineering at KTH. The author contributed mainly to the
since the interpretation of declining e-commerce might not be       ARIMA model, as well as part I, IV-VI.
the reality. Instead, other aspects should be incorporated into
market analysis, such as the total sales revenue.                                                R EFERENCES
   The report also shows some evidence of how the results can
be misleading and give a wrongful picture. When investigating        [1] L. Zhou, L. Dai, and D. Zhang, “Online shopping acceptance model-
                                                                         a critical survey of consumer factors in online shopping,” Journal of
different demographics the majority of the forecasts show the            Electronic commerce research, vol. 8, no. 1, p. 41, 2007.
opposite trend from the general forecast, and therefore the          [2] EuroStat. Share of enterprises’ turnover on e-commerce.
general model can be seen as too generalized and poorly                  [Online]. Available:
representative. Furthermore, the results from the models show        [3] C. Katawetawaraks and C. Wang, “Online shopper behavior: Influences
that there are differences within a certain demographic group            of online shopping decision,” Asian journal of business research, vol. 1,
in their approach to e-commerce, which has to be considered              no. 2, 2011.
                                                                     [4] InternetStiftelsen, “Svenskarna och internet 2020,” Svenskarna och in-
by a firm but are missed in the general forecasts. While some            ternet, pp. 44–55, 2020.
groups in society have reduced their online shopping, others         [5] H. R. Postnord, Svensk digitalhandel, “E-barometern helårsrapport
increasingly used the internet for their purchases. For example,         2020,” E-barometern, p. 5, 2020. [Online]. Available: https://media.
many elders made their first purchase online during 2020.            [6] U. D. of Economic and S. Affairs. The 17 goals. [Online]. Available:
During the same period e-commerce for people with Grund-       
skola level of education dropped. This shows that different          [7] L. Mossberg and M. Sundström, Marknadsföringsboken. Studentlitter-
                                                                         atur, 2011.
groups will react differently to the same events and therefore       [8] A. Feldmann, “Forelasning 4 - prognoser in me1316,” January 2020.
it is important to investigate the target audience of a company      [9] A. Agarwal. Polynomial regression. [Online]. Available: https:
instead of the general picture.                                          //
                                                                    [10] Abhigyan. An introduction to support vector regression
   Thus, the conclusion is that while the models can show a              (svr). [Online]. Available:
possible forecast, more information is needed to conduct a               understanding-polynomial-regression-5ac25b970e18
market analysis.                                                    [11] C.-J. Lu, T.-S. Lee, and C.-C. Chiu, “Financial time series forecasting
                                                                         using independent component analysis and support vector regression,”
   As for the use of these models to gain knowledge of the               Decision Support Systems, vol. 47, no. 2, pp. 115–125, 2009.
market and use the information for marketing purposes, the               [Online]. Available:
models come with different pros and cons. The decision of                S0167923609000323
                                                                    [12] B.-J. Chen, M.-W. Chang et al., “Load forecasting using support vector
the model should be based on the available data and the                  machines: A study on eunite competition 2001,” IEEE transactions on
trade-off between simplicity and complexity. Furthermore,                power systems, vol. 19, no. 4, pp. 1821–1830, 2004.
the models seem to be more accurate in their forecasts of           [13] M. Awad and R. Khanna, Support Vector Regression. Berkeley, CA:
                                                                         Apress, 2015, pp. 67–80. [Online]. Available:
certain demographics compared to the forecast including the              978-1-4302-5990-9 4
entire population. The conclusion is thereby that the models        [14] S. Prabhakaran. Arima model – complete guide to time series forecasting
implemented in this report give more accurate forecasts when             in python. [Online]. Available:
modelling a certain, smaller target group. Companies can            [15] J. Salvi. Significance of acf and pacf plots in time series
thereby use the models to predict the behaviour of certain               analysis.     [Online].    Available:
segments of the market.                                                  significance-of-acf-and-pacf-plots-in-time-series-analysis-2fa11a5d10a8
                                                                    [16] R. J. Hyndman and G. Athanasopoulos, Forecasting: Principles and
                                                                         Practice. OTexts: Melbourne, Australia, 2018, ch. 12.7.
                     ACKNOWLEDGMENT                                 [17] ——, Forecasting: Principles and Practice.            OTexts: Melbourne,
                                                                         Australia, 2018, ch. 3.4.
   We are grateful for the opportunity to research and inves-       [18] P. Meesad and R. I. Rasel, “Predicting stock market price using support
                                                                         vector regression,” in 2013 International Conference on Informatics,
tigate the area of e-commerce, which we find interesting and             Electronics and Vision (ICIEV), 2013, pp. 1–6.
of relevance.                                                       [19] Z. Ma, C. Ye, and W. Ma, “Support vector regression for predicting
   We thank Internetstiftelsen for providing the data from their         building energy consumption in southern china,” Energy Procedia,
                                                                         vol. 158, pp. 3433–3438, 2019, innovative Solutions for Energy
annual survey, Svenskarna och internet, and especially our               Transitions. [Online]. Available:
supervisor Cia Bohlin.                                                   article/pii/S1876610219309762

[20] U. Thissen, R. van Brakel, A. de Weijer, W. Melssen, and
     L. Buydens, “Using support vector machines for time series
     prediction,” Chemometrics and Intelligent Laboratory Systems, vol. 69,
     no. 1, pp. 35–49, 2003. [Online]. Available: https://www.sciencedirect.
[21] Chun-Hsin Wu, Jan-Ming Ho, and D. T. Lee, “Travel-time prediction
     with support vector regression,” IEEE Transactions on Intelligent Trans-
     portation Systems, vol. 5, no. 4, pp. 276–281, 2004.
[22] S. Ho and M. Xie, “The use of arima models for reliability forecasting
     and analysis,” Computers Industrial Engineering, vol. 35, no. 1,
     pp. 213–216, 1998. [Online]. Available:
[23] M. Babai, M. Ali, J. Boylan, and A. Syntetos, “Forecasting and
     inventory performance in a two-stage supply chain with arima(0,1,1)
     demand: Theory and empirical analysis,” International Journal of
     Production Economics, vol. 143, no. 2, pp. 463–471, 2013, focusing
     on Inventories: Research and Applications. [Online]. Available:
[24] B. Guha and G. Bandyopadhyay, “Gold price forecasting using arima
     model,” Journal of advance Management Journal, 03 2016.
[25] T.    Sharp.     An     introduction   to   support    vector    regres-
     sion (svr). [Online]. Available:
[26] M. Awad and R. Khanna, Support Vector Machines for Classification.
     Berkeley, CA: Apress, 2015, pp. 39–66. [Online]. Available: https:
     // 3

                         VII. A PPENDIX A                                     model can always find an optimal decision boundary in some
                                                                              dimension [26].
                                                                                 To increase the generalization capabilities the SVM also
                                                                              has an additional parameter C or slack. The slack variable
                                                                              allows for some misclassification which can be good because
                                                                              the training data doesn’t always entirely represent the
                                                                              actual distribution of data. This introduces what is called
                                                                              a soft margin, which is used in cases where some points
                                                                              need to be misclassified to find an optimal solution, i.e.
Fig. 4. Example of a simple SVR [25]                                          decision boundary. As C increases, the less tolerance for
                                                                              misclassification the model will have. If C instead is small,
                                                                              the more tolerant the model will be.

                                                                                  1) Loss function: L" (f (x), q) =
                                                                                               |f (x)   q|   ", if |f (x)   q|   "
                                                                                                                      0, otherwise

                                                                              Where " represents the the radius around the optimal hyper-
Fig. 5. ARIMA model with both historic data and forecast plotted [14]         plane.

Fig. 6. Example of forecast vs actual values [14]

                         VIII. A PPENDIX B
A. Mean Square Error

                     E=           (predii       yi ) 2                  (9)
                            n i=1

B. Support Vector Machine
   Support Vector Machine (SVM) is a classification model
and performs the task of classifying data points by finding
the optimal hyperplane, i.e decision boundary, in N-dimension
for separating the data points. The goal is to find a decision
boundary with respect to two criteria: the model should
generalize well on unseen data and thus classify all new data
points correctly. As well as being the decision boundary that
maximizes the margin (maximizes the distance between the
data points and the boundary). And while there might be
multiple solutions to a classification problem, there is only
one solution that fulfils both criteria.
   To find this optimal decision boundary, and to be able to
solve problems that are seemingly non-linearly separable when
visualized, the SVM uses the kernel trick. That is, to map the
data onto a higher dimension. SVM uses the kernel trick be-
cause data points that are non-separable in a lower dimension
will become separable in a higher dimension. Thereby, the

                    IX. A PPENDIX C
A. Polynomial Regression

         Group                                  MSE           Error %
         General                               173.031         80.34
          Age                 16-25             7.852           7.78
                              26-35             10.865          9.00
                              36-45             17.125         11.50
                              46-55             41.202         18.80
                              56-65             77.802         27.15
                              66-75            105.813         36.54
                               75+              3.667           9.06    Fig. 8. Forecast without 2020
           City                Stad             24.222         14.46
                          Landsbyggd           45.625          21.82
        Education         Grundskola            32.951         20.88
                           Gymnasie              3.086          5.81
                           Högskola             1.063          2.70
          Sector            Industri             1.250          3.40
                             Handel              204.5         46.74
                           Transport             10.25         10.13
                     IT och Kommuikation         14.5          10.64
                             Finans              6.25           8.18
                      Offentlig förvaltning      2.5           4.38
                       Skola/Utbildning           65           24.54
                              Vård              0.25           1.14
                                                                        Fig. 9. Forecast over segments based on age
                                 TABLE II
                          DIFFERENT FORECASTS

            Year    Prediction     W/O 2020     Ground Truth
            2005         54              54              54
            2006         64              64              64
            2007         75              75              75
            2008         99              99              77
            2009         89              89              79
            2010         94              94              81
            2011         99              99              81
            2012        104             104              84             Fig. 10. Forecast over segments based on location of living
            2013        108             108              85
            2014         83              83              85
            2015         80              80              79
            2016         77              77              90
            2017         73              73              92
            2018         67              67              92
            2019         89              91              90
            2020         60              83              87
            2021         81              80               -
            2022         77              76               -
                             TABLE III

                                                                        Fig. 11. Forecast over segments based on education

Fig. 7. Forecast with all data points

   Below are the different demographics modeled.

Fig. 12. Forecast over segments based on work sector            Fig. 16. Forecast over segments based on location of living

B. Support Vector Regression

                                                                Fig. 17. Forecast over segments based on education

Fig. 13. Forecast with all data points

                                                                Fig. 18. Forecast over segments based on work sector

                                                                         Group                                  MSE      Error %
                                                                         General                               721.046        177.26
Fig. 14. Forecast without 2020                                            Age                 16-25             1.124          0.09
                                                                                              26-35             0.131          0.55
                                                                                              36-45             0.212          0.11
   Below are the different demographics modeled.                                              46-55             0.393          0.59
                                                                                              56-65             2.042          1.51
                                                                                              66-75             1.219          2.48
                                                                                               75+             15.708         18.21
                                                                          City                 Stad              1.128          0.34
                                                                                          Landsbyggd            0.538          0.66
                                                                        Education         Grundskola            9.760          8.80
                                                                                           Gymnasie             12.346          8.41
                                                                                           Högskola            0.491          0.17
                                                                         Sector             Industri            1.616           3.43
                                                                                             Handel            18.184         14.06
                                                                                           Transport             3.625          5.64
                                                                                     IT och Kommuikation        4.447          6.35
Fig. 15. Forecast over segments based on age                                                 Finans             8.191          17.53
                                                                                      Offentlig förvaltning    12.553         10.89
                                                                                       Skola/Utbildning         0.630          2.25
                                                                                              Vård             1.119          3.40
                                                                                           TABLE IV
                                                                         MSE AND P ERCENTAGE E RROR FOR SVR S FORECASTS
You can also read