THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL

Page created by Shawn Keller
 
CONTINUE READING
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
The curious case of posts on Stack Overflow

                                 Shailja Shukla

Subject: (Information Systems)

Corresponds to: (30 hp)

Presented: (VT 2020)

Supervisor: Mudassir Imran Mustafa

Department of Informatics and Media

                                                  1
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
Contents
Abstract ...................................................................................................................................... 6
Acknowledgements .................................................................................................................... 7
Chapter 1 .................................................................................................................................... 8
1.     Introduction ........................................................................................................................ 8
     1.1.      Background ................................................................................................................ 8
     1.2.      Motivation ................................................................................................................ 10
     1.2       Research Questions .................................................................................................. 11
     1.3       Delimitation: ............................................................................................................ 12
     1.4       Limitation:................................................................................................................ 12
Chapter 2 .................................................................................................................................. 13
2.     Theory ............................................................................................................................... 13
     2.1       Topic Modelling: ..................................................................................................... 13
     2.2       Latent Dirichlet Allocation (LDA): ......................................................................... 14
     2.3       Related Work ........................................................................................................... 15
Chapter 3 .................................................................................................................................. 17
3.     Methodology:.................................................................................................................... 17
     3.1       Data Collection: ....................................................................................................... 18
     3.2       Data Extraction: ....................................................................................................... 18
       3.2.1 Schema: ................................................................................................................. 19
     3.3       Data Pre-processing: ................................................................................................ 20
       3.1.1       Subset corpus data: .............................................................................................. 20
       3.1.2       Remove code snippets: ........................................................................................ 21
       3.3.3       Combine related documents to form a single corpus: .......................................... 22
       3.3.4       Tokenization: ....................................................................................................... 22
       3.3.5       Lowercasing: ........................................................................................................ 23
       3.3.6       Remove punctuations: .......................................................................................... 23
       3.3.7       Text Standardization/Replace Contractions:........................................................ 23
       3.3.8       Remove stop words: ............................................................................................. 24
       3.3.9       Remove URLs:..................................................................................................... 24
       3.3.10          Minimum size words: ...................................................................................... 24
       3.3.11          Remove multiple whitespaces: ........................................................................ 25
       3.3.12          Generate N-Grams: .......................................................................................... 25
       3.3.13          Stemming: ........................................................................................................ 25
       3.3.14          Lemmatisation: ................................................................................................ 26

                                                                                                                                               2
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
3.4        Create Dictionary and Term Document Frequency: ................................................ 26
    3.5        Run the LDA model: ................................................................................................ 28
Chapter 4 .................................................................................................................................. 29
4     Analysis: ........................................................................................................................... 29
Chapter 5 .................................................................................................................................. 34
5     Result ................................................................................................................................ 34
    5.1        RQ1- What are the popular discussion topics in Stack Overflow? .......................... 34
      5.1.1        Web as a recurring discussion topic: ................................................................... 36
      5.1.2        UI Development as a recurring discussion topic: ................................................ 37
      5.1.3        Data management as a recurring discussion topic: .............................................. 37
    5.2        RQ2- How does the developer's interest change over time? ................................... 38
    5.3        RQ3- How do the interests in specific technologies change over time?.................. 39
      5.3.1        React vs Angular .................................................................................................. 39
      5.3.2        Python vs JavaScript ............................................................................................ 40
      5.3.3.          Popular discussion topics related to Web technologies ................................... 40
      5.3.4        Relational Databases (RDBMS) .......................................................................... 41
      5.3.5        Android vs iOS .................................................................................................... 42
      5.3.6        Object-Oriented Programming............................................................................. 43
      5.3.7        Machine Learning ................................................................................................ 44
Chapter 6 .................................................................................................................................. 45
6     Validity of research and experiences: ............................................................................... 45
Chapter 7 .................................................................................................................................. 46
7     Conclusion: ....................................................................................................................... 46
Chapter 8 .................................................................................................................................. 47
8     Discussion & Future Work: .............................................................................................. 47
Appendix 1: Tools and technology .......................................................................................... 48
Appendix 2: Popular discussion topics lists among developers: ............................................. 49
Appendix 3: Acronym / Abbreviation Table ........................................................................... 54
References: ............................................................................................................................... 56

                                                                                                                                              3
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
Table of Figures:

Figure 1: Venn Diagram of the intersection of the Text Mining and six related fields (Miner
et al., 2012) ................................................................................................................................ 9
Figure 2: Schematic Overview of LDA (Debortoli et al., 2016). ............................................ 14
Figure 3: Methodology Model ................................................................................................. 17
Figure 4: Sample user post before cleaning of code snippet from the text content. ................ 21
Figure 5: Sample user post after cleaning of code snippet from the text content. ................... 21
Figure 6: Title of sample user post .......................................................................................... 22
Figure 7: Body of sample user post ......................................................................................... 22
Figure 8: Combined title and body of sample user post text ................................................... 22
Figure 9: Sample text before pre-processing .......................................................................... 25
Figure 10: Sample text after partial pre-processing ................................................................. 25
Figure 11: Sample text before stemming and lemmatisation................................................... 26
Figure 12: Sample text after stemming and lemmatisation ..................................................... 26
Figure 13: Sample pre-processed text ...................................................................................... 27
Figure 14: Term Document Frequency of sample text, generated from a dictionary .............. 27
Figure 15: Post types count ...................................................................................................... 30
Figure 16: Question Answer Ratio .......................................................................................... 31
Figure 17: Coherence score for different value of K (number of topics) ................................ 32
Figure 18: Intertopic distance map .......................................................................................... 35
Figure 19: Sample bar chart showing top 30 relevant terms for the topic (Topic 18-Function)
.................................................................................................................................................. 36
Figure 20: Top 20 trending tags ............................................................................................... 39
Figure 21: React vs Angular trend ........................................................................................... 39
Figure 22: Python vs JavaScript trend ..................................................................................... 40
Figure 23: Web technology trends ........................................................................................... 41
Figure 24: Relational DBMS trend .......................................................................................... 42
Figure 25: Android vs iOS trend .............................................................................................. 43
Figure 26: Object-Oriented Programming language trend ...................................................... 43
Figure 27: Machine Learning language trend .......................................................................... 44
Figure 28: Topic 1- Machine Learning .................................................................................... 49
Figure 29: Topic 2 - Javascript UI development ..................................................................... 49
Figure 30: Topic 3 - Relational DBMS.................................................................................... 49
Figure 31: Topic 4 - UI development ...................................................................................... 49
Figure 32: Topic 5 - Object-Oriented Programming ............................................................... 50
Figure 33: Topic 6 - Web Design ............................................................................................ 50
Figure 34: Topic 7 - Web Development .................................................................................. 50
Figure 35: Topic 8 - Data warehousing ................................................................................... 50
Figure 36: Topic 9 - Mobile Development .............................................................................. 51
Figure 37: Topic 10 - Text processing ..................................................................................... 51
Figure 38: Topic 11 - Coding style / practice .......................................................................... 51
Figure 39: Topic 12- CLI programming ................................................................................. 51
Figure 40: Topic 13 - Web Service Application ...................................................................... 52
Figure 41: Topic 14- Tabular data ........................................................................................... 52
Figure 42: Topic 15- Security / Authentication ....................................................................... 52
Figure 43: Topic 16 -Version Control Management................................................................ 52
Figure 44: Topic 17- File Operations....................................................................................... 53
Figure 45: Topic 18- Function ................................................................................................. 53

                                                                                                                                                   4
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
Figure 46: Topic 19- Cloud / Container technologies ............................................................. 53
Figure 47: Topic 20- Server-Side Scripting ............................................................................. 53

List of Tables
Table 1: Stack Overflow Posts schema .................................................................................... 19
Table 2: Post type and Post type Id .......................................................................................... 29
Table 3: Questions with Answers per Year ............................................................................. 30
Table 4: Coherence Scores of generated models with varying number of topics .................... 32
Table 5: Discovered Latent Topics .......................................................................................... 35

                                                                                                                               5
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
Abstract

Community website for programming related Q&A (Question and Answer), Stack Overflow
serves as a popular platform to ask questions and respond from other community members.
Over the period, user posts on Stack Overflow have turned into a source of valuable information
for programmers and the programming industry. By understanding the essential topic of
discussions among developers, new insights found about developers' changing trends and
needs. This thesis proposes an analysis of user posts on Stack Overflow to find topics of user
posts. Distributed topics in the text content of user posts extracted by using the topic modelling
technique. Latent Dirichlet allocation (LDA) is applied for topic discovery and extracted the
optimal number of topics. The trend of developer interest derived by combining the view count
of questions and discovered topics. Based on the analysis within the thesis's scope, developers
discuss topics ranging from programming languages, language runtimes, storage, cloud to
networking. Scripting programming languages are more discussed compared to non-scripting
languages. Discovered topics consist of several recurring categories, i.e., Web Development,
Data management and UI development. According to our findings, Machine Learning is
gaining popularity as well as data processing and analytics solutions. Mobile development is
another favoured subject among developers. The analysis of research findings has inferred that
one technology's popularity also reflects in related technology's popularity trend.

                                                                                                6
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
Acknowledgements

Thanks to Uppsala University's supervisor, Mudassir Imran Mustafa, for his support, feedback,
and motivation during the thesis. Also, David Johnson for sharing the idea to analyze Stack
Overflow content during our thesis meeting and to Ruth Lochan, our course coordinator, for
approving the proposal. Finally, I would like to thank my spouse for his support. Glad to present
the thesis work, and it was beautiful to experience.
Thanks to all for the support.

Shailja Shukla
Uppsala, 2021-04-21

                                                                                               7
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
Chapter 1

1. Introduction

   1.1. Background

Programmers all around the world look to solve the problems at hand. Sometimes tasked with
solving technical issues and other times to help others or satisfy their curiosity, programmers
often turn towards web-based online "social question and answer developer community
platforms". On such platforms, programmers can interact with other developers and find
answers to their questions. With the spread of technology in novel fields, software developer
communities are also growing day by day. There are many forums where developers can post
questions related to technical issues. Stack overflow is one such website where a developer can
post questions, answers, add comments on answers provided by other developers. For this
thesis's scope, visitors or users of the website Stack Overflow referred to as developers.
Stack Overflow is the most popular Q&A website among software developers (Alrashedy et
al., 2020). As a platform for knowledge sharing and acquisition (Alrashedy et al., 2020). Stack
Overflow is a valuable source of support for developers seeking probable solutions from the
web (Rubei et al., 2020).
Stack overflow allows its users to ask questions, tag a question with keywords to categorize
the questions, provide answers, comment on questions or answers, up or down vote a question
or answer. Over the period, Stack Overflow has become a community knowledge base for
programming-related subjects. This knowledge base in its current form is still most popular.
Stack Overflow knowledge base can be accessed through a web search or internal search
functionality provided by the Stack Overflow website navigation. Navigating the website
through tags is one of the navigation options. User-created tags lead to tag explosion,
challenging to manage (Li et al., 2019). Developers find exciting posts with the help of tags
associated with the post. Since tags are user-created, sometimes they might be missing from
specific posts or possibly irrelevant. Content of the post itself not used to assign tags to posts,
which leaves a gap of untapped opportunity to find exciting insight from the corpus of content
posted by users of the website (Barua, Thomas and Hassan, 2012).

                                                                                                 8
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
User posts on Stack Overflow expressed in rich and ambiguous natural language (Debortoli et
al., 2016). One way to analyze natural language is qualitative data analysis using manual
coding; the size of text data sets obtained from the Internet makes manual analysis virtually
impossible (Debortoli et al., 2016). "Text mining" and "text analytics" are broad umbrella terms
describing a variety of technologies for analyzing and processing semi-structured and
unstructured text data (Miner et al., 2012). Text mining techniques allow to automatically
extract implicit, previously unknown, and potentially practical knowledge from enormous
amounts of unstructured textual data in a scalable and repeatable way (Debortoli et al., 2016).
Automated text mining allows Information systems researchers to overcome the limitations of
manual approaches to qualitative data analysis; the study can be repeated easier and faster, and
it yields insights that could otherwise not be found (Debortoli et al., 2016).
Text mining is divided into seven practice areas depending on that area's unique characteristics
(Miner et al., 2012). These text mining divisions are interrelated and often require skills in
more than one area (Miner et al., 2012). A topic model is a probabilistic generative model used
broadly in computer science with a specific focus on text mining and information retrieval (Liu
et al., 2016). The position and relation of topic modelling/information retrieval practice area
illustrated in the following diagram:

    Figure 1: Venn Diagram of the intersection of the Text Mining and six related fields (Miner et al., 2012)

                                                                                                                9
THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
To get optimal text mining results, it requires required skill sets from Computer Science and
linguistics. An IS researcher might not be equipped with knowledge and skills in all these
fields. Much technical literature on the ideas and methods underlying specific text-mining
algorithms exists, such as topic modelling (Debortoli et al., 2016).

   1.2. Motivation

Stack Overflow is not only a famous programming question and answers community platform
but also an Information System, which is based on user-generated content and ranking based
moderation system. A search is a primary tool to find solutions from already answered
questions and find questions to answer or comment on Stack Overflow.
Another method to find content is navigating through tags associated with questions. Tags are
helpful for navigation and discovering interesting content within the website, but it has some
shortcomings. The wild nature of user-generated tagging makes them prone to inconsistencies
caused by spelling variations, synonyms, acronyms, and hyponyms. Which affect the tag
quality, and as a result, tags do not entirely represent underlying topics from the text contents
of user posts. These inconsistencies might cause "tag explosion", which means a small subset
of tags are overused (Joorabchi, English and Mahdi, 2015). "Tag explosion describes the
phenomenon that the number of tags dramatically increases along with continuous additions of
software objects" (Li et al., 2019).
Two factors that affect the tag quality are tag synonym and tag explosion. Tag synonym is
caused by when a post is tagged with similar tags i.e., "javascript", "javascript", "c#", "c-sharp",
"ios", "i-os", ".net", "dotnet" etc. Tag explosion is caused by the inclusion of total overall tags
into the system, making it hard to navigate and analyze content topics manually (Li et al.,
2019).
Tags cannot be associated with an answer or comment type explicitly, which leaves much user-
generated content untagged. However, user-generated tags for each "question" type post might
also be perceived as representative of "answer" and "post" type posts associated with identical
question type posts. In tag misuse or incorrect tagging, the tags might not represent the posts'
content (Meta Stack Overflow, n.d.).
Thus, it is concluded that tags associated with posts do not stand for an extensive part of total
user posts. Analysis of this untagged data can give several types of insights and answer many
different questions. This finding makes uses of tags as indexing metadata unsuitable for Stack

                                                                                                 10
Overflow data-based Information Systems. Topic modelling is an alternative solution to find
associated topics with each user post through unsupervised learning. It helps in categorizing
posts into broad categories. Topic modelling is an information retrieval technique used to find
structure in the collection of documents. These techniques were developed to make browsing
an enormous collection of documents more accessible (Eickhof and Neuss, 2017).
User-generated content can be analyzed through the topic model to get more insight into the
community platform's conversations. Similar studies on the subject of topic modelling on Stack
Overflow have been conducted in the past, first by Barua, Thomas and Hassan in 2012 and
second by Verma, Sardana and Lal in 2019, the same subject with Stack Overflow posts for
different periods. Our motivation to conduct this study is to see how technological trends and
discussion topics have changed over time with different period.

   1.2 Research Questions

The text content of user posts from the Stack Overflow website used to find the technology
trends over time, discussion topics among developers. Research questions inspired by a similar
study on Stack Overflow data conducted by Barua, Thomas and Hassan (2012).

   1. What are the popular discussion topics in Stack Overflow?
   Knowledge of popular discussion topics among programmers can be a valuable piece of
   information. It gives vital insights to technology analysts, vendors, authors, educators, and
   companies in general, which helps them decide on their work and products. Topics
   generated through LDA can help in improving the accessibility of discovered topics. Stack
   Overflow website will benefit from the knowledge by improving the reach of LDA topics
   through navigation. These benefits motivated us to find the main discussion topics among
   developers on Stack Overflow.

   2. How do the developer's interests change over time?
   By seeing most active topics, businesses, professionals, book authors, institutes may better
   assess newfound opportunities and risks, predict trends, and change their focus in a
   different direction that is better suited for their respective goals. Information systems help
   by recording such experiences in a knowledge base.

                                                                                              11
3. How do the interests in specific technologies change over the period?
    Observing change in popularity of topics help maintainers of software libraries to
    understand growing or losing interest in their released work. If an open-source JavaScript-
    based library, "React" from Facebook, starts gaining more developer interest against
    another popular library from Google called "Angular," then analyzing this trend might help
    maintainers of Angular library.

    1.3 Delimitation:

This study is enclosed to Stack Overflow data and not dependent on any other data. The study
uses an English language dictionary for natural language processing and topic modelling. It is
limited to performing topic modelling using Latent Dirichlet Allocation (LDA) on data from
01 January 2018 to 01 March 2020 and analyzing the result.

    1.4 Limitation:

The hardware and software specification computing resource used in the study is MacBook Pro
with 16GB RAM, 2.6 GHz 6-core Intel Core i7, on Mac OS, resulting in restricted
computational power. It affects the LDA model and evaluation of models. The study is based
on 20 topics, and it does not include the discovery of a dominant topic in each document. A
mix of custom and default parameters of algorithms from the opensource python library
Gensim is used to clean up our dataset. For example, to remove stop words from the text, a
custom list of the words used. This list is not exhaustive to ensure the removal of all the
meaningless words. The applied Natural Language Processing techniques are based on popular
best practices, and a more suitable approach might be possible. For stemming, an English
language dictionary is used. Using the custom dictionary more suitable for the subject is
expected to get a better result.

                                                                                            12
Chapter 2

2. Theory

        2.1 Topic Modelling:

Topic modelling is a popular information retrieval method to find and extract essential terms
from the collection of many documents without or partial human intervention. It helps in the
deriving structure of the relationship between document (Arora et al., 2013).
The principle behind topic modelling is that each document is a mixture of topics in a collection
of documents. Here a topic refers to a probability distribution over words (Alghamdi and
Alfalqi, 2015).
Topic models are algorithms for discovering the main themes that pervade a large and
otherwise unstructured collection of documents and can organize the collection according to
the discovered themes (Blei, 2012).
There are many topic modelling techniques. The First Probabilistic topic modelling model was
Probabilistic Latent Semantic Analysis (PLSA), introduced by T. Hofmann in 1999. In 2003,
D. Blei, A. Ng and M. Jordan proposed its Bayesian extension named the Latent Dirichlet
Allocation (LDA) (Kochedykov et al., 2017). Since then, topic modelling has been developed
within graphical models and Bayesian learning (Kochedykov et al., 2017).
The use of "vanilla" LSA or LDA is prevalent in IS research for topic modelling due to the lack
of publicly available implementations for many specialized topic modelling methods (Eickhoff
and Neuss, 2017). LSA extracts the underlying topics from a term-document matrix by
applying singular value decomposition (SVD); this approach contradicts human intuition about
topics, LDA evolves from LSA and pLSA by imposing Dirichlet distributed priors to its word
to topic and topic to document distributions to produce a result more in line with human
intuition (Eickhoff and Neuss, 2017).
Apart from general uses in research and verified results of topic modelling in LDA empirical
studies, it has been implemented in many programming languages like Python, Java, R. Several
implementations of LDA are publicly available open-source and free software (Debortoli et al.,
2016). Latent Dirichlet allocation (LDA) selected for topic modelling for the study's scope, a

                                                                                              13
generative probabilistic model of a corpus (Blei, Ng and Jordan, 2003). The basic idea is that
documents are represented as random mixtures over latent topics, where each topic is
characterized by a distribution over words (Blei, Ng and Jordan, 2003). LDA tries to find the
proper assignment of a topic to every word such that the parameters of the generative model
are maximized (Arun et al., 2010).

        2.2 Latent Dirichlet Allocation (LDA):

LDA uses an imaginary generative process that assumes that authors composed documents by
choosing a discrete distribution of t topics to write about and draw w words from a discrete
distribution of typical words for each topic (see Figure 2) (Debortoli et al., 2016).

     Figure 2: Schematic Overview of LDA (Debortoli et al., 2016).

The LDA algorithm computationally estimates the hidden topic and word distributions given
the observed per-document word occurrences (Debortoli et al., 2016). LDA can perform this
estimation via sampling approaches (e.g., Gibbs sampling) or optimization approaches (e.g.,
Variational Bayes) (Debortoli et al., 2016).
Based on these features provided by the LDA algorithm, it derived that the LDA topic
modelling method is most suitable to apply on Stack Overflow data to extract topics. In 2010

                                                                                               14
David M. Blei published another optimization of LDA as "Online Learning for Latent Dirichlet
Allocation". Online LDA uses an online variational Bayes (VB) algorithm for Latent Dirichlet
Allocation (LDA) (Hoffman, Bach and Blei, 2010). This development enhances the capability
of the LDA algorithm for a large set of documents. Online LDA can be applied to even
streaming text. To get the benefit of this performance and boost, Online LDA has been used in
the study.

         2.3 Related Work

There are many research studies conducted on Stack Overflow as a subject. In 2012 research
conducted by Barua, Thomas and Hassan had analyzed the questions and answers posted on
Stack Overflow from June 2008 to September 2010 to find the main discussion topics on Stack
Overflow, change in developer interests over time and change in specific technologies. They
have used the Latent Dirichlet Allocation topic modelling technique to find topics; they found
around 40 different topics. They have also used the Cox-Stuart test to analyze changing trends
over time. Few findings were that mobile application development being on the rise, faster than
web development. Android and iPhone development are far more prevalent than Blackberry
development. The PHP scripting language is becoming extremely popular, much more so than,
say, Perl. Java is also a continuing player within the programming languages and APIs sector,
while the .NET framework is decreasing slightly. Git has surpassed SVN in the VCS popularity
contest, and MySQL is the hottest DBMS of the last few years (Barua, Thomas and Hassan,
2012).
In 2018, Johri and Bansal analyzed the Stack Overflow data for 2014, 2015 to get insight into
trends in technologies for different subdisciplines of computer science and programming
languages. They found that Website Design/CSS is the most impactful topic. Data
Analysis/Visualization and Mobile App Development are hot topics, and their popularity is
increasing, while the impact of Object-Oriented Programming and Coding Style/Practice has
decreased over time. On the other hand, topics like Authentication/Security and UI
Development have shown steady trends over time. Furthermore, R and Python have dominated
in Data Analysis/Visualization topic, Oracle and MySQL are the most popular database
platform, Python is the most impactful scripting language (Johri and Bansal, 2018).
In 2019 research conducted by Verma, Sardana and Lal had analyzed the questions and answers
posted on Stack Exchange for 2015, 2016, 2017 to find the critical discussion dimensions
topics, interest of developers changes over time, interests in a specific technology changes over

                                                                                              15
time. They found that the discussion's popular topics are Programming skills, object-oriented
design, and design & development in all three successive years. The topics are labelled
meaningfully based on top words assigned by LDA. The leading technologies discussed for
which question was raised were Java and C# (Verma, Sardana and Lal, 2019).
Our study is related to the topic modelling of developer posts on Stack Overflow.

                                                                                          16
Chapter 3

3. Methodology:

To achieve the goal to find out the main discussion topics among developers on the Stack
Overflow website. Topic modelling is performed on user posts retrieved from the Stack
Overflow dataset. With the topic models, it is possible to retrieve topics from a collection of
texts without document metadata. The methodology is motivated by similar studies conducted
in the past by Barua, Thomas and Hassan (2012) and Verma, Sardana and Lal (2019), as
referenced in section 2.2., It does not replicate the same steps, and there are few changes made
according to the requirement of the study, like customized stop words, removal of contractions.
Online variational Bayes algorithm for latent Dirichlet allocation (Online LDA) by Hoffman,
Bach and Blei, (2010) is used for topic modelling. Topic modelling is performed under the
following steps shown in figure 3:

             Figure 3: Methodology Model

                                                                                             17
3.1 Data Collection:

Popular programming community Q&A website Stack Overflow is part of the Stack Exchange
network comprising 173 Q&A communities for various communities (About - Stack Exchange,
2020). Stack Overflow serves over 120 million visitors every month (About, 2020).
Data for research has been downloaded from archive.org. Archive.org (internet archive), a non-
profit, builds a digital library of Internet sites and other cultural artefacts in digital form
(Internet Archive: About IA, 2020). Stack Exchange provides quarterly data dump of Stack
Exchange network sites at [https://archive.org/details/stackexchange]. Stack Overflow data has
been shared publicly by Stack Exchange under creative commons licence. Eight datasets
available from Stack Overflow as Badges.7z, Comments.7z, PostHistory.7z, PostsLinks.7z,
Posts.7z, Tags.7z, Users.7z, Votes.7z (Stack Exchange Data Dump: Stack Exchange, Inc.: Free
Download, Borrow, and Streaming: Internet Archive, 2020).

   3.2 Data Extraction:

For topic modelling, the file Posts.xml is selected, which is the most suited data dump file used
as the data source in the thesis because it has the contents of questions, answers as posts; this
collection of posts will serve as a corpus for topic modelling. The archive has posts until 01
March 2020, since data is updated quarterly and only the most recent data dump is made
available at the given URL [https://archive.org/download/stackexchange/stackoverflow.com-
Posts.7z]. The data source used in the thesis is available below URL:
[https://archive.org/details/stackoverflow.com-Posts.7z].
MD5 checksum of 7zip archive is e5c0b370d5f9a6905c88fdb5971b145a
The size of archive Posts.7z on the disc is 14.6 GB. After extracting the archive, the file size
of the extracted file Posts.XML is approx. 75 GB. It has total 4,79,31,101 posts from 2008 till
01 March 2020.

                                                                                              18
3.2.1 Schema:

The schema of Posts.xml is as follows (Stack Exchange Data Dump, 2020):

      Posts.xml
               Id
               PostTypeId
                        1: Question
                        2: Answer
               ParentID (only present if PostTypeId is 2)
               AcceptedAnswerId (only present if PostTypeId is 1)
               CreationDate
               Score
               ViewCount
               Body
               OwnerUserId
               LastEditorUserId
               LastEditorDisplayName
               LastEditDate
               LastActivityDate
               Community Owned Date
               ClosedDate
               Title
               Tags
               AnswerCount
               CommentCount
               FavoriteCount

Table 1: Stack Overflow Posts schema

For the scope of the study, data from 01 January 2018 queried. MySQL database supports XML
file as schema input.
Following SQL command is used to import Posts.xml into a MySQL database.

                                                                                       19
load XML local infile '/Path/To/stackexchange/Posts.xml' into table posts rows identified by
'';
Following SQL query is used to query posts from 01 January 2018:
select * from posts where CreationDate>='2018-01-01' order by CreationDate Desc;
The query returned 97,47,021 posts, out of which 43,49,023 posts related to questions and
53,97,998 posts related to answers, found by the value of PostTypeId. The initial dataset is
created from the exported query result as a SQL data dump. Exported SQL dump size is 4.3
GB Gzip. Prepare the data for pre-processing, and the data format needs to be one of the file
formats supported by data pre-processing and data cleaning software. CSV, Avro, JSON and
Parquet are among the most widely used and supported file formats by data processing
applications.
Google Big Query is used to import SQL data and export it as an Avro file format; later, Avro
files were converted to Parquet file format supported by data processing python library Pandas.
Online LDA algorithm implementation from python library Gensim is used for LDA topic
modelling.

   3.3 Data Pre-processing:

An Exported SQL dump is used as source input to create and export the dataset for processing.
The data needs pre-processing to get meaningful results before processing it in the LDA model.
Data pre-processing consists of a sequence of steps to transform the raw data derived from data
extraction into a clean and tidy dataset before analysis (Malley, Ramazzotti and Wu, 2016) to
reduce noise in the dataset. Data cleaning has been performed in the following steps: -

       3.1.1    Subset corpus data:

For creating a corpus, data from the "Title" and "Body" columns are retrieved, and the rest of
the columns discarded. Only those posts which are also a question will have nonempty value
in the "Title" column because Stack Overflow only allows questions to have a title. Column
"Id" is also kept for backtracking to the entire dataset and can map processed values with an
unprocessed dataset table if needed.

                                                                                            20
3.1.2     Remove code snippets:

To automate topic modelling, any part of the corpus which is not meaningful needs to be
cleaned up. In Stack Overflow posts, code snippets are a regular part of the post content, but
these snippets add to the noise for the algorithms used for Natural Language Processing. Code
snippets are enclosed inside  and  HTML tags. Code snippets,
including enclosing "code" and "pre", are removed. The posts corpus still has many other
HTML tags e.g.,  . These tags are removed, but the enclosing text is kept.
Sample user post for before (figure 4) and after (figure 5) clean-up of code snippet from the
text content is shown below.

Figure 4: Sample user post before cleaning of code snippet from the text content.

Figure 5: Sample user post after cleaning of code snippet from the text content.

After performing these steps, the dataset size reduced to 4.3 GB.

                                                                                           21
3.3.3     Combine related documents to form a single corpus:

Dataset table converted into a Pandas data frame for further data processing, and posts are still
divided into "Title" and "Body" columns. Both columns are combined to form a single corpus
then null or empty values are replaced by an empty string.
The title of the sample user post is shown below in figure 6.

Figure 6: Title of sample user post

The body text content of the sample user post is shown below in figure 7.

Figure 7: Body of sample user post

Combined sample text of title and body is shown below in figure 8

Figure 8: Combined title and body of sample user post text

3.3.4     Tokenization:

In NLP studies, the focus is on analysis and not on the basic units called tokens, i.e., words,
but without clear segregation of words, it is impossible to carry out analysis on documents
written in natural languages (Webster and Kit, 1992).
The text analyzed is converted into the list of meaningful segments called tokens (Bhargav
Srinivasa-Desikan, 2018). These segments could be words, punctuation, numbers, or other
special characters that are the building blocks of a sentence (Bhargav Srinivasa-Desikan, 2018).
Tokens are units that need not be decomposed in further processing. This process achieves
automatic segmentation by constructing the dictionary and applying strategies for
disambiguation (Webster and Kit, 1992).
The study uses whitespace as a delimiter; This can be difficult in non-English languages. This
study is scoped to English language text; hence delimitation using white spaces is not a
problem.

                                                                                              22
3.3.5   Lowercasing:

Changing the letter case is a part of text pre-processing as such that tokens will be cleaned from
text case related ambiguity (Kulkarni and Shivananda, 2019).
User post content from the Stack Overflow dataset is a natural language text. While using
automated natural language processing tools, it is possible to get overwhelmingly fragmented
results because of case sensitivity. Changing all the text into lower case reduces this ambiguity.
E.g., "Sass" and "SASS" will be changed into "sass" and "JavaScript", and "Javascript" will be
transformed into "javascript" (Bhargav Srinivasa-Desikan, 2018).

3.3.6   Remove punctuations:

Punctuations do not add meaning or supply additional value to the text being pre-processed for
topic modelling. Removing punctuations from the enormous collection of text pre-processing
will also reduce the size of the text. Processing of more amount of text will require more
computing resources. The smaller text collection size will enable us to perform text pre-
processing with reduced computing resources (Bhargav Srinivasa-Desikan, 2018). A most
popular method to remove punctuation from text documents is to use regular expression along
with a list of punctuations to be removed. Python supplies a list of punctuations as a part of the
standard library.
Several programming languages have punctuation in their names, e.g., "+" used in "C++," "#"
in "C#". Our topic model's result might be skewed by removing punctuation text from the
technology keywords erroneously. For example, if all the punctuations from the text are
removed, then technological words will change from "C++, C, C#" to "c, c, c". This sample
transformation might triple the probabilistic distribution of token "c" and incorrectly remove
"c#" and "c++" from the text and later from the topic model. All punctuation from the text
except tokens from a non-exhaustive exception list of such programming keywords is removed.

3.3.7   Text Standardization/Replace Contractions:

It is converting a raw corpus into a canonical and standard form to ensure that the textual input
is consistent before analysis and processing (Bokka et al., 2019). Contractions are shortened
versions of words or syllables (Sarkar, 2019). Shortened versions of existing words are created

                                                                                               23
by removing specific letters and sounds, and contractions pose a problem for NLP and text
analytics (Sarkar, 2019).
Stack Overflow is primarily a social community. Users interact in natural languages and often
not in their primary spoken language. There is a high possibility of people using short words
and abbreviations to stand for the same meaning. In many cases, certain words might be
misspelt, and popular slang substitutes have been used. Canonical forms of abbreviation in the
corpus have been corrected, e.g., gotta -> got to, brb > be right back.
Spellings of tokens were not corrected because dictionary-based auto-correct might remove
non-English technical terms, and that would misrepresent token sparsity.

3.3.8     Remove stop words:

As a next step, stop words have been removed. Stop words are common words that appear to
be of little value in helping to select documents matching a user need are excluded from the
vocabulary entirely (Manning, Raghavan and Schütze, 2008) for example: 'a', 'an', 'the', 'of',
'else'.
By removing the commonly used words, the focus shifts to the essential keywords instead
(Bhargav Srinivasa-Desikan, 2018). For example, in the following text, "How do I atomically
move a 64bit value in x86 ASM?" common English words, i.e. "How," "do," "I," "a," "in" have
been removed. For stop word removal, the study uses a custom list of stop words that include
additional tokens to be removed to improve coherence technology-related terms.

3.3.9     Remove URLs:

Text containing URLs and emails is one of the often-used tokens, which adds to the noise in
the pre-processed text's quality. URLs have been removed from the text using regular python
expression.

3.3.10 Minimum size words:

Words less than a certain number of letters are often not useful. Any word less than two letters
have been removed. Some whitelisted technological terms “'c++', 'c#', 'f#',' r ', 'c'” were not
removed.

                                                                                             24
3.3.11 Remove multiple whitespaces:

Whitespace delimiter used for tokenization. If there are multiple continuous occurrences of
whitespaces, it might create null tokens. Python regular expression used to remove multiple
whitespaces.
Sample of before (figure 9) and after partially pre-processed (figure 10) text document.

Figure 9: Sample text before pre-processing

Figure 10: Sample text after partial pre-processing

3.3.12 Generate N-Grams:

Every token in Natural Language Processing is considered a feature, and an n-gram is a
contiguous sequence of n features in the text (Bhargav Srinivasa-Desikan, 2018). For a single
feature, the value of N is 1. This representation of features is called "Unigrams". Sometimes a
token derives meaning by combining with previous or next feature. When an occurrence of
token "Java" and "Script" found together, then by capturing these tokens, a feature "JavaScript"
is derived. This process is called the generation of "N-Grams". Here "N" denotes the number
of tokens captured to form a new feature. When two tokens form a feature, then it is called "Bi-
Gram" if three tokens form a feature, it is called "Tri Grams" and more (Bhargav Srinivasa-
Desikan, 2018). Bigrams and Trigrams were generated for the scope of the study.

3.3.13 Stemming:

It refers to a crude heuristic process that chops off the ends of words in the hope of achieving
this goal correctly most of the time and often includes the removal of derivational affixes
(Manning, Raghavan and Schütze, 2008). Stemming is a process of extracting a root word; for
example, "fish," "fishes," and "fishing" are stemmed into fish (Bhargav Srinivasa-Desikan,
2018).

                                                                                             25
Snowball stemming algorithm implemented in Python library NLTK used in the study
improved over the Porter stemming algorithm. Snowball stemming algorithm stems words to
its base form. Snowball is a small string processing language designed for creating stemming
algorithms for use in Information Retrieval (Snowball, n.d.).

3.3.14 Lemmatisation:

It refers to doing things correctly using vocabulary and morphological analysis of words,
usually aiming to remove inflectional endings only and return the base or dictionary form of a
word, known as the lemma (Manning, Raghavan and Schütze, 2008). Like Stemming,
Lemmatisation also is a process of extracting a root word, but by considering the vocabulary,
for example, "good," "better," or "best", lemmatised into good (Bokka et al., 2019). While
stemming returns the root word, Lemmatised word must be a valid dictionary word (Bokka et
al., 2019). WordNet lemmatiser implemented in Python library NLTK used in this study.
Example before n-gram generation, stemming, and lemmatisation is shown below in figure11.

Figure 11: Sample text before stemming and lemmatisation

Example after n-gram generation, stemming, and lemmatisation is shown below in figure 11.

Figure 12: Sample text after stemming and lemmatisation

    3.4 Create Dictionary and Term Document Frequency:

Dictionary is generated from the lemmatized data using the dictionary module of python library
Gensim. Dictionary is a list of unique words found in the collection of documents, with each
word being assigned an index value. A word's integer-id mapping made by the dictionary also
referred to as "word id" (Bhargav Srinivasa-Desikan, 2018). A dictionary-based text
categorization relies on experts assembling lists of words and phrases that are likely to indicate
a chunk of text membership in a specific category (Debortoli et al., 2016). Dictionary
encapsulates the mapping between normalized words and their integer ids (Řehůřek, 2019).
The dictionary is used to create the Term Document Frequency for the LDA topic model input.
Term Document Frequency is the corpus that has word-id and its frequency in each document.

                                                                                               26
Each document in our collection of documents is converted into a bag of words using the
doc2bow method of the dictionary created using Gensim. It is a list of lists, where each list is
a documents bag-of-words representation (Bhargav Srinivasa-Desikan, 2018). The Bag of
Words model represents each text document as a numeric vector where each dimension is a
specific word from the corpus, and the value is its frequency in the document, the occurrence
denoted by 1 or 0, or even weighted values (Sarkar, 2019).
The dictionary must create Term Document Frequency for LDA topic model input. Term
Document Frequency is the corpus that has the word id and its frequency in each document.
Sample pre-processed text used to create Term Document Frequency is shown below in figure
13.

Figure 13: Sample pre-processed text

 Term Document Frequency (figure 14) of sample text (figure 13), generated from a dictionary.

 Figure 14: Term Document Frequency of sample text, generated from a dictionary

                                                                                             27
3.5    Run the LDA model:

There are several topic modelling algorithms. For this study, the Latent Dirichlet Allocation
(LDA) model has been selected. It is crucial to find an optimal number of latent topics suitable
to generate a Topic model based on a given corpus. The idea behind this is that a small number
of latent topics are enough to effectively represent a large corpus (Arun et al., 2010).
The computational task of the LDA algorithm is to estimate the hidden topic and word
distributions, given the observed per-document word occurrences, an estimation can be done
either via sampling approaches (e.g., Gibbs sampling) or optimization approaches (e.g.,
Variational Bayes) (Debortoli et al., 2016).
Online LDA model implemented in python-based natural language processing library Gensim
is used for the topic modelling in this study. The LDA model ran multiple times to tweak the
model parameters based on the LDA model results, and text pre-processing steps were revised,
e.g., creating custom stop words and custom punctuation. The initial topic model has run with
50 topics. Afterwards, the model has run with a different number of topics and then calculates
each model's coherence score using the Coherence Model from Gensim; the model with the
best coherence score has been selected for further analysis.

Tools and libraries used in the study are listed in Appendix-1.

                                                                                             28
Chapter 4

4 Analysis:

Upon analyzing the dataset, a total number of 9,74,7021 posts found. PostTypeId is used to
determine the post whether it is a question, answer or any other type of post. Post Type and
PostTypeId mapping shown in the below table (SEDE, 2020).

 PostTypeId                                     Post Type
 1                                              Question
 2                                              Answer
 3                                              Orphaned tag wiki
 4                                              Tag wiki excerpt
 5                                              Tag wiki
 6                                              Moderator nomination
 7                                              Wiki placeholder
 8                                              Privilege wiki
Table 2: Post type and Post type Id

Out of 9,7,47,021 posts, there are 4,3,37,053 (44.5 %) “Question” and 5,3,97,998 (55.4%)
“Answer” type post. These two types of posts contribute to 99.9% of the total posts.
The remaining 11,970 (0.1%) posts are distributed in other types of posts.

                                                                                         29
Figure 15: Post types count

The analysis finds the number of questions asked and answered during each year. In the year
2018 total of 1,9,07,440 questions were asked, out of which 34.7% of questions were answered.
Similarly, in the year 2019, 32% of the total 2,0,56,068 questions were answered. Complete
data for the year 2020 is not available; the dataset contains data until 01 March 2020, during
which a total of 3,73,545 questions with 29.6% of the answer rate.

 Year                                      Number of questions   Questions with answers
 2018                                      1907440               34.7 %
 2019                                      2056068               32.0 %
 2020                                      373545                29.6 %
Table 3: Questions with Answers per Year

Analysis of the count of posts over the period, based on post type, reveals that the count of
answers posted has been continuously higher than the count of questions asked. Apart from
that, the gap between the count of questions to answer type posts is continuously reducing. In
fact, in February 2020, the count of questions crossed the count of answers. It is noted that the
upward trend in answer count, again going downwards towards the remaining end of the dataset
period, but downwards trend is based on the data of a pretty short interval to conclude that the

                                                                                              30
answer upwards trend is short-lived. In the below figure, the question graph series colour is
turquoise blue, and the answer graph series is navy blue. Legend given in graph displays the
colour of PostType Id.

Figure 16: Question Answer Ratio

Since LDA is an unsupervised topic modelling technique, it is unknown before running the
model how many numbers of topics corpus consists? The number of topics represented by K
denotes the refinement of the discovered topics. K's optimal value is derived through the 'Topic
Coherence' score; the higher the coherence more optimal the value of K (Verma, Sardana and
Lal, 2019). Larger values of K will produce finer-grained, more complex topics, while smaller
values of K will produce coarser-grained, more general topics (Barua, Thomas and Hassan,
2012).
Multiple models were generated to find the optimal number of topics, with 2, 8, 14, 20, 26, 32,
38, 44 number of topics and their coherence scores were compared. CoherenceModel from the
Gensim library is used to derive the coherence score of generated topic models. The number
of topics to be discovered is limited to a maximum of 50. For the study's scope to find an
optimal value of K. For every generated model, the value of K increased by six from the value
used in the previous model; the value of K used in the first model was 2.

                                                                                             31
You can also read