Topic modeling and bias analysis in a large scale spanish text dataset

Page created by Darryl Alvarado
 
CONTINUE READING
Topic modeling and bias analysis in a large scale spanish text dataset
Universitat Politècnica de Catalunya (UPC) - BarcelonaTech

       MASTER IN ARTIFICIAL INTELLIGENCE (MAI)

Topic modeling and bias analysis in a
  large scale spanish text dataset

            Facultat d’Informàtica de Barcelona (FIB)

                 Facultat de matemàtiques (UB)

           Escola tècnica superior d’ingenyeria (URV)

Author: Òscar Hernández Saborit
Advisor: Marta Ruiz Costa-jussà
Co-advisor: Quim Moré López

                         April 18, 2021
Topic modeling and bias analysis in a large scale spanish text dataset
Abstract

In a world where data generation is doubled continuously, the ability to efficiently classify
and analyze large amounts of data is key to success. In this thesis, we will describe the
implementation of a distributed topic modelling pipeline and put it in practise against a
45TB dataset, demosntrating how HPC is essential to deal such quantity of data. We will
show the different problems that can arise when dealing with a huge amount of unsupervised
data and discuss the possible solutions. On the other hand, we will also play with word
embeddings, we will train different GloVe vectors to later compare them at different levels,
demonstrating how we can identify gender bias in trained embeddings and prove that
gender bias is present in all the data selected from the previous modelling task.

                                             1
Topic modeling and bias analysis in a large scale spanish text dataset
Contents

1 Introduction, motivation and goals                                                          1

2 State of the art                                                                            3
  2.1   Topic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      3
  2.2   Word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        4
        2.2.1   GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      5

3 Methodology                                                                                 6

4 Experimental work                                                                           8
  4.1   Supercomputing resources . . . . . . . . . . . . . . . . . . . . . . . . . . .         8
  4.2   The dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      9
        4.2.1   Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     9
        4.2.2   Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    10
  4.3   Topic modelling experimentation . . . . . . . . . . . . . . . . . . . . . . .         10
        4.3.1   Preprocessing pipeline . . . . . . . . . . . . . . . . . . . . . . . . .      11
        4.3.2   Running LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       15
        4.3.3   Topic modelling results and discussion . . . . . . . . . . . . . . . .        18
  4.4   Bias analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   22
        4.4.1   Data gathering pipeline . . . . . . . . . . . . . . . . . . . . . . . . .     22
        4.4.2   Bias analysis with glove . . . . . . . . . . . . . . . . . . . . . . . .      23
        4.4.3   Bias analysis results and discussion . . . . . . . . . . . . . . . . . .      25

5 Conclusions                                                                                 31

                                               2
Topic modeling and bias analysis in a large scale spanish text dataset
Contents                                                                                  3

Bibliography                                                                             33

A Word weghts for LDA models                                                             35
  A.1 Word weights for 10 topics . . . . . . . . . . . . . . . . . . . . . . . . . . .   35
  A.2 Word weights for 25 topics . . . . . . . . . . . . . . . . . . . . . . . . . . .   36

B Small datasets created                                                                 39
  B.1 Tested URL from different categories . . . . . . . . . . . . . . . . . . . . .     40
  B.2 Male-female-neutral occupations . . . . . . . . . . . . . . . . . . . . . . . .    42
Topic modeling and bias analysis in a large scale spanish text dataset
List of Figures

 2.1   Document topic modelling example . . . . . . . . . . . . . . . . . . . . . .         3
 2.2   Vector relations from [5]. Man-woman left, comparative-superlative right. .          4
 2.3   Glove computation example from [5] . . . . . . . . . . . . . . . . . . . . .         5

 3.1   Block diagram for realized experiments . . . . . . . . . . . . . . . . . . . .       6

 4.1   Dataset size distribution across folders . . . . . . . . . . . . . . . . . . . .     9
 4.2   Parallel data processing pipeline . . . . . . . . . . . . . . . . . . . . . . . .   12
 4.3   Steps performed y worker nodes during parallel processing . . . . . . . . .         13
 4.4   Word cloud generated on the final data collected . . . . . . . . . . . . . . .      15
 4.5   Gathered data after the first parallel processing pipeline    . . . . . . . . . .   16
 4.6   Frequency distribution of the different words in the collected data . . . . .       16
 4.7   Word distribution across domains . . . . . . . . . . . . . . . . . . . . . . .      17
 4.8   Word cloud generated on the data after filtering . . . . . . . . . . . . . . .      18
 4.9   Coherence score among different topic number. (the lower, the better) . . .         18
 4.10 Webpage distribution across 10 topics . . . . . . . . . . . . . . . . . . . . .      19
 4.11 Webpage distribution across 25 topics . . . . . . . . . . . . . . . . . . . . .      20
 4.12 confusion matrix for web classification . . . . . . . . . . . . . . . . . . . .      20
 4.13 Selected news domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      23
 4.14 Data cleaning pipeline for the news dataset . . . . . . . . . . . . . . . . . .      23
 4.15 Bias comparison with regard to model dimensionality . . . . . . . . . . . .          26
 4.16 Bias representation of the word in the él-ella vector . . . . . . . . . . . . .      27
 4.17 All models direct gender bias . . . . . . . . . . . . . . . . . . . . . . . . .      28
 4.18 Correlation between elmundo and eldiaro word gender biases . . . . . . . .           28

                                             4
Topic modeling and bias analysis in a large scale spanish text dataset
List of Figures                                                                             5

   4.19 20minutos indirect gender bias words representation, baloncesto on the left,
        danza on the right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    29
   4.20 Political terms closer to PP (right-hand side) and to PSOE (left-hand side),
        20minutos dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   30
Topic modeling and bias analysis in a large scale spanish text dataset
List of Tables

 4.1   Topic modelling objective . . . . . . . . . . . . . . . . . . . . . . . . . . .    11
 4.2   Pipeline implementations differences in collected data . . . . . . . . . . . .     14
 4.3   Different trained glove models . . . . . . . . . . . . . . . . . . . . . . . . .   24
 4.4   Direct gender bias computed on the male-female occupations list of words .         25

                                            6
Chapter 1

Introduction, motivation and goals

Every day, and consistently for the last years, an increasing amount of information is shared
worldwide on the internet. Comprehensive data of the world and its past/current/future
events and knowledge is stored everyday in the form of written data.

In late 90’s, Larry Page and Sergey Brin, successfully understood the potential of under-
standing this data, leading them to the foundation of one of the most relevant companies
around the world, Google Inc. Having web query and content understanding as their main
income source, companies like Google have pushed the technology forward, taking advan-
tage of hardware generational leaps to implement more computing intensive workloads,
materializing today’s deep learning frameworks like Tensorflow or Torch, allowing for the
creation of state of the art language processing models such as BERT.

NLP, what is commonly known as the collection of software developed to allow machines
to understand (and properly encode) human language, has existed on its own as a re-
search field for longer than these companies. However, original rule-based NLP systems
were starting to reach their limits, and during the last decade, NLP community has been
focused on new neural based solutions, as it can be appreciated in solution submissions to
NLP contests like semeval[1] .

Encouraged by this glowing ecosystem. I decided to test my skills and all the available
knowledge to develop the experiments reported in this thesis. Having an enormous dataset
available, the objective was to test different techniques to classify the enormous amount of
web crawled data. Trying to understand how companies like google process, classify and
extract information for their later use.

This project has 2 main goals. 1 - Develop a preprocessing and classification approach

                                             1
Introduction, motivation and goals                                                         2

for a 45TB dataset of unannotated web-crawled data from spanish domains. 2 - Further
analyse some of this data using state of the art techniques, identifying the underlying bias.

The development of this work contributes in the exploration of new techniques for analyzing
massive-sized datasets in an extremely parallel system, all without the use of any big data
framework such as Spark or Hadoop. Additionally, current bias analysis techniques have
been explored and used to evaluate raw data directly extracted from the classified, but
barely processed dataset.
Chapter 2

State of the art

2.1     Topic modelling
Topic modelling techniques consist in different procedures that allow, with the use of a
variety of information, classify a collection of given sentences or documents into groups.
The idea is to group together those documents that share properties, and to do it in an
unsupervised manner.

                     Figure 2.1: Document topic modelling example

There are different techniques and approaches to develop this task. But nearly all of them
use at some point a statistical generative model, which allow to automatically define topics
based on word observations, and later classify each document based on the probabilities
of each one of its words to belong to a given topic. In the development of this project, I
have opted for the use of LDA[2](Latent Dirichlet Allocation), a generative discriminative

                                             3
2.2. Word embeddings                                                                          4

apporach that uses the Dirichlet statistical distribution to generate the words-topics affinity.

Recent literature suggests to use LDA in combiantion with word-embeddings [3], as the
relations encoded in those help LDA to better identify bonds between words. However,
in my experiments, these variants have not been proposed, I have opted for a direct LDA
approach. The main reasons behind this decision are: 1. Data requirements: LDA does
not need the original sentences, word order does not matter and we are only interested in
BoW, which is ideal to simplify an enormous corpus. This also allows me to prepossess
data in a distributed way. 2. Time saving: Due to the quantity of the data, applying
heavy preprocessing would simply take too much time. I am aiming for a simpler approach.

2.2     Word embeddings
Word embeddings is the general term used for the representation of words in the NLP
world. The idea behind them is that, instead of using a words as simple ids, which just
work as identifiers and are not able to encode any kind of relationship between them, we
use precomputed vector representations of the words.

These vector representations of the words, which can be of different dimensionality, allow
us to compute different relations between words. Via the use of arithmetical operations in
the vectorial space, we are able to identify words that are similar in meaning (synonyms)
or completely opposite (antonyms), concepts and words related with them...we are able to
process words as meaningful data, and not just ids. As stated in Tomas Mikolov et al. [4],
these was the necessary leap towards the creation of more capable and complex NLP mod-
els. Nowadays, word embeddings are extensively (and almost indispensable) when dealing
with natural language tasks.

  Figure 2.2: Vector relations from [5]. Man-woman left, comparative-superlative right.
2.2. Word embeddings                                                                        5

Although word embeddings need to be previously trained, they are portable. So, once a
model is trained, it can be used in different NLP tasks - if they share language -, though for
some specific tasks, it is also recommended to train embeddings on domain-specific corpus.
There are multiple alternatives for training word embeddings, being all formed by different
neural network topologies, the most relevant ones are word2vec [4] - with the CBOW and
the Skip-gram models -, and GloVe [5], which generally outperforms the former models
and has been the chosen one for the reported experiments.

2.2.1    GloVe
GloVe stands for gloval vector representations, and the main difference between other
local models like word2vec, is precisely that, the use of a Global occurrence matrix in
combination with local co-ocurrences. It is built by computing the similarity between
triplets of words, following the expression F(i,j,k) = P_ik/P_jk, where i,j,k are the
compared words and P_ik = X_ik/X_i, with X_ik being the number of times i and
k appear together, and X_i the number of times i appeared in the corpus. An example
can be appreciated in 2.3.

                     Figure 2.3: Glove computation example from [5]

Following this example, we see how when k is most similar to i, than to j, we get >1 values
(solid). When the opposite happens, we get
Chapter 3

Methodology

The development of the project has been structured in 3 different blocks. Starting from
a more general perspective, the first block consists in analyzing and understanding the
dataset. The second and third blocks aim for more specific tasks. The former is to prove
that the initial hypothesis of web domains being classified by its content with the use of
BoW is plausible, and the latter, evaluating the gender bias present in some of the data
collected, more specifically, news media. The schema in figure 3.1 show the different steps
that have been followed.

                   Figure 3.1: Block diagram for realized experiments

  1. Dataset Exploration: The first part of the experiments consists in analyzing the
     different characteristics of the available dataset. Files distribution, availability and
     format are analyzed in order to evaluate the most adequate techniques for achieving

                                             6
Methodology                                                                                7

     our objectives. Additionally, random samples of data are collected, to evaluate pos-
     sible encoding problems, that will need to be corrected in the following steps.

  2. Topic modelling: The second part of the experiments is the denser one. An MPI
     code is built to preprocess the dataset and different alternatives for preprocessing are
     evaluated. Issues with the different generated BoWs are discussed and the best is
     used to feed different LDA models. U_mass score is used to evaluate the best num-
     ber of topics for the LDA classification. Finally, a custom build small dataset is used
     to validate some of the classification, and guess the content behind generated topics.
     Final classification results are evaluated and possible alternatives are discussed.

  3. Bias analysis: The third and last part of the experiments consists in a more in depth
     analysis of some particular part of the dataset, the news webpages. The previously
     used pipeline is modified, this time to gather the full corpus of selected URLs of the
     dataset. A new preprocessing pipeline is deveoped to tokenize sentences and remove
     sentence repetition. Different GloVe models are built and evaluated. Finally gender
     bias is studied by using a male/female/neutral dataset of professional occupations
     and results are discussed.
Chapter 4

Experimental work

Following the aforementioned schema. The experiments will be divided in 3 parts: 1.
Dataset analysis, 2. Topic modelling approaches and finally 3. Bias analysis. But first, we
will introduce the supercomputing resources we have used to carry out the experiments

4.1     Supercomputing resources
We would not have been able to process such amount of data without supercomputing
resources provided by the Barcelona Supercomputing Center(BSC)[7]. First, they have
provided enough storage capacity to handle the 45TB dataset -plus all the extra data
collected in the experiments- via a distributed parallel filesystem(GPFS). Secondly, access
to 2 different architecture machines was also granted: Marenostrum4 and CTE-Power9.

      GPFS: Known as general parallel filesystem. With approximately 15PB of storage
      capacity, this shared filesystem allows the computing nodes of all machines in the
      center share data on disk, extremely useful to avoid setting more complex big data
      frameworks such as Hadoop or Spark.

      Marenostrum4: General purpose cluster (NO-GPU). Formed by 3456 nodes, each
      containing 48cpus and up to 380G of RAM.[8]

      CTE-Power9: GPU cluster. Formed by 50 nodes, each containing 160cpus and 4
      nvidia V100 acceletors (specially designed for AI pipelines), up to 580G of available
      RAM.[9]

                                             8
4.2. The dataset                                                                         9

4.2     The dataset
Extracting useful information of this dataset has been the main objective of the project.
So we should fairly devote some time to analyze its size, form and content.

It has been generated as a result of a massive crawling done by the BNE (biblioteca na-
cional de España)[10]. A crawling with the intention of recording each one of the .es
domain websites, in order to have a snapshot of the available information contained by
the websites aimed at the spanish people. We do not have more details on the crawling
implementation, but we will share our findings after some simple data exploration.

4.2.1    Size
The dataset contains a total of 507689 files. Ranging from sizes of 1MB to several GBs.
All these files are spread in different directories, with no apparent relation whatsoever.
In figure 4.1 we can appreciate this uneven data distribution by observing the difference
in size of the different data folders, containing as average 500GB of data, but with some
folders containing up to 3.7TB of data.

                   Figure 4.1: Dataset size distribution across folders

The total size of the dataset is around 45TB. Making clear that trying to process all these
4.3. Topic modelling experimentation                                                        10

data is the first challenge we will try to overcome.

4.2.2    Format
With regard to the form of the data. As mentioned before, it is spread in the form of
507689 json files. The json data contains the following keys:

url: Full URL where the info has been extracted from
p: Text (and also HTML directives) contained in the URL
heads: Headers or links of the given webpage (not always present)
keywords: Keywords set to identify the web (not always present)

The 2 keys of interest for this work are 1. url and 2. p. They contain the webpage in-
spected, as well as the textual content we will aim to use for the analysis. As we will later
showcase, data contained in “p” comes in a really dirty state, and several preprocessing
will be needed to start getting a clean corpus.

Regarding the file organization, there is none. Each json file may contain data for a single,
or multiple URLs. Furthermore, the same url or data can be at the same time in multiple
json files.

4.3     Topic modelling experimentation
Our first objective was to extract relevant information from the files, and use that infor-
mation to classify each one of the domains present in the data. The ideal result after this
step would imply having a list of all domains present in the dataset, classified in their cor-
responding area (news, sports, shop, company, blog...), as it can be appreciated in table 4.1.

The complexity of this task, thought, remained in the quantity and quality of the data
available. As shown in section 4.2, the quality is poor, and the quantity is overwhelming.
So, the first step was to create a code thought for a large scale system (specifically Marenos-
trum4), which was able to process the available data in a timely and distributed manner.
Literature led me to prepare the data to run LDA (Latent Dirichlet Allocation)[2], as it
would allow me to reduce the initial data by only storing a bag of words (BoW) for each
one of the analysed domains, reducing significantly the size of the initial dataset (45TB).
4.3. Topic modelling experimentation                                                   11

                      Web domain                Topic (Class)
                      https://elperiodico.es            News
                      https://mundodeportivo.es        Sports
                      https://decathlon.es               shop
                      ...                                  ...

                          Table 4.1: Topic modelling objective

4.3.1    Preprocessing pipeline

With half a million files, trying to process all the data using a simple thread could have
taken months. As presented in the previous section 4.2.1, we are dealing with dense files,
which are unevenly distributed in the different folders that form our dataset. Even run-
ning in a full Marenostrum4 node could take weeks, so it was clear some multi node MPI
implementation was needed.
4.3. Topic modelling experimentation                                                       12

                       Figure 4.2: Parallel data processing pipeline

Figure 4.2 shows the final structure of the implemented code. We can appreciate how the
master orchestrates the file distribution to the workers, which are continuously processing
data until the end of the pipeline, when all data is dumped to disk. Additionally, we can
observe how the pipeline has been built with a fault tolerant approach, as it is continuously
storing the processed data as breakpoints to avoid losing all the progress in case of failure.
4.3. Topic modelling experimentation                                                       13

          Figure 4.3: Steps performed y worker nodes during parallel processing

Now focusing in the file preprocessing pipeline, we can appreciate in figure 4.3 the differ-
ent steps performed in order to clean and organize the data: 1.The first step is to extract the
root url, most of the dataset urls are of the form of https://elperiodico.es/news/new0001.html,
as we want to classify pure domains, such as elperiodico.es, we use regex rules to extract the
relevant part of the text. 2. data is not clean (contains several html tags and symbols), we
also use an HTML library to extract clean text 3. We will center in analyzing only spanish
text, so we filter all the data that is not correctly identified as spanish. 4 We tokenize the
text and remove all stopwords. 5. We finally count the number of appearances of each
token, and add it to their corresponding web domain.

Three different approaches were considered for carrying out this task.

  2. First implementation: This first pipeline implementation ran as expected. We
  3.
  1.
     submitted a Marenostrum job requesting for 10 nodes (480 tasks), a single master
     task was able to handle correctly the 479 tasks of the workers preprocessing the data.
     The job took approximately 2 days.

     However, data generated was not good enough for running LDA. As it can be ap-
     preciated in table 4.2, for the first preprocessing pipeline, the vocabulary size of this
     first run was more than 189 million words. That exposed some issues: 1- Too big
     vocabulary size to be handled in a timely manner (also considering memory restric-
     tions). 2- Huge amount of words for the Spanish language.
4.3. Topic modelling experimentation                                                   14

    A deep inspection of the gathered data, also revealed a couple problems that needed
    to be addressed. 1- The data was not clean enough, my pipeline was still letting
    through several non-spanish words, which increased my vocabulary size without pro-
    viding useful information for the context of study. 2- There was a problem with the
    dataset. Some of the text was wrongly stored, joining the end of some sentences with
    the start of new ones. That resulted in a lot of fake compound words added to the
    vocabulary.

  2. Second implementation: After acknowledging the first try flaws, we decided to
     correct them by adding slight modifications to the initial pipeline. For this second
     try, we tried to correct the size of the vocabulary by implementing 2 extra processes
     to the pipeline:

     (a) Word stemming: With the purpose of taking generalizing words, we took the
         approach of extracting the root sense of each word. In a traditional pipeline,
         we would have gone for lemmatization as it would have provided accurate re-
         sults. However, None of the lemmatization solutions were fast/good enough to
         be processed in the pipeline.

     (b) Only repeating words count: Apart from stemming, to solve the fake com-
         pound words problem mentioned, we opted for just taking the words that ap-
         peared more than 10 times in a given domain. This way we would be discarding
         all those fake words that were randomly appearing in the dataset.

                        1. raw data 2. Stemming+ rep. 3. Dict. filtering
    Dictionary words       189709389           3205200          1763567
    Dictionary size          5300MB              73MB              42MB
    Data collected        52000MB            2500MB           5800MB

           Table 4.2: Pipeline implementations differences in collected data

    As it can be observed in table 4.2, after these new additions to the pipeline, data
    collection was significantly reduced. Data was more affordable to run LDA, however,
    the word repetition was far from optimal. Data collected from each domain was
    heavily reduced, as we were only keeping repeated words, and we were still dealing
    with foreign language terms.
4.3. Topic modelling experimentation                                                     15

  3. Final implementation: Finally, in order to address all the flaws mentioned earlier,
     a definitive approach was taken. Given the bad quality of the data, we opted for
     building a dictionary of valid words in spanish, and just work with the words that
     appeared in it.

     As we did not find any reliable compiled dictionary with all the spanish corpus. I
     decided to build one myself. I developed a Crawling code to look for words in the
     Spanish Real Academy Website[11]. The dictionary contains all the gathered RAE
     words, and it is able to map each spanish verbal form to the infinitive. To cope with
     plural words, we also added some versatility to the dict, allowing for variable letters
     at the end of each word, which also had a reduced risk of introducing unknown words

     In table 4.2, we can appreciate how the dict length of this new approach was smaller,
     as I was avoiding all the non-spanish words. On the other hand, it can be appre-
     ciated how the size of the extracted Corpus is greater than the previous approach,
     as we removed the word repetition limitation. In figure 4.4 words are represented
     proportionally to their frequency.

     With that data collected, we were ready to start experimenting with LDA and the
     topic modelling task.

              Figure 4.4: Word cloud generated on the final data collected

4.3.2    Running LDA
Before running the algorithm, we decided to inspect the main features and understand the
collected dataset. Extracting topics with random information collected from web pages
would not be an easy task, and knowing about the dataset would help me better work with
4.3. Topic modelling experimentation                                                       16

the data.

            Figure 4.5: Gathered data after the first parallel processing pipeline

Figure 4.5 represents the collected data, we can appreciate how each domain is represented
by the words that appear in it, also storing the number of times each word is repeated
into it. If we take a look at figure 4.6, frequency distribution of the collected data, we can
appreciate how half of the words appear less than 10 times, a fact that is already telling
us that most of the words will not be decisive for topic modelling.

      Figure 4.6: Frequency distribution of the different words in the collected data
4.3. Topic modelling experimentation                                                    17

Further analyzing the collected dataset, in figure 4.7 we can appreciate the word distribu-
tion across domains. As expected, little words appear in more than 10% of the domains,
being these words not much informative (such as página, encontrar, nombre, mostrar, web,
poder. . . )

                      Figure 4.7: Word distribution across domains

After analyzing the collected data, different versions of the LDA algorithm were submitted,
playing with different filtering to make the data fit into the Power9 node memory (580G),
as well as with the objective of obtaining useful results.

We used the LDA implementation from the gensim[12] python package, which also comes
with a coherence evaluation method that was already used to evaluate the quality of the
topics. After several failed attempts, and based in previous figures 4.7 and 4.6, we ended
up filtering my BoW to contain words that fulfilled the following conditions: 1. Appear in
less than 20% and more than 0.1% of the documents. 2. Has at least 100 occurrences in the
full corpus. Resulting corpus after the clearance contained a vocabulary of 32223 words,
2% of the total 1763567. Remaining words can be appreciated in figure 4.8, if compared
to the initial wordcloud in 4.4, we can appreciate how most common and uninformative
words like poder, hacer, web, si... have been removed from the corpus.

Additionally, we also decided to center in the domains with a vocabulary size of, at least,
50 words. Cleaning from the corpus all those domains without enough information to be
relevant for the topic modelling. After removing these domains, total number of domains
4.3. Topic modelling experimentation                                                    18

was reduced from 4737799 to 1362194 (28%).

               Figure 4.8: Word cloud generated on the data after filtering

4.3.3    Topic modelling results and discussion
Having filtered the data, we had to guess the correct number of topics to request to the
algorithm. As we had no idea on what to expect, we led the data guide me through this
decision. Using the u_masstopic coherence algorithm as score - it was implemented in
gensim, and found interesting literature about it[13] - we run the LDA requesting from 5
to 55 topics. Being the lower score the better for the coherence values, figure 4.9 clearly
shows 25 as the best number.

   Figure 4.9: Coherence score among different topic number. (the lower, the better)
4.3. Topic modelling experimentation                                                     19

Another interesting way of judging the clustering, is by analyzing the most weighted words
in each one of the topics. In the annex files A.1 and A.2, word weights for the 10 and 25
topic execution are shown. By taking a quick look into it, the first thing that draws the
attention is the low weight each one of the words represent - something expected due to
huge diversity of content web pages may show -. But taking a closer look, we can identify
words that may infer to the main content of the web-page. We can see together words like
jugar", "club", "partido", "temporada" that are clearly referring to sports-related content,
as well as producto", "mas", "diseño", "venta", "compra" that could clearly refer to web-
portals specialized in selling stuff.

Further analyzing the 25 and 10 topic models, we filter each one of the 1362194 web-pages,
to evaluate how are they distributed across different topics, see figures 4.11 and 4.10. It
can be appreciated how they follow a similar pattern - webpages with no clear topic are
assigned to topic 0 -, in which there is a huge majority with no clear topic, but the other
ones are fairly distributed across other available topics.

                   Figure 4.10: Webpage distribution across 10 topics
4.3. Topic modelling experimentation                                                       20

                    Figure 4.11: Webpage distribution across 25 topics

As there is no annotated dataset to test our classification. We decided to build my own test
one, by using known webpages from which we know their content, and that we considered
should share topic. As we have special interest in press articles, we decided to check if the
classification was able to correctly distinguish between sports press and general press,a part
from that, we also added some extra webs that we though could form their own cluster,
such as forums, housing and technology stores. The urls used are in Annex B.1, table 4.12
summarizes how were the URLS classified.

                    Figure 4.12: confusion matrix for web classification

It is clear that, at least for news, sports and stores, there is clear topic that correctly
identifies most of the web pages. For the other 2 domain fields, it is not as clear, but for a
4.3. Topic modelling experimentation                                                      21

further analysis, more webs should be added to this test.

The analysis performed proves that it is possible to classify web content massively. How-
ever, in my study, a lot of information had to be purged, losing a lot of information in
the way, that resulted in a final topic segmentation that, although it classified a lot of
webpages, also left most of them unclassified (see the size of topic 0 in 4.11). So, it is
important to take a look back to the full pipeline, analyze what are the problems of this
technique, and how we could have addressed the problem to obtain better results.

  1. Initial hypothesis: The reported work was developed under the idea that all web
     pages could be classified based on their content. However, several analyses devel-
     oped during the experiments, proved that, at least the information present in our
     dataset, was suffering from a high imbalance. Having some domains such as “lavan-
     guardia.com” containing a huge amount of data on different topics. And others that
     contained under 100 words, just showing a random internet message about cookies.

     That makes 2 things clear. 1. We cannot expect to generalize web content with the
     internet itself being that unbalanced. Some kind of extra filter could have been ap-
     plied to the initial pipeline, to reduce the number of webpages to classify, at the same
     time that we improved the general quality of the data. 2. A safer initial approach
     would have been to classify content on some “news” webpage, as it contains several
     articles of similar form, based on different topics.

  2. Word filtering: The technique applied for reducing the size of the collected vo-
     cabulary, word filtering based on the spanish dictionary words, has some flaws that
     unavoidably reduce the quality of the gathered data. By doing that, we lose a lot
     of names that could be present in the collected corpus, so people, places, entities
     are completely removed for the corpus, making it more difficult to finally extract
     points in common between different web domains. A part from the possible typos or
     language variations that could be lost.

     Future work should imply the use of alternatives like subword tokenization. With
     techniques such as Byte Pair Encoding, in which the vocabulary is reduced based on
     its frequency, but maintaining most of the meaning.

  3. Alternatives to LDA: For this work, LDA method was chosen due to its ability to
     work on BoW. However, some alternatives such as the combination with word em-
     beddings could have been discussed. Embeddings would have allowed to aggregate
4.4. Bias analysis                                                                        22

      more uncommon words, better classifying the data.

      The objective of this work was to make as efficient as possible the processing of those
      45TB of data. Thus reducing data to BoW was something that could be done in a
      massively parallel way, however, looking at the results, in future approaches I would
      also try some alternatives to enable more complex algorithms processing.

      For example use train Glove in a small portion of the dataset, to later generate em-
      beddings of the whole data. And work with some clustering techniques such as KNN
      for cluster/topic identification.

4.4     Bias analysis

After successfully completing the first task of web classification, we decided to further in-
spect the content of webpages classified as a certain domain, more specifically the press/news
webpages.

So, we opted for selecting different news webpages classified as press, and work with the
raw content we could extract from them. The objective of this second phase, then, was to
evaluate techniques to extract bias in the used corpus, more specifically gender bias.

4.4.1    Data gathering pipeline

As there were lots of webpages classified as as press, in order to decide which news web-
pages to gather, I checked for the more popular ones within Spain[14]. Ending with the
ones represented in figure 4.13.
4.4. Bias analysis                                                                      23

                           Figure 4.13: Selected news domains

The pipeline used was really similar as the one used in the topic modelling task. However,
this time what I did was to just filter data from the URL’s above, directly conllecting raw
content from the original BNE dataset. Without any tokenization nor word filtering.

As we were dealing with raw data, after the pipeline run, we had gathered 236GB of data.
To further clean these 234GB of data, after the first massively parallel pipeline, a second
task is executed to 1. Separate data for each one of the domains targeted, one sentence
per row. 2. Avoid sentence repetition within the corpus. Those tasks are represented in
figure 4.14.

                Figure 4.14: Data cleaning pipeline for the news dataset

4.4.2    Bias analysis with glove
We used the collected data on news media to train different glove models. This models are
later used to develop different experiments regarding bias analysis. There was not any kind
4.4. Bias analysis                                                                         24

of filtering for the feeding data, as the idea was to represent the underlying bias present in
models trained on raw data. All glove models were trained with different dimension sizes,
and the characteristics of the different models that were finally used in the experiments
are detailed in table 4.3.

                         20minutos europapress eldiario         elmundo
    Dataset size                2.2G          9.7G        2.0G         1.5G
    Vocabulary                849084      1116404       647082       743632
    Embedding dimensions 64, 128, 256 64, 128, 256 64, 128, 256 64, 128, 256

                         Table 4.3: Different trained glove models

The first experiment consisted in analyzing direct gender bias in words related with pro-
fessional occupations. The idea was to compare masculine/feminine equivalent jobs, and
evaluate their direct bias. A handcrafted list of equivalent occupations was built, similarly
as in Basta et al.[16] the list is available in Annex B.2, taking a random sample of existing
occupations (72 samples), additionally, some non-gendered occupations were also added to
the dataset. In order to compute the bias of each word, we followed Bolukbasi et al. [15],
computing the cosine similarity between the selected words (w) and the gender direction
he-she (g), which in this experiment will be defined with the words [él, hombre, niño] as
male representatives, and [ella, mujer, niña] as female representatives.

The second experiment consists in analyzing indirect gender bias, also using the words for
occupations. The idea of this experiment is to replace the él - ella axis, for a different
one, but one that does not directly imply gender, but sports. The selected words were
[baloncesto] and [danza]. I want to evaluate if we see similar representations as with the
previous experiments, hence, the bias is still apparent using this apparently non-gendered
words.

As we are dealing with press from different political ideologies. A third experiment was
made, in which we will try project some political-specific words into the direction of the
words of pp - psoe, the historically most important political parties in Spain. The objec-
tive of this experiment is to evaluate if there are some words directly related with political
parties, and how the embeddings relate them.
4.4. Bias analysis                                                                      25

4.4.3    Bias analysis results and discussion

The first experiment for bias analysis was performed on the models presented in table
4.3: Europapress.com, 20minutos.es, eldiario.es and elmundo.es. In order to get an idea
of the direct gender bias present in the whole dataset, feminine, masculine and total bias
was computed for each one of the data sources. Direct gender bias was computed by only
taking the words in the desired dataset (masculine, feminine or all) and averaging the bias
of all the words in their corresponding group. In order to relate model complexity with
bias, this analysis was performed across all-sized models (64, 128 and 256). Results of
these bias analysis are represented in table 4.4.

        Dim. 64            20minutos europapress eldiario elmundo
        Male occupations      0.19970     0.08391 0.15526 0.19545
        Female occupations   0.22415     0.15424 0.21552    0.19385
        Full list             0.19442     0.11281 0.17709   0.17779
        Dim. 128           20minutos europapress eldiario elmundo
        Male occupations      0.12165     0.07164 0.12853   0.14215
        Female occupations   0.19320     0.09363 0.17519 0.16539
        Full list             0.14552     0.07637 0.14298   0.14044
        Dim. 256           20minutos europapress eldiario elmundo
        Male occupations      0.08130     0.06020 0.08707   0.10099
        Female occupations   0.14827     0.07568 0.13854 0.13202
        Full list             0.10654     0.06263 0.10510   0.10495

  Table 4.4: Direct gender bias computed on the male-female occupations list of words

Table 4.4 shows clearly how the bias is systematically higher when dealing with the female
occupations, as we are dealing with gender related occupations some gender component is
expected, but this numbers proof that female occupations tend to be more gender-related
than male ones. As we are dealing with random samples collected from the different press
media, we cannot make any safe assumption on one media being more biased than the other.
4.4. Bias analysis                                                                       26

           Figure 4.15: Bias comparison with regard to model dimensionality

If comparing model dimensionalities, it is also observable how, by increasing the dimen-
sionality of the model, we are consistently obtaining a less biased model (see figure 4.15).
So, it seems that when increasing the complexity of the model, gender biased is reduced in
favour of other data relations. Finally, it can also be appreciated how the media with the
bigger dataset is the less biased one, more analysis should be made in that regard, but it
can give us a hint on how increasing the dataset can help in reducing biases. For the next
experiments, we will center in the 256 dimensions models, as they are the more likely to
be used in real applications.

Following the first experiment, direct comparison between occupations was made. The
following figures represent the distance, being the X-axis representing the projection of
each word with respect to the gender direction. As I have computed the values as he-she ,
positive values show the words that were closer to the feminine form, and negative values
the ones closer to the masculine. The further a word from the neutral point (or no corre-
lation point, 0), the more biased this word is towards the corresponding gender.
4.4. Bias analysis                                                                        27

             Figure 4.16: Bias representation of the word in the él-ella vector

Figure 4.16 represents some of the words studied. As it can be appreciated in the colors:
yellow for men-related, purple for women-related, and blue for neutral. The majority of
the words are correctly classified in their corresponding gender, and the neutral words (not
gender oriented occupations) lay in between both extremes.

The ideal behaviour for a non biased model would consist in having equivalent gender bias
for women and man related professions - as some occupations have semantic female rela-
tion, it is clear that they will be closer to "ella" than to él, and the same in the opposite
situation -. However, a part from the results shown in 4.4, by looking at figure 4.16, we
can appreciate some interesting facts: 1. The words "física" and "informática" are on
the male side, which clearly shows that the attributes tend to be man related, even when
describing women occupations. 2. The word "técnica" is really close to the null relation
with gender, fact yhat is showing that this word will be barely related with a profession,
and probaly encodes a different meaning. 3. Neutral words like "economista, oficinista,
oficial" are direclty attributed to men, and "gerente, periodista, forense" are attributed to
women 4. In general, female occupations tend to be more gender biased than male ones,
meaning that gender has more weight in their vectorial representation.
4.4. Bias analysis                                                                        28

                        Figure 4.17: All models direct gender bias

Figure 4.17 shows all the plots side by side. A similar behaviour is observed, with similar
words in scenarios as the ones commented in the previous plot. By taking a look to all
plots, it is clear how there is a general bias towards female terms, which are always further
from the neutral point. Figure 4.18 reafirms this observation, by showing the correlation
between the models eldiarioand elmundo. It can be appreciated how the majority of the
word stay in their quadrant, as well as the blue words (not gendered ones) quite centered
in the diagonal.

        Figure 4.18: Correlation between elmundo and eldiaro word gender biases
4.4. Bias analysis                                                                      29

The second experiment intents to analyze the indirect bias. As explained in the pre-
vious section, we use words that are not directly él-ella , but that are culturally related
with the gender terms. In this case, we used the sports mentioned in the previous section.
Results of this experiment can be appreciated in figure 4.19, where -if compared to the
initial 20minutes plot- there is not such a clear line separating gender words. However,
we can still see how female oriented words are systematically on the "danza" side (female
words are always on the right) and male oriented words are on the "baloncesto" side.

Figure 4.19: 20minutos indirect gender bias words representation, baloncesto on the left,
danza on the right

One last experiment tries to represent political terms with the PP-PSOE projection.
As curiosity, it locates the political party that was at the government, and the one that
was opposition at the dates the information was gathered. But as it can be appreciated, all
there are some badly con-notated words like "corrupcón, trama, malversación..." that even
when changing the words to socialista-popular (that apparrently don’t have any relation
with politics), the words keep attached to the same direction. It is clear that embeddings
are highly sensible to the data, and the bias (or knowledge) in the data, gets directly
translated to the words.
4.4. Bias analysis                                                                        30

Figure 4.20: Political terms closer to PP (right-hand side) and to PSOE (left-hand side),
20minutos dataset

The developed experiments make some clear statements regarding the models built on the
datasets. The first experiment shows that gathering and training models on data without
any control on its content, may always end in a gendered biased dataset -4 out of 4 models
showed a higher female bias-. Current society has been male centered for decades, and
written content is not an exception. It has been proved that the learnt bias in professional
words is consistent across media, as we have fig 4.18 clearly showing the correlation be-
tween 2 press media with opposite political ideologies. In the second experiment, it has
been proved how the influence is not only in direct comparisons against the gender pro-
jection, but it is also present in the sports field (and probably much more that have not
been tested). Finally, in the last experiment, we have shown that some "good" or "bad"
terms can directly be related with entities, such as the relation with PP to corruption, that
can later derive in "popular" to corruption. That makes clear that running name entity
recognition on datasets may help avoiding random concept relations with words.
Chapter 5

Conclusions

During the development of this work several decisions have been taken, sometimes these
decisions were right, but some others were not. In the following lines I will try to sum-
marize the different objectives achieved, as well as to suggest in which direction we could
follow the research.

With regard to the topic modeling task, as exposed in section 4.2, and the consequent ex-
periments and analysis. Quality of the data was really low, and several preprocessing was
needed to achieve the objectives proposed. After some failed attempts, we can say that the
objective for a first general data classification has been accomplished, but only partially.
We have proved that we are able to classify principal [press, sports, stores] webpages, but
there are also a lot of webpages that we haven’t been able to classify. Future work: A
more complete analysis - with human supervised datasets - should be made to determine
the precision of the method presented. To achieve decent results, several decisions have
been made during the process, such as filtering the words based on a dictionary, or directly
aggregating the data into BoW, this decisions have lead to the loss of a lot of information.
In order to fully complete and validate the method, some other discussed procedures could
be implemented and consequently compared with the proposed model.

From the bias analysis task we have extracted a lot of information. The first point to high-
light is the capacity of glove for extracting data from massive unprocessed corpus. Just by
separating data into sentences, we have been able to build 4 competent word embeddings,
as we have proved they are able to correctly distinguish between male and female occupa-
tions. In the analysis of bias we have also shown how, when dealing with fewer dimensions,
gender bias component tends to be more representative. We have proved how 4 datasets
that are different in size and in political ideology tend to be more biased when dealing with
female occupations, so we could make the assumption that Spanish press, such as all other

                                             31
32

media nowadays, is gender biased -no political/gender-bias correlation has been found-. We
have confirmed it by replicating the gender direction with two neutral words like baloncesto
and danza. Finally, we have shown some of the problems of training with unsupervised
embeddings, as the word popularfrom Partido Popular has encoded deep relations with
terms such as corrupción. Future work: To further complete this experimentation, some
more experiments with non-gendered words clustering could be made, as suggested in the
literature related. For additional experimentation, a study on debiasing techniques could
also be applied on the different media collected.

Personally, it has been a good resiliency test. Positive results do never come at the first
try, and persistence and observation have been key to achieve the objectives that were
initially proposed in this thesis. During the thesis development, I have been continuously
expanding my knowledge in the field, hence, realizing that some decisions were not the
most appropriate, but overall, I am proud of the final result.
Bibliography

[1] Semeval, international workshop for semantic evaluation. https://semeval.github.
    io/

[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Latent Dirichlet Allocation (enero
    de 2003). University of California, Berkeley, CA

[3] Christopher Moody. Mixing Dirichlet Topic Models and Word Embeddings to Make
    lda2vec. Stitch Fix One Montgomery Tower, Suite 1200 San Francisco, California
    94104, USA https://arxiv.org/pdf/1605.02019.pdf

[4] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word
    Representations in Vector Space https://arxiv.org/pdf/1301.3781.pdf

[5] Jeffrey Pennington, Richard Socher, Christopher D. Manning. Glove: Global Vectors
    for Word Representation. Computer Science Department, Stanford University, Stan-
    ford, CA 94305

[6] Thushan      Ganegedara.     Intuitive    Guide      to      Understand-
    ing       GloVe       Embeddings.       https://towardsdatascience.com/
    light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010

[7] Barcelona Supercomputing Center. Centro nacional de supercomputación. España,
    Cataluña, Barcelona. https://www.bsc.es/

[8] Marenostrum4 cluster (BSC-CNS) User Guide https://www.bsc.es/support/
    MareNostrum4-ug.pdf

[9] Power9 GPU cluster (BSC-CNS) User Guide https://www.bsc.es/support/POWER_
    CTE-ug.pdf

[10] Biblioteca Nacional de España http://www.bne.es/es/Inicio/index.html

[11] Real Academia Española https://www.rae.es

[12] gensim, topic modelling for humans https://radimrehurek.com/gensim/

                                          33
Bibliography                                                                         34

[13] F. Rosner, A. Hinneburg, M. Roder, M. Nettling, A. Both. Evaluating topic coherence
    measures https://www.researchgate.net/publication/261101181_Evaluating_
    topic_coherence_measures

[14] Kadaza, web pages popularity ranking portal. https://www.kadaza.es/noticias

[15] Bolukbasi T, Chang KW, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer
    programmer as woman is to homemaker? debiasing word embeddings. In: Lee DD,
    Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information
    Processing Systems 29, Curran Associates, Inc., pp 4349–4357

[16] Basta, C., Costa-jussà, M.R. Casas, N. Extensive study on the underlying gender
    bias in contextualized word embeddings. Neural Comput Applic 33, 3371–3384 (2021).
    https://doi.org/10.1007/s00521-020-05211-z

[17] Hila Gonen, Yoav Goldberg. Lipstick on a Pig: Debiasing Methods Cover up Sys-
    tematic Gender Biases in Word Embeddings But do not Remove Them. https:
    //arxiv.org/abs/1903.03862

[18] Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. “Semantics derived auto-
    matically from language corpora contain human-like biases.” Science 356.6334 (2017):
    183-186. http://opus.bath.ac.uk/55288/
Appendix A

Word weghts for LDA models

A.1    Word weights for 10 topics
Topic: 0
Words: 0.002*"producto" + 0.001*"alta" + 0.001*"comprar" + 0.001*"diseño"
+ 0.001*"base" + 0.001*"disponible" + 0.001*"agua" + 0.001*"fácil" + 0.001*"mercado" +
0.001*"alto"
Topic: 1
Words: 0.002*"mayo" + 0.001*"josé" + 0.001*"julio" + 0.001*"diciembre"
+ 0.001*"cultura" + 0.001*"juan" + 0.001*"junio" + 0.001*"abril"
+ 0.001*"entrada" + 0.001*"octubre"
Topic: 2
Words: 0.002*"siguientes" + 0.002*"gestión" + 0.001*"formación" + 0.001*"actividad" +
0.001*"profesionales" + 0.001*"facilitar" + 0.001*"electrónico" + 0.001*"proceso" +
0.001*"sector" + 0.001*"recursos"
Topic: 3
Words: 0.002*"veces" + 0.001*"pues" + 0.001*"compartir" + 0.001*"ayudar" +
0.001*"escribir" + 0.001*"tema" + 0.001*"libro" + 0.001*"algún" + 0.001*"verdad" +
0.001*"blog"
Topic: 4
Words: 0.002*"usuarios" + 0.002*"foro" + 0.001*"registrar" + 0.001*"navegar" +
0.001*"pocos" + 0.001*"privacidad" + 0.001*"administración" + 0.001*"mar" +
0.001*"éxito"+ 0.001*"temas"
Topic: 5
Words: 0.001*"juan" + 0.001*"jugar" + 0.001*"destacar" + 0.001*"cinco" + 0.001*"josé"
0.001*"decidir" + 0.001*"segundo" + 0.001*"próximo" + 0.001*"juego" +
0.001*"primeros"
Topic: 6

                                    35
A.2. Word weights for 25 topics                                         36

Words: 0.002*"ayuntamiento" + 0.001*"mayo" + 0.001*"presidente" + 0.001*"juan" +
0.001*"europa" + 0.001*"abril" + 0.001*"futuro" + 0.001*"viernes" + 0.001*"destacar" +
0.001*"julio"
Topic: 7
Words: 0.001*"mas" + 0.001*"gente" + 0.001*"fotos" + 0.001*"verdad" + 0.001*"edad" +
0.001*"perder" + 0.001*"tarde" + 0.001*"vivir" + 0.001*"marca" + 0.001*"bueno"
Topic: 8
Words: 0.001*"salir" + 0.001*"medios" + 0.001*"partido" + 0.001*"presidente" +
0.001*"vivir" + 0.001*"gente" + 0.001*"perder" + 0.001*"pues" + 0.001*"situación" +
0.001*"cinco"
Topic: 9
Words: 0.002*"reservados" + 0.002*"copyright" + 0.001*"domicilio" +
0.001*"disposición" + 0.001*"venta" + 0.001*"teléfono" + 0.001*"contenidos" + 0.001*"p

A.2    Word weights for 25 topics
Topic: 0
Words: 0.001*"mercado" + 0.001*"presidente" + 0.001*"indicar" + 0.001*"usuario" +
0.001*"medios" + 0.001*"ningún" + 0.001*"alta" + 0.001*"ley" + 0.001*"europa" +
0.001*"actividad"
Topic: 1
Words: 0.002*"situación" + 0.001*"medios" + 0.001*"sociales" + 0.001*"apoyo" +
0.001*"pública" + 0.001*"junio" + 0.001*"embargo" + 0.001*"futuro" +
0.001*"pretender"+ 0.001*"participación"
Topic: 2
Words: 0.001*"pues" + 0.001*"tarde" + 0.001*"espacio" + 0.001*"julio" +
0.001*"situación" + 0.001*"unir" + 0.001*"queda" + 0.001*"único" + 0.001*"interés"
+ 0.001*"terminar"
Topic: 3
Words: 0.001*"usuario" + 0.001*"intentar" + 0.001*"algún" + 0.001*"necesario" +
0.001*"anterior" + 0.001*"segundo" + 0.001*"proceso" + 0.001*"juan" + 0.001*"alta"
+ 0.001*"contenidos"
Topic: 4
Words: 0.002*"pues" + 0.002*"veces" + 0.002*"verdad" + 0.002*"gente" + 0.002*"bueno"
+ 0.002*"salir" + 0.002*"blog" + 0.001*"cierto" + 0.001*"claro" + 0.001*"creo"
Topic: 5
Words: 0.001*"anterior" + 0.001*"vivir" + 0.001*"intentar" + 0.001*"comentarios" +
0.001*"siguientes" + 0.001*"compartir" + 0.001*"recursos" + 0.001*"septiembre" +
0.001*"blog" + 0.001*"segundo"
A.2. Word weights for 25 topics                                         37

Topic: 6
Words: 0.002*"usted" + 0.001*"contenido" + 0.001*"sector" + 0.001*"dinero" +
0.001*"problema" + 0.001*"proceso" + 0.001*"clientes" + 0.001*"comprar" +
0.001*"algún"+ 0.001*"mercado"
Topic: 7
Words: 0.002*"producto" + 0.002*"mas" + 0.002*"diseño" + 0.002*"venta" +
0.002*"compra"+ 0.002*"comprar" + 0.001*"tienda" + 0.001*"gratis" +
0.001*"precios" + 0.001*"alta"
Topic: 8
Words: 0.002*"presidente" + 0.002*"partido" + 0.002*"josé" + 0.001*"juan" +
0.001*"pesar" + 0.001*"próximo" + 0.001*"asegurar" + 0.001*"mañana" + 0.001*"salir"
+ 0.001*"europa"
Topic: 9
Words: 0.001*"amigos" + 0.001*"julio" + 0.001*"juan" + 0.001*"viernes" +
0.001*"formación" + 0.001*"problema" + 0.001*"josé" + 0.001*"libre" +
0.001*"contactar" + 0.001*"mañana"
Topic: 10
Words: 0.002*"usuarios" + 0.001*"tecnología" + 0.001*"usuario" + 0.001*"diseño" +
0.001*"software" + 0.001*"modo" + 0.001*"alta" + 0.001*"control" + 0.001*"versión" +
0.001*"funcionar"
Topic: 11
Words: 0.001*"junio" + 0.001*"siguientes" + 0.001*"proceso" + 0.001*"mayores" +
0.001*"contenidos" + 0.001*"control" + 0.001*"marca" + 0.001*"respecto" +
0.001*"fácil" + 0.001*"edad"
Topic: 12
Words: 0.001*"juan" + 0.001*"actividad" + 0.001*"mayores" + 0.001*"terminar" +
0.001*"josé" + 0.001*"teléfono" + 0.001*"pues" + 0.001*"formación" + 0.001*"cultura"
+ 0.001*"diciembre"
Topic: 13
Words: 0.002*"regresar" + 0.002*"error" + 0.002*"bruto" + 0.002*"edad" +
0.002*"intentar" + 0.002*"terminar" + 0.002*"realidad" + 0.002*"perder" +
0.002*"tarde" + 0.002*"éxito"
Topic: 14
Words: 0.002*"juan" + 0.001*"santa" + 0.001*"josé" + 0.001*"enviar" + 0.001*"mas" +
0.001*"inicio" + 0.001*"valencia" + 0.001*"usuarios" + 0.001*"luis" +
0.001*"anterior"
Topic: 15
Words: 0.001*"serie" + 0.001*"mayo" + 0.001*"cinco" + 0.001*"próximo" +
0.001*"mayores" + 0.001*"alta" + 0.001*"profesionales" + 0.001*"informar" +
0.001*"futuro" + 0.001*"unir"
You can also read