Lda Latent Dirichlet Allocation Continuous Inputs
Latent Dirichlet Allocation
Latent Dirichlet Allocation
Joshua Charles Campbell , ... Eleni Stroulia , in The Art and Science of Analyzing Software Data, 2015
6.2 Applications of LDA in Software Analysis
The LDA method was originally formulated by Blei et al. [1], and it soon became quite popular within the software-engineering community. LDA's popularity comes from the variety of its potential applications.
LDA excels at feature reduction, and can employed as a preprocessing step for other models, such as machine learning algorithms. LDA can also be used to augment the inputs to machine learning and clustering algorithms by producing additional features from documents. One example of this type of LDA usage was described by Wang and Wong [2], who employed it in a recommender system. Similarly, labeled LDA can be used to create vectors of independent features from arbitrary feature sets such as tags.
An important use of LDA is for linking software artifacts. There are many instances of such artifact-linking applications, such as measuring coupling between code modules [3] and matching code modules with natural-language requirement specifications [4] for traceability purposes. Asuncion et al. [5] applied LDA on textual documentation and source-code modules and used the topic-document matrix to indicate traceability between the two. Thomas et al. [6] focused on the use of LDA on yet another traceability problem, linking e-mail messages to source-code modules. Gethers et al. [7] investigated the effectiveness of LDA for traceability-link recovery. They combined information retrieval techniques, including the Jenson-Shannon model, the vector space model, and the relational topic model using LDA. They concluded that each technique had its positives and negatives, yet the integration of the methods tended to produce the best results. Typically steeped in the information retrieval domain, Savage et al. [8], Poshyvanyk [9], and McMillan et al. [10] have explored the use of information retrieval techniques such as LSI [11] and LDA to recover software traceability links in source code and other documents. For a general literature survey related to traceability techniques (including LDA), the interested reader should refer to De Lucia et al. [12].
Baldi et al. [13] labeled LDA-extracted topics and compared them with aspects in software development. Baldi et al. claim that some topics do map to aspects such as security, logging, and cross-cutting concerns, which was somewhat corroborated by Hindle et al. [14].
Clustering is frequently used to compare and identify (dis)similar documents and code, or to quantify the overlap between two sets of documents. Clustering algorithms can potentially be applied to topic probability vectors produced by LDA. LDA has been used in a clustering context, for issue report querying, and for deduplication. Lukins et al. [15] applied LDA topic analysis to issue reports, leveraging LDA inference to infer if queries, topics, and issue reports were related to each other. Alipour et al. [16] leveraged LDA topics to add context to deduplicate issue reports, and found that LDA topics added useful contextual information to issue/bug deduplication. Campbell et al. [17] used LDA to examine the coverage of popular project documentation by applying LDA to two collections of documents at once: user questions and project documentation. This was done by clustering and comparing LDA output data.
Often LDA is used to summarize the contents of large datasets. This is done by manually or automatically labeling the most popular topics produced by unlabeled LDA. Labeled LDA can be used to track specific features over time—for example, to measure the fragmentation of a software ecosystem as in Han et al. [18].
Even though LDA topics are assumed to be implicit and not observable, there is substantial work on assessing the interpretability of those summaries by developers. Labeling software artifacts using LDA was investigated by De Lucia et al. [19]. By using multiple information retrieval approaches such as LDA and LSI, they labeled and summarized source code and compared it against human-generated labels. Hindle et al. [20] investigated if developers could interpret and label LDA topics. They reported limited success, with 50% being successfully labeled by the developers, and that nonexperts tend to do poorly at labeling topics on systems they have not dealt with.
Finally, there has been some research on the appropriate choice of LDA hyperparameters and parameters: α, β, K topics. Grant and Cordy [21] were concerned about K, where K is the number of topics. Panichella et al. [22] proposed LDA-GA, a genetic algorithm approach to searching for appropriate LDA hyperparameters and parameters. LDA-GA needs an evaluation measure, and thus Panichella et al. used software-engineering-specific tasks that allowed their genetic algorithm optimized the number of topics for cost-effectiveness.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124115194000069
Can the Human Association Norm Evaluate Machine-Made Association Lists?
Michał Korzycki , ... Wiesław Lubaszewski , in Cognitive Approach to Natural Language Processing, 2017
2.3.3 LDA-sourced lists
Latent Dirichlet Allocation is a mechanism used for topic extraction [BLE 03]. It treats documents as probabilistic distribution sets of words or topics. These topics are not strongly defined – as they are identified on the basis of the likelihood of co-occurrences of words contained in them.
In order to obtain ranked lists of words associated with a given word wn, we take the set of topics generated by LDA, and then for each word contained, we take the sum of the weight of each topic multiplied by the weight of given word wn in this topic.
Formally, for N topics and wji denoting the weight of the word i in the topic j, the ranking weight for the word i is computed as follows:
[2.3]
This representation allows us to create a ranked list of words associated with a given word wn based on their probability of co-occurrence in the documents.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9781785482533500020
Text Mining and Network Analysis of Digital Libraries in R
Eric Nguyen , in Data Mining Applications with R, 2014
4.4.1 The Latent Dirichlet Allocation
The LDA is a technique developed by David Blei, Andrew Ng, and Michael Jordan and exposed in Blei et al. (2003). The LDA is a generative model, but in text mining, it introduces a way to attach topical content to text documents. Each document is viewed as a mix of multiple distinct topics. An advantage of the LDA technique is that one does not have to know in advance what the topics will look like. By tuning the LDA parameters to fit different dataset shapes, one can explore topic formation and resulting document clusters.
The mathematics behind the LDA is beyond the scope of this work, but one should be aware of the following aspects of the algorithm. The number of topics K is fixed and specified in advanced. The corpus contains documents of length n i . Each word w i,j comes from a vocabulary which consists of V different terms.
The term distribution for each topic is modeled by
where Dirichlet(η) denotes the Dirichlet distribution for parameter η.
The proportion of topic distribution for each document is distributed as
Each word w i,j is associated to a topic z i,j which follows
where Multinomial(ω i ) denotes the multinomial distribution with one trial (Grun and Hornik, 2011).
In this setup, the LDA is a "bag of words" model, where the order in which the words appear does not affect the grammar.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124115118000049
Probabilistic methods
Ian H. Witten , ... Christopher J. Pal , in Data Mining (Fourth Edition), 2017
Latent Dirichlet Allocation
pLSA can be extended into a hierarchical Bayesian model with three levels, known as latent Dirichlet allocation . We refer to this as LDAb ("b" for Bayesian) to distinguish it from linear discriminant analysis which is commonly referred to as LDA. LDAb was proposed in part to reduce the overfitting that has been observed with pLSA, and has been extended in many ways. Extensions of LDAb can be used to determine trends over time and identify "hot" and "cold" topics. Such analyses are particularly interesting today, with the recent explosion of social media and the interest in analyzing it.
Latent Dirichlet allocation is a hierarchical Bayesian model that reformulates pLSA by replacing the document index variables d i with the random parameter , a vector of multinomial parameters for the documents. The distribution of is influenced by a Dirichlet prior with hyperparameter , which is also a vector. (Appendix A.2 explains Dirichlet distributions and their use as priors for the parameters of the discrete distribution.) Finally, the relationship between the discrete topic variables z ij and the words w ij is also given an explicit dependence on a hyperparameter, namely, the matrix . Fig. 9.11B shows the corresponding graphical model. The probability model for the set of all observed words W is
which marginalizes out the uncertainty associated with each and z ij . is given by a k-dimensional Dirichlet distribution, which also leads to k-dimensional topic variables z ij . For a vocabulary of size V, encodes the probability of each word given each topic, and prior information is therefore captured by the dimensional matrix .
The marginal log-likelihood of the model can be optimized using an empirical Bayesian method by adjusting the hyperparameters and via the variational EM procedure. To perform the E-step of EM, the posterior distribution over the unobserved random quantities is needed. For the model defined by the above equation, with a random for each document, word observations , and hidden topic variables z, the posterior distribution is
which, unfortunately, is intractable. For the M-step it is necessary to update hyperparameters and , which can be done by computing the maximum likelihood estimates using the expected sufficient statistics from the E-step. The variational EM procedure amounts to computing and using a separate approximate posterior for each and each z ij .
A method called "collapsed Gibbs sampling" turns out to be a particularly effective alternative to variational methods for performing LDAb. Consider first that the model in Fig. 9.11B can be expanded to that shown in Fig. 9.11C, which was originally cast as the smoothed version of LDAb. Then add another Dirichlet prior with parameters given by on the topic parameters of , a formulation that further reduces the effects of overfitting. Standard Gibbs sampling involves iteratively sampling the hidden random variables z ij , the 's and the elements of matrix . Collapsed Gibbs sampling is obtained by integrating out the θ i 's and analytically, which deals with these distributions exactly. Consequently, conditioned by the current estimates of , , and the observed words of a document corpus, the Gibbs sampler proceeds by simply iteratively updating each z ij to compute the required approximate posterior. Using either samples or variational approximations it is then relatively straightforward to obtain estimates for the θ i 's and .
The overall approach of using a smoothed, collapsed LDAb model to extract topics from a document collection can be summarized as follows: first define a hierarchical Bayesian model for the joint distribution of documents and words following the structure of Fig. 9.11C. We could think of there being a Bayesian E-step that performs approximate inference using Gibbs' sampling to sample from the joint posterior over all topics for all documents in the model, or , where the θ i 's and have been integrated out. This is followed by an M-step that uses these samples to update the estimates of the θ i 's and , using update equations that are functions of , , and the samples. This procedure is performed within a hierarchical Bayesian model, so the updated parameters can be used to create a Bayesian predictive distribution over new words and new topics given the observed words.
Table 9.1 shows the highest probability words from a sampling of topics mined by Griffiths and Steyvers (2004) through applying LDAb to 28,154 abstracts of papers published in the Proceedings of the National Academy of Science from 1991 to 2001 and tagged by authors with subcategory information. Analyzing the distribution over these tags identifies the highest probability user tags for each topic, which are shown at the bottom of Table 9.1. Note that the user tags were not used to create the topics, but we can see how well the extracted topics match with human labels.
Table 9.1. Highest Probability Words and User Tags From a Sample of Topics Extracted From a Collection of Scientific Articles
| Topic 2 | Topic 39 | Topic 102 | Topic 201 | Topic 210 |
|---|---|---|---|---|
| Species | Theory | Tumor | Resistance | Synaptic |
| Global | Time | Cancer | Resistant | Neurons |
| Climate | Space | Tumors | Drug | Postsynaptic |
| Co2 | Given | Human | Drugs | Hippocampal |
| Water | Problem | Cells | Sensitive | Synapses |
| Geophysics, geology, ecology | Physics, math, applied math | Medical sciences | Pharmacology | Neurobiology |
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978012804291500009X
Mining Software Logs for Goal-Driven Root Cause Analysis
Hamzeh Zawawy , ... John Mylopoulos , in The Art and Science of Analyzing Software Data, 2015
18.5.1 Latent Semantic Indexing
Some of the well-known document retrieval techniques include LSI [18], PLSI [19 ], latent Dirichlet allocation [ 20], and the correlated topic model [21]. In this context, semantic analysis of a corpus of documents consists in building structures that identify concepts from this corpus of documents without any prior semantic understanding of the documents.
LSI is an indexing and retrieval method to identify relationships among terms and concepts in a collection of unstructured text documents. LSI was introduced by Deerwester et al. [18]. It takes a vector space representation of documents based on term frequencies as a starting point and applies a dimension reduction operation on the corresponding term/document matrix using the singular value decomposition algorithm [22]. The fundamental idea is that documents and terms can be mapped to a reduced representation in the so-called latent semantic space. Similarities among documents and queries can be more efficiently estimated in this reduced space representation than in their original representation. This is because two documents which have terms that have been found to frequently co-occur in the collection of all documents will have a similar representation in the reduced space representation, even if these two specific documents have no terms in common. LSI is commonly used in areas such as Web retrieval, document indexing [23], and feature identification [24]. In this context, we consider each log entry as a document. The terms found in the goal and antigoal node annotation expressions become the search keywords. LSI is then used to identify those log entries that are mostly associated with a particular query denoting a system feature, a precondition, an occurrence, or a postcondition annotation of a goal or antigoal node.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124115194000185
Moving on
Ian H. Witten , ... Christopher J. Pal , in Data Mining (Fourth Edition), 2017
Natural Language Processing
Natural language processing, a rich field of study with a long history, is an active application area for deep learning. We have seen how latent semantic analysis (LSA) and latent Dirichlet allocation (LDA b) permit exploratory topic analysis of document collections. More recently, it has been observed that neural language modeling techniques can perform significantly better than LSA for preserving relationships among words; moreover, LDAb also has difficulty scaling to truly massive data.
Researchers at Google have created a set of language models called word2vec, based on single-hidden-layer networks trained with vast amounts of data—783 million words for initial experiments, and 30 billion for later ones. (Associated software is available online.) One such model produces continuous representations of words by training a neural bag-of-words model to predict words given their context. Because word order in the context window is not captured, this is known as a "continuous bag of words" model. Another model, "skip-gram," gives each word to a log-linear classifier with a linear projection layer—a form of shallow neural network—that predicts nearby words within a certain distance before and after the source word. Here the number of states for the output prediction equals the vocabulary size, and with data at this scale the vocabulary ranges from 105 to 109 terms, so the output is decomposed into a binary tree known as a "hierarchical softmax," which, for a V-word vocabulary needs to evaluate only log2(V) rather than V output nodes.
A particularly noteworthy aspect of this work is that the representations learned yield projections for words that allow inferences about their meaning to be performed with vector operations. For example, upon projecting the words Paris, France, Italy and Rome into the learned representation, one finds that simple vector subtraction and addition yield the relationship Paris–France+Italy Rome! More precisely, Rome is found to be the closest word when all words are projected into this representation.
Many research and development groups are mining massive quantities of text data in order to learn as much as possible from scratch, replacing features that have previously been hand-engineered by ones that are learned automatically. Large neural networks are being applied to tasks ranging from sentiment classification and translation to dialog and question answering. The deep encoder-decoder architecture discussed at the end of Chapter 10, Deep learning, represents one such example: Google researchers have used it to learn how to translate languages from scratch, based on voluminous data.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128042915000131
30th European Symposium on Computer Aided Process Engineering
Gulnara Shavalieva , ... Stavros Papadokonstantakis , in Computer Aided Chemical Engineering, 2020
3 Results
The results of two topic modelling models and various similarity scores are presented in Table 1. As can be seen, the LSI results are more sensitive to the similarity score threshold value than the LDA results both for the number of resulting sentences and for the number of potentially relevant sentences.
Table 1. Results for LSI and LDA models.
| Similarity score | Number of resulting sentences | Number of potentially relevant sentences | ||
|---|---|---|---|---|
| LSI | LDA | LSI | LDA | |
| 0.3 | 1020 | 478 | 34 | 24 |
| 0.4 | 483 | 400 | 21 | 22 |
| 0.5 | 171 | 344 | 9 | 21 |
The potentially relevant sentences contain rules, correlations, indications to QSARs or read-across predictions of aquatic toxicity. For example, sentences like "Longer chain alcohols display a very strong structure-activity relationship to acute fish toxicity as a function of hydrophobicity", "There is a linear relationship between both acute and chronic toxicities and LogKow, suggesting that with the increase of hydrophobicity the aquatic toxicity increases" might be obtained.
Additionally, the main parameters affecting the results of the study have been identified and considered for further improvement. More general keywords combination used for search of the articles (e.g. "aquatic toxicity" instead of "aquatic LC50") result in more relevant sentences. More specific search ("aquatic LC50") leads to identification of numerous sentences devoted to various experimental studies limited to certain test conditions difficult to be generalized as prior knowledge.
The identification of the potentially relevant topics generated by the models is not straightforward, is heavily based on human expertise and may have a significant impact on the results. The evaluation of the resulting sentences, in turn, revealed the necessity to back track the documents containing potentially useful information (e.g., extracting more context by the surrounding text of the identified sentences). Thus, the approach can serve as additional indication to promising articles to retrieve more data for generation of prior knowledge. However, it should be noted that this will require an additional step of human intervention in the overall prior knowledge extraction procedure.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128233771503177
Mining Android Apps for Anomalies
Konstantin Kuznetsov , ... Andreas Zeller , in The Art and Science of Analyzing Software Data, 2015
10.1 Introduction
Detecting whether a mobile application behaves as expected is a prominent problem for users. Whenever they install a new app on their mobile device, they run the risk of it being "malware"—i.e., to act against their interests. Security researchers have largely focused on detecting malware in Android™ apps, but their techniques typically check new apps against a set of predefined known patterns of malicious behavior. This approach works well for detecting new malware that uses known patterns, but does not protect against new attack patterns. Moreover, in Android it is not easy to define what malicious behavior is and therefore to define its key features to detect malware. The problem is that any specification on what makes behavior beneficial or malicious very much depends on the current context.
Typical Android malware, for instance, sends text messages to premium numbers, or collects sensitive information from users, such as mobile number, current location, and contacts. However, this very same information, and the very same operations, frequently occur in benign applications as well. Sending text messages to premium numbers is for instance a legitimate paying method to unlock new app features; tracking the current location is what a navigation app has to do; and collecting the list of contacts and sending it to an external server is what most free messaging apps like WhatsApp do upon synchronization. The question thus is not whether the behavior of an app matches a specific malicious pattern or not; it is whether an app behaves as one would expect.
In our previous work we presented CHABADA, a technique to check implemented app behavior against advertised app behavior [1]. We analyzed the natural language descriptions of 22.500+ Android applications, and we checked whether the description matched the implemented behavior, represented as a set of application programming interfaces (APIs). The key of CHABADA is to associate descriptions and API usage to detect anomalies.
Our CHABADA approach includes the five steps illustrated in Figure 10.1:
Figure 10.1. Detecting applications with anomalous behavior. Starting from a collection of "benign" apps (1), CHABADA identifies their description topics (2) to form clusters of related apps (3). For each cluster, CHABADA identifies the APIs used (4), and can then identify outliers that use APIs that are uncommon for that cluster (5). 1
- 1.
-
CHABADA starts with a collection of 22,500+ supposedly "benign" Android applications downloaded from the Google Play Store.
- 2.
-
Using Latent Dirichlet Allocation (LDA) on the app descriptions, CHABADA identifies the main topics ("theme," "map," "weather," "download") for each application.
- 3.
-
CHABADA then clusters applications by related topics. For instance, if there were enough apps whose main description topics are "navigation" and "travel," they would form one cluster.
- 4.
-
In each cluster, CHABADA identifies the APIs each app statically accesses. It only considers sensitive APIs, which are governed by a user permission. For instance, APIs related to Internet access are controlled by the "INTERNET" permission.
- 5.
-
Using unsupervised learning, CHABADA identifies outliers within a cluster with respect to API usage. It produces a ranked list of applications for each cluster, where the top apps are most abnormal with respect to their API usage—indicating possible mismatches between description and implementation. Unknown applications would thus first be assigned to the cluster implied by their description and then be classified as being normal or abnormal.
By flagging anomalous API usage within each cluster, CHABADA is set up to detect any suspicious app within a set of similar apps and can therefore detect whether an app has any mismatch between advertised and implemented behavior. We show how this works in practice with a real app as an example. Figure 10.2 shows the description of the Official New York Jets team app, 2 available from the Google Play Store. Its description clearly puts it into the "sports" cluster.
Figure 10.2. Official New York Jets app description.
In addition to expected common API calls, the version of the Official New York Jets app we analyzed can check whether GPS location is available, via the API method LocationManager.addGpsStatusListener(), and can send text messages, via the API method SmsManager.sendTextMessage(), which are highly uncommon operations for this kind of application. These API method calls, together with similar others, make Official New York Jets an outlier within the "sports" cluster. By flagging such anomalies, CHABADA can detect false advertising, plain fraud, masquerading, and other questionable behavior. CHABADA can be used as a malware detector as well. By training it on a sample of benign apps, CHABADA can classify new apps as benign or malware, without any previous notion of malicious behavior.
This chapter extends our previous conference paper [1] by presenting several new techniques that lead to significant improvements:
- 1.
-
We now rank down irrelevant APIs when looking for anomalies. Specifically, we give a lower weight to APIs that are common within a particular cluster (e.g., Internet access, which is frequently used in applications). By giving more importance to less common behavior, CHABADA can highlight outliers more easily.
- 2.
-
We incorporate an additional technique for anomaly detection. The anomaly detection of CHABADA is now based on a distance-based algorithm, which allows clearly identifying the APIs that make an app anomalous.
- 3.
-
To use CHABADA as a classifier of malicious applications, we now run anomaly detection as a preliminary step, and we exclude the anomalies from the training set. This allows to remove noise from the training set, and consequently to improve the abilities of the classifier. CHABADA can now predict 74% of malware as such (previously 56%), and suffers from only 11% of false positives (previously 15%).
- 4.
-
We can now automatically select the optimal parameters for the classifier. This also contributes to the improvement of CHABADA as a malware classifier.
The remainder of this paper is organized as follows. We first describe how CHABADA clusters applications by description topics in Section 10.2. This book chapter does not improve our first paper [1] on this side, but we include a description of this step for the sake of completion. Section 10.3 describes how in each cluster we detect outliers with respect to their API usage. In particular, we describe the new algorithm that CHABADA uses, and we highlight the advantages of the new approach. Section 10.4 evaluates the improvements of CHABADA. After discussing the related work (Section 10.5), Section 10.6 closes with conclusions and future work.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124115194000100
Big Data Analytics
Venu Govindaraju , ... Srirangaraj Setlur , in Handbook of Statistics, 2015
4.1.2 Prior Work
Studying the trend of scientific ideas over time was earlier studied by Kuhn in 1962 (Kuhn, 1962). In Kuhn's model, science is viewed as shifting from one paradigm to another; since researchers' ideas and vocabulary are constrained by their paradigm, successive less compatible paradigms will have a different vocabulary and framing. Although Kuhn's model was intended to apply only to very large shifts in scientific thought, rather than at the microlevel of trends in research, this insight that vocabulary and vocabulary shift is a crucial indicator of ideas and shifts in ideas has been explored by several researchers in the machine learning and information engineering communities (Hall et al., 2008). A related issue is that of analyzing culture changes using millions of digitized books (Michel et al., 2011).
Latent Dirichlet allocation (LDA) ( Blei et al., 2003) also known as unsupervised topic modeling was first published in 2003 and is the most basic idea of probabilistic topic (or theme) modeling. It is assumed that a fixed number of "topics" are distributions over words in a fixed vocabulary, in the entire document collection, so that LDA provides a method for automatically discovering topics that the documents collectively contain. Other more advanced methods of discovering latent hierarchies based on unsupervised learning of densities and nested mixtures include finite-depth trees (Williams, 1999), diffusive branching processes (Neal, 2003), and hierarchical clustering (Heller and Ghahramani, 2005; Teh et al., 2008). Other latent hierarchical Bayesian approaches include semi-supervised learning (Kemp et al., 2003), relational learning (Roy et al., 2006), and multi-task learning (Daumé, 2009). Most recently, evolutionary diffusion processes have been proposed to capture the tree-like, hierarchical structure of natural data (Adams et al., 2010; Meeds et al., 2008; Paisley et al., 2012).
The Dynamic Topic Model (Blei and Lafferty, 2006) is an example of how to model temporal relationships by extending the standard LDA, where each year's documents are assumed to be generated from a normal distribution centroid over topics, and the following year's centroid is generated from the preceding year's, with a Markov chain type of relationship. The Topics over Time Model (Wang and McCallum, 2006) assumes that each document chooses its own time stamp based on a topic-specific beta distribution. These two models however impose strong constraints on the time periods. Along these line, we also implemented a dynamic topic model published in the Journal of Machine Learning (Malgireddy et al., 2013). In this model, we learned the relationships between the input observables also as a Markov chain type of relationship and used this model to cluster and classify human activities in large collections of videos. An example of a subtree of documents inferred using 20 topics is presented in Fig. 4 (left), where only the nodes with at least 50 documents are shown.
Figure 4. Top: A subtree of documents from NIPS 1-12 (modified from Adams et al., 2010) showing the hierarchy on thematic documents data. Bottom: A sample of our table and caption detector (a current work in progress) on a physics article published in 1968. Image best viewed in color.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780444634924000010
Cognitive Applications and Their Supporting Architecture for Smart Cities
Haytham Assem , ... Declan O'Sullivan , in Big Data Analytics for Sensor-Network Collected Intelligence, 2017
4.2.1 Related work to discovering regional patterns
In recent years, many approaches have been proposed for identifying patterns using mobility and LBSN data. In Ref. [53], Bicoocchi et al. proposed an approach based on clustering and segmenting GPS traces to infer the places of relevance to the user. In Ref. [54], Eagle et al. applied principle component analysis (PCA) to infer places and mobility patterns on the basis of nearby radio frequency beacons (e.g., WIF and GSM towers). The human activities termed as eigen behaviors are represented as the top eigenvectors of the PCA. Similarly, the work presented by Sigg et al. in Ref. [55] compared different data mining techniques for extracting patterns from mobility data where they found that independent component analysis (ICA) and PCA are most suitable for identifying humans' daily patterns.
Although the previous unsupervised learning methods showed a great success for detecting patterns in cities based on whole days, there is also a need for detecting such patterns for various time intervals [56 ]. Topic modeling is a useful tool in such a case that has been investigated by several works to extract individual recurrent patterns. It was introduced originally for finding underlying topics of words from a large collection of documents. latent Dirichlet allocation (LDA) is one of the implementations of topic modeling that has been widely used for extracting individual recurrent patterns in cities [ 57].
In Ref. [58], Laura et al. presented a method based on LDA to automatically discover users' routine behavior extracted from a Google Latitude mobility data set. They focused more on extracting the routine behaviors other than relevant places compared to what was proposed in Refs. [53] and [54]. In Ref. [56], authors presented an approach based on LDA for crowd detection using Twitter posts of data in New York that contains a large set of users but in a sparse way. In Ref. [59], Samiul et al. provided foundational tools that can be used to predict user-specific activity patterns. They addressed the main limitation for geo-location data for modeling individual behaviors and presented a topic model that can extract the activity patterns without the socio-demographic details of the individuals. Felix et al. in Ref. [60] combined textual and movement data, and they applied topic models to the combined data on an averaged week activity in which they were able to show how city modalities evolve over time and space.
In the past, neural networks have had the following drawbacks: they require labeled data that is difficult to obtain in most cases; the learning time does not scale well as it is very slow in networks with multiple hidden layers; and it has the tendency to get stuck in a poor local minima [61]. Smolensky introduced the RBM [62] and afterwards Hinton introduced a learning algorithm called Contrastive Divergence for the training RBM [46]. Hinton and Salakhutdinov introduced the pre-training process by stacking a number of RBMs [49] and being able to train each RBM separately using the Contrastive Divergence algorithm. This was found to provide a crude convergence for the parameters that can be used as an initialization for the fine-tuning process. The fine-tuning process is very similar to the learning algorithms that have been used in the feed forward neural networks (FFNNs). By using an optimization model, the parameters converge to reconstruct the input. Hinton and Salakhutdinov validated their method on the popular MNIST data set by demonstrating how it reduces the dimensionality of an input vector from 784 dimensions to 2 dimensions, but still represents the original data well [49].
Hinton and Salakhutdinov introduced the constrained Poisson model (CPM) as a core component of RBM to model word count data for performing a dimensional reduction on document data [63]. Afterwards, the authors replaced the CPM with the replicated Softmax model (RSM) due to the inability of the CPM when defining a proper distribution over word counts [64]. Later, the RSM was introduced to act as the first component in the DBN pertaining process [65]. Hinton and Salakhutdinov also introduced Semantic Hashing to produce binary values [63], which can be applied to measure the similarity between documents using hamming distance.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128093931000088
Source: https://www.sciencedirect.com/topics/computer-science/latent-dirichlet-allocation
0 Response to "Lda Latent Dirichlet Allocation Continuous Inputs"
ارسال یک نظر