<div class="container column">

    <h2 class="title">Topic Modeling Data Visualization </h2>
    <div class="space"></div>
    <p class="sub_text">Topic Modeling is a text-mining approach which can be valuable for identifying which topics or subjects are part of a dataset. With TDM Studio, Topic Modeling can be used with both newspaper content as well as dissertation and thesis content for several different objectives. For example:</p>
    <div class="space"></div>
    <ul class="sub_text list">
        <li>If I am interested in understanding the relationship between what is discussed on the front page of the newspaper and the 2009 financial crisis, Topic Modeling can be valuable. How do public narratives impact economic recovery? Or how does economic recovery impact reported narratives?</li>
        <div class="space"></div>
        <li>Topic Modeling can be used to analyze recent Computer Science dissertations and theses to determine what were the trending methodologies in machine learning over the past five years. This can also be valuable from a discovery standpoint for finding dissertations and theses related to my research (e.g., for a literature review).</li>
    </ul>
    <div class="space"></div>
    <p class="sub_text">In the example below, we are using LDA to analyze a set of 8851 newspaper articles from the New York Times for the month of September, 2001. These are all of the articles published by the New York Times for the month of September. How does the news cycle change in response to the tragic, terrorist attack? How does this differ from one newspaper to another?</p>

    <div class="space"></div>

    <div class="info_tab" (click)="toggleTab(1)">
        <div class="tab_bar">
            <h3 class="tab_title">Topic Modeling (Latent Dirichlet Allocation) and Pre-Processing</h3>
            <div [ngClass]="isOpen(1) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(1)">
            <p class="tab_text">LDA (Latent Dirichlet Allocation) is a generative model which attempts to discover ‘latent’ or hidden topics within a collection of documents. The only observed variable in the model is the occurrence of words in documents. The number of topics is provided as an input from the user (in TDM Studio via the ‘Number of Topics’ dropdown) and will impact the resulting topic model. </p>
            <p class="tab_text">For TDM Studio, we use scikit-learn’s implementation of Latent Dirichlet Allocation:
                <br/> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html" target="_blank">https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html</a> This implementation also includes a valuable User Guide which includes further details on how word and topic distributions are computed: <a href="https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation" target="_blank">https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation</a></p>
            <p class="tab_text">For preparing documents for topic modeling, we rely upon scikit-learn’s CountVectorizer:
                <br/> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" target="_blank">https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html</a> For newspaper articles, we use title, abstract, and full text as input. Because dissertations and theses are often hundreds of pages long, for dissertations and theses, we use the title and abstract as input.</p>
            <p class="tab_text"><span class="italic">Important Note: </span>For both the LDA model as well as the document-term matrix, we are using the same parameters across all projects. Depending on the number of documents in the project as well as the words in the documents, a better, corpus-specific topic model may be built with different parameters.</p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(2)">
        <div class="tab_bar">
            <h3 class="tab_title">Topic Modeling Keywords and Topic Documents</h3>
            <div [ngClass]="isOpen(2) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(2)">
            <p class="tab_text">For each topic, we list ten words which have the highest probability for the topic. These words often, though not always, give an indication of what the topic is about.</p>
            <p class="tab_text">By clicking on a topic card, we present a list of up to fifty documents related to the selected topic. These are the documents for which the selected topic has a high probability of occurring. By clicking on the title of a document, a new window will open with the full text of the selected document.</p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(3)">
        <div class="tab_bar">
            <h3 class="tab_title">Topics Over Time</h3>
            <div [ngClass]="isOpen(3) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(3)">
            <p class="tab_text">Topic modeling can be valuable for tracking trends over time. We also present topics over time using an approach inspired by David Hall et al. where we calculate the probability of a topic occurring for a specific date range. This is valuable for tracking emerging and disappearing topics over time. </p>
            <p class="tab_text">For example, in our September, 2001 dataset, we can see that NYTs newspaper coverage of baseball decreases noticeably following September 11, 2001. This also aligns with what we know from history—Bud Selig, the then commissioner of baseball, suspended all baseball games for one week following the terrorist attacks. </p>
            <p class="tab_text">One important note is that if there are gaps or missing articles in the project dataset, this will impact the results and possibly suggest there is a trend over time where there is not.</p>  
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(7)">
        <div class="tab_bar">
            <h3 class="tab_title">Additional Recommended Reading</h3>
            <div [ngClass]="isOpen(7) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(7)">
            <p class="tab_text space">Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. <span class="italic">The Journal of Machine Learning Research</span>, 3, pp.993-1022.</p>   
            <p class="tab_text space">Hall, D., Jurafsky, D. and Manning, C.D., 2008, October. Studying the history of ideas using topic models. In <span class="italic">Proceedings of the 2008 conference on empirical methods in natural language processing</span> (pp. 363-371).</p>
            <p class="tab_text space">Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S. and Blei, D.M., 2009, December. Reading tea leaves: How humans interpret topic models. In <span class="italic">Neural Information Processing Systems</span> (Vol. 22, pp. 288-296).</p>
            <p class="tab_text space">Dieng, A.B., Ruiz, F.J. and Blei, D.M., 2019. The dynamic embedded topic model. <span class="italic">arXiv preprint arXiv:1907.05545</span>.</p>
        </div>
    </div>

</div>
