<div class="container column">

    <h2 class="title">Geographic Data Visualization</h2>
    <p class="sub_text">The Geographic Visualization is valuable for exploring trends over space and over time. You can use Geographic Analysis to tackle questions like: </p>
    <br/>
    <ul class="sub_text list">
        <li>How did the Flint water crisis unfold from 2014 to 2020?</li>
        <li>What countries are most interested in electric cars and solar energy?</li>
        <li>Where are the highest levels of homelessness in the US? Are these states talking about these challenges in the public sphere?</li>
    </ul>
    <br/>
    <p class="sub_text"> The following geographic visualization is created from a set of US newspapers using the search terms “drinking water” AND lead AND “contaminate*”.</p>
    <br/>
    
    <img class="img_center img_size" src="../../../../../assets/images/Documentation/geo-main.png" alt="Geographic Data Visualization" >
    
    <br/>
    <p class="sub_text space">Geographic Analysis is a difficult task, and there will definitely be instances in the data visualization and results export where the TDM Studio algorithm picks the incorrect location—e.g. placing “London” in Ontario instead of England. In a teaching and learning context, it can be valuable to use these geocoding or NER errors as teaching opportunities for understanding the limitations of algorithmic text mining as well as some of the challenges surrounding the task.</p>
    
    <p class="sub_text space">Each cluster or circle on the map represents a count of locations identified in the documents. For example, in the cluster over Spain, there are 12 occurrences of the underlying locations which have been resolved to this area. The 12 locations will likely come from fewer than 12 articles.</p>
    
    <p class="sub_text space">By adjusting the time slider, it is possible to see how the number of locations on the map changes over time. The time slider is similar to a date filter—All of the points which occur within the date range will be included on the map.</p>
    
    <p class="sub_text space"><b>Important Note:</b> Creating interactive data visualizations can be computationally intensive. In order to expedite the availability of your visualization, it may be necessary to select a sample from your dataset. The locations presented in the geographic analysis visualization are likely a sample of the total locations present in the entire dataset. Thus, the minimum and maximum dates on the time slider are based on this sample of locations which are visualized on the map. This is often different from the project date range which is presented in the project header. For example, if a project dataset contains an article which has the earliest publication date but does not contain any locations, then the project date range will be different than the Geographic Analysis time slider date range.</p>


    <div class="info_tab" (click)="toggleTab(1)">
        <div class="tab_bar">
            <h3 class="tab_title">List of Articles </h3>
            <div [ngClass]="isOpen(1) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(1)">
            <p class="tab_text">When you click on a cluster, a drawer opens presenting the list of articles which contain locations included in the selected cluster. The articles are listed in order from most recent to oldest publication date. </p>
            <p class="tab_text">In the above example, I am interested in learning more about why Flint has more locations identified than Chicago even though Chicago has a far greater population. This is due to the Flint Water Crisis, which is also apparent from the list of articles. </p>
            <p class="tab_text">By clicking on a specific article title from the list, a new tab will be opened with the full-text view of the article.</p>
            <p class="tab_text"><b>Important Note:</b> The list of articles may have fewer articles than the number of locations in the selected cluster. This is because most articles contain more than one location.</p>
            <img class="img_center img_size" src="../../../../../assets/images/Documentation/geo-articles.png" alt="List of Articles " >
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(2)">
        <div class="tab_bar">
            <h3 class="tab_title">Export Data</h3>
            <div [ngClass]="isOpen(2) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(2)">
            <p class="tab_text">It is possible to export the geographic data as well as the article metadata via “Export Data.” By clicking on “Export Data,” you can select the data format which works best for you (.csv or geojson), and the selected file will begin to download. Depending on the size of the selected file, this can take a few minutes. </p>
            <img class="img_center img_size" src="../../../../../assets/images/Documentation/geo-export.png" alt="Export Data" >
            <p class="tab_text">You can then use this exported data for further text mining analysis. For example, if I wanted to analyze how income and education related to water crises, I could export the geographic data from TDM Studio for my project and pair this data with other available datasets. </p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(3)">
        <div class="tab_bar">
            <h3 class="tab_title">Geographic Named Entity Recognition (NER)</h3>
            <div [ngClass]="isOpen(3) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(3)">        
            <p class="tab_text">The algorithm which delivers the location information to the data visualization is created via a two-step process: <b>Geotagging</b> and <b>Geocoding</b>. For Geographic Analysis in TDM Studio, algorithms and approaches have specifically been chosen which are intelligible and open-source licensed.
                The first step is to identify geographic entities within each newspaper document. This can be a challenging task because words such as “Charlotte” can be used both as a person’s name as well as the name of a location. For this process, TDM Studio uses SpaCy’s NER model and pipeline. SpaCy provides an overview here: <a href="https://spacy.io/usage/linguistic-features" target="_blank" (click)="toggleTab(3)">https://spacy.io/usage/linguistic-features</a></p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(4)">
        <div class="tab_bar">
            <h3 class="tab_title">Candidate Selection</h3>
            <div [ngClass]="isOpen(4) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(4)">
            <p class="tab_text">Once SpaCy has identified location entities within each newspaper article (Title, Abstract, Text), TDM Studio then uses GeoNames (<a href="https://www.geonames.org/" target="_blank" (click)="toggleTab(4)">https://www.geonames.org</a>) to create a list of candidates to link the geographic entity. In other words, when a newspaper article mentions the geographic entity “London,” is it referring the “London” in England or the “London” in Canada? 
                To select candidates from GeoNames, TDM Studio uses exact, lower-cased token-matching.  The alternate names from GeoNames as well as the official names as candidates are included. </p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(5)">
        <div class="tab_bar">
            <h3 class="tab_title">Geocoding</h3>
            <div [ngClass]="isOpen(5) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(5)">
            <p class="tab_text">To pick between candidates, TDM Studio uses a gravity-inspired, geocoding algorithm which ProQuest has developed. The initial inspiration and pilot work was completed via a collaboration with the University of Michigan. </p>
            <p class="tab_text">The primary intuition behind the gravity geocoder is that newspapers have a geographic center and are more likely to discuss places which are closer to that geographic center. For example, when The Guardian mentions “London”, it is more likely to be referring to London, England vs. London, Ontario. On the other hand, when The Globe and Mail refers to “London”, it is more likely to be referring to London, Ontario than London, England.</p> 
            <p class="tab_text">To pick which “London” the article is referring to, TDM Studio uses Newton’s formula for gravity: </p>
            <img class="img_center no-border" src="../../../../../assets/images/Documentation/geo-formula.png" alt="formula" width="150" height="30">
            <p class="tab_text">Where the population of the candidate is used for mass and the distance between publisher location and the candidate is used for distance (r). TDM Studio then chooses the candidate with the greatest gravitational force. This approach means that specific publications (e.g. The Guardian) will always pick London, England when “London” occurs in a newspaper article. This approach has been benchmarked against internal newspaper datasets as well as external evaluation datasets.</p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(6)">
        <div class="tab_bar">
            <h3 class="tab_title">Subsampling for Visualization</h3>
            <div [ngClass]="isOpen(6) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(6)">
            <p class="tab_text">Each newspaper article may have multiple locations, and it may also mention the same location multiple times. This can result in a very large number of locations and can create challenges for visualization performance. For the Geographic Visualization, a random sampling is used to limit the max number of points on the map to 4,000. For example, if the project dataset results in 15,000 total locations, we will take a random sample of 4,000 locations from this 15,000 and plot these 4,000 locations on the map. </p>
            <p class="tab_text">All locations (in this example, 15,000) which have been geocoded are included in the exportable csv / geojson files. </p>
            <p class="tab_text"><b>Important Note:</b> In rare cases, long documents will have hundreds or even thousands of locations. If a document has more than twenty locations, only the first twenty locations are included in the results.</p>
        </div>
    </div>

    <div class="info_tab" (click)="toggleTab(7)">
        <div class="tab_bar">
            <h3 class="tab_title">Additional Recommended Reading</h3>
            <div [ngClass]="isOpen(7) ? 'up_arrow' : 'down_arrow'"></div>
        </div>
        
        <div class="tab_content column" *ngIf="isOpen(7)">
            <p class="tab_text space">Below are a few selected articles which discuss some of the challenges and solutions to Geotagging and Geocoding. These articles were important to the development of a gravity-inspired geocoding algorithm for TDM Studio and are valuable for further exploration.</p>
            <p class="tab_text space">Buscaldi, D. and Magnini, B., 2010. Grounding toponyms in an Italian local news corpus. In <span class="italic">Proceedings of the 6th workshop on geographic information retrieval</span> (pp. 1-5).</p>   
            <p class="tab_text space">DeLozier, G., Baldridge, J. and London, L., 2015. Gazetteer-independent toponym resolution using geographic word profiles. In <span class="italic">Proceedings of the AAAI Conference on Artificial Intelligence</span> (Vol. 29, No. 1).</p>
            <p class="tab_text space">Gritta, M., Pilehvar, M.T., Limsopatham, N. and Collier, N., 2018. <span class="italic">What’s missing in geographical parsing? Language Resources and Evaluation</span>, 52, pp. 603-623.</p>
        </div>
    </div>

</div>