Chapters

Chapter 1: The Computational Library

This chapter introduces the concept of computational research and thinking in libraries. It covers (i) the basic concept of text mining, (ii) the need for text mining in libraries, (iii) understanding text characteristics, and (iv) identifying different text mining problems, which include document classification/text categorization, information retrieval, clustering, and information extraction. It presents various use cases of libraries that have applied text mining techniques and enumerates various costs, limitations, and benefits related to text mining. This chapter is followed by a case study showing the clustering of documents using two different tools.

Case Study: Clustering of Documents using Two Different Tools

Chapter 2: Text Data and Where to Find Them?

This chapter first throws light on the standard data file types with their usage, advantages, and disadvantages. In a digital library, data might be useless and considered incomplete without a metadata record. Therefore, the functions, uses, components, and importance of metadata are covered comprehensively, followed by steps to create quality metadata, common metadata standards available, different metadata repositories, common concerns, and solutions. The second part of the chapter focuses on the importance of the inclusion of optical character recognition (OCR) for digitized data, followed by different ways of getting data from (i) online repositories, (ii) relational databases, (iii) web APIs, and (iv) web/screen scraping to start a text mining project. Further, several online repositories, language corpora, and repositories with APIs available for text mining are enumerated. Finally, some of the essential applications of APIs for librarians and for what purpose librarians can use them in their day-to-day work are covered in this chapter.

Chapter 3: Text Pre-Processing

This chapter focuses on the theoretical framework of text data pre-processing. It describes the three levels of text representation: lexical, syntactic, and semantic. It further explains the concept of bag of words, word embedding, term frequency and weighting, named entity extraction, and parsing. The chapter is followed by a case study showing text analysis of Tolkien’s books, a web project developed by Emil Johanson.

Chapter 4: Topic Modeling

Topic modeling is usually used to identify the hidden theme/concept using an algorithm based on high word frequency among the documents. It can be used to process any textual data commonly present in libraries to make sense of the data. Latent Dirichlet Allocation algorithm is the most famous topic modeling algorithm that finds out the highly contributing words in each topic. Additionally, it provides a topic proportion that can segregate all the documents under the identified themes/topics. Thus, topic modeling can tag each document with a topic that can later be used to index and link the defined set of documents if embedded in the website or database for better searching and retrieval purpose. A comprehensive conceptual framework related to topic modeling, tools, and ways to visualize topic models is covered in this chapter with various use cases. This chapter is followed by a case study using three different tools to demonstrate the application of topic modeling in libraries.

Case Study: Topic Modeling of Documents using Three Different Tools

Chapter 5: Network Text Analysis

This chapter covers the theoretical framework for network text analysis, including its advantages, disadvantages, and various essential features. Further, it covers various open-source tools that can be used to make a text network. Information professionals may use network text analysis to answer various research questions and get a better visual representation of textual data. Use cases that show the application of network text analysis in libraries are also covered. Lastly, to demonstrate the application of network text analysis in libraries better, two case studies are performed using the bibliometrix and textnets packages in R language.

Case Study: Network Text Analysis of Documents using Two Different R Packages

Chapter 6: Burst Detection

This chapter provides a theoretical framework for burst detection, including its advantages, disadvantages, and other essential features. It further enumerates various open-source tools that can be used to conduct burst detection and discusses the use cases on how the information professionals can apply it in their daily lives. The chapter is followed by a case study using two different tools to demonstrate the application of burst detection in libraries.

Case Study: Burst Detection of Documents using Two Different Tools

Chapter 7: Sentiment Analysis

Sentiment or opinion analysis employs natural language processing to extract a significant pattern of knowledge from a large amount of textual data. It examines comments, opinions, emotions, beliefs, views, questions, preferences, attitudes, and requests communicated by the writer in a string of text. It extracts the writer’s feelings in the form of subjectivity (objective and subjective), polarity (negative, positive, and neutral), and emotions (angry, happy, surprised, sad, jealous, and mixed). Thus, this chapter covers the theoretical framework and use cases of sentiment analysis in libraries. The chapter is followed by a case study showing the application of sentiment analysis in libraries using two different tools.

Case Study: Sentiment Analysis of Documents using Two Different Tools

Chapter 8: Predictive Modeling

This chapter covers a comprehensive theoretical framework for predictive modeling (or supervised machine learning). It also covers various biases, challenges, solutions, and use cases of predictive modeling in libraries. A case study that shows how library professionals can use predictive modeling to index/tag future textual resources without repeating a text mining technique, again and again, is also included.

Case Study: Predictive Modeling of Documents using RapidMiner

Chapter 9: Information Visualization

This chapter aims to build a theoretical framework for information visualization with a particular focus on libraries. It exhibits and explains fundamental graphs, advanced graphs, and text and document visualizations in detail. It enumerates various rules on visual design and use cases from libraries to understand information visualization concepts and how they can be applied in libraries comprehensively. A case study showing how to build a dashboard in R language is also included. This chapter is helpful for information professionals who (i) are new to the concept of information visualization, (ii) want to know more about information visualization, or (iii) want to visualize their data.

Chapter 10: Tools and Techniques for Text Mining and Visualizations

This chapter covers 19 popular open-access text mining and visualization tools, including R, Topic-Modeling-Tool, RapidMiner, WEKA, Orange, Voyant Tools, Gephi, Tableau Public, Infogram, and Microsoft Power BI, among others, with their applications, pros, and cons. As there are many text mining and visualization tools available, we covered only those open-source tools that have a simple GUI so that information professionals who are new to these tools can learn to use and implement them in their daily work.

Chapter 11: Text Data and Mining Ethics

Before leaping to the critical legal and ethical issues related to text mining, it is vital to comprehend (i) the importance of data management for text mining, (ii) the lifecycle of research data, (iii) data management plan that strategizes the various data security, legal, and ethical constraints, (iv) data citation, and (v) data sharing. This chapter covers all the above-stated concepts in addition to legal and ethical issues related to text mining (such as copyright, licenses, fair use, creative commons, digital management rights), algorithm confounding, and social media research. It further presents text mining licensing conditions by selected prominent publishers and a “do’s and dont’s” list to help library professionals conduct text mining efficiently.

BONUS –

Curated Datasets: This repository contains some of the additional datasets which are in open-access and can be used to practice or teach text mining. The goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP.

Posted on:: January 1, 2023

Length:: 7 minute read, 1285 words

See Also: