< Back

TextReader: An N-gram-Based Text Analysis Tool

Java, Swing, Hash Tables, N-gram Model, Linked Lists, Standard Deviation Analysis

TextReader is a project designed to help users understand and explore word frequency, predictability, and distribution through N-gram analysis. The application reads text files and analyzes the content using a combination of data structures and machine learning techniques to provide insightful visualizations and predictions.

Built with Java, the project features a GUI using Swing, where users can import text documents, extract word frequencies, and visualize hash function distribution using linked lists. Hash tables are used to handle word collisions efficiently, while different hash algorithms (e.g., Division, Multiplication, and Universal Hashing) provide diverse options to evaluate the spread of hashed data, helping reduce collisions.

The application also leverages the power of N-gram models to predict the next word in a sequence, which is highly useful for language learning and word prediction. Users can choose between 2-gram, 3-gram, 4-gram, or 5-gram models to analyze how well the program can predict the next word based on the given context.

With TextReader, users can also generate histograms of hash distribution to visualize how different algorithms affect the performance of hashing. This allows an understanding of how well the selected hash function reduces clustering and evenly distributes the words across the hash table, thereby enhancing efficiency.

Features:

  • Word Frequency Analysis
  • N-gram Word Prediction
  • Hash Table Linked List Length Visualization
  • Multiple Hashing Algorithms for Testing
  • Standard Deviation Analysis for Linked List Length

To learn more about the code and contribute, visit the GitHub repository.

Why I Built TextReader?

TextReader is a project I developed out of personal need, as I'm not a native English speaker. Every time I try to read an English book, I find that unfamiliar words frequently interrupt my reading flow, affecting my experience. To solve this issue, I wanted a tool that could help me identify all the words in the book and their frequencies. By learning these words in advance (especially those that are less common), I can improve my comprehension and make reading more enjoyable.

In the future, I plan to expand TextReader's features to allow importing PDF files and add functionality for providing direct explanations for unfamiliar words.

How Does Word Prediction Work?

Example: Suppose you have the following sentence: "The quick brown fox".

If you select a 2-gram model and input "the", the application analyzes all occurrences of "the" in the text and predicts the next word based on frequency. For example, the next word could be "quick" if it occurs most frequently after "the".

Similarly, for a 3-gram model, if you input "the quick", the predicted next word might be "brown" if "brown" follows "the quick" most frequently.

This prediction helps users understand word sequences, enhancing vocabulary learning and understanding common phrase patterns.