Basic Text Processing

On This Page

Basic Text Processing

Tokenization

1. Word Tokenization

Description: Word tokenization splits a text into individual words using spaces and punctuation as delimiters.

Shell Version:

namespace TextProcessing {

    class WordTokenizer {
    public:
        WordTokenizer(); // Constructor

        // Tokenize the input text into words
        std::vector<std::string> tokenize(const std::string &text);

    private:
        // Helper functions or attributes can be added here
    };

}
copy

Example:

Input: “Hello, world! This is a test.” Output: [“Hello”, “world”, “This”, “is”, “a”, “test”]

2. Sentence Tokenization

Sentence tokenization divides a text into individual sentences by identifying sentence boundaries.

Shell Version:

namespace TextProcessing {

    class SentenceTokenizer {
    public:
        SentenceTokenizer(); // Constructor

        // Tokenize the input text into sentences
        std::vector<std::string> tokenize(const std::string &text);

    private:
        // Helper functions or attributes can be added here
    };

}
copy

Example: Input: “Hello, world! This is a test. How are you?” Output: [“Hello, world!”, “This is a test.”, “How are you?”]

3. Subword Tokenization

Subword tokenization breaks down words into smaller units, such as prefixes, suffixes, or even individual characters.

Shell Version:

namespace TextProcessing {

    class SubwordTokenizer {
    public:
        SubwordTokenizer(); // Constructor

        // Tokenize the input word into subwords
        std::vector<std::string> tokenize(const std::string &word);

    private:
        // Helper functions or attributes can be added here
    };

}
copy

Example:

Input: “unhappiness” Output: [“un”, “happiness”] or [“un”, “happy”, “ness”] (depending on the tokenization strategy)

4. Character Tokenization

Character tokenization involves splitting text into individual characters.

Shell Version:

namespace TextProcessing {

    class CharacterTokenizer {
    public:
        CharacterTokenizer(); // Constructor

        // Tokenize the input text into characters
        std::vector<char> tokenize(const std::string &text);

    private:
        // Helper functions or attributes can be added here
    };

}
copy

Example:

Input: “Hello” Output: [‘H’, ’e’, ’l’, ’l’, ‘o’]

Test Cases for Tokenization

Test Cases for Word Tokenization

Input: "The quick brown fox, who was very quick, jumped over the lazy dog."
Expected Output: ["The", "quick", "brown", "fox", "who", "was", "very", "quick", "jumped", "over", "the", "lazy", "dog"]
Input: "Mr. John Doe, Ph.D., was honored on Dec. 10th, 2020, at 10:00 AM."
Expected Output: ["Mr", "John", "Doe", "Ph.D.", "was", "honored", "on", "Dec", "10th", "2020", "at", "10:00", "AM"]
Input: "In the year 2021, COVID-19 vaccinations increased rapidly worldwide."
Expected Output: ["In", "the", "year", "2021", "COVID-19", "vaccinations", "increased", "rapidly", "worldwide"]
Input: "E-mail addresses such as john.doe@example.com or info@company.org are commonly used."
Expected Output: ["E-mail", "addresses", "such", "as", "john.doe@example.com", "or", "info@company.org", "are", "commonly", "used"]
Input: "Python's popularity soared in 2020; however, C++ remains essential."
Expected Output: ["Python's", "popularity", "soared", "in", "2020", "however", "C++", "remains", "essential"]

Test Cases for Sentence Tokenization

Input: "Dr. John Doe was appointed CEO. He took over on January 1st, 2021."
Expected Output: ["Dr. John Doe was appointed CEO.", "He took over on January 1st, 2021."]
Input: "She said, 'Let's meet at 10:00 AM.' He replied, 'Sure, see you then!'"
Expected Output: ["She said, 'Let's meet at 10:00 AM.'", "He replied, 'Sure, see you then!'"]
Input: "The results were published in the journal. Prof. Smith, et al., reported success."
Expected Output: ["The results were published in the journal.", "Prof. Smith, et al., reported success."]
Input: "Hello! Are you there? I'm waiting for your reply."
Expected Output: ["Hello!", "Are you there?", "I'm waiting for your reply."]
Input: "He scored 99.5% on his test, which was remarkable. Consequently, he was awarded."
Expected Output: ["He scored 99.5% on his test, which was remarkable.", "Consequently, he was awarded."]

Test Cases for Subword Tokenization

Input: "unbelievable"
Expected Output: ["un", "believable"] or ["un", "believe", "able"]
Input: "counterintuitive"
Expected Output: ["counter", "intuitive"] or ["counter", "intuit", "ive"]
Input: "restructuring"
Expected Output: ["re", "structuring"] or ["re", "structure", "ing"]
Input: "antidisestablishmentarianism"
Expected Output: ["anti", "dis", "establishment", "arianism"]
Input: "internationalization"
Expected Output: ["inter", "national", "ization"] or ["international", "ization"]

Test Cases for Character Tokenization

Input: "Hello, world!"
Expected Output: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Input: "C++ is great!"
Expected Output: ['C', '+', '+', ' ', 'i', 's', ' ', 'g', 'r', 'e', 'a', 't', '!']
Input: "123-456-7890"
Expected Output: ['1', '2', '3', '-', '4', '5', '6', '-', '7', '8', '9', '0']
Input: "E-mail"
Expected Output: ['E', '-', 'm', 'a', 'i', 'l']
Input: " " (Single space)
Expected Output: [' ']

Text Normalization

1. Named Entity Recognition

Named Entity Recognition is the process of identifying and classifying named entities within a text into predefined categories such as persons, organizations, locations, dates, etc. This is an essential task in information extraction, helping to structure unstructured data.

Shell Version:

namespace TextProcessing {

    enum class EntityType { PERSON, ORGANIZATION, LOCATION, DATE, MISC };

    struct Entity {
        std::string text;
        EntityType type;
        int startIndex;
        int endIndex;
    };

    class NamedEntityRecognizer {
    public:
        NamedEntityRecognizer(); // Constructor

        // Identify and classify named entities in the input text
        std::vector<Entity> recognize(const std::string &text);

    private:
        // Helper functions or attributes can be added here
    };

}
copy

Simple Input and Output Example:

Input: “Apple Inc. was founded by Steve Jobs in Cupertino.”

Output:

Entity 1: {"Apple Inc.", EntityType::ORGANIZATION, 0, 8}
Entity 2: {"Steve Jobs", EntityType::PERSON, 21, 31}
Entity 3: {"Cupertino", EntityType::LOCATION, 36, 45}
copy

Test Input Data and Expected Outputs

Input: “Barack Obama was born on August 4, 1961, in Honolulu, Hawaii.”

Expected Output:

{"Barack Obama", EntityType::PERSON, 0, 12}
{"August 4, 1961", EntityType::DATE, 21, 35}
{"Honolulu", EntityType::LOCATION, 40, 48}
{"Hawaii", EntityType::LOCATION, 50, 56}
copy

Input: “Google LLC is headquartered in Mountain View, California.”

Expected Output:

{“Google LLC”, EntityType::ORGANIZATION, 0, 9} {“Mountain View”, EntityType::LOCATION, 26, 38} {“California”, EntityType::LOCATION, 40, 50}

Input: “The United Nations was established on October 24, 1945.”

Expected Output:

{“United Nations”, EntityType::ORGANIZATION, 4, 18} {“October 24, 1945”, EntityType::DATE, 35, 50}

Input: “Tesla’s headquarters are in Palo Alto.”

Expected Output:

{“Tesla”, EntityType::ORGANIZATION, 0, 5} {“Palo Alto”, EntityType::LOCATION, 27, 36}

Input: “Albert Einstein was born in Ulm on March 14, 1879.”

Expected Output:

{“Albert Einstein”, EntityType::PERSON, 0, 15} {“Ulm”, EntityType::LOCATION, 29, 32} {“March 14, 1879”, EntityType::DATE, 36, 50}