Basic Text Processing
Tokenization
1. Word Tokenization
Description: Word tokenization splits a text into individual words using spaces and punctuation as delimiters.
Shell Version:
namespace TextProcessing {
class WordTokenizer {
public:
WordTokenizer(); // Constructor
// Tokenize the input text into words
std::vector<std::string> tokenize(const std::string &text);
private:
// Helper functions or attributes can be added here
};
}
copy
Example:
Input: “Hello, world! This is a test.” Output: [“Hello”, “world”, “This”, “is”, “a”, “test”]
2. Sentence Tokenization
Sentence tokenization divides a text into individual sentences by identifying sentence boundaries.
Shell Version:
namespace TextProcessing {
class SentenceTokenizer {
public:
SentenceTokenizer(); // Constructor
// Tokenize the input text into sentences
std::vector<std::string> tokenize(const std::string &text);
private:
// Helper functions or attributes can be added here
};
}
copy
Example: Input: “Hello, world! This is a test. How are you?” Output: [“Hello, world!”, “This is a test.”, “How are you?”]
3. Subword Tokenization
Subword tokenization breaks down words into smaller units, such as prefixes, suffixes, or even individual characters.
Shell Version:
namespace TextProcessing {
class SubwordTokenizer {
public:
SubwordTokenizer(); // Constructor
// Tokenize the input word into subwords
std::vector<std::string> tokenize(const std::string &word);
private:
// Helper functions or attributes can be added here
};
}
copy
Example:
Input: “unhappiness” Output: [“un”, “happiness”] or [“un”, “happy”, “ness”] (depending on the tokenization strategy)
4. Character Tokenization
Character tokenization involves splitting text into individual characters.
Shell Version:
namespace TextProcessing {
class CharacterTokenizer {
public:
CharacterTokenizer(); // Constructor
// Tokenize the input text into characters
std::vector<char> tokenize(const std::string &text);
private:
// Helper functions or attributes can be added here
};
}
copy
Example:
Input: “Hello” Output: [‘H’, ’e’, ’l’, ’l’, ‘o’]
Test Cases for Tokenization
Test Cases for Word Tokenization
Input:
"The quick brown fox, who was very quick, jumped over the lazy dog."
Expected Output:["The", "quick", "brown", "fox", "who", "was", "very", "quick", "jumped", "over", "the", "lazy", "dog"]
Input:
"Mr. John Doe, Ph.D., was honored on Dec. 10th, 2020, at 10:00 AM."
Expected Output:["Mr", "John", "Doe", "Ph.D.", "was", "honored", "on", "Dec", "10th", "2020", "at", "10:00", "AM"]
Input:
"In the year 2021, COVID-19 vaccinations increased rapidly worldwide."
Expected Output:["In", "the", "year", "2021", "COVID-19", "vaccinations", "increased", "rapidly", "worldwide"]
Input:
"E-mail addresses such as john.doe@example.com or info@company.org are commonly used."
Expected Output:["E-mail", "addresses", "such", "as", "john.doe@example.com", "or", "info@company.org", "are", "commonly", "used"]
Input:
"Python's popularity soared in 2020; however, C++ remains essential."
Expected Output:["Python's", "popularity", "soared", "in", "2020", "however", "C++", "remains", "essential"]
Test Cases for Sentence Tokenization
Input:
"Dr. John Doe was appointed CEO. He took over on January 1st, 2021."
Expected Output:["Dr. John Doe was appointed CEO.", "He took over on January 1st, 2021."]
Input:
"She said, 'Let's meet at 10:00 AM.' He replied, 'Sure, see you then!'"
Expected Output:["She said, 'Let's meet at 10:00 AM.'", "He replied, 'Sure, see you then!'"]
Input:
"The results were published in the journal. Prof. Smith, et al., reported success."
Expected Output:["The results were published in the journal.", "Prof. Smith, et al., reported success."]
Input:
"Hello! Are you there? I'm waiting for your reply."
Expected Output:["Hello!", "Are you there?", "I'm waiting for your reply."]
Input:
"He scored 99.5% on his test, which was remarkable. Consequently, he was awarded."
Expected Output:["He scored 99.5% on his test, which was remarkable.", "Consequently, he was awarded."]
Test Cases for Subword Tokenization
Input:
"unbelievable"
Expected Output:["un", "believable"]
or["un", "believe", "able"]
Input:
"counterintuitive"
Expected Output:["counter", "intuitive"]
or["counter", "intuit", "ive"]
Input:
"restructuring"
Expected Output:["re", "structuring"]
or["re", "structure", "ing"]
Input:
"antidisestablishmentarianism"
Expected Output:["anti", "dis", "establishment", "arianism"]
Input:
"internationalization"
Expected Output:["inter", "national", "ization"]
or["international", "ization"]
Test Cases for Character Tokenization
Input:
"Hello, world!"
Expected Output:['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Input:
"C++ is great!"
Expected Output:['C', '+', '+', ' ', 'i', 's', ' ', 'g', 'r', 'e', 'a', 't', '!']
Input:
"123-456-7890"
Expected Output:['1', '2', '3', '-', '4', '5', '6', '-', '7', '8', '9', '0']
Input:
"E-mail"
Expected Output:['E', '-', 'm', 'a', 'i', 'l']
Input:
" "
(Single space)
Expected Output:[' ']
Text Normalization
1. Named Entity Recognition
Named Entity Recognition is the process of identifying and classifying named entities within a text into predefined categories such as persons, organizations, locations, dates, etc. This is an essential task in information extraction, helping to structure unstructured data.
Shell Version:
namespace TextProcessing {
enum class EntityType { PERSON, ORGANIZATION, LOCATION, DATE, MISC };
struct Entity {
std::string text;
EntityType type;
int startIndex;
int endIndex;
};
class NamedEntityRecognizer {
public:
NamedEntityRecognizer(); // Constructor
// Identify and classify named entities in the input text
std::vector<Entity> recognize(const std::string &text);
private:
// Helper functions or attributes can be added here
};
}
copy
Simple Input and Output Example:
Input: “Apple Inc. was founded by Steve Jobs in Cupertino.”
Output:
Entity 1: {"Apple Inc.", EntityType::ORGANIZATION, 0, 8}
Entity 2: {"Steve Jobs", EntityType::PERSON, 21, 31}
Entity 3: {"Cupertino", EntityType::LOCATION, 36, 45}
copy
Test Input Data and Expected Outputs
Input: “Barack Obama was born on August 4, 1961, in Honolulu, Hawaii.”
Expected Output:
{"Barack Obama", EntityType::PERSON, 0, 12}
{"August 4, 1961", EntityType::DATE, 21, 35}
{"Honolulu", EntityType::LOCATION, 40, 48}
{"Hawaii", EntityType::LOCATION, 50, 56}
copy
Input: “Google LLC is headquartered in Mountain View, California.”
Expected Output:
{“Google LLC”, EntityType::ORGANIZATION, 0, 9} {“Mountain View”, EntityType::LOCATION, 26, 38} {“California”, EntityType::LOCATION, 40, 50}
Input: “The United Nations was established on October 24, 1945.”
Expected Output:
{“United Nations”, EntityType::ORGANIZATION, 4, 18} {“October 24, 1945”, EntityType::DATE, 35, 50}
Input: “Tesla’s headquarters are in Palo Alto.”
Expected Output:
{“Tesla”, EntityType::ORGANIZATION, 0, 5} {“Palo Alto”, EntityType::LOCATION, 27, 36}
Input: “Albert Einstein was born in Ulm on March 14, 1879.”
Expected Output:
{“Albert Einstein”, EntityType::PERSON, 0, 15} {“Ulm”, EntityType::LOCATION, 29, 32} {“March 14, 1879”, EntityType::DATE, 36, 50}