TF-IDF stands for “Term Frequency, Inverse Document Frequency.” It’s a way to score the importance of words (or “terms”) in a document based on how frequently they appear across multiple documents.
Intuitively…If a word appears frequently in a document, it’s important. Give the word a high score.But if a word appears in many documents, it’s not a unique identifier. Give the word a low score.Therefore, common words like “the” and “for,” which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.
Scikit-learn provides two methods to get to our end result (a tf-idf weight matrix).
One is a two-part process of using the CountVectorizer class to count how many times each term shows up in each document, followed by the TfidfTransformer class generating the weight matrix.
The other does both steps in a single TfidfVectorizer class
After setting up our CountVectorizer, we follow the general fit/transform convention of scikit-learn, starting with fitting.
You can find the complete notebook related this post on my github link below, feel free to ask any queries related to this post:
https://github.com/abhibisht89/tfidf_example