python - Calculate tf-idf of strings -

- June 15, 2010

i have 2 documents doc1.txt , doc2.txt. contents of these 2 documents are:

 #doc1.txt  good, bad, great   #doc2.txt  bad, restaurent, nice place visit

i want make corpus separated , final documenttermmatrix becomes:

      terms  docs            bad        great   restaurent   nice place visit  doc1       tf-idf          tf-idf         tf-idf          0                    0  doc2       0               tf-idf         0               tf-idf             tf-idf

i know, how calculate documenttermmatrix of individual words (using http://scikit-learn.org/stable/modules/feature_extraction.html) don't know how calculate documenttermmatrix of strings in python.

you can specify analyzer argument of tfidfvectorizer function extracts features in customized way:

from sklearn.feature_extraction.text import tfidfvectorizer  docs = ['very good, bad, great',         'very bad, restaurent, nice place visit']  tfidf = tfidfvectorizer(analyzer=lambda d: d.split(', ')).fit(docs) print tfidf.get_feature_names()

the resulting features are:

['good restaurent', 'nice place visit', 'very bad', 'very good', 'you great']

if cannot afford load data memory, workaround:

from sklearn.feature_extraction.text import tfidfvectorizer  docs = ['doc1.txt', 'doc2.txt']  def extract(filename):     open(filename) f:         features = []         line in f:             features += line.strip().split(', ')         return features  tfidf = tfidfvectorizer(analyzer=extract).fit(docs) print tfidf.get_feature_names()

which loads each document 1 @ time without holding of them in memory @ once.