python - Reduce "for loops for big data" and make improvement -
i trying make code (that made) faster possible. first, code follows
#lemmas list consisting of 20,000 words. #that is, lemmas = ['apple', 'dog', ... ] #new_sents list consisting of 12,000 lists representing sentence. #that is, new_sents = [ ['hello', 'i', 'am', 'a', 'boy'], ['hello', 'i', 'am', 'a', 'girl'], ... ] x in lemmas: y in lemmas: # prevent 0 denominator x_count = 0.00001 y_count = 0.00001 xy_count = 0 ## dice denominator in new_sents: x_count += i.count(x) y_count += i.count(y) if(x in , y in i): xy_count += 1 sim_score = float(xy_count) / (x_count + y_count)
as can see, there many iterations.. 20,000 * 20,000 * 12,000, big numbers. sim_score dice coeffient of 2 words. is, xy_count means number of word x , word y appeared in sentence , x_count , y_count mean total number of word x , y shown in new_sents respectively.
i made code slow. there better way?
thanks in advance.
you computing each thing twice. score symmetrical in x , y, can 2-fold speed doing this:
for x, y in itertools.combinations(lemmas, 2):
i assuming don't want compare lemmas[0]
itself, otherwise can use combinations_with_replacement
.
the implementation faster if lemmas
set.
but still computing same thing several times. can take each lemma, count in news_sent
, store it.
Comments
Post a Comment