python - Reduce "for loops for big data" and make improvement -

- September 15, 2012

i trying make code (that made) faster possible. first, code follows

#lemmas list consisting of 20,000 words.  #that is, lemmas = ['apple', 'dog', ... ]   #new_sents list consisting of 12,000 lists representing sentence.  #that is, new_sents = [ ['hello', 'i', 'am', 'a', 'boy'], ['hello', 'i', 'am', 'a', 'girl'], ... ]     x in lemmas:         y in lemmas:             # prevent 0 denominator              x_count = 0.00001             y_count = 0.00001              xy_count = 0             ## dice denominator              in new_sents:                 x_count += i.count(x)                  y_count += i.count(y)                  if(x in , y in i):                     xy_count += 1              sim_score = float(xy_count) / (x_count + y_count)

as can see, there many iterations.. 20,000 * 20,000 * 12,000, big numbers. sim_score dice coeffient of 2 words. is, xy_count means number of word x , word y appeared in sentence , x_count , y_count mean total number of word x , y shown in new_sents respectively.

i made code slow. there better way?

thanks in advance.

you computing each thing twice. score symmetrical in x , y, can 2-fold speed doing this:

for x, y in itertools.combinations(lemmas, 2):

i assuming don't want compare lemmas[0] itself, otherwise can use combinations_with_replacement.

the implementation faster if lemmas set.

but still computing same thing several times. can take each lemma, count in news_sent , store it.

Search This Blog

Backgorund

python - Reduce "for loops for big data" and make improvement -

Comments

Post a Comment

Popular posts from this blog

database - VFP Grid + SQL server 2008 - grid not showing correctly -

jquery - Set jPicker field to empty value -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -