r - Smote fails making oversampling -


i have done oversampling in dataset using smote, included in dmwr package.

my dataset formed 2 classes. original distribution 12 vs 62. so, have coded oversampling:

newdata <- smote(score ~ ., data, k=3, perc.over = 400,perc.under=150) 

now, distribution 60 vs 72.

however, when display 'newdata' dataset discover how smote has made oversampling , there samples repeated.

for example, sample number 24 appears 24.1, 24.2 , 24.3.

is correct? affects directly in classification because classifier learn model data present in test, not legal in classification.

edit: think didn't explain correctly issue:

as know, smote technique oversample. creates new samples original ones, modifying values of features it. however, when display new data generated smote, obtain this:

(these values values of features) sample50: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645 0.008043167

sample 50.1: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645

sample 50 belongs original dataset. sample 50.1 'artificial' sample generated smote. (and issue), smote has created repeated sample, instead of creating artificial 1 modifying 'a bit' values of features.

i hope can understand me.

thanks!

smote algorithm generates synthetic examples of given class (the minority class) handle imbalanced distributions. strategy generating new data combined random under-sampling of majority class. when use smote in package dmwr need specify over-sampling percentage , under-sampling percentage. these value must set because obtained distribution of data may remain imbalanced.

in case, , given parameters set, namely percentage of under- , over-sampling smote introduce replicas of examples of minority class.

your initial class distribution 12 62 , after applying smote end 60 72. means minority class oversampled smote , new synthetic examples of class produced.

however, majority class had 62 examples, contains 72! under sampling percentage applied class increased number of examples. since number of examples select majority class determined based on examples of minority class, number of examples sampled class larger existing.

therefore, had 62 examples , algorithm tried randomly select 72! means replicas of examples of majority class introduced.

so, explain over-sampling , under-sampling selected:

12 examples minority class 400% of oversampling gives: 12*400/100=48. so, 48 new synthetic examples added minority class (12+48=60 final number of examples minority class).

the number of examples select majority class are: 48*150/100=72. majority class has 62, replicas introduced.


Comments

Popular posts from this blog

database - VFP Grid + SQL server 2008 - grid not showing correctly -

jquery - Set jPicker field to empty value -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -