TA的每日心情 | 衰 2019-11-19 15:32 |
---|
签到天数: 1 天 [LV.1]初来乍到
|
EDA365欢迎您登录!
您需要 登录 才可以下载或查看,没有帐号?注册
x
留一法交叉验证(LOOCV)
$ M g- }* N) k: I+ o留一法即Leave-One-Out Cross Validation。这种方法比较简单易懂,就是把一个大的数据集分为k个小数据集,其中k-1个作为训练集,剩下的一个作为测试集,然后选择下一个作为测试集,剩下的k-1个作为训练集,以此类推。其主要目的是为了防止过拟合,评估模型的泛化能力。计算时间较长。
1 q: k8 w0 G5 O; k. J- B. q; R3 f, o0 u5 Q" z- X
适用场景:) n9 H( ]9 c; o% W, R9 l$ @
% }3 @# U) n( m! h. o6 ~数据集少,如果像正常一样划分训练集和验证集进行训练,那么可以用于训练的数据本来就少,还被划分出去一部分,这样可以用来训练的数据就更少了。loocv可以充分的利用数据。
/ K- [- I/ F! I0 C w1 j/ F0 P' t' |$ l
' H- _; M4 R# G2 j. u2 x
快速留一法KNN
) U% R, B$ N; d. n6 x, E5 n( R
. E9 M# q* j6 P$ x/ m! u2 b/ d因为LOOCV需要划分N次,产生N批数据,所以在一轮训练中,要训练出N个模型,这样训练时间就大大增加。为了解决这样的问题,根据留一法的特性,我们可以提前计算出不同样本之间的距离(或者距离的中间值),存储起来。使用LOOCV时直接从索引中取出即可。下面的代码以特征选择为Demo,验证快速KNN留一法。6 g- m0 S4 A% I5 P
1 N S9 e7 g; n2 V其中FSKNN1是普通KNN,FSKNN2是快速KNN
' y$ P) R* ^1 K" m! B6 ]3 ~2 l+ _: w' j B6 ?- C
主函数main.m
0 b, V$ @; V4 w1 i7 j+ D+ h2 Z( J9 j7 R. E1 Q! S& A, r+ ?5 x
- clc
- [train_F,train_L,test_F,test_L] = divide_dlbcl();
- dim = size(train_F,2);
- individual = rand(1,dim);
- global choice
- choice = 0.5;
- global knnIndex
- [knnIndex] = preKNN(individual,train_F);
- for i = 1:100
- [error,fs] = FSKNN1(individual,train_F,train_L);
- [error2,fs2] = FSKNN2(individual,train_F,train_L);
- end
4 V8 @, i: v: p/ S. X. p% T
1 o& h9 }9 ^) A7 U+ _1 o# H+ v7 p
0 E( H9 y: P% x' J' s数据集划分divide_dlbcl.m
. B6 J. O9 G$ a! p
, \: G+ e: Y1 a5 _, ?- function [train_F,train_L,test_F,test_L] = divide_dlbcl()
- load DLBCL.mat;
- dataMat=ins;
- len=size(dataMat,1);
- %归一化
- maxV = max(dataMat);
- minV = min(dataMat);
- range = maxV-minV;
- newdataMat = (dataMat-repmat(minV,[len,1]))./(repmat(range,[len,1]));
- Indices = crossvalind('Kfold', length(lab), 10);
- site = find(Indices==1|Indices==2|Indices==3);
- test_F = newdataMat(site,:);
- test_L = lab(site);
- site2 = find(Indices~=1&Indices~=2&Indices~=3);
- train_F = newdataMat(site2,:);
- train_L =lab(site2);
- end6 ]7 n3 H9 t( W" ~1 {1 m
2 l$ D2 _% a" o
2 M d7 o. W5 _2 }2 d1 j
简单KNN
; x q; u$ ?. _$ ?4 E5 u) X6 k8 W; Y$ D! u1 s4 V
FSKNN1.m
- d9 T6 U2 h6 B& g* M7 b5 V0 N! m: q/ l5 E! d' t
- function [error,fs] = FSKNN1(x,train_F,train_L)
- global choice
- inmodel = x>choice;%%%%%设定恰当的阈值选择特征
- k=1;
- train_f=train_F(:,inmodel);
- train_length = size(train_F,1);
- flag = logical(ones(train_length,1));
- error=0;
- for j=1:train_length
- flag(j) = 0;
- CtrainF = train_f(flag,:);
- CtrainL = train_L(flag);
- CtestF = train_f(~flag,:);
- CtestL = train_L(~flag);
- classifyresult= KNN1(CtestF,CtrainF,CtrainL,k);
- if (CtestL~=classifyresult)
- error=error+1;
- end
- flag(j) = 1;
- end
- error=error/train_length;
- fs = sum(inmodel);
- end8 p9 p0 A: `% H7 z+ \
* K, g6 Q; \" N) D, j, M% N) h* V) \5 m9 M' S- V
KNN1.m
4 O' B0 {( q, g8 B. ?
# l% e$ P2 H" g- function relustLabel = KNN1(inx,data,labels,k)
- %%
- % inx 为 输入测试数据,data为样本数据,labels为样本标签 k值自定1~3
- %%
- [datarow , datacol] = size(data);
- diffMat = repmat(inx,[datarow,1]) - data ;
- distanceMat = sqrt(sum(diffMat.^2,2));
- [B , IX] = sort(distanceMat,'ascend');
- len = min(k,length(B));
- relustLabel = mode(labels(IX(1:len)));
- end
- N( T4 f2 u% \ Q' X/ c9 j 9 V, \( H$ l1 O: X
1 {: ]' Y9 l/ J6 `6 P* C快速KNN" M# R, G! o" l0 T( Z
/ o% T% E C- m% [- M7 ~% X
preKNN.m
) d; P' P0 z/ s3 X
' `: K; H3 Q' H* w# E- function [knnIndex] = preKNN(x,train_F)
- inmodel = x > 0;
- train_f=train_F(:,inmodel);
- train_length = size(train_F,1);
- flag = logical(ones(train_length,1));
- knnIndex = cell(train_length,1);
- for j=1:train_length
- flag(j) = 0;
- CtrainF = train_f(flag,:);
- CtestF = train_f(~flag,:);
- [datarow , ~] = size(CtrainF);
- diffMat = repmat(CtestF,[datarow,1]) - CtrainF ;
- diffMat = diffMat.^2;
- knnIndex{j,1} = diffMat;
- flag(j) = 1;
- end
- end5 F% t) j% ~" t1 u, S
K) b7 g& i0 V1 m/ k! d* T3 _6 \
1 ^. X+ S% L2 F. ~ qFSKNN2.m
0 W6 U, A, S- n' m$ z) s( y% S9 c& l7 h1 N% A, K' d
- function [error,fs] = FSKNN2(x,train_F,train_L)
- global choice
- inmodel = x>choice;%%%%%设定恰当的阈值选择特征
- global knnIndex
- k=1;
- train_length = size(train_F,1);
- flag = logical(ones(train_length,1));
- error=0;
- for j=1:train_length
- flag(j) = 0;
- CtrainL = train_L(flag);
- CtestL = train_L(~flag);
- classifyresult= KNN2(CtrainL,k,knnIndex{j}(:,inmodel));
- if(CtestL~=classifyresult)
- error=error+1;
- end
- flag(j) = 1;
- end
- error=error/train_length;
- fs = sum(inmodel);
- end1 v% }& z7 R5 ]# |( r$ z
, D& O. |' v0 {" n# U6 D0 ^
$ J( j8 q* V# Z; X
KNN2.m9 h4 a8 R( Y- K6 ]* \
! J9 Y1 Y( D! v" R3 p! t
- function relustLabel = KNN2(labels,k,diffMat)
- distanceMat = sqrt(sum(diffMat,2));
- [B , IX] = sort(distanceMat,'ascend');
- len = min(k,length(B));
- relustLabel = mode(labels(IX(1:len)));
- end1 J: j D. k: x- N5 t6 j- {( \
4 J8 D( j. m3 o4 W, K
" l" g+ D& g( |% \% I: @, x结果
9 g6 I' h ^/ ~7 m0 A4 x" Q* S1 {- g) l; H6 U) Y
6 S# T* b' j" N" S) I- n7 q( f
1 ?$ v; R; r4 v9 l/ k* w+ j C- T$ t
可以看到FSKNN2+preKNN的时间比FSKNN1要少很多。$ f6 n9 T2 Z0 H: |: ]
|
|