- soft entropy constraint object copyright 2009-2012 Blair Armstrong, Christine Watson, David Plaut This file is part of SOS SOS is free software: you can redistribute it and/or modify it for academic and non-commercial purposes under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For commercial or for-profit uses, please contact the authors (sos@cnbc.cmu.edu). SOS is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
0001 % - soft entropy constraint object 0002 % 0003 % copyright 2009-2012 Blair Armstrong, Christine Watson, David Plaut 0004 % 0005 % This file is part of SOS 0006 % 0007 % SOS is free software: you can redistribute it and/or modify 0008 % it for academic and non-commercial purposes 0009 % under the terms of the GNU General Public License as published by 0010 % the Free Software Foundation, either version 3 of the License, or 0011 % (at your option) any later version. For commercial or for-profit 0012 % uses, please contact the authors (sos@cnbc.cmu.edu). 0013 % 0014 % SOS is distributed in the hope that it will be useful, 0015 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0016 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0017 % GNU General Public License for more details. 0018 0019 % You should have received a copy of the GNU General Public License 0020 % along with SOS (see COPYING.txt). 0021 % If not, see <http://www.gnu.org/licenses/>. 0022 0023 0024 classdef softEntropyConstraint < softConstraint 0025 %% creates and supports soft entropy constraints 0026 % 0027 % Objects of this class measure the cost in terms of minimizing or 0028 % maximizing a measure of entropy, or degree to which items have been 0029 % randomly distributed on a particular dimension. 0030 % 0031 % Additional functionality is inherited from parent softConstraint 0032 % 0033 % The particular measure of entropy used in the class takes its 0034 % inspiration from the standard version of Gibbs entropy: 0035 % 0036 % -k*[Sigma(p)*ln(p)]; 0037 % 0038 % In this formula, k is a positive constant, and p is the probability 0039 % of choosing an item with a particular value on the 0040 % dimension of interest. The sum of p*ln(p) is summed across all 0041 % values of the dimension of interest. 0042 % 0043 % In this standard version of entropy, the values of Entropy increase 0044 % as the 'randomness' of the distribution of items increases. That 0045 % is, the highest values of entropy are obtained when a randomly 0046 % selected item has an equal chance of having any one of the values on 0047 % the dimension of interest. This also corresponds to having a 0048 % uniform distribution of items across values on the dimension of 0049 % interest. In contrast, when items are not randomly distributed in 0050 % this uniform fashion, lower values of entropy are obtained. The 0051 % lowest possible entropy would occur when all items shared the same 0052 % value. 0053 % 0054 % Maximizing entropy thus provides one means of randomly distributing 0055 % items across a range of values, whereas minimizing entropy can serve 0056 % to force units to all take on the same value. 0057 % 0058 % Though the standard Gibbs entropy formula has the desirable 0059 % characteristics outlined above, for present purposes in one main 0060 % respect: the domain of the results of the entropy formula are not 0061 % well constrainted to a particular range. This problem is due to two 0062 % main aspects of the formula. First, if a particular value of a 0063 % dimension literally has no item representing it, its probability 0064 % would be zero, and ln(0) = -Inf. Second, the amount of entropy in 0065 % the standard formula varies as a function of the number of items. 0066 % Neither of these characteristics are desirable in the present 0067 % context where it is useful to have the standard output of a cost 0068 % function consistently have standardized values in the range -1 - 1 0069 % (or at least within an order of magnitude thereof). 0070 % 0071 % This additional desirable characteristic resulted in the development 0072 % of a new 'entropy' formula, as follows: 0073 % 0074 % ent = (-1*Sigma(1:N)[p*log((p+1)')/Sum(N)] -1/(N)*ln((1/N)+1)) / 0075 % (ln(2)/N - 1/N * ln((1/N) +1)); 0076 % 0077 % Don't let the complexity of this formula scare you though, as in 0078 % principle it's quite straightforward. First, to avoid the case 0079 % where values with p=0 generated -Inf, it was necessary to add some 0080 % constant value to the ln() term. A value of 1 was selected as this 0081 % results in only positive numbers being generated from the equation, 0082 % since log(1) == 0 and log(1+positiveN) > 0. Next, this equation is 0083 % divided by N, the number of different possible items, which renders 0084 % the output of the equation relatively insensitive to the number of 0085 % items for which entropy is being calculated. The theoretical 0086 % minimum entropy for this particular dataset is then subtracted away 0087 % from the result of hte previous step (1/N*ln((1/N)+1) corresponding 0088 % to the case where all items are distributed uniformally across 0089 % values) giving the equation a bound a bound of zero when maximum 0090 % entropy is acheived. The result of this entire calculation 0091 % subsequently serves as the dividend in a division which 0092 % standardizes the equation so that the maximum possible entropy that 0093 % could result will always have a value of 1.0, thus bounding the 0094 % formula to the 0-1 domain. Finally, to conserve the property that 0095 % entropy is maximized as randomness increases, the result of the 0096 % previous calculation is multiplied by -1, thus yielding an entropy 0097 % formula that is bounded at [-1, 0], where -1 corresponds to the 0098 % least randomly distributed state, and 0 corresponds to the most 0099 % randomly distributed state. 0100 % 0101 % 0102 % s1 % sample entropy is calculated on 0103 % s1Col % data column entropy is calculated on 0104 % nbin % number of bins to subdivide data into 0105 % pdSpread % type of spread of scores desired. Options currently are 'sample' and 'allItems'. 0106 % ent % current entropy value 0107 % swEnt % swap entropy value 0108 % comparison %handle to comparison calculator (to minimize or max entropy) 0109 % initStats % handle to method that initializes stats 0110 % swStats % handle to method that calculates swap stats 0111 % scores % copy of scores in the column entropy is calucated for. 0112 % swScores % copy of scores if a swap occured 0113 % bins % mid-point of bins data is divided into 0114 % swBins % mid-point of bins data is divided into after a swap 0115 % pd % probability distribution of scores 0116 % swpd % probability distribution of swap socres 0117 % minScore % min score in set 0118 % swMinScore % min score in swap set 0119 % maxScore % max score in set 0120 % swMaxScore % max score in swap set. 0121 % 0122 %PROPERTIES 0123 % s1 % sample entropy is calculated on 0124 % s1Col % data column entropy is calculated on 0125 % nbin % number of bins to subdivide data into 0126 % pdSpread % type of spread of scores desired. Options currently are 'sample' and 'allItems'. 0127 % ent % current entropy value 0128 % swEnt % swap entropy value 0129 % comparison %handle to comparison calculator (to minimize or max entropy) 0130 % initStats % handle to method that initializes stats 0131 % swStats % handle to method that calculates swap stats 0132 % scores % copy of scores in the column entropy is calucated for. 0133 % swScores % copy of scores if a swap occured 0134 % bins % mid-point of bins data is divided into 0135 % swBins % mid-point of bins data is divided into after a swap 0136 % pd % probability distribution of scores 0137 % swpd % probability distribution of swap socres 0138 % minScore % min score in set 0139 % swMinScore % min score in swap set 0140 % maxScore % max score in set 0141 % swMaxScore % max score in swap set. 0142 % 0143 %PROPERTIES (Constant) 0144 % s2 = NaN; % s2 = NaN; for consistency with other methods, there is a swap2 property, but it is just set to a null-like value. 0145 % smallVal = 0.00000001 % small value added to avoid case where a value is exactly equal the min or max range of the distribution 0146 % 0147 %METHODS 0148 % obj = softEntropyConstraint(varargin) % CONSTRUCTOR 0149 % cost = initCost() METHOD 0150 % initEnt() % calculates the initial entropy value for the scores in sample1<sample1Col> 0151 % cost = minEnt(curEnt) % calculates the cost associated with curEnt, when minimal Entropy is desired 0152 % cost = maxEnt(curEnt) % calculates the cost associated with curEnt, when maximal entropy is desired 0153 % swCost = swapCost(targSample,targSampleIndex, feederdf,feederdfIndex) % Calculates the new cost if items from targSample and feederdf were swapped. 0154 % swEntropy(obj,targSample,targSampleIndex, feederdf,feederdfIndex) % Calculates the new (swap) entropy if items from targSample and feederdf were swapped. 0155 % cost = acceptSwap() % alter internal variables to reflect accepted swap 0156 % cost = rejectSwap() % set internal variables to reflect rejection of proposed swap 0157 % plotDistribution() % plots the probability distribution used to calculate entropy 0158 0159 0160 %% Properties 0161 properties 0162 s1 % sample entropy is calculated on 0163 s1Col % data column entropy is calculated on 0164 s1ColName %name of column entropy is being calculated on 0165 nbin % number of bins to subdivide data into 0166 pdSpread % type of spread of scores desired. Options currently are 'sample' and 'allItems'. 0167 ent % current entropy value 0168 swEnt % swap entropy value 0169 comparison %handle to comparison calculator (to minimize or max entropy) 0170 initStats % handle to method that initializes stats 0171 swStats % handle to method that calculates swap stats 0172 scores % copy of scores in the column entropy is calucated for. 0173 swScores % copy of scores if a swap occured 0174 bins % mid-point of bins data is divided into 0175 swBins % mid-point of bins data is divided into after a swap 0176 pd % probability distribution of scores 0177 swpd % probability distribution of swap socres 0178 minScore % min score in set 0179 swMinScore % min score in swap set 0180 maxScore % max score in set 0181 swMaxScore % max score in swap set. 0182 figHandle % handle to figure for plotting entropy 0183 end 0184 0185 %% Properties (Constant) 0186 properties (Constant) 0187 s2 = NaN; % s2 = NaN; for consistency with other methods, there is a swap2 property, but it is just set to a null-like value. 0188 smallVal = 0.00000001 % small value added to avoid case where a value is exactly equal the min or max range of the distribution 0189 end 0190 0191 0192 methods 0193 %% obj = softEntropyConstraint(varargin) CONSTRUCTOR 0194 function obj = softEntropyConstraint(varargin) 0195 % Constructs a softEntropy constraint object 0196 % 0197 % PARAMETERS: 0198 % REQUIRED: 0199 % 'sosObj'/sos object - the SOS object the constraint will be linked to, and which contains the sample the constraint operates on. 0200 % 'constraintType'/'soft' - the type of contraint - must be 'soft' 0201 % 'fnc'/'minEnt'|'maxEnt' the entropy cost to calculate. 0202 % 'sample1'/sample - the first sample 0203 % 's1ColName'/string - name of column in 1st sample 0204 % 'pSpread'/'sample'|'allItems' Should entropy be maximized relative to sample items only, or to the theoretical min and max values in the population? 0205 % 0206 % 0207 % OPTIONAL: 0208 % 'nbin'/integer - number of bins to divide data into for 0209 % purposes of calculating probability distribution. Defaults 0210 % to number of items in the sample. Must be greater than 2 0211 % and <= number of items in the sample 0212 % 'exponent'/numeric - defaults to 2 (quadratic difference) 0213 % 'weight'/numeric - defaults to 1 (equal weighting of all soft costs) 0214 0215 p = inputParser; 0216 0217 p.addParamValue('sosObj','null',@(sosObj)strcmp(class(sosObj),'sos')); 0218 p.addParamValue('constraintType', 'null', ... 0219 @(constraintType)any(strcmp({'soft'},constraintType))); 0220 p.addParamValue('fnc','null', ... 0221 @(fnc)any(strcmp({'minEnt' 'maxEnt'},fnc))); 0222 p.addParamValue('nbin',2,@(nbin)validateattributes(nbin, {'numeric'}, ... 0223 {'scalar', 'integer', 'positive', '>', 0})); 0224 p.addParamValue('pdSpread','null', ... 0225 @(pdSpread)any(strcmp({'sample' 'allItems'},pdSpread))); 0226 p.addParamValue('sample1','null',@(sample1)strcmp(class(sample1),'sample')); 0227 p.addParamValue('s1ColName','',@(s1ColName)ischar(s1ColName)); 0228 p.addParamValue('exponent',2,@(exponent)isnumeric(exponent)); 0229 p.addParamValue('weight',1,@(weight)isnumeric(weight)); 0230 p.addParamValue('name','noname',@(name)ischar(name)); 0231 0232 p.parse(varargin{:}); 0233 0234 % check additional constraints on values submitted to the 0235 % constructor 0236 if(p.Results.sosObj.containsSample(p.Results.sample1) == false) 0237 error('Cannot create soft distance constraint: sos Object does not contain the sample1'); 0238 end 0239 0240 col1 = p.Results.sample1.colName2colNum(p.Results.s1ColName); 0241 if(col1 == -1) 0242 error('Specified column name not found in sample1'); 0243 end 0244 0245 if(strcmp(p.Results.sample1.format{col1},'%f') == 0) 0246 error('Specified column is not of numeric (%f) format, so cannot use as hard bound'); 0247 end 0248 0249 if any(strcmp(p.UsingDefaults,'nbin')) 0250 obj.nbin = p.Results.sample1.n; 0251 else 0252 if p.Results.nbin <= p.Results.sample1.n 0253 obj.nbin = p.Results.nbin; 0254 else 0255 error('Number of bins must be <= number of observations in the sample'); 0256 end 0257 end 0258 0259 if obj.nbin < 2 0260 error ('There must be at least 2 bins in the entropy calculation'); 0261 end 0262 0263 0264 %assign the appropriate handles for the calculation. 0265 obj.sosObj = p.Results.sosObj; 0266 obj.constraintType = p.Results.constraintType; 0267 obj.fnc = p.Results.fnc; 0268 obj.weight = p.Results.weight; 0269 obj.s1 = p.Results.sample1; 0270 obj.s1ColName = p.Results.s1ColName; 0271 obj.s1Col = col1; 0272 obj.exp = p.Results.exponent; 0273 0274 obj.initStats = @obj.initEnt; 0275 obj.swStats = @obj.swEntropy; 0276 0277 if(strcmp(p.Results.fnc,'minEnt')) 0278 obj.comparison = @obj.minEnt; 0279 elseif(strcmp(p.Results.fnc,'maxEnt')) 0280 obj.comparison = @obj.maxEnt; 0281 else 0282 error('Entropy <fnc> must be either "minEnt" or "maxEnt"'); 0283 end 0284 0285 if(any(strcmp({'sample' 'allItems'},p.Results.pdSpread))) 0286 obj.pdSpread = p.Results.pdSpread; 0287 else 0288 error('pdSpread must be either "sample" or "allItems"'); 0289 end 0290 0291 obj.cost = NaN; 0292 obj.swCost = NaN; 0293 0294 % add the name and the label 0295 obj.label = [obj.constraintType,'_',obj.fnc,... 0296 '_pd_',obj.pdSpread,'_nbin',num2str(obj.nbin),'_',... 0297 obj.s1.name,'_',... 0298 obj.s1ColName,'_w',... 0299 num2str(obj.weight),'_e',num2str(obj.exp)]; 0300 if any(strcmp(p.UsingDefaults,'name')) 0301 obj.name = obj.label; 0302 else 0303 obj.name = p.Results.name; 0304 end 0305 0306 0307 verbosePrint('Soft Entropy Constraint has been created', ... 0308 'softEntropyConstraint_Constructor_endObjCreation'); 0309 end % constructor 0310 0311 0312 %% cost = initCost() METHOD 0313 function cost = initCost(obj) 0314 % Calculates, saves, and returns the cost value for the current items in the sample. 0315 0316 %init the stats, then compare them. Return the calculated cost 0317 obj.initStats(); 0318 cost = obj.comparison(obj.ent); 0319 0320 obj.swEnt = NaN; 0321 obj.swScores = NaN; 0322 obj.swpd = NaN; 0323 obj.swMinScore = NaN; 0324 obj.swMaxScore = NaN; 0325 0326 obj.cost = cost; 0327 0328 end %initCost 0329 0330 %% initEnt() METHOD 0331 function initEnt(obj) 0332 % calculates the initial entropy value for the scores in 0333 % sample1<sample1Col> 0334 0335 obj.scores = (obj.s1.zdata{obj.s1Col})'; 0336 0337 0338 % get min and max values for the probability distribution: 0339 %this procedure is different depending on whether we are using 0340 %the sample as the range or allItems 0341 0342 if strcmp(obj.pdSpread,'sample') 0343 minVal = min(obj.scores); 0344 maxVal = max(obj.scores); 0345 elseif strcmp(obj.pdSpread,'allItems') 0346 %check in the population 0347 minVal = min([obj.s1.population.zdata{obj.s1Col}]); 0348 maxVal = max([obj.s1.population.zdata{obj.s1Col}]); 0349 0350 %check in all samples associated with the population (this 0351 %includes the original one 0352 for i=1:length(obj.s1.population.samples) 0353 minVal = min([minVal; ... 0354 obj.s1.population.samples(i).zdata{obj.s1Col}]); 0355 maxVal = max([maxVal; ... 0356 obj.s1.population.samples(i).zdata{obj.s1Col}]); 0357 end 0358 else 0359 error('Unsupported pdSpread in initEnt. pdSpread must be "sample" or "allItems"'); 0360 end 0361 0362 % add a smallVal to avoid boundary cases 0363 minVal = minVal - obj.smallVal; 0364 maxVal = maxVal + obj.smallVal; 0365 0366 %now have min and max values, determine size of each bin: 0367 obj.minScore = minVal; 0368 obj.maxScore = maxVal; 0369 0370 spread = maxVal - minVal; 0371 binSize = spread/obj.nbin; 0372 0373 %smallVal added so that the last score is included (simulating 0374 %<= rather than < 0375 obj.bins = (minVal+0.5*binSize):binSize:(maxVal-0.5*binSize)+obj.smallVal; 0376 0377 % scale values to 0-1 0378 obj.pd = hist(obj.scores,obj.bins)/length(obj.scores); 0379 0380 0381 % calculate entropy 0382 obj.ent = obj.pd*log((obj.pd+1)')/length(obj.pd); 0383 obj.ent = -1* (obj.ent - 1/length(obj.pd)*log(1/length(obj.pd)+1)) / ... 0384 (1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1)); 0385 0386 end % initEnt 0387 0388 %% cost = minEnt(curEnt) METHOD 0389 function cost = minEnt(obj,curEnt) 0390 % calculates the cost associated with curEnt, when minimal entropy is desired 0391 cost = -(abs((curEnt))^obj.exp)*obj.weight; 0392 end 0393 0394 %% cost = maxEnt(curEnt) 0395 function cost = maxEnt(obj,curEnt) 0396 % calculates the cost associated with curEnt, when maximal entropy is desired 0397 cost = (abs((curEnt))^obj.exp)*obj.weight; 0398 end 0399 0400 %% swCost(targSample,targSampleIndex, feederdf,feederdfIndex) METHOD 0401 function swCost = swapCost(obj,targSample,targSampleIndex, feederdf,feederdfIndex) 0402 % Calculates the new cost if items from targSample and feederdf were swapped. 0403 % 0404 %By definition, if this method is called it means that at least 0405 % one of the two swap objects is implicated in this function 0406 % 0407 %PARAMETERS: 0408 % targSample - the target sample (i.e., the object that will call it's swapSample() method if a swap later occurs). 0409 % targSampleIndex - row index of item to swap 0410 % feederdf - dataframe (sample/pop) containin the other item to swap 0411 % feederdfIndex - row index of item to swap. 0412 0413 0414 if (obj.s1 ~= targSample && obj.s1 ~= feederdf) 0415 error('swCost called, but no sample part of this cost function'); 0416 end 0417 0418 % update the stats, then calculate cost of new stats 0419 obj.swStats(targSample,targSampleIndex, feederdf,feederdfIndex); 0420 0421 swCost = obj.comparison(obj.swEnt); 0422 0423 obj.swCost = swCost; 0424 0425 end 0426 0427 %% swEnt(targSample,targSampleIndex,feederdf,feederdfIndex) METHOD 0428 function swEntropy(obj,targSample,targSampleIndex, feederdf,feederdfIndex) 0429 % Calculates the new (swap) entropy if items from targSample and feederdf were swapped. 0430 % 0431 % Inputs are the same as for swapCost() 0432 0433 0434 0435 obj.swScores = obj.scores; 0436 obj.swpd = obj.pd; 0437 obj.swMinScore = obj.minScore; 0438 obj.swMaxScore = obj.maxScore; 0439 obj.swBins = obj.bins; 0440 0441 % re-extract the raw, unnormalized entropy value from the 0442 % current entropy 0443 rawEnt = obj.ent; 0444 rawEnt = rawEnt *(1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1)); 0445 rawEnt = rawEnt*-1; 0446 rawEnt = rawEnt + 1/length(obj.pd)*log(1/length(obj.pd)+1); 0447 0448 spread = obj.swMaxScore - obj.swMinScore; 0449 binSize = spread/obj.nbin; 0450 0451 recalculate = false; 0452 0453 if targSample == obj.s1 0454 0455 obj.swScores(targSampleIndex) = ... 0456 feederdf.zdata{obj.s1Col}(feederdfIndex); 0457 0458 % if the min and max bounds may change and we're only 0459 % calculating the entropy of the sample and letting the 0460 % range of the sample vary dynamically, recalculate entropy 0461 % from scratch 0462 if(strcmp(obj.pdSpread,'sample') && ... 0463 (lt(obj.swScores(targSampleIndex)-obj.smallVal, obj.minScore) || ... 0464 gt(obj.swScores(targSampleIndex)+obj.smallVal, obj.maxScore) || ... 0465 obj.scores(targSampleIndex)-obj.smallVal == obj.minScore || ... 0466 obj.scores(targSampleIndex)+obj.smallVal == obj.maxScore)) 0467 0468 recalculate = true; 0469 else 0470 %we can do a local update 0471 0472 found = false; 0473 0474 % find the old bin: 0475 for i=1:(length(obj.swBins)) 0476 0477 0478 if (obj.scores(targSampleIndex) >= obj.swBins(i) - 0.5*binSize ... 0479 && obj.scores(targSampleIndex) < obj.swBins(i)+0.5*binSize) 0480 0481 %remove the old score 0482 rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0483 obj.swpd(i) = obj.swpd(i) - 1/length(obj.scores); 0484 rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0485 found = true; 0486 break; 0487 end 0488 end 0489 0490 if found == false 0491 error('unable to find the swap bin'); 0492 end 0493 0494 found = false; 0495 0496 for i=1:(length(obj.swBins)) 0497 if obj.swScores(targSampleIndex) >= obj.swBins(i) - 0.5*binSize ... 0498 && obj.swScores(targSampleIndex) < obj.swBins(i) + 0.5*binSize 0499 %add the new score 0500 0501 rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0502 obj.swpd(i) = obj.swpd(i) + 1/length(obj.scores); 0503 rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0504 0505 found = true; 0506 break; 0507 end 0508 end 0509 0510 if found == false 0511 error('unable to find the swap bin'); 0512 end 0513 0514 end 0515 end 0516 0517 if feederdf == obj.s1 0518 0519 obj.swScores(feederdfIndex) = ... 0520 targSample.zdata{obj.s1Col}(targSampleIndex); 0521 0522 % if using the sample's spread and the new score could 0523 % exceed the existing bounds, recalculate from sractch 0524 if(strcmp(obj.pdSpread,'sample') && ... 0525 (lt(obj.swScores(feederdfIndex)-obj.smallVal, obj.minScore) || ... 0526 gt(obj.swScores(feederdfIndex)+obj.smallVal, obj.maxScore) || ... 0527 obj.scores(feederdfIndex)-obj.smallVal == obj.minScore || ... 0528 obj.scores(feederdfIndex)+obj.smallVal == obj.maxScore)) 0529 0530 recalculate = true; 0531 else 0532 %we can do a local update 0533 0534 found = false; 0535 0536 0537 0538 % find the old bin: 0539 for i=1:(length(obj.swBins)) 0540 if obj.scores(feederdfIndex) >= obj.swBins(i) - 0.5*binSize ... 0541 && obj.scores(feederdfIndex) < obj.swBins(i) + 0.5*binSize 0542 %remove the old score 0543 rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0544 obj.swpd(i) = obj.swpd(i) - 1/length(obj.scores); 0545 rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0546 0547 found = true; 0548 break; 0549 end 0550 end 0551 0552 if found == false 0553 error('Unable to find the old bin'); 0554 end 0555 0556 found = false; 0557 for i=1:(length(obj.swBins)) 0558 if obj.swScores(feederdfIndex) >= obj.swBins(i) - 0.5*binSize ... 0559 && obj.swScores(feederdfIndex) < obj.swBins(i) + 0.5*binSize 0560 %add the new score 0561 rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0562 obj.swpd(i) = obj.swpd(i) + 1/length(obj.scores); 0563 rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd); 0564 0565 found = true; 0566 break; 0567 end 0568 end 0569 0570 if found == false 0571 error('Unable to find an appropriate swbin for the data'); 0572 end 0573 0574 end 0575 end 0576 0577 % a local update is possible if neither of the swap scores 0578 % exceeds the current spread. 0579 0580 %recalculate = true; 0581 0582 if recalculate == true 0583 %re-derive all measures 0584 minVal = min(obj.swScores); 0585 maxVal = max(obj.swScores); 0586 0587 minVal = minVal - obj.smallVal; 0588 maxVal = maxVal + obj.smallVal; 0589 0590 %now have min and max values, determine size of each bin: 0591 obj.swMinScore = minVal; 0592 obj.swMaxScore = maxVal; 0593 0594 spread = maxVal - minVal; 0595 binSize = spread/obj.nbin; 0596 % remove the small -1.0e-14 value to avoid rounding errors 0597 % that result in non-overlapping bins. So long as this 0598 % number is < e-16 then this appears to prevent these 0599 % rounding errors. And so long as this number is 0600 % substantially smaller than smallVal it will not interfere 0601 % with the actual bin sizes (currently smallVal is < 0602 % 1.0-e10) 0603 obj.swBins = (minVal + 0.5*binSize):binSize:(maxVal - 0.5*binSize)+obj.smallVal; 0604 0605 for i=1:length(obj.swBins) 0606 obj.swBins(i) = obj.swBins(i) - 1.0e-14; 0607 end 0608 0609 obj.swpd = hist(obj.swScores,obj.swBins)/length(obj.scores); 0610 obj.swEnt = obj.swpd*log((obj.swpd+1)')/length(obj.pd); 0611 obj.swEnt = -1* (obj.swEnt - 1/length(obj.pd)*log(1/length(obj.pd)+1)) / ... 0612 (1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1)); 0613 0614 else 0615 % cannot exceed bounds because min and max were taken from 0616 % the population. Proceed with local update 0617 obj.swEnt = -1* (rawEnt - 1/length(obj.pd)*log(1/length(obj.pd)+1)) / ... 0618 (1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1)); 0619 0620 end 0621 0622 if isnan(obj.swEnt) 0623 error('NaN obtained during swEnt computation (perhaps there is missing data for an item?)'); 0624 end 0625 0626 end %swEntropy() 0627 0628 0629 %% cost = acceptSwap() METHOD 0630 function cost = acceptSwap(obj) 0631 % alter internal variables to reflect accepted swap 0632 0633 0634 if isnan(obj.swEnt) 0635 %do nothing, no need to swap 0636 else 0637 obj.ent = obj.swEnt; 0638 obj.swEnt = NaN; 0639 obj.scores = obj.swScores; 0640 obj.swScores = NaN; 0641 0642 obj.pd = obj.swpd; 0643 obj.swpd = NaN; 0644 0645 obj.minScore = obj.swMinScore; 0646 obj.swMinScore = NaN; 0647 obj.maxScore = obj.swMaxScore; 0648 obj.swMaxScore = NaN; 0649 0650 obj.bins = obj.swBins; 0651 obj.swBins = NaN; 0652 0653 end 0654 0655 cost = acceptSwap@genericConstraint(obj); 0656 end 0657 0658 %% cost = rejectSwap() METHOD 0659 function cost = rejectSwap(obj) 0660 % set internal variables to reflect rejection of proposed swap 0661 obj.swEnt = NaN; 0662 obj.swScores = NaN; 0663 obj.swpd = NaN; 0664 obj.swMinScore = NaN; 0665 obj.swMaxScore = NaN; 0666 0667 obj.swBins = NaN; 0668 0669 cost = rejectSwap@genericConstraint(obj); 0670 end 0671 0672 %% plotDistribution() METHOD 0673 function plotDistribution(obj) 0674 % plots the probability distribution used to calculate entropy 0675 0676 % make the entropy figure the active figure 0677 if isempty(obj.figHandle) 0678 obj.figHandle = figure('Name',['Entropy - ',obj.s1.name, ... 0679 '|',obj.s1ColName],'NumberTitle','off'); 0680 0681 end 0682 0683 figure(obj.figHandle); 0684 0685 if isempty(obj.cost) == false && isnan(obj.cost) == false %the constraints have not been initialized 0686 clf(); 0687 0688 0689 if strcmp(obj.pdSpread,'sample') 0690 minVal = min(obj.s1.data{obj.s1Col}); 0691 maxVal = max(obj.s1.data{obj.s1Col}); 0692 elseif strcmp(obj.pdSpread,'allItems') 0693 %check in the population 0694 minVal = min([obj.s1.population.data{obj.s1Col}]); 0695 maxVal = max([obj.s1.population.data{obj.s1Col}]); 0696 0697 %check in all samples associated with the population (this 0698 %includes the original one 0699 for i=1:length(obj.s1.population.samples) 0700 minVal = min([minVal; ... 0701 obj.s1.population.samples(i).data{obj.s1Col}]); 0702 maxVal = max([maxVal; ... 0703 obj.s1.population.samples(i).data{obj.s1Col}]); 0704 end 0705 else 0706 error('Unsupported pdSpread in initEnt. pdSpread must be "sample" or "allItems"'); 0707 end 0708 0709 % add a smallVal to avoid boundary cases 0710 minVal = minVal - obj.smallVal; 0711 maxVal = maxVal + obj.smallVal; 0712 0713 spread = maxVal - minVal; 0714 binSize = spread/obj.nbin; 0715 0716 %smallVal added so that the last score is included (simulating 0717 %<= rather than < 0718 tmpbins = (minVal+0.5*binSize):binSize:(maxVal-0.5*binSize)+obj.smallVal; 0719 0720 % scale values to 0-1 0721 tmppd = hist(obj.s1.data{obj.s1Col},obj.nbin)/length(obj.s1.data{obj.s1Col}); 0722 0723 bar(tmpbins,tmppd,'hist'); 0724 ylim([0 1]); 0725 title(['Entropy - ',obj.s1.name, ... 0726 '|',obj.s1ColName]); 0727 xlabel([obj.s1ColName, ]);%' - each bar is centered on bin mid-point']); 0728 ylabel('p(item from a particular bin)'); 0729 else 0730 clf(); 0731 ylim([0 1]); 0732 title(['Entropy - ',obj.s1.name, ... 0733 '|',obj.s1ColName]); 0734 xlabel([obj.s1ColName, ]);%' - each bar is centered on bin mid-point']); 0735 ylabel('p(item from a particular bin)'); 0736 text(0.5,0.5,'Cost must be initialized before data is plotted', ... 0737 'HorizontalAlignment','center'); 0738 end 0739 0740 0741 0742 0743 end 0744 0745 end 0746 0747 end 0748