Description of softEntropyConstraint

0001 % - soft entropy constraint object
0002 %
0003 % copyright 2009-2012 Blair Armstrong, Christine Watson, David Plaut
0004 %
0005 %    This file is part of SOS
0006 %
0007 %    SOS is free software: you can redistribute it and/or modify
0008 %    it for academic and non-commercial purposes
0009 %    under the terms of the GNU General Public License as published by
0010 %    the Free Software Foundation, either version 3 of the License, or
0011 %    (at your option) any later version.  For commercial or for-profit
0012 %    uses, please contact the authors (sos@cnbc.cmu.edu).
0013 %
0014 %    SOS is distributed in the hope that it will be useful,
0015 %    but WITHOUT ANY WARRANTY; without even the implied warranty of
0016 %    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0017 %    GNU General Public License for more details.
0018 
0019 %    You should have received a copy of the GNU General Public License
0020 %    along with SOS (see COPYING.txt).
0021 %    If not, see <http://www.gnu.org/licenses/>.
0022 
0023 
0024 classdef softEntropyConstraint < softConstraint
0025     %% creates and supports soft entropy constraints
0026     %
0027     % Objects of this class measure the cost in terms of minimizing or
0028     % maximizing a measure of entropy, or degree to which items have been
0029     % randomly distributed on a particular dimension.
0030     %
0031     % Additional functionality is inherited from parent softConstraint
0032     %
0033     % The particular measure of entropy used in the class takes its
0034     % inspiration from the standard version of Gibbs entropy:
0035     %
0036     %   -k*[Sigma(p)*ln(p)];
0037     %
0038     %  In this formula, k is a positive constant, and p is the probability
0039     %  of choosing an item with a particular value on the
0040     %  dimension of interest.  The sum of p*ln(p) is summed across all
0041     %  values of the dimension of interest.
0042     %
0043     %  In this standard version of entropy, the values of Entropy increase
0044     %  as the 'randomness' of the distribution of items increases.  That
0045     %  is, the highest values of entropy are obtained when a randomly
0046     %  selected item has an equal chance of having any one of the values on
0047     %  the dimension of interest.  This also corresponds to having a
0048     %  uniform distribution of items across values on the dimension of
0049     %  interest.  In contrast, when items are not randomly distributed in
0050     %  this uniform fashion, lower values of entropy are obtained.  The
0051     %  lowest possible entropy would occur when all items shared the same
0052     %  value.
0053     %
0054     %  Maximizing entropy thus provides one means of randomly distributing
0055     %  items across a range of values, whereas minimizing entropy can serve
0056     %  to force units to all take on the same value.
0057     %
0058     %  Though the standard Gibbs entropy formula has the desirable
0059     %  characteristics outlined above, for present purposes in one main
0060     %  respect: the domain of the results of the entropy formula are not
0061     %  well constrainted to a particular range.  This problem is due to two
0062     %  main aspects of the formula.  First, if a particular value of a
0063     %  dimension literally has no item representing it, its probability
0064     %  would be zero, and ln(0) = -Inf.  Second, the amount of entropy in
0065     %  the standard formula varies as a function of the number of items.
0066     %  Neither of these characteristics are desirable in the present
0067     %  context where it is useful to have the standard output of a cost
0068     %  function consistently have standardized values in the range -1 - 1
0069     %   (or at least within an order of magnitude thereof).
0070     %
0071     %  This additional desirable characteristic resulted in the development
0072     %  of a new 'entropy' formula, as follows:
0073     %
0074     %     ent = (-1*Sigma(1:N)[p*log((p+1)')/Sum(N)] -1/(N)*ln((1/N)+1)) /
0075     %           (ln(2)/N - 1/N * ln((1/N) +1));
0076     %
0077     %  Don't let the complexity of this formula scare you though, as in
0078     %  principle it's quite straightforward.  First, to avoid the case
0079     %  where values with p=0 generated -Inf, it was necessary to add some
0080     %  constant value to the ln() term.  A value of 1 was selected as this
0081     %  results in only positive numbers being generated from the equation,
0082     %  since log(1) == 0 and log(1+positiveN) > 0.  Next, this equation is
0083     %  divided by N, the number of different possible items, which renders
0084     %  the output of the equation relatively insensitive to the number of
0085     %  items for which entropy is being calculated.  The theoretical
0086     %  minimum entropy for this particular dataset is then subtracted away
0087     %  from the result of hte previous step (1/N*ln((1/N)+1) corresponding
0088     %  to the case where all items are distributed uniformally across
0089     %  values) giving the equation a bound a bound of zero when maximum
0090     %  entropy is acheived.  The result of this entire calculation
0091     %  subsequently serves as the dividend in a division which
0092     %  standardizes the equation so that the maximum possible entropy that
0093     %  could result will always have a value of 1.0, thus bounding the
0094     %  formula to the 0-1 domain.  Finally, to conserve the property that
0095     %  entropy is maximized as randomness increases, the result of the
0096     %  previous calculation is multiplied by -1, thus yielding an entropy
0097     %  formula that is bounded at [-1, 0], where -1 corresponds to the
0098     %  least randomly distributed state, and 0 corresponds to the most
0099     %  randomly distributed state.
0100     %
0101     %
0102     %     s1 % sample entropy is calculated on
0103     %     s1Col   % data column entropy is calculated on
0104     %     nbin    % number of bins to subdivide data into
0105     %     pdSpread % type of spread of scores desired.  Options currently are 'sample' and 'allItems'.
0106     %     ent % current entropy value
0107     %     swEnt % swap entropy value
0108     %     comparison %handle to comparison calculator (to minimize or max entropy)
0109     %     initStats % handle to method that initializes stats
0110     %     swStats % handle to method that calculates swap stats
0111     %     scores % copy of scores in the column entropy is calucated for.
0112     %     swScores % copy of scores if a swap occured
0113     %     bins % mid-point of bins data is divided into
0114     %     swBins % mid-point of bins data is divided into after a swap
0115     %     pd % probability distribution of scores
0116     %     swpd % probability distribution of swap socres
0117     %     minScore % min score in set
0118     %     swMinScore % min score in swap set
0119     %     maxScore % max score in set
0120     %     swMaxScore % max score in swap set.
0121     %
0122     %PROPERTIES
0123     %     s1 % sample entropy is calculated on
0124     %     s1Col   % data column entropy is calculated on
0125     %     nbin    % number of bins to subdivide data into
0126     %     pdSpread % type of spread of scores desired.  Options currently are 'sample' and 'allItems'.
0127     %     ent % current entropy value
0128     %     swEnt % swap entropy value
0129     %     comparison %handle to comparison calculator (to minimize or max entropy)
0130     %     initStats % handle to method that initializes stats
0131     %     swStats % handle to method that calculates swap stats
0132     %     scores % copy of scores in the column entropy is calucated for.
0133     %     swScores % copy of scores if a swap occured
0134     %     bins % mid-point of bins data is divided into
0135     %     swBins % mid-point of bins data is divided into after a swap
0136     %     pd % probability distribution of scores
0137     %     swpd % probability distribution of swap socres
0138     %     minScore % min score in set
0139     %     swMinScore % min score in swap set
0140     %     maxScore % max score in set
0141     %     swMaxScore % max score in swap set.
0142     %
0143     %PROPERTIES (Constant)
0144     %    s2 = NaN; % s2 = NaN; for consistency with other methods, there is a swap2 property, but it is just set to a null-like value.
0145     %   smallVal = 0.00000001 % small value added to avoid case where a value is exactly equal the min or max range of the distribution
0146     %
0147     %METHODS
0148     %   obj = softEntropyConstraint(varargin) % CONSTRUCTOR
0149     %   cost = initCost() METHOD
0150     %   initEnt() % calculates the initial entropy value for the scores in sample1<sample1Col>
0151     %   cost = minEnt(curEnt) % calculates the cost associated with curEnt, when minimal Entropy is desired
0152     %   cost = maxEnt(curEnt)  % calculates the cost associated with curEnt, when maximal entropy is desired
0153     %   swCost = swapCost(targSample,targSampleIndex, feederdf,feederdfIndex)  % Calculates the new cost if items from targSample and feederdf were swapped.
0154     %   swEntropy(obj,targSample,targSampleIndex, feederdf,feederdfIndex)  % Calculates the new (swap) entropy if items from targSample and feederdf were swapped.
0155     %   cost = acceptSwap()  % alter internal variables to reflect accepted swap
0156     %   cost = rejectSwap() % set internal variables to reflect rejection of proposed swap
0157     %   plotDistribution() % plots the probability distribution used to calculate entropy
0158     
0159 
0160     %% Properties
0161     properties
0162         s1 % sample entropy is calculated on
0163         s1Col   % data column entropy is calculated on
0164         s1ColName %name of column entropy is being calculated on
0165         nbin    % number of bins to subdivide data into
0166         pdSpread % type of spread of scores desired.  Options currently are 'sample' and 'allItems'.
0167         ent % current entropy value
0168         swEnt % swap entropy value
0169         comparison %handle to comparison calculator (to minimize or max entropy)
0170         initStats % handle to method that initializes stats
0171         swStats % handle to method that calculates swap stats
0172         scores % copy of scores in the column entropy is calucated for.
0173         swScores % copy of scores if a swap occured
0174         bins % mid-point of bins data is divided into
0175         swBins % mid-point of bins data is divided into after a swap
0176         pd % probability distribution of scores
0177         swpd % probability distribution of swap socres
0178         minScore % min score in set
0179         swMinScore % min score in swap set
0180         maxScore % max score in set
0181         swMaxScore % max score in swap set.
0182         figHandle % handle to figure for plotting entropy
0183     end
0184     
0185     %% Properties (Constant)
0186     properties (Constant)
0187         s2 = NaN; % s2 = NaN; for consistency with other methods, there is a swap2 property, but it is just set to a null-like value.
0188         smallVal = 0.00000001 % small value added to avoid case where a value is exactly equal the min or max range of the distribution
0189     end
0190         
0191     
0192     methods
0193         %% obj = softEntropyConstraint(varargin) CONSTRUCTOR
0194         function obj = softEntropyConstraint(varargin)
0195             % Constructs a softEntropy constraint object
0196             %
0197             % PARAMETERS:
0198             % REQUIRED:
0199             %   'sosObj'/sos object - the SOS object the constraint will be linked to, and which contains the sample the constraint operates on.
0200             %   'constraintType'/'soft' - the type of contraint - must be 'soft'
0201             %   'fnc'/'minEnt'|'maxEnt' the entropy cost to calculate.
0202             %   'sample1'/sample - the first sample
0203             %   's1ColName'/string - name of column in 1st sample
0204             %   'pSpread'/'sample'|'allItems'  Should entropy be maximized relative to sample items only, or to the theoretical min and max values in the population?
0205             %
0206             %
0207             % OPTIONAL:
0208             %   'nbin'/integer - number of bins to divide data into for
0209             %       purposes of calculating probability distribution.  Defaults
0210             %       to number of items in the sample.  Must be greater than 2
0211             %       and <= number of items in the sample
0212             %   'exponent'/numeric - defaults to 2 (quadratic difference)
0213             %   'weight'/numeric - defaults to 1 (equal weighting of all soft costs)
0214             
0215             p = inputParser;
0216 
0217             p.addParamValue('sosObj','null',@(sosObj)strcmp(class(sosObj),'sos'));
0218             p.addParamValue('constraintType', 'null', ...
0219                 @(constraintType)any(strcmp({'soft'},constraintType)));
0220             p.addParamValue('fnc','null', ...
0221                  @(fnc)any(strcmp({'minEnt' 'maxEnt'},fnc)));
0222             p.addParamValue('nbin',2,@(nbin)validateattributes(nbin, {'numeric'}, ...
0223                 {'scalar', 'integer', 'positive', '>', 0}));
0224             p.addParamValue('pdSpread','null', ...
0225                  @(pdSpread)any(strcmp({'sample' 'allItems'},pdSpread)));
0226             p.addParamValue('sample1','null',@(sample1)strcmp(class(sample1),'sample'));
0227             p.addParamValue('s1ColName','',@(s1ColName)ischar(s1ColName));
0228             p.addParamValue('exponent',2,@(exponent)isnumeric(exponent));
0229             p.addParamValue('weight',1,@(weight)isnumeric(weight));
0230             p.addParamValue('name','noname',@(name)ischar(name));
0231             
0232             p.parse(varargin{:});
0233             
0234             % check additional constraints on values submitted to the
0235             % constructor
0236             if(p.Results.sosObj.containsSample(p.Results.sample1) == false)
0237                 error('Cannot create soft distance constraint: sos Object does not contain the sample1');
0238             end
0239             
0240             col1 = p.Results.sample1.colName2colNum(p.Results.s1ColName);           
0241             if(col1 == -1)
0242                 error('Specified column name not found in sample1');
0243             end         
0244             
0245             if(strcmp(p.Results.sample1.format{col1},'%f') == 0)
0246                 error('Specified column is not of numeric (%f) format, so cannot use as hard bound');
0247             end   
0248             
0249             if any(strcmp(p.UsingDefaults,'nbin'))
0250                 obj.nbin = p.Results.sample1.n;
0251             else
0252                 if p.Results.nbin <= p.Results.sample1.n
0253                     obj.nbin = p.Results.nbin;
0254                 else
0255                     error('Number of bins must be <= number of observations in the sample');
0256                 end
0257             end
0258             
0259             if obj.nbin < 2
0260                 error ('There must be at least 2 bins in the entropy calculation');
0261             end
0262             
0263             
0264             %assign the appropriate handles for the calculation.
0265             obj.sosObj = p.Results.sosObj;
0266             obj.constraintType = p.Results.constraintType;
0267             obj.fnc = p.Results.fnc;           
0268             obj.weight = p.Results.weight;
0269             obj.s1 = p.Results.sample1;     
0270             obj.s1ColName = p.Results.s1ColName;
0271             obj.s1Col = col1;   
0272             obj.exp = p.Results.exponent;      
0273             
0274             obj.initStats = @obj.initEnt;
0275             obj.swStats = @obj.swEntropy;
0276             
0277             if(strcmp(p.Results.fnc,'minEnt'))
0278                 obj.comparison = @obj.minEnt;
0279             elseif(strcmp(p.Results.fnc,'maxEnt'))
0280                obj.comparison = @obj.maxEnt;
0281             else
0282                 error('Entropy <fnc> must be either "minEnt" or "maxEnt"');
0283             end
0284 
0285             if(any(strcmp({'sample' 'allItems'},p.Results.pdSpread)))
0286                 obj.pdSpread = p.Results.pdSpread;
0287             else
0288                 error('pdSpread must be either "sample" or "allItems"');
0289             end
0290              
0291             obj.cost = NaN;
0292             obj.swCost = NaN;
0293             
0294             % add the name and the label
0295             obj.label = [obj.constraintType,'_',obj.fnc,...
0296                     '_pd_',obj.pdSpread,'_nbin',num2str(obj.nbin),'_',...
0297                     obj.s1.name,'_',...
0298                     obj.s1ColName,'_w',...
0299                     num2str(obj.weight),'_e',num2str(obj.exp)];              
0300             if any(strcmp(p.UsingDefaults,'name'))                 
0301                 obj.name = obj.label;
0302             else
0303                  obj.name = p.Results.name;  
0304             end     
0305             
0306             
0307             verbosePrint('Soft Entropy Constraint has been created', ...
0308                     'softEntropyConstraint_Constructor_endObjCreation');
0309         end % constructor
0310         
0311             
0312         %% cost = initCost() METHOD
0313         function cost = initCost(obj)
0314             % Calculates, saves, and returns the cost value for the current items in the sample.
0315           
0316             %init the stats, then compare them.  Return the calculated cost
0317             obj.initStats(); 
0318             cost = obj.comparison(obj.ent);
0319             
0320             obj.swEnt = NaN;
0321             obj.swScores = NaN;
0322             obj.swpd = NaN;
0323             obj.swMinScore = NaN;
0324             obj.swMaxScore = NaN;
0325                        
0326             obj.cost = cost;
0327             
0328         end %initCost
0329         
0330         %% initEnt() METHOD
0331         function initEnt(obj)
0332             % calculates the initial entropy value for the scores in
0333             % sample1<sample1Col>
0334             
0335             obj.scores = (obj.s1.zdata{obj.s1Col})';
0336             
0337             
0338             % get min and max values for the probability distribution:
0339             %this procedure is different depending on whether we are using
0340             %the sample as the range or allItems
0341             
0342             if strcmp(obj.pdSpread,'sample')
0343                 minVal = min(obj.scores);
0344                 maxVal = max(obj.scores);    
0345             elseif strcmp(obj.pdSpread,'allItems')
0346                 %check in the population
0347                 minVal = min([obj.s1.population.zdata{obj.s1Col}]);
0348                 maxVal = max([obj.s1.population.zdata{obj.s1Col}]);
0349 
0350                 %check in all samples associated with the population (this
0351                 %includes the original one
0352                 for i=1:length(obj.s1.population.samples)
0353                     minVal = min([minVal; ...
0354                         obj.s1.population.samples(i).zdata{obj.s1Col}]);
0355                     maxVal = max([maxVal; ...
0356                         obj.s1.population.samples(i).zdata{obj.s1Col}]);                
0357                 end
0358             else
0359                 error('Unsupported pdSpread in initEnt.  pdSpread must be "sample" or "allItems"');
0360             end
0361                 
0362             % add a smallVal to avoid boundary cases
0363             minVal = minVal - obj.smallVal;
0364             maxVal = maxVal + obj.smallVal;
0365             
0366             %now have min and max values, determine size of each bin:
0367             obj.minScore = minVal;
0368             obj.maxScore = maxVal;
0369             
0370             spread  = maxVal - minVal;            
0371             binSize = spread/obj.nbin;
0372             
0373             %smallVal added so that the last score is included (simulating
0374             %<= rather than <
0375             obj.bins = (minVal+0.5*binSize):binSize:(maxVal-0.5*binSize)+obj.smallVal;
0376               
0377             % scale values to 0-1
0378             obj.pd = hist(obj.scores,obj.bins)/length(obj.scores);
0379            
0380             
0381             % calculate entropy
0382             obj.ent = obj.pd*log((obj.pd+1)')/length(obj.pd);     
0383             obj.ent = -1* (obj.ent - 1/length(obj.pd)*log(1/length(obj.pd)+1)) / ...
0384                 (1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1));
0385            
0386         end % initEnt
0387         
0388         %% cost = minEnt(curEnt) METHOD
0389         function cost = minEnt(obj,curEnt)
0390             % calculates the cost associated with curEnt, when minimal entropy is desired
0391             cost = -(abs((curEnt))^obj.exp)*obj.weight;
0392         end
0393         
0394         %% cost = maxEnt(curEnt)
0395         function cost = maxEnt(obj,curEnt)
0396             % calculates the cost associated with curEnt, when maximal entropy is desired
0397             cost = (abs((curEnt))^obj.exp)*obj.weight;
0398         end        
0399 
0400         %%  swCost(targSample,targSampleIndex, feederdf,feederdfIndex) METHOD
0401         function swCost = swapCost(obj,targSample,targSampleIndex, feederdf,feederdfIndex)
0402             % Calculates the new cost if items from targSample and feederdf were swapped.
0403             %
0404             %By definition, if this method is called it means that at least
0405             % one of the two swap objects is implicated in this function
0406             %
0407             %PARAMETERS:
0408             %   targSample - the target sample (i.e., the object that will call it's swapSample() method if a swap later occurs).
0409             %   targSampleIndex - row index of item to swap
0410             %   feederdf - dataframe (sample/pop) containin the other item to swap
0411             %   feederdfIndex - row index of item to swap.
0412  
0413             
0414            if (obj.s1 ~= targSample && obj.s1 ~= feederdf)
0415                error('swCost called, but no sample part of this cost function');
0416            end
0417 
0418            % update the stats, then calculate cost of new stats
0419            obj.swStats(targSample,targSampleIndex, feederdf,feederdfIndex);
0420            
0421            swCost = obj.comparison(obj.swEnt);
0422             
0423            obj.swCost = swCost;
0424               
0425         end
0426 
0427         %% swEnt(targSample,targSampleIndex,feederdf,feederdfIndex) METHOD
0428         function swEntropy(obj,targSample,targSampleIndex, feederdf,feederdfIndex)
0429            % Calculates the new (swap) entropy if items from targSample and feederdf were swapped.
0430            %
0431            % Inputs are the same as for swapCost()
0432  
0433            
0434            
0435            obj.swScores = obj.scores;
0436            obj.swpd = obj.pd;
0437            obj.swMinScore = obj.minScore;
0438            obj.swMaxScore = obj.maxScore;
0439            obj.swBins = obj.bins;
0440            
0441            % re-extract the raw, unnormalized entropy value from the
0442            % current entropy
0443            rawEnt = obj.ent;
0444            rawEnt = rawEnt *(1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1));
0445            rawEnt = rawEnt*-1;
0446            rawEnt = rawEnt + 1/length(obj.pd)*log(1/length(obj.pd)+1);
0447       
0448            spread  = obj.swMaxScore - obj.swMinScore;            
0449            binSize = spread/obj.nbin;
0450                 
0451            recalculate = false;
0452 
0453             if targSample == obj.s1
0454                 
0455                 obj.swScores(targSampleIndex) = ...
0456                     feederdf.zdata{obj.s1Col}(feederdfIndex);
0457                 
0458                 % if the min and max bounds may change and we're only
0459                 % calculating the entropy of the sample and letting the
0460                 % range of the sample vary dynamically, recalculate entropy
0461                 % from scratch
0462                 if(strcmp(obj.pdSpread,'sample') && ...
0463                     (lt(obj.swScores(targSampleIndex)-obj.smallVal, obj.minScore) || ...
0464                      gt(obj.swScores(targSampleIndex)+obj.smallVal, obj.maxScore) || ...
0465                      obj.scores(targSampleIndex)-obj.smallVal == obj.minScore || ...
0466                      obj.scores(targSampleIndex)+obj.smallVal == obj.maxScore))
0467                 
0468                     recalculate = true;
0469                 else
0470                    %we can do a local update
0471                 
0472                    found = false;
0473                    
0474                    % find the old bin:
0475                    for i=1:(length(obj.swBins))
0476                      
0477                        
0478                        if (obj.scores(targSampleIndex) >= obj.swBins(i) - 0.5*binSize ...
0479                                && obj.scores(targSampleIndex) < obj.swBins(i)+0.5*binSize) 
0480                                
0481                            %remove the old score
0482                            rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0483                            obj.swpd(i) = obj.swpd(i) - 1/length(obj.scores);
0484                            rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0485                            found = true; 
0486                            break;
0487                        end
0488                    end
0489                    
0490                    if found == false
0491                        error('unable to find the swap bin');
0492                    end
0493                    
0494                    found = false;
0495                   
0496                     for i=1:(length(obj.swBins))
0497                        if obj.swScores(targSampleIndex) >= obj.swBins(i) - 0.5*binSize ...
0498                                && obj.swScores(targSampleIndex) < obj.swBins(i) + 0.5*binSize
0499                            %add the new score
0500                            
0501                            rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0502                            obj.swpd(i) = obj.swpd(i) + 1/length(obj.scores);
0503                            rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0504                            
0505                             found = true;
0506                            break;
0507                        end
0508                     end  
0509                    
0510                    if found == false
0511                        error('unable to find the swap bin');
0512                    end
0513                    
0514                 end                
0515              end
0516                                
0517              if feederdf == obj.s1
0518                 
0519                 obj.swScores(feederdfIndex) =  ...
0520                     targSample.zdata{obj.s1Col}(targSampleIndex);
0521  
0522                 % if using the sample's spread and the new score could
0523                 % exceed the existing bounds, recalculate from sractch
0524                 if(strcmp(obj.pdSpread,'sample') && ...
0525                         (lt(obj.swScores(feederdfIndex)-obj.smallVal, obj.minScore) || ...
0526                          gt(obj.swScores(feederdfIndex)+obj.smallVal, obj.maxScore) || ...
0527                          obj.scores(feederdfIndex)-obj.smallVal == obj.minScore || ...
0528                          obj.scores(feederdfIndex)+obj.smallVal == obj.maxScore))
0529                 
0530                     recalculate = true;
0531                 else
0532                    %we can do a local update
0533                   
0534                    found = false;
0535                    
0536 
0537                    
0538                    % find the old bin:
0539                    for i=1:(length(obj.swBins))
0540                        if obj.scores(feederdfIndex) >= obj.swBins(i) - 0.5*binSize ...
0541                                && obj.scores(feederdfIndex) < obj.swBins(i) + 0.5*binSize
0542                            %remove the old score
0543                            rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0544                            obj.swpd(i) = obj.swpd(i) - 1/length(obj.scores);
0545                            rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0546                             
0547                            found = true;
0548                            break;
0549                        end
0550                    end
0551                    
0552                    if found == false
0553                        error('Unable to find the old bin');
0554                    end
0555                    
0556                    found = false;
0557                     for i=1:(length(obj.swBins))
0558                        if obj.swScores(feederdfIndex) >= obj.swBins(i) - 0.5*binSize ...
0559                                && obj.swScores(feederdfIndex) < obj.swBins(i) + 0.5*binSize
0560                            %add the new score
0561                            rawEnt = rawEnt - obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0562                            obj.swpd(i) = obj.swpd(i) + 1/length(obj.scores);
0563                            rawEnt = rawEnt + obj.swpd(i)*log(obj.swpd(i)+1)/length(obj.pd);
0564                          
0565                            found = true;
0566                            break;
0567                        end
0568                     end    
0569                     
0570                     if found == false
0571                        error('Unable to find an appropriate swbin for the data');
0572                     end
0573                    
0574                 end                
0575             end
0576             
0577             % a local update is possible if neither of the swap scores
0578             % exceeds the current spread.
0579             
0580             %recalculate = true;
0581             
0582             if recalculate == true
0583                 %re-derive all measures
0584                 minVal = min(obj.swScores);
0585                 maxVal = max(obj.swScores);  
0586                 
0587                 minVal = minVal - obj.smallVal;
0588                 maxVal = maxVal + obj.smallVal;
0589                 
0590                 %now have min and max values, determine size of each bin:
0591                 obj.swMinScore = minVal;
0592                 obj.swMaxScore = maxVal;
0593 
0594                 spread  = maxVal - minVal;            
0595                 binSize = spread/obj.nbin;
0596                 % remove the small -1.0e-14 value to avoid rounding errors
0597                 % that result in non-overlapping bins.  So long as this
0598                 % number is < e-16 then this appears to prevent these
0599                 % rounding errors.  And so long as this number is
0600                 % substantially smaller than smallVal it will not interfere
0601                 % with the actual bin sizes (currently smallVal is <
0602                 % 1.0-e10)
0603                 obj.swBins = (minVal + 0.5*binSize):binSize:(maxVal - 0.5*binSize)+obj.smallVal;
0604 
0605                 for i=1:length(obj.swBins)
0606                     obj.swBins(i) = obj.swBins(i) - 1.0e-14;
0607                 end
0608 
0609                 obj.swpd = hist(obj.swScores,obj.swBins)/length(obj.scores);
0610                 obj.swEnt = obj.swpd*log((obj.swpd+1)')/length(obj.pd);     
0611                 obj.swEnt = -1* (obj.swEnt - 1/length(obj.pd)*log(1/length(obj.pd)+1)) / ...
0612                     (1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1));
0613             
0614             else
0615                 % cannot exceed bounds because min and max were taken from
0616                 % the population.  Proceed with local update
0617                 obj.swEnt = -1* (rawEnt - 1/length(obj.pd)*log(1/length(obj.pd)+1)) / ...
0618                     (1*log(2)/length(obj.pd) - 1/length(obj.pd)*log(1/length(obj.pd)+1));
0619                             
0620             end
0621                
0622             if isnan(obj.swEnt)
0623                 error('NaN obtained during swEnt computation (perhaps there is missing data for an item?)');
0624             end
0625       
0626         end %swEntropy()
0627 
0628         
0629        %% cost = acceptSwap() METHOD
0630         function cost = acceptSwap(obj)
0631             % alter internal variables to reflect accepted swap
0632             
0633             
0634             if isnan(obj.swEnt) 
0635                 %do nothing, no need to swap
0636             else
0637                 obj.ent = obj.swEnt;
0638                 obj.swEnt = NaN;
0639                 obj.scores = obj.swScores;
0640                 obj.swScores = NaN;
0641 
0642                obj.pd = obj.swpd;
0643                obj.swpd = NaN;
0644 
0645                obj.minScore = obj.swMinScore;
0646                obj.swMinScore = NaN;
0647                obj.maxScore = obj.swMaxScore;
0648                obj.swMaxScore = NaN;
0649                
0650                obj.bins = obj.swBins;
0651                obj.swBins = NaN;
0652                       
0653             end
0654             
0655             cost = acceptSwap@genericConstraint(obj);
0656         end
0657 
0658         %% cost = rejectSwap() METHOD
0659         function cost = rejectSwap(obj)
0660             % set internal variables to reflect rejection of proposed swap
0661             obj.swEnt = NaN;
0662             obj.swScores = NaN;
0663             obj.swpd = NaN;
0664             obj.swMinScore = NaN;
0665             obj.swMaxScore = NaN;
0666             
0667             obj.swBins = NaN;
0668                         
0669             cost = rejectSwap@genericConstraint(obj);
0670         end
0671         
0672         %% plotDistribution() METHOD
0673         function plotDistribution(obj)
0674             % plots the probability distribution used to calculate entropy
0675             
0676             % make the entropy figure the active figure
0677             if isempty(obj.figHandle)
0678                 obj.figHandle = figure('Name',['Entropy - ',obj.s1.name, ...
0679                     '|',obj.s1ColName],'NumberTitle','off');
0680                 
0681             end
0682             
0683             figure(obj.figHandle);
0684             
0685             if isempty(obj.cost) == false && isnan(obj.cost) == false %the constraints have not been initialized
0686                 clf();
0687                 
0688 
0689                 if strcmp(obj.pdSpread,'sample')
0690                     minVal = min(obj.s1.data{obj.s1Col});
0691                     maxVal = max(obj.s1.data{obj.s1Col});
0692                 elseif strcmp(obj.pdSpread,'allItems')
0693                     %check in the population
0694                     minVal = min([obj.s1.population.data{obj.s1Col}]);
0695                     maxVal = max([obj.s1.population.data{obj.s1Col}]);
0696 
0697                     %check in all samples associated with the population (this
0698                     %includes the original one
0699                     for i=1:length(obj.s1.population.samples)
0700                         minVal = min([minVal; ...
0701                             obj.s1.population.samples(i).data{obj.s1Col}]);
0702                         maxVal = max([maxVal; ...
0703                             obj.s1.population.samples(i).data{obj.s1Col}]);                
0704                     end
0705                 else
0706                     error('Unsupported pdSpread in initEnt.  pdSpread must be "sample" or "allItems"');
0707                 end
0708 
0709                 % add a smallVal to avoid boundary cases
0710                 minVal = minVal - obj.smallVal;
0711                 maxVal = maxVal + obj.smallVal;
0712 
0713                 spread  = maxVal - minVal;            
0714                 binSize = spread/obj.nbin;
0715 
0716                 %smallVal added so that the last score is included (simulating
0717                 %<= rather than <
0718                 tmpbins = (minVal+0.5*binSize):binSize:(maxVal-0.5*binSize)+obj.smallVal;
0719 
0720                 % scale values to 0-1
0721                 tmppd = hist(obj.s1.data{obj.s1Col},obj.nbin)/length(obj.s1.data{obj.s1Col});
0722                 
0723                 bar(tmpbins,tmppd,'hist');
0724                 ylim([0 1]);
0725                 title(['Entropy - ',obj.s1.name, ...
0726                     '|',obj.s1ColName]);
0727                  xlabel([obj.s1ColName, ]);%' - each bar is centered on bin mid-point']);
0728                 ylabel('p(item from a particular bin)');
0729             else
0730                 clf();
0731                 ylim([0 1]);
0732                 title(['Entropy - ',obj.s1.name, ...
0733                     '|',obj.s1ColName]);
0734                 xlabel([obj.s1ColName, ]);%' - each bar is centered on bin mid-point']);
0735                 ylabel('p(item from a particular bin)');
0736                 text(0.5,0.5,'Cost must be initialized before data is plotted', ...
0737                     'HorizontalAlignment','center');
0738             end
0739             
0740 
0741                       
0742           
0743         end
0744         
0745     end
0746     
0747 end
0748
softEntropyConstraint

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SUBFUNCTIONS

SOURCE CODE