Week 11: Improving UPROB

 This week I improved the UPROB module, implementing a UPROB(K-100) functionality, internalizing the filtration operation to automatically choose optimal minimum and maximum points. I also overcame last weeks challenge of caching and managed to implement a working cache functionality for CondP().

Monday :

Discussed functionalities of UPROB with Roger sir and got some suggestions and feedback. Some of the things I will be implementing to improve UPROB are:

  • Calculate minPoints and maxPoints for filtering operations internally and abstract it from the user. 

  • Implement UPROB(K=100)

  • Caching the RKHS depending on the filter values rather than filtered variables to avoid re-initializing it. This solves the issue I had last week with caching.



Tuesday and Wednesday:

Implemented the caching functionality. This will especially be helpful in condP() calculation to perform operations where we are only altering the value of expected variable with constant conditionals. For ex:

The following operations can be performed with a single rkhs 

P(A = 1 | B = 2, C = 3)
   P(A = 3 | B = 2, C = 3)

P(A = 5 | B = 2, C = 3)


Implemented a variant of UPROB(K=100) not involving the usage of an RKHS. To explain the working :


In a simple example of 3 dimensions, A, B and C, When we are trying to find out say, P(A|B= 1, C= 2), after filtering the dataset, we are left with some N data-points of the form:

(a1, 1, 2) ; (a2, 1 , 2) ; (a3, 1 , 2) ......... (an, 1 , 2). for some values of a1,a2,a3...,an

 

In such a dataset I deduced that the average of all these a1,a2,a3....,an values should provide us with the expected value of A. I implemented this and sure enough, the r2 values of this method is very similar to ProbSpace. 

Here are some results for dims = 2,3,4,5,6;   datSize = 1000;   K = 100

 

dims, datSize, tries, K =  2 1000 10 100

Average R2: JP, UP, PS =  0.95948351161 0.9917388944263 0.9618719349801685

 

 

dims, datSize, tries, K =  3 1000 10 100

Average R2: JP, UP, PS =  0.9470697228 0.759582385234 0.8081795917910626

 

dims, datSize, tries, K =  4 1000 10 100

Average R2: JP, UP, PS =  0.9078011903 0.687716501589 0.6617432174774956

 

dims, datSize, tries, K =  5 1000 10 100

Average R2: JP, UP, PS =  0.8003123156 0.404339122151 0.3788512381207686

 

dims, datSize, tries, K =  6 1000 10 100

Average R2: JP, UP, PS =  0.528478955 0.037851936471 -0.0110369733116401

 

Thursday and Friday:

Implemented UPROB(K=100) but this time, using a single-dimensional RKHS. 


P(A = 1| B = 2, C = 3)


Similar to the previous method, I filter first on all of the conditional variables (B,C), I then pass the remaining data points (only the values of A from the results after filtration) into an RKHS to calculate the expected value of the A. I run this query for different values of A and the mean of these results will give me the E(A | B = 2, C = 3).


I was pleasantly surprised that the results of this method are highly accurate:


dims, datSize, tries =  2 1000 1

Average R2: JP, UP, PS =  0.842644618 0.9828869871 0.9696438228341947



dims, datSize, tries =  3 1000 1

Average R2: JP, UP, PS =  0.717250601 0.73071595754 0.7154192883214503



dims, datSize, tries =  4 1000 1

Average R2: JP, UP, PS =  0.664790677 0.72162423052 0.6011758633380084



Next I implemented minPoints and maxPoints calculations:


  • self.minPoints  = ceil(self.N**((dims-filter_len)/dims))*self.rangeFactor

  • self.maxPoints = ceil(self.N**((dims-filter_len)/dims))/self.rangeFactor


Here the goal was to find the optimal rangeFactor value so that the filtering operations would be optimal.

While I was running tests, I detected a fatal error in the filtering operation, ProbSpace.filter() we have been using so far. The way ProbSpace is configured, if Probspace cannot get at least minPoints number of data-points within a certain range of the conditional value, it increases the range and checks again. It tries this  for 8 iterations. If by chance it was unable to find minPoints number of datapoints even then, it returns 0 data-points (this generally only happens if the dataset we are using is of small size (N ~= 1000).


This is a problem because we cannot initialize our RKHS with 0 data points and this error crashes the testing program. After discussion with Roger sir, I will be implementing a failsafe in place such that if  UPROB(K=100) returns 0 data points, I try again with UPROB(K=80) or whatever value of K corresponding to all but 1 conditional variable being filtered. In other words, we progressively reduce the number of variables to be filtered until we obtain the required minimum amount of data points. 


I have implemented this new K-decrement filtering but the variables need some fine tuning to work properly as in its current state, the filtering algorithm always drops to calculating UPROB(K=0) which is just JPROB, reducing accuracy.


Next week plan:

Perfect the K decrement algorithm such that it only reduces K when the obtained number of data points from filtration is 0.


Comments

Popular posts from this blog

Week 1: RKHS Kernel Implementation

Week 10: Implementing UPROB

Week 12: Final Week