Week 10: Implementing UPROB

Week 10 Report

This week, I developed the UPROB module (you can check out the code here). Since this was the main focus throughout the week and I spent all my time developing, bug-fixing , testing and documenting results, I feel that a day-wise summary would not be the most effective method to show results. Instead I have opted to explain the functionalities, working and usage of this module. This will also serve as documentation for future reference. I have listed the results and conclusion towards the end of this report.


UPROB Init:


Initializing a UPROB Object is very similar to that of rkhsMV except for the final argument k, which denotes the % of variables to be filtered before applying JPROB methods on the remaining ones. 


R1 and R2 are two separate rkhsMV instances, one for each of the UPROB methods, condE() and condP(). The idea is to store R1 and R2 such that if we are repeatedly performing operations on the same dataset, we need not initialize a new kernel each time thus saving time. However this runs into a problem when dealing with filtering, which I will expand upon in a future section.

Usage:

Suppose the dataset we are dealing with is of 5 dimensions. This means that there is 1 target variable and 4 conditional variables.

Dataset is of the form: [ ‘A’, ‘B’, ‘C’, ‘D’, ‘E’ ] and we are calculating E(A|B,C,D,E) 


Suppose K = 25

This means that 25% of the conditional variables are to be filtered out.
25% of 4 = 1 variable


From the way I have configured the filtration process, variables are filtered from the end i.e,
In this example, variable E is filtered out from the dataset and we are left calculating E(A|B,C,D) using JPROB.


Similarly, if K = 75, We will be filtering out 3 variables C,D,E and just be calculating E(A|B).

And naturally, if K = 0, We will not be filtering any variables and just apply JPROB for calculating E(A|B,C,D,E). I will be explaining the filtration process in the next section.

UPROB methods:

  • condP():
        

rkhsMV.condP() has just one attribute, vector X = [x1,x2,..,xn] such that we calculate P(A=x1|B=x2,C=x3…). Uprob.condP() has 3 additional attributes, K (which is the same as the parameter explained in the previous section), minPoints and maxPoints which will be used during the filtration process. 


The above snippet also shows how we calculate the number of variables to filter.


Suppose the number of dimensions = 5 (meaning 4 conditional variables) and K = 30.


30% of 4 = 1.2


In this case we use the floor() function i.e floor(1.2) = 1  and hence, we filter 1 variable.


This means that providing K = 99 is generally a surefire way to filter all but 1 variable. For example, 99% of 4 = 3.96. floor(3.96) = 3. 


The problem with filtration:


Earlier, we mentioned caching our rkhs kernel is a problem when we use filtration. This is because when we filter datapoints, the set of filtered datapoints we obtain is different based on the value of the conditional variable(s) we are filtering on. A simplified example: 

Dims = 5, K = 50

Suppose we want to find

P(A=1 | B=2, C=-1, D = 5, E = 6) and P(A=1 | B=2, C=-1, D=5, E=10) 

corresponding to 2 data points, (1,2,-1,5,6) and (1,2,-1,5,10)


The only difference in these 2 data points is the value of variable E. however, we cannot use the same instance of R1 to calculate condP() because:


We filter 50% i.e, 2 variables D and E from the dataset. When filtering, we obtain the resultant data points on the condition D = 5, E = 6 for the first data point.


However, when we are evaluating the second data point, we need filtered data points on the condition D=5, E=10 which is not the same as the previous set of filtered data points obtained.

Hence, unfortunately, caching is not an option and the rkhs instance needs to be reinitialized for every data point when we are utilizing filtration. 


  • condE():


Similar to condP(), condE() also has the same UI as rkhsMV.condE() with 3 additional attributes, K, minPoints and maxPoints.

This method also runs into the same caching problem when filtration is involved.

Testing and Results:

The testing + comparison scripts I used for condP() and condE() methods can be found here and here respectively.


There are quite a few factors to vary and compare here, 

  1. Number of Datapoints

  2. Dimensions (Number of conditional variables)

  3. K (% of variables to filter)

And any combination of the above. This is also excluding the minDatapoints and maxDatapoints attributes for filtration.


However, here are some observed results upon extensive testing:


  1. UPROB is significantly better than JPROB and ProbSpace at higher no. of datapoints

For example, here are the results of varying data sizes for dims = 6 and K = 25. 


dims, datSize, tries, K =  6 1000 10 25

Average R2: JP, UP, PS =  0.5054770104980694  0.6848691336132597 -0.05166716263016276

Min R2: JP, UP, PS =          0.41836323070529136 0.57773465634426     -0.1918955782499241

Max R2: JP, UP, PS =          0.5864285627942418 0.7242482651077647 0.07638348669314632

Std R2: JP, UP, PS =         0.057676690148538164 0.03931599275870492 0.0858202275171853

Runtimes: JP, UP, PS =      0.004376943                 0.00276687 0.0021929769999999996


dims, datSize, tries, K =  6 10000 10 25

Average R2: JP, UP, PS =  0.664148835112867   0.8920561409854251   0.43136058392161214

Min R2: JP, UP, PS =          0.613286717497986    0.8670004503228375   0.3401795952516222

Max R2: JP, UP, PS =        0.7258628762475468   0.9138816225612929   0.4860661115122077

Std R2: JP, UP, PS =         0.03068490135189018 0.01254851628877029 0.05038726650446831

Runtimes: JP, UP, PS =     0.0049546477               0.0016485467               0.00168032499999


dims, datSize, tries, K =  6 100000 3 25

Average R2: JP, UP, PS =  0.7519314697060077   0.932155421318086     0.627585696346015

Min R2: JP, UP, PS =           0.7412501371222967   0.9254657779137831  0.6058950094714524

Max R2: JP, UP, PS =         0.7626418277078421     0.9396762680511742 0.6552845665001804

Std R2: JP, UP, PS =     0.008733145228881379 0.005831105196361989 0.02060591801100624

Runtimes: JP, UP, PS =  0.0043991443 0.0014187821 0.0008068118333333334



  1. Filtering is more useful at higher dimensions.

Here is an example of varying dimensions for datsize = 1000 and K = 50 or 30 (basically filtering 1 variable). We observe that UPROB is only better than JPROB at higher dimensions, implying filtering is not useful at lower dimensions.


dims, datSize, tries, K =  3 1000 10 50

Average R2: JP, UP, PS =  0.9309764059208459   0.8781039287990706   0.8070951362451947

Min R2: JP, UP, PS =          0.8903638351372366    0.7891909579355112   0.7509307756310211

Max R2: JP, UP, PS =         0.9579976790874651    0.9227986910666764   0.8832366070711307

Std R2: JP, UP, PS =         0.01890120379527616    0.0430679715236071   0.04067303986064

Runtimes: JP, UP, PS =      0.004062797             0.002591509        0.0014191420000000002


dims, datSize, tries, K =  4 1000 10 50

Average R2: JP, UP, PS =  0.9133952355240955    0.8265118171274175  0.6508554261904768

Min R2: JP, UP, PS =          0.892400861180834     0.7550964433535459    0.5711322915956666

Max R2: JP, UP, PS =         0.9377860718857439   0.9088297932500913   0.7088037503323505

Std R2: JP, UP, PS =     0.015784096482029684     0.04138208318095978    0.03990677956072

Runtimes: JP, UP, PS =  0.004098137    0.0024652259999999996      0.0013648529999999999


dims, datSize, tries, K =  5 1000 10 30

Average R2: JP, UP, PS =  0.7922385546506344    0.6965894523720709  0.3587224666437617

Min R2: JP, UP, PS =         0.7594699407677175     0.6237208480515811     0.26835230668011

Max R2: JP, UP, PS =    0.8501688573399228    0.7670967735306128    0.49417505067557754

Std R2: JP, UP, PS =    0.026957593104641134   0.05276575139588666  0.05768979287284774

Runtimes: JP, UP, PS =  0.003978654     0.002426918     0.001622321


dims, datSize, tries, K =  6 1000 10 30

Average R2: JP, UP, PS =  0.5022365823220564  0.7006543106173015 -0.08841926654825051

Min R2: JP, UP, PS =         0.38796964478843643 0.660753386032186   -0.2880613302307433

Max R2: JP, UP, PS =         0.6188868130540273  0.7516141114462412   0.0400211793138066

Std R2: JP, UP, PS =     0.06423793879875953    0.02651502695409599   0.096889155907086

Runtimes: JP, UP, PS =  0.0044137740000000005     .002824172      0.00216012


  1. Increasing K(filtration) leads to diminishing returns.

Here is an example of increasing K value for fixed datsize = 10,000 and dims = 6



dims, datSize, tries, K =  6 10000 10 20

Average R2: JP, UP, PS =  0.6517231447317035   0.8778086406183512   0.4113404692300726

Min R2: JP, UP, PS =          0.6056548619196698    0.850187469024279    0.258372843011633

Max R2: JP, UP, PS =         0.705683008754233    0.9047693438102975    0.5134132854682382

Std R2: JP, UP, PS =  0.031149342941849293     0.020075046625489788     0.0697866636573

Runtimes: JP, UP, PS =  0.004013334499999999     0.0012714557        0.0013807842


dims, datSize, tries, K =  6 10000 10 40

Average R2: JP, UP, PS =  0.6691678757584814   0.6668479371933367   0.446419879488767

Min R2: JP, UP, PS =          0.6384621805248569    0.6290361610115509   0.3886342594748

Max R2: JP, UP, PS =       0.7233691641137414     0.7206500436162973    0.4820233163693878

Std R2: JP, UP, PS =  0.024554658037176734      0.025042985137462854      0.0298879741692

Runtimes: JP, UP, PS =  0.0042958492     0.00190575850000    0.001351252300000


dims, datSize, tries, K =  6 10000 10 60

Average R2: JP, UP, PS =  0.6724516559911595    0.4126739976809491    0.4200551696329

Min R2: JP, UP, PS =          0.637267056603291      0.3365587570037534   0.3310201150892522

Max R2: JP, UP, PS =        0.7149151980699677     0.5100503693703589   0.52570682393528

Std R2: JP, UP, PS =  0.028782096306332716 0.05028824102006239 0.05149500981440

Runtimes: JP, UP, PS =  0.004149639500000001 0.0020041857999997 0.0013618286999999


dims, datSize, tries, K =  6 10000 10 80

Average R2: JP, UP, PS =  0.65584081344727     0.29515611927357455     0.41233948247854

Min R2: JP, UP, PS =           0.5968616011274676     0.21379146991240727   0.3123298810182

Max R2: JP, UP, PS =          0.7309102801389927     0.3669113172531612     0.4662013714724

Std R2: JP, UP, PS =  0.03941897411416016       0.05345849023716808      0.04513484509915

Runtimes: JP, UP, PS =  0.004062514800000001     0.0020822128       0.0013345937


Summary and Future enhancements:

The ideal use case for UPROB (FIlter + JPROB) would be higher dimensional (dims >=5) datasets. In such a scenario we shave off a small number of variables with the help of filtration and apply JPROB to the rest to obtain the best results as compared to regular JPROB and ProbSpace. 

So far, filtering does not seem very effective at lower dimensions as plain JPROB and even ProbSpace provide good accuracy in such settings. I will be checking if the status quo changes when I go into altering the minPoints and maxPoints attributes for filtration. 

I will also be looking into implementing filtration of 100% of the conditional variables and returning the single-dimensional RKHS result. However I have low hopes for this method as we have already established that filtration produces diminishing results.




Comments

Popular posts from this blog

Week 1: RKHS Kernel Implementation

Week 12: Final Week