Week 10: Implementing UPROB

- August 02, 2021

Week 10 Report

This week, I developed the UPROB module (you can check out the code here). Since this was the main focus throughout the week and I spent all my time developing, bug-fixing , testing and documenting results, I feel that a day-wise summary would not be the most effective method to show results. Instead I have opted to explain the functionalities, working and usage of this module. This will also serve as documentation for future reference. I have listed the results and conclusion towards the end of this report.

UPROB Init:

Initializing a UPROB Object is very similar to that of rkhsMV except for the final argument k, which denotes the % of variables to be filtered before applying JPROB methods on the remaining ones.

R1 and R2 are two separate rkhsMV instances, one for each of the UPROB methods, condE() and condP(). The idea is to store R1 and R2 such that if we are repeatedly performing operations on the same dataset, we need not initialize a new kernel each time thus saving time. However this runs into a problem when dealing with filtering, which I will expand upon in a future section.

Usage:

Suppose the dataset we are dealing with is of 5 dimensions. This means that there is 1 target variable and 4 conditional variables.
Dataset is of the form: [ ‘A’, ‘B’, ‘C’, ‘D’, ‘E’ ] and we are calculating E(A|B,C,D,E)

Suppose K = 25
This means that 25% of the conditional variables are to be filtered out.
∴ 25% of 4 = 1 variable.

From the way I have configured the filtration process, variables are filtered from the end i.e,
In this example, variable E is filtered out from the dataset and we are left calculating E(A|B,C,D) using JPROB.

Similarly, if K = 75, We will be filtering out 3 variables C,D,E and just be calculating E(A|B).
And naturally, if K = 0, We will not be filtering any variables and just apply JPROB for calculating E(A|B,C,D,E). I will be explaining the filtration process in the next section.

UPROB methods:

condP():

rkhsMV.condP() has just one attribute, vector X = [x1,x2,..,xn] such that we calculate P(A=x1|B=x2,C=x3…). Uprob.condP() has 3 additional attributes, K (which is the same as the parameter explained in the previous section), minPoints and maxPoints which will be used during the filtration process.

The above snippet also shows how we calculate the number of variables to filter.

Suppose the number of dimensions = 5 (meaning 4 conditional variables) and K = 30.

30% of 4 = 1.2

In this case we use the floor() function i.e floor(1.2) = 1 and hence, we filter 1 variable.

This means that providing K = 99 is generally a surefire way to filter all but 1 variable. For example, 99% of 4 = 3.96. floor(3.96) = 3.

The problem with filtration:

Earlier, we mentioned caching our rkhs kernel is a problem when we use filtration. This is because when we filter datapoints, the set of filtered datapoints we obtain is different based on the value of the conditional variable(s) we are filtering on. A simplified example:
Dims = 5, K = 50
Suppose we want to find
P(A=1 | B=2, C=-1, D = 5, E = 6) and P(A=1 | B=2, C=-1, D=5, E=10)
corresponding to 2 data points, (1,2,-1,5,6) and (1,2,-1,5,10)

The only difference in these 2 data points is the value of variable E. however, we cannot use the same instance of R1 to calculate condP() because:

We filter 50% i.e, 2 variables D and E from the dataset. When filtering, we obtain the resultant data points on the condition D = 5, E = 6 for the first data point.

However, when we are evaluating the second data point, we need filtered data points on the condition D=5, E=10 which is not the same as the previous set of filtered data points obtained.
Hence, unfortunately, caching is not an option and the rkhs instance needs to be reinitialized for every data point when we are utilizing filtration.

condE():

Similar to condP(), condE() also has the same UI as rkhsMV.condE() with 3 additional attributes, K, minPoints and maxPoints.
This method also runs into the same caching problem when filtration is involved.

Testing and Results:

The testing + comparison scripts I used for condP() and condE() methods can be found here and here respectively.

There are quite a few factors to vary and compare here,
Number of Datapoints
Dimensions (Number of conditional variables)
K (% of variables to filter)
And any combination of the above. This is also excluding the minDatapoints and maxDatapoints attributes for filtration.

However, here are some observed results upon extensive testing:

UPROB is significantly better than JPROB and ProbSpace at higher no. of datapoints
For example, here are the results of varying data sizes for dims = 6 and K = 25.

dims, datSize, tries, K = 6 1000 10 25
Average R2: JP, UP, PS = 0.5054770104980694 0.6848691336132597 -0.05166716263016276
Min R2: JP, UP, PS = 0.41836323070529136 0.57773465634426 -0.1918955782499241
Max R2: JP, UP, PS = 0.5864285627942418 0.7242482651077647 0.07638348669314632
Std R2: JP, UP, PS = 0.057676690148538164 0.03931599275870492 0.0858202275171853
Runtimes: JP, UP, PS = 0.004376943 0.00276687 0.0021929769999999996

dims, datSize, tries, K = 6 10000 10 25
Average R2: JP, UP, PS = 0.664148835112867 0.8920561409854251 0.43136058392161214
Min R2: JP, UP, PS = 0.613286717497986 0.8670004503228375 0.3401795952516222
Max R2: JP, UP, PS = 0.7258628762475468 0.9138816225612929 0.4860661115122077
Std R2: JP, UP, PS = 0.03068490135189018 0.01254851628877029 0.05038726650446831
Runtimes: JP, UP, PS = 0.0049546477 0.0016485467 0.00168032499999

dims, datSize, tries, K = 6 100000 3 25
Average R2: JP, UP, PS = 0.7519314697060077 0.932155421318086 0.627585696346015
Min R2: JP, UP, PS = 0.7412501371222967 0.9254657779137831 0.6058950094714524
Max R2: JP, UP, PS = 0.7626418277078421 0.9396762680511742 0.6552845665001804
Std R2: JP, UP, PS = 0.008733145228881379 0.005831105196361989 0.02060591801100624
Runtimes: JP, UP, PS = 0.0043991443 0.0014187821 0.0008068118333333334

Filtering is more useful at higher dimensions.
Here is an example of varying dimensions for datsize = 1000 and K = 50 or 30 (basically filtering 1 variable). We observe that UPROB is only better than JPROB at higher dimensions, implying filtering is not useful at lower dimensions.

dims, datSize, tries, K = 3 1000 10 50
Average R2: JP, UP, PS = 0.9309764059208459 0.8781039287990706 0.8070951362451947
Min R2: JP, UP, PS = 0.8903638351372366 0.7891909579355112 0.7509307756310211
Max R2: JP, UP, PS = 0.9579976790874651 0.9227986910666764 0.8832366070711307
Std R2: JP, UP, PS = 0.01890120379527616 0.0430679715236071 0.04067303986064
Runtimes: JP, UP, PS = 0.004062797 0.002591509 0.0014191420000000002

dims, datSize, tries, K = 4 1000 10 50
Average R2: JP, UP, PS = 0.9133952355240955 0.8265118171274175 0.6508554261904768
Min R2: JP, UP, PS = 0.892400861180834 0.7550964433535459 0.5711322915956666
Max R2: JP, UP, PS = 0.9377860718857439 0.9088297932500913 0.7088037503323505
Std R2: JP, UP, PS = 0.015784096482029684 0.04138208318095978 0.03990677956072
Runtimes: JP, UP, PS = 0.004098137 0.0024652259999999996 0.0013648529999999999

dims, datSize, tries, K = 5 1000 10 30
Average R2: JP, UP, PS = 0.7922385546506344 0.6965894523720709 0.3587224666437617
Min R2: JP, UP, PS = 0.7594699407677175 0.6237208480515811 0.26835230668011
Max R2: JP, UP, PS = 0.8501688573399228 0.7670967735306128 0.49417505067557754
Std R2: JP, UP, PS = 0.026957593104641134 0.05276575139588666 0.05768979287284774
Runtimes: JP, UP, PS = 0.003978654 0.002426918 0.001622321

dims, datSize, tries, K = 6 1000 10 30
Average R2: JP, UP, PS = 0.5022365823220564 0.7006543106173015 -0.08841926654825051
Min R2: JP, UP, PS = 0.38796964478843643 0.660753386032186 -0.2880613302307433
Max R2: JP, UP, PS = 0.6188868130540273 0.7516141114462412 0.0400211793138066
Std R2: JP, UP, PS = 0.06423793879875953 0.02651502695409599 0.096889155907086
Runtimes: JP, UP, PS = 0.0044137740000000005 .002824172 0.00216012

Increasing K(filtration) leads to diminishing returns.
Here is an example of increasing K value for fixed datsize = 10,000 and dims = 6

dims, datSize, tries, K = 6 10000 10 20
Average R2: JP, UP, PS = 0.6517231447317035 0.8778086406183512 0.4113404692300726
Min R2: JP, UP, PS = 0.6056548619196698 0.850187469024279 0.258372843011633
Max R2: JP, UP, PS = 0.705683008754233 0.9047693438102975 0.5134132854682382
Std R2: JP, UP, PS = 0.031149342941849293 0.020075046625489788 0.0697866636573
Runtimes: JP, UP, PS = 0.004013334499999999 0.0012714557 0.0013807842

dims, datSize, tries, K = 6 10000 10 40
Average R2: JP, UP, PS = 0.6691678757584814 0.6668479371933367 0.446419879488767
Min R2: JP, UP, PS = 0.6384621805248569 0.6290361610115509 0.3886342594748
Max R2: JP, UP, PS = 0.7233691641137414 0.7206500436162973 0.4820233163693878
Std R2: JP, UP, PS = 0.024554658037176734 0.025042985137462854 0.0298879741692
Runtimes: JP, UP, PS = 0.0042958492 0.00190575850000 0.001351252300000

dims, datSize, tries, K = 6 10000 10 60
Average R2: JP, UP, PS = 0.6724516559911595 0.4126739976809491 0.4200551696329
Min R2: JP, UP, PS = 0.637267056603291 0.3365587570037534 0.3310201150892522
Max R2: JP, UP, PS = 0.7149151980699677 0.5100503693703589 0.52570682393528
Std R2: JP, UP, PS = 0.028782096306332716 0.05028824102006239 0.05149500981440
Runtimes: JP, UP, PS = 0.004149639500000001 0.0020041857999997 0.0013618286999999

dims, datSize, tries, K = 6 10000 10 80
Average R2: JP, UP, PS = 0.65584081344727 0.29515611927357455 0.41233948247854
Min R2: JP, UP, PS = 0.5968616011274676 0.21379146991240727 0.3123298810182
Max R2: JP, UP, PS = 0.7309102801389927 0.3669113172531612 0.4662013714724
Std R2: JP, UP, PS = 0.03941897411416016 0.05345849023716808 0.04513484509915
Runtimes: JP, UP, PS = 0.004062514800000001 0.0020822128 0.0013345937

Summary and Future enhancements:

The ideal use case for UPROB (FIlter + JPROB) would be higher dimensional (dims >=5) datasets. In such a scenario we shave off a small number of variables with the help of filtration and apply JPROB to the rest to obtain the best results as compared to regular JPROB and ProbSpace.
So far, filtering does not seem very effective at lower dimensions as plain JPROB and even ProbSpace provide good accuracy in such settings. I will be checking if the status quo changes when I go into altering the minPoints and maxPoints attributes for filtration.
I will also be looking into implementing filtration of 100% of the conditional variables and returning the single-dimensional RKHS result. However I have low hopes for this method as we have already established that filtration produces diminishing results.

Comments