I assume there should be a choice of K (number of parts) - 3,4,5,6.
Should all the parts be the same or they can be different?
Yey, im glad you liked the idea, because from my deep research, this is a method is been used in all predictive models in the Data Science field,
About your question of the implementation ideas,
Yes, There should be a value of "K" which would essentially mean, K=(TotalAmountOfData/NumberOfFolds),
So for example if we want 10 folds for example, The total amount of data would be divided by 5 parts of 20% out of the total of 100% of the data,
ALL The parts should be the same in value*.
(I was kinda drunk when writing the first post, i noticed that i had some miscalculations of 4 folds and 20% which was needed to be 25%*, so sorry about that hehe...),
Lets make an example of 5 Folds this time:
User set the K-Folds parameter to "5",
SQ will divide the data into 5 SIMILAR SIZED folds (parts), (100%/5=20% per each part):
Run 1 = [IS][IS][IS][IS][OOS]
Run 2 = [OOS][IS][IS][IS][IS]
Run 3 = [IS][OOS][IS][IS][IS]
Run 4 = [IS][IS][OOS][IS][IS]
Run 5 = [IS][IS][IS][OOS][IS]
Each step will also have some kind of score (Maybe we can use R-Squared?),
And if all the Runs will have have some kind of an Average value of R-Squared We will consider that the strategy passed the K-Fold Cross Validation.
More info about the benefits and the proper usage of this method can be found anywhere by researching "K-Fold Cross Validation".
.. Because the Builder is some-what an optimization engine that tries different parameters to fit our Ranking Filters and then show us the result as a strategy in our databank,
In this so called "Optimization" processes we can already implement some "forecasting procedure" validation from the get go that will already have some predictive futuristic bias with it,
BEFORE it will enter our databank,
Hence i found some good references:
This will explain why K-Fold cannot be used, and explain other different methods that are some-what Walkforward Validations:
https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4
OneStepCross-Validation (Looks like what we already got in WFM but the IS part stays with the same starting period),
+
MultiStepCross-Validation (Same as the above but seems to be more future predictive?)
https://www.youtube.com/watch?v=oGqsyv49Wvo
The K-Fold analysis is for training on IS and verifying on OS. It is for checking the performance of the training process itself ie an optimization process or machine learning training process. So what training process are we really trying to analyze here the builder itself? Or you want to do a mini optimization within the building process? (if so it will be way slower than 5x validations, it will be more tests per ITERATION of course, also I think it would be comparable or even better to do a batch build first then a k-fold analysis optimization task after.)
K-Fold would be awesome as a new task or implemented in to optimization task but why complicate the builder itself?
this method is been used in Data Analysis of different things that are NON-Time Series based,
Hence i changed the subject of the topic for the following,
OneStepCross-Validation (Looks like what we already got in WFM but the IS part stays with the same starting period), + MultiStepCross-Validation (Same as the above but seems to be more future predictive?) https://www.youtube.com/watch?v=oGqsyv49Wvo
It would be an awesome feature to have inside SQ if we will have the SQ builder engine a robustness method within itself,
This will help us a bunch.
The following 2 above methods are already kinda implemented into SQX already, like the first one: OneStepCross-Validation,
and WF in general is the ultimate method in Data analysis for validation of models if they are fitted or not,
If this method would be available to us inside the first steps of strategy mining, this will save us huge amount of work,
Im not saying that SQ's Dev-team need to change the whole damn thing, what i am saying is to give this to us as an option to tick with a "V" and use it,
if the user wont want to use it, he could simply turn it off.
Yes it's anchored and we have it already in sqx. in sqx it is "floating/fixed" (select "fixed" to have an anchored IS start time.)
Im not saying that SQ's Dev-team need to change the whole damn thing, what i am saying is to give this to us as an option to tick with a "V" and use it,
if the user wont want to use it, he could simply turn it off.
to
be clear now this ticket is only about adding WF in to the builder?
Hence i changed the subject of the topic for the following,
Nope it still says k-fold in the subject and the main text of this ticket still shows k-fold examples.....
Subject changed from K-Fold Cross Validation to Builder's Cross Validation for less overfitting strategies from the get-go ?
to be clear now this ticket is only about adding WF in to the builder?
Yes, i guess so, The builder is an Optimization engine already as it is, why not just add some kind of validation method to make it robust from the get go?, atleast an option to.
OK but looks like Clonex started working out the leakage issue to try and make k-fold useful.
https://analyticsindiamag.com/can-we-trust-k-fold-cross-validation-for-financial-modelling/
As for putting it in to builder. It seems like you'd just be removing a cpu load from one part of your worklflow and putting it in to another part. It shouldn't necessarily be more efficient to do it at the beginning with builder than at the end. It's preferable to filter on the quickest tests first so we have less strategies to do the longer tests on. Any kind of WF or k-fold analysis is a long test. The results are exactly the same if you do the long test first except you've done more work.
The builder is an Optimization engine already as it is,
I don't think so, not really. The builder is not constrained to a single strategy like optimizer it's actually swapping new blocks in to make completely different strategies. In theory you could use a restricting template to build with an in situ "analysis" (k-fold or other) to check the TEMPLATE performance itself though.... The building process itself combined with the template is what you'd be analyzing in that case. To analyze each strategy with WF or k-fold we need to do a full WF or k-fold on each strategy and for that as I pointed out above it can and probably should be in a different task.
https://roadmap.strategyquant.com/tasks/sq4_5699/edit
Another thing that i thought about is this:
Each strategy we will find we will optimize all the parameters of the strategy with %+-X of each parameter of the strategy with the steps of %+-Y, so for example the generation found a strategy of Bar Closed > MA100, TP 200, SL 200, The method will take all the 3 values that are available to us for this example: MA TP SL and will check automatically if the surrounding parameters are robust, This will work speretly for each parameter OR at once for all the parameters (By choosing to do so), MA = 100 TP = 200 SL = 200 Lets say we will optimize all the parameters for this simple example with 10 steps for all, STEPS of optimization = 10 MAXimum of optimization values = 25 so MA 100 will be optimized as this: (100 = 100%), (1000.25=25), (25/10=2.5). & TP 200 will be optimized as this: (200 = 100%), (2000.25=50), (50/10=5). & SL 200 will be optimized as this: (200 = 100%), (2000.25=50), (50/10=5). If all STEPS surrounding the parameters passed our criteria than the strategy will pass this validation method, What do you think?, seems to be a simple one..
This is a new more specific feature suggestion on what i wrote above this msg, please vote:
https://roadmap.strategyquant.com/tasks/sq4_7901
Made this one: https://roadmap.strategyquant.com/tasks/sq4_7915/edit
less work cause its already implemented elsewhere inside SQ..
please vote
Subject changed from Builder's Cross Validation for less overfitting strategies from the get-go ? to DELETE
Description changed: