So other ways of combining greater numbers of microbatch gradients in an effective/consistent manner for performing an update is one area of potential future work. I think your idea is an interesting way to approach it. Though there are a bunch of potentially effective ways of doing it.
I think the idea of averaging the k-models afterwards though is at odds with the core concept of gradient agreement filtering though because you're back at combining two distinct directions of improvement without a guarantee that the combination is better (even though it does seem to be in practice). The the core idea is that you philosophically only want to learn the patterns that agree across multiple specific examples and build some some algorithmic protections to ensure that is happening. Just averaging, while it might work and even yield improvement, but it doesn't necessarily lead to proper generalized learning.
I think the idea of averaging the k-models afterwards though is at odds with the core concept of gradient agreement filtering though because you're back at combining two distinct directions of improvement without a guarantee that the combination is better (even though it does seem to be in practice). The the core idea is that you philosophically only want to learn the patterns that agree across multiple specific examples and build some some algorithmic protections to ensure that is happening. Just averaging, while it might work and even yield improvement, but it doesn't necessarily lead to proper generalized learning.