Masking of individual words is just a counterfactual.

Why we shouldn’t be mentioning these catchy names.

When a roaring crowd of aspiring data-satanists is about to run over you shouting their new catchy word, be it ‘BERT’ or ‘Transformer’ or anything else for that matter, remember that “This too shall pass” and, please, don’t react… not to say that if you join the crowd - you are lost, and maybe forever. As Feynman used to say by the accounts of his collaborators: “Don’t tell me what it is called, tell me what it is/means.”

What do these ‘new’ tricks in machine processing of language really mean?

Let’s put our ‘naked eye’ aside and wear our microscope glasses. What are these two things brought to our attention?

1. The so called ‘Transformer’ is just a ‘Transmission channel’ of Shannon / Fano… or a simple chain of encoders and decoders with one or more intermediate states.

This is a relatively simple idea, so many Ph.D. degrees have been sucked out of this finger over the years that I’m not even talking about it. It is worth discussion only in the context of choosing a ‘common’ language for human-machine interactions. The one that would be equally meaningful and understandable for machines and humans who will be using it as a ‘common denominator’ for mutual ‘understanding’.

2. The so called bert ‘BERT’ is a probabilistic verification of multiple counterfactual hypothesis about a phrase.

Yes. What exactly is this ‘masking’ of a word in a phrase? It is a ‘counterfactual’! It is a counterfactual in a meta-language of course, but counterfactual none the less.

Later.