What gestalt law is being used on main page of youtube for videos thumbnails?

I'm a bit confused on what law is being used on the video thumbnails, in the main page of youtube. The thumbnails are closer together then they are from the 4 lines of text below them, yet I see a group composed of image and text and not see all the images together as a group. Shouldn't the elements closer together, in this case all the images on the horizontal row be seen as a group? Also, if proximity isn't the prevailing principle here, what is?

enter image description here