You must log in or # to comment.
Wouldn’t their patch embeddings return different results depending on the visual boundaries? They don’t appear to use overlap redundancy; this means it’s going to be significantly less resource intensive, but the chance of losing significant signals in the image to text translation surely must be inversely high?
Good question, not sure how they account for that. Maybe there’s a higher level layer responsible for dealing with the boundaries?