As my friend wrote to me, "Amid fan outrage, the writer, Dan Slott, actually started hitting up message boards to claim that his intent was to imply a pre-wedding kiss... most fans have assumed that much more went on. Who is right?"
The short answer is: it is entirely ambiguous about what happens. This is exactly the same as the example McCloud shows with the guy with the axe (right). The unseen actions need to be filled in by your mind given the information you are provided. In both cases, you are given preparatory actions, and then shown something different, with only a word balloon to connect to the preparatory action.
The key is that your mind tries to figure out what happened because of the associated word balloon and because the second panel doesn't show the action. It's worth emphasizing that, contrary to McCloud's claims, the "filling in of the information" does not happen between panels, but within that underspecified second panel. In both cases, the authors choose not to show the action.
In the Spider-Man example, the first panel is also slightly underspecified. It hints at Aunt May about to kiss, but the action isn't defined well enough. The "No Stop!" balloon reads as anticipatory, suggesting that something is about to happen. The most direct answer then would be a kiss in the second panel, because it's set up directly in the previous panel (just like the axe sets up chopping in the second panel).
However, there is nothing to prevent other interpretations, since the first panel's "preparatory kiss" wasn't drawn that clearly and the second panel is entirely ambiguous because all you see is a door and all cues come from the word balloon. The question then becomes how associated is the word balloon "Ahhh!" with just kissing or with something else.
(It is also worth mentioning the implication of duration that an action takes place. Kisses are can be single pointed actions, or can be drawn out (as can...ahem... other actions). Showing a door doesn't give any further information about the duration of time passing in the hidden event—this is also only suggested by the word balloon.)
So... from a structural perspective, there is no real "right answer." It is a (likely intentionally) suggestive and ambiguous depiction, but it nicely plays with properties of the visual language grammar to elicit varying interpretations.
For more information on these types of concerns, see my paper The limits of time and transitions (pdf here).




