Are remote preference tests (A vs. B) always flawed because of learning effect?

Currently doing UX testing on low budget. We have prototypes in Figma and user tests in Maze.
Now one of the main goals is to compare two versions of a feature.

The problem?

screenshot of the software Maze

All the remote testing tools I've tried don't offer the possibility of randomization of tasks. So there will always be a strong learning effect for each user, because they start with A and go to B.

Solution Ideas

So what can I do?

  1. Add a general "intro task" that lets the subject get to know the user interface & context
    (resulting in unnecessary length of the test)

  2. Create two user tests and handle the randomization myself and send each person the correct link individually
    (resulting in annoying manual work & having to aggregate the results from 2 tests)

  3. Find a magic tool that somehow handles this randomization for me and gives me one single share link
    (still manual work of aggregating results)

Does someone have experience with this or has solved this problem before?