Introducing LeX-O: The Robot Agent That Challenges You to Tic Tac Toe

Ever since I got obsessed with embodied and physical AI back in December, I’ve been closely following the latest work and the big players— Stanford, Berkeley, Physical Intelligence, and not to forget the open-source HuggingFace LeRobot team.

Lerobot is honestly a gem. From training and fine-tuning to evaluation and deployment on the So-101 robot arm, they’ve built a powerful and accessible codebase— even reimplementing SOTA models. Their blogs, experiments, and Twitter updates have become part of my everyday scroll, giving me quick bites of insight and inspiration.

So, when they announced the Lerobot Worldwide Hackathon, I knew I had to jump in.

This hackathon was a 48-hour sprint happening simultaneously in cities across the world, bringing together AI + robotics enthusiasts to build, experiment, and just have fun. Perfect timing— because I was already in Bangalore, and found that LossFunk was hosting the India chapter.

Now, I didn’t have a robot arm. Or much hands-on robotics experience, honestly. But that didn’t stop me. I found a team online— Dhruv Dange and Ayush Sharan— who did have a setup. We hit it off right away. Fun fact: they’re both incoming CMU grads and some of the most humble, curious minds I’ve met.

June 14. The hackathon began with everyone just walking around, chatting, sharing wild ideas. The energy was electric. When you pack a room with people who are all passionate about building, things get chaotic— but magical. As complete newcomers, we were way too ambitious at the start. Our plan? Building a robot to solve the shape sorting cube. Thought of using IL with RL fine-tuning, curriculum learning, and whatnot. In hindsight, it’s hilarious. Because just a few hours in, we hit the harsh reality: even basic pick-and-place is hard—especially when you’re bottlenecked by data, time, and compute.

It’s something I do want to try sometime later but for the hackathon we pivoted.

We asked ourselves: What’s a minimal, fun challenge that fits our constraints, lets us explore perception and planning, and still feels interactive?

That’s how we came up with LeX-O—a robot arm that plays Tic Tac Toe against a human.

It didn’t require a huge dataset or complex models, and we could decouple the “intelligence” from low-level control, making the problem far more tractable within a 48-hour window.

After a few pizzas, cokes, and a sleepless night, we finally had a somewhat working LeX-O!

here’s a quick demo (in 4x) of LeX-O in action:

Link

In this post, I’ll dive into the tech behind it, our approach, and the key lessons we learned.

Building LeX-O

Tic Tac Toe is a simple yet strategic game. Two players take turns placing their tokens (X or O) on a 3x3 grid, aiming to align three of them in a row— horizontally, vertically, or diagonally. Winning requires foresight and optimal decision-making.

Translating this into a robotic task involves two key capabilities:

Deciding an optimal move based on the current board state after the opponent’s move.
Physically executing a pick-and-place to put the token in the chosen grid cell.

This process repeats until one player wins.

On a physical robot, implementing this means getting the robot to:

Perceive the current state of the board,
Reason about the optimal move,
And physically execute that move.

One way is, obviously, to use a solver and write a script to progam the robot to play the move but our goal was to explore learning-based methods such as imitation learning which can also generalize and demonstrate robustness. On the other hand, an interesting direction would be to build a sort of robot brain that sees the board, understands the task, reasons about it’s next optimal move and directly generates actions.

With the rise of Vision-Language-Action (VLA) models, this isn’t as far-fetched as it sounds. You can imagine prompting a model with an input like:

“Here’s the board state. It’s your turn to play X. Where would you put it?”

The model would then reason in natural language via chain-of-thought, choosing a cell (e.g., “Bottom-right corner” or “Grid Cell 5”), and output a corresponding sequence of motor actions to achieve the goal. However, doing this properly would require:

Fine-tuning a large enough pre-trained model (3-7 B params),
Collecting a large, diverse dataset, (diverse board states, reasoning steps and actions sequences)
And having significant compute. (long hours of atleast x2 A100)

For a weekend hackathon, that wasn’t feasible.

Our Approach: Hierarchical Reasoning + Simple Control

Instead, we chose to offload the high-level reasoning to a pre-trained LLM and learn an action policy for low-level control.

This follows a System 1–System 2 or hierarchical control paradigm:

System 1 (LLM): Decides where to place the coin, given the board state.
System 2 (policy model): Learns to pick and place the coin in that location.

This modularity made the problem tractable— we didn’t need to worry about the “thinking” part. We just had to train a pick-and-place policy.

The Policy: Action Chunking Transformer (ACT)

For the low-level policy, we chose the Action Chunking Transformer (ACT). It’s a lightweight imitation learning method that works well for robotic tasks, especially when you:

Have limited data,
Want fast convergence,
And care about smooth motion.

ACT extends behavior cloning using action chunking and temporal ensembling to reduce compounding errors and jerky behavior.

How ACT Works

Inputs:
- Current robot joint states,
- Image observations,
- (In our extension) task instruction embeddings.
Outputs:
- Predicted action chunks (i.e., trajectories over a few timesteps).

ACT uses a Conditional Variational Autoencoder (CVAE) architecture. During training:

It encodes image + joint states into a style latent z,
Then decodes actions conditioned on this latent, plus joint states and current observations.

At test time, we simply set z = 0, discarding the encoder.

Multi-Task Learning via Task Instructions

By default, ACT only learns one task. But Tic Tac Toe requires nine different pick-and-place positions. Fortunately, these are all structurally similar— just different spatial coordinates.

To generalize across these, we:

Augmented ACT with task instruction embeddings (e.g., “Place at grid cell 9”).
This allowed the model to learn a conditional policy across all 9 grid positions.

It worked surprisingly well! Likely because the tasks are low variance— just different poses for the same action.

⚠️ Note: This would likely not work as smoothly for very different tasks (e.g., pouring vs stacking). Multi-Task ACT solves this using a novel mixture-of-action expert embeddings. But for our current task, it would’ve been overkill.

Experiment Specifications

Collected a dataset of 10 demonstrations per grid position, using randomly configured board states each time (to learn robustness to board state variations) and used black and white carrom coins as X and O tokens.
Used an RTX 5090 GPU.
Trained for 160min, 25k steps and 40 epochs.
Default ACT config parameters as used by leRobot
Dataset: link, Model: link, Code: link

Notes:

25k steps might’ve been an overkill. We tried a policy trained on 15k steps and it was just as good.
We were able to increase batch size to 64 and hence number of training steps required was lower.

Peeking Inside the Model: Interpreting ACT

The robot could now play Tic Tac Toe. But I wanted to understand what it was actually learning.

Ville Kuosmanen has this excellent Physical AI Interpretability Toolkit to visualize attention over the inputs.

We integrated the tool to leRobot and applied to the input observations (extending it to support task instruction in our code)

What did we find?

High attention to task instructions ✅ (makes sense—it tells the robot where to place the coin).
High attention to joint states ✅ (used for motor planning).
Almost no attention to vision ❌ (concerning…).

Attention visualization video link

This suggests that the model ignored the image input, instead memorizing trajectories conditioned on task and joint states. It makes sense:

Our dataset was small and not very visually diverse,
The joint states and task ID alone were enough to succeed.

We didn’t randomize the table layout, background, or coin positions within the cells— only the board state configuration. This made it easy and sufficient to learn just from the joint states and task instruction, hence, neglecting the vision input.

On Attention Map Visualization

One issue: most attention maps for vision were flat (blue). Turns out, this was due to global min-max normalization used by the visualization tool.

To experiment, as a quick hack, I tried scaling up the attention values (100x) to see any fine-grained pattern. It revealed faint distinctions— but overall, still quite noisy and uninformative.

100x scaled attention map video

Well, as a first-timer, I am okay with this. I believe such failures teach you lot more insight.
To get a better analysis, we should probably train on a larger and more diverse data.
Ideally the attention map would be spread out, then converge

Takeaways

For a 48-hour hackathon, building a working Tic Tac Toe robot was a win. But more importantly, we:

Hands-on, built our first model deployed on a real arm.
Studied ACT deeply and modified it to accept additional task instruction input.
Gained insights and learnt lessons from the training, inference and post-hoc analysis.

For next time, I’d:

Collect more diverse data for better generalization and robustness.
Try out the shape-sorting idea.
Fine-tune and test this on a VLA (perhaps, SmolVLA

But for now—LeX-O plays, thinks (a bit), and picks its spot like a champ.

Update: I’ve bought my own LeRobot SO101 arm ! for a project I lead in our student robotics club. Looking forward to more fun stuff.

Onward and upward 🚀

Tags: AI, Research, Travel