Evaluation Metrics

This page is for describing our metrics, all evaluations in this system are down using gpt-40-mini for speed purpose and with agent argument for testing reasoning abilities. The results can be find here

We have an notebook for demonstrating our evaluation process. We have uploaded our trial runs of our system into google cloud drive for anyone who may be interesting in looking at it (details of using the data getting function in the evaluation notebook).

Experiment Configurations¶

The following is how we configured our experiment:

Setting seed = iteration_num (i.e. 1, 2, 3, ...)
Running 20 times for each systems. with num_box_upperbound as 3.
Neglecting empty actions rounds.
Trial automatically ends after 50 round of environmental steps.

*Note that in some final rounds of moving boxes, some rounds does not converge with syntactic error response, which is likely due to spy's misleading information giving to the central agent, resulting in non-convergence round (within 50 env-step) as later rounds with less boxes run really fast (due to the configuration of the system, refer here).

Standard Reasoning Metrics¶

Evaluating standard computtaion metrics:

Test Success Rate (does the system converge before max number of stesp (50 environmental steps in our evals) reached).
Convergence speed (number of environmental execution steps needed for convergence (solve the given problem)).
Average boxs moved to other grid and average boxs matched taregets.
Human annotation (evaluate whether the system outputs reasonable/optimal response at each steps. Manually reason what is the optimal number of steps given the environment.)s
- Given the same seed (same environment setup), how many boxes are removed at each step and whether this step is efficient (human-annotate).
- Does output reasoning (tokens) makes sense & does action decision make snese.

Evaluating agent based reasoning's power. For each round of testing, the spy agent is the same (diagonal line) however, since the environment's box and target distribution is random, this allows the spy agent's lcoation to remain constant. Comparing between RPLH-agent-spy with RPLH-standard-spy with RPLH-standard-no-spy.

Sentence embedding cosine similarity (social understanding), described in more details here:
- non-spyvs spy in spy_model.
- different agents comparison in agent_model.
Feature importance:
- We put agent/standard RPLH's data together and get features (social reasoning similarity, convergence rate), described in more details here.
Energy metrics (our unique measures):
- Using distance metrics to describe the energy that our system would use to complete the task, described in more details here.
Statistical test:
- We performed both hypothesis testing and AB testing to evaluate whether there is a significant result between our testing environment (spy vs no-spy).

Embeddings¶

Our conversation sentence embedding similarity comparison uses a set of pre-defined sentence for comparing to how LM thinks about each agent's attitude. Specifically, we have sentences starts with seems like for describing LM's model of other agent in teh agent/spy model and I suspect for the justification model (*this comparison matters a lot, we try our best to set a neutral standard).

- Nice_agent: "Seems to be be very easy going, trying to find agreement with the central ageent and the goal is to resolve conversation."
- Critica_agent: "Seems to be very critical and proposing many changes."
- Agreeing_agent: "Seems to be super coopertaive, agreeing to central command."
- Spy_agent: "Seems to be the spy agent, its goal is to prevent match targets."
- Justification: "I suspect that this agent is the spy agent, thus, I would not listen to this agent."

Evaluation Metrics

Experiment Configurations¶

Standard Reasoning Metrics¶

Agent-Social Reasoning Metrics¶

Embeddings¶