Evaluation Results
Our Main Findings¶
The following is our evaluation table for all three system in which we includeed metrics such as:
- Average energy metrics.
- Average convergence speed.
- Averaged average boxes to targets per response.
- Average number of response.
- Average convergence speed.
- Average similarity metric in embeddings for s
spy_model
andjustification_model
against spy sentence.
We have conducted experiment with manipulation of number of boxes (same color) in the environment, we have box_upperbound
as 2 and 3. We have conducted the same test with bootstrapped
dataset, which we got similar result. Details are visited in the table.
rplh-agent-spy-3up | rplh-standard-spy-3up | rplh-standard-nospy-3up | rplh-agent-spy-2up | rplh-standard-spy-2up | rplh-standard-nospy-2up | |
---|---|---|---|---|---|---|
Average-AUC-Norm1 | 187.714 | 228.818 | 241.808 | 156.536 | 163.571 | 160.292 |
Average-AUC-Norm2 | 108.593 | 133.279 | 138.934 | 85.747 | 90.352 | 87.453 |
Average-Slope1 | -0.388 | -0.333 | -0.279 | -0.661 | -0.622 | -0.567 |
Average-Slope2 | -0.27 | -0.24 | -0.191 | -0.417 | -0.371 | -0.334 |
Average-Box-To-Targets-Per-Response | 0.406 | 0.271 | 0.278 | 0.365 | 0.271 | 0.278 |
Average-Responses | 28.4 | 35.8 | 36.95 | 24.105 | 35.8 | 36.95 |
Average-Convergence-Rate | 0.7 | 0.55 | 0.65 | 0.737 | 0.7 | 0.6 |
Average-Embedding-Similarity-Spy_Embed_Agent[0.5, 0.5] | 0.563 | nan | nan | 0.592 | nan | nan |
Average-Embedding-Similarity-Spy_Embed_Agent[2.5, 2.5] | 0.647 | nan | nan | 0.641 | nan | nan |
Average-Embedding-Similarity-Spy_Embed_Agent[1.5, 1.5] | 0.622 | nan | nan | 0.613 | nan | nan |
Average-Embedding-Similarity-Justification_Embed | 0.627 | 0.553 | 0.554 | 0.611 | nan | nan |
As we can see, our system with agent reasoning rplh-agent-spy
outperforms any of the other two systems no matter in upperbound as 2 or 3. On average, rplh-agent would take:
- Least energy to complete the task.
- Fastest speed of convergence.
- Highest rate of convergence.
- Highest average box to targets per response ratio.
- Least number of responses needed.
We can make a few observations:
-
In all cases, having less boxes in the environment make the environment easier to be completed (in terms of convergence rate, convergence speed, number of responses needed)
- Interestingly,
Average-Box-To-Targets-Per-Response
did not change for both rplh-standard system neglecting the presence of spy, pointing to potential conclusion that there underlaying mechanism may be the same, independent of number of boxes. But with rplh-agent, more boxes would result in a more efficient understanding of the spy agent, resulting in higher target match rate.
- Interestingly,
-
As expected, with more boxes, the agent has more experience with spys, improving the justification embedding similarity. However, teh same conclusion is not transferable to spy model embeddings.
- Interestingly, it takes longer for the system to solve the task when no-spy is around, which we hypothesize that it might be caused by the phenomenon where the presence of spy agent "forces" the HCA to conduct better agent modeling, hence, better convergence in solving the task.
- However, though doing work is more efficient, the convergence rate is actually lower, signifying that spy does have effect on disrupting central agent as well (when HCA doesn't explicitky models spy (RPLH-standard)).
Energy Metrics¶
The energy metric is our unique measure of telling how our system performs with different settings. It is a distance
based metrics where we calculate, for each trial, and for each environmental step, what is teh current distance between each boxes and each target, the distance is measures in two forms:
Norm1
: How many actual step is needed to move the box to the targets (Manhatten distance), for each color type we pick the smalles norm1 distance as the easiest target for the box to reach.Norm2
: How many "distance" (abstract idea) the boxes are to each targets, the value is an arbitrary value for comparison purpose, which is why we averages against each color.
Then for each trail, we can see that these two notion of distance changes over different environmental steps, all teh way until convergence (we only take convergence trails). When slope
tells us about the convergence speed, we use area under the curve (AUC
) as the Energy
or all the steps needed for a system to converge. Here is a demo of how such graph would look like for our three system with seed=3
trial.
Sentence Embedding Simlarity¶
For RPLH system with agent based reasoning (having agent_model
, spy_model
, and justification
), we uses embedding methods to compare what our LM have said with the spy detect sentence Seems to be the spy agent, its goal is to prevent match targets
. The following result is RPLH-agent system running for 20 trials with box-num-upperbound set to 3 in a 3x3 grid environment (shown agents are spy agents).
- Note that we used a higher num-box-upperbound here to allow the central agent to learn more about the spy agent.
Spy_Embed_Agent[0.5, 0.5] | Spy_Embed_Agent[1.5, 1.5] | Spy_Embed_Agent[2.5, 2.5] | Justification_Embed | |
---|---|---|---|---|
count | 14 | 14 | 14 | 14 |
mean | 0.570538 | 0.619103 | 0.64323 | 0.615159 |
std | 0.108039 | 0.0747719 | 0.109523 | 0.0501143 |
min | 0.353954 | 0.529466 | 0.353954 | 0.500486 |
25% | 0.532694 | 0.567079 | 0.599519 | 0.590435 |
50% | 0.587794 | 0.599107 | 0.68312 | 0.61961 |
75% | 0.644218 | 0.673218 | 0.725615 | 0.651351 |
max | 0.711985 | 0.753357 | 0.734128 | 0.698648 |
We can see that the HCA agent is somewhat aware of the spy.
Feature Importance & Hypothesis Testing¶
We made agent's performance metrics (df above) into features and used a RandomForest Regressor to see which of the features matter more to the prediction of number of response
or average targets matched per response
. We also used both as an vector to do comparison.
When having all the data, we can clearly see how the presence of spy may effect performance, especially for Number of response
. We performed follow-up bootstrapped hypothesis testing with alpha-value=0.1
(need to divide by 2 due to two tail test) and test-statistics as differences in mean.
- H0: There is no difference in the mean convergence times between the two groups.
- H1: There is difference in the mean convergence times between the two groups.
The p-value represents the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed, assuming that the null hypothesis (H0) is true. We randomly bootstrapped from the full data distribution uniformally and calculated (mean_spy - mean_no_spy) as our test statistics (bootstrapped differences in mean, normally distributed). Our observed statistics falls on the tail of such distribution with P=0.049
, which is below our alpha
value. Hence, we reject H0. With more data collected, we will get better understanding of our performance.