Evaluation Results

Our Main Findings¶

The following is our evaluation table for all three system in which we includeed metrics such as:

Average energy metrics.
Average convergence speed.
Averaged average boxes to targets per response.
Average number of response.
Average convergence speed.
Average similarity metric in embeddings for sspy_model and justification_model against spy sentence.

We have conducted experiment with manipulation of number of boxes (same color) in the environment, we have box_upperbound as 2 and 3. We have conducted the same test with bootstrapped dataset, which we got similar result. Details are visited in the table.

	rplh-agent-spy-3up	rplh-standard-spy-3up	rplh-standard-nospy-3up	rplh-agent-spy-2up	rplh-standard-spy-2up	rplh-standard-nospy-2up
Average-AUC-Norm1	187.714	228.818	241.808	156.536	163.571	160.292
Average-AUC-Norm2	108.593	133.279	138.934	85.747	90.352	87.453
Average-Slope1	-0.388	-0.333	-0.279	-0.661	-0.622	-0.567
Average-Slope2	-0.27	-0.24	-0.191	-0.417	-0.371	-0.334
Average-Box-To-Targets-Per-Response	0.406	0.271	0.278	0.365	0.271	0.278
Average-Responses	28.4	35.8	36.95	24.105	35.8	36.95
Average-Convergence-Rate	0.7	0.55	0.65	0.737	0.7	0.6
Average-Embedding-Similarity-Spy_Embed_Agent[0.5, 0.5]	0.563	nan	nan	0.592	nan	nan
Average-Embedding-Similarity-Spy_Embed_Agent[2.5, 2.5]	0.647	nan	nan	0.641	nan	nan
Average-Embedding-Similarity-Spy_Embed_Agent[1.5, 1.5]	0.622	nan	nan	0.613	nan	nan
Average-Embedding-Similarity-Justification_Embed	0.627	0.553	0.554	0.611	nan	nan

As we can see, our system with agent reasoning rplh-agent-spy outperforms any of the other two systems no matter in upperbound as 2 or 3. On average, rplh-agent would take: - Least energy to complete the task. - Fastest speed of convergence. - Highest rate of convergence. - Highest average box to targets per response ratio. - Least number of responses needed.

We can make a few observations:

In all cases, having less boxes in the environment make the environment easier to be completed (in terms of convergence rate, convergence speed, number of responses needed)
- Interestingly, Average-Box-To-Targets-Per-Response did not change for both rplh-standard system neglecting the presence of spy, pointing to potential conclusion that there underlaying mechanism may be the same, independent of number of boxes. But with rplh-agent, more boxes would result in a more efficient understanding of the spy agent, resulting in higher target match rate.
As expected, with more boxes, the agent has more experience with spys, improving the justification embedding similarity. However, teh same conclusion is not transferable to spy model embeddings.
- Interestingly, it takes longer for the system to solve the task when no-spy is around, which we hypothesize that it might be caused by the phenomenon where the presence of spy agent "forces" the HCA to conduct better agent modeling, hence, better convergence in solving the task.
- However, though doing work is more efficient, the convergence rate is actually lower, signifying that spy does have effect on disrupting central agent as well (when HCA doesn't explicitky models spy (RPLH-standard)).

Energy Metrics¶

The energy metric is our unique measure of telling how our system performs with different settings. It is a distance based metrics where we calculate, for each trial, and for each environmental step, what is teh current distance between each boxes and each target, the distance is measures in two forms:

Norm1: How many actual step is needed to move the box to the targets (Manhatten distance), for each color type we pick the smalles norm1 distance as the easiest target for the box to reach.
Norm2: How many "distance" (abstract idea) the boxes are to each targets, the value is an arbitrary value for comparison purpose, which is why we averages against each color.

Then for each trail, we can see that these two notion of distance changes over different environmental steps, all teh way until convergence (we only take convergence trails). When slope tells us about the convergence speed, we use area under the curve (AUC) as the Energy or all the steps needed for a system to converge. Here is a demo of how such graph would look like for our three system with seed=3 trial.

Sentence Embedding Simlarity¶

For RPLH system with agent based reasoning (having agent_model, spy_model, and justification), we uses embedding methods to compare what our LM have said with the spy detect sentence Seems to be the spy agent, its goal is to prevent match targets. The following result is RPLH-agent system running for 20 trials with box-num-upperbound set to 3 in a 3x3 grid environment (shown agents are spy agents). - Note that we used a higher num-box-upperbound here to allow the central agent to learn more about the spy agent.

	Spy_Embed_Agent[0.5, 0.5]	Spy_Embed_Agent[1.5, 1.5]	Spy_Embed_Agent[2.5, 2.5]	Justification_Embed
count	14	14	14	14
mean	0.570538	0.619103	0.64323	0.615159
std	0.108039	0.0747719	0.109523	0.0501143
min	0.353954	0.529466	0.353954	0.500486
25%	0.532694	0.567079	0.599519	0.590435
50%	0.587794	0.599107	0.68312	0.61961
75%	0.644218	0.673218	0.725615	0.651351
max	0.711985	0.753357	0.734128	0.698648

We can see that the HCA agent is somewhat aware of the spy.

Feature Importance & Hypothesis Testing¶

We made agent's performance metrics (df above) into features and used a RandomForest Regressor to see which of the features matter more to the prediction of number of response or average targets matched per response. We also used both as an vector to do comparison.

When having all the data, we can clearly see how the presence of spy may effect performance, especially for Number of response. We performed follow-up bootstrapped hypothesis testing with alpha-value=0.1 (need to divide by 2 due to two tail test) and test-statistics as differences in mean.

H0: There is no difference in the mean convergence times between the two groups.
H1: There is difference in the mean convergence times between the two groups.

The p-value represents the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed, assuming that the null hypothesis (H0) is true. We randomly bootstrapped from the full data distribution uniformally and calculated (mean_spy - mean_no_spy) as our test statistics (bootstrapped differences in mean, normally distributed). Our observed statistics falls on the tail of such distribution with P=0.049, which is below our alpha value. Hence, we reject H0. With more data collected, we will get better understanding of our performance.