A glimpse of the human brain: To click or not to click ?
|Christophe Roger||Teva Brotherson||Antoine Vincent||Emilien Seiler|
The objective of the Wikispeedia game is to navigate from one Wikipedia article to another using only the links within the articles.
To play the game, you will start at a randomly chosen Wikipedia article. Your goal is to reach a specific target article, also chosen randomly, by clicking on the links within the articles. You are not allowed to use the search function or any other external resources.
The game is timed, and your score is based on how quickly you are able to reach the target article.
In Wikispeedia, players must make strategic decisions about which Wikipedia articles to click on in order to reach the target article as quickly as possible. These decisions involve a balance between speed and thinking, as players must weigh the potential risks and rewards of each choice. As players navigate through the network of Wikipedia articles, they may encounter various factors that influence their decision-making process. For example, they may encounter articles that are more relevant to the target topic, but require more clicks to reach. Alternatively, they may encounter articles that are less relevant, but can be reached more quickly. By analyzing the decisions that players make in Wikispeedia, we can gain insight into the patterns of human behavior and decision-making in a digital environment. This information can be used to a better understanding of how people navigate in complex networks and make strategic choices under time pressure. So whether you're a seasoned Wikispeedia player or just starting out, be sure to consider the various factors that may impact your decisions as you race to reach the target article.
Analysis Part : A glimpse of the human brain
Motivation & Objectives
Why using a game such as Wikispeedia as a way of understanding human brain?
A game is “an activity that one engages in for amusement or fun.” [Oxford languages’ definition] and it leads to a feeling defined as the “eager or willing to do something new or challenging.” [Oxford languages].
From the definition, playing a game is then a deliberate process that is enjoyable and where the brain is at its maximal concentration because it is wanted and challenging for it. Because It is a deliberate process, this activity is then what humans tends to do naturally and from what we can find intrinsic insight of its behaviour.
For example, these graphs show that in average someone tends to replay a game if he has lost it beforehand. According to scientific researches (such as: Cowie and Cornelius (2003)), this result is what would have been expected because, from a psychological point of view, negative feedbacks have a bigger impact on the brain than positive ones. Therefore, in our case, players encounter more emotions by losing a game than winning one and then bring a bigger interest to the game, leading to doing another game.
This is then the type of insight of the true functionning of human brain we can easily find using a game analysis.
Another reason is simply that, although the game seems straight forward, it is not that easy to go from one page to another and there is actually a significant difference between optimal and average player paths. Indeed, from more than 50 000 game where users manage to complete the path, the average player path length is 6.76 clicks (6.4 when ignoring the backwards step) whereas the theorical shortest average path length is 3.2 clicks (the theoretical distance was initially computed using the Floyd-Warshall algorithm). Their respective median are 6 and 3.
In order to determine wheter the sub-optimal clicks were actually linked to knowledge and human brain, the optimal path was needed, and not only it's length but also the pages that constituted it. This was extracted from the network based on the connections between every pages, the links. For every paths played and finished, it was compared to the shortest path given by the network and their differences were noted and analyzed. By the way, the length of optimal paths that were extracted was not exactly equal to the dataset 'shortest distance matrix' that was given, a possible explanation is that some links were ignored although present on the page because their position was not conventional, another one could be that some of the words linked to pages that didn't carry the same name (X. color -> colour) which induced some errors.
Every article is classified with two attributes: a subject more general, and a specific category within it. The comparison was first carried upon the subjects. The results are drawn every time there is a sub-optimal click in those paths. The category of the word and of the existing optimal paths are then extracted, and analyzed. Out of the existing 50'000 paths, there's 20% that were either optimal either non-existing paths.
A same type of analysis was carried upon the categories.
By making the difference between those two graphs, we were able to distinguish between the subjects that were "over-/underrated", i.e. the subjects that were not chosen in accordance to their correctness. For example, the left graph below shows that even if 'Countries' is a category on which people click a lot (even sub-optimally), they should click on it even more as it is one time out-of-three an optimal choice (See Graphs above). In the opposite, 'Science' is often mistaken as being an optimal category although it isn't. Once normalized, we notice a remarkable trend, the 'Geography' subject despite it often appears is left aside by people. that are more appealed by 'Science' which also appears quite often.
And the same applies to the categories
Not only, there is a difference in the path lengths between shortest and played but there is also a bias in the subject chosen, i.e. within the a priori reflexion and the understanding of Wikipedia's knowledge gathering. It shows an approach of the common knowledge which is quite interesting and worth to investigate.
Therefore, based on those observations, analysing a game is a very good way to analyse natural human behaviour in different situation and is very appropriate to be related to already existing literature.
Analysis of the learning:In most games more experienced players tend to perform better, we could then assume that there is a learning factor that plays a role in the results obtained: The more someone plays a game, the better he gets at it. This could be since players slowly understand mechanics of a game and then develop strategies to perform better. According to Lambert and McCombs [ Lambert and McCombs [American Psychological Association], 1997]: “The acquisition of complex knowledge and skills demands the investment of considerable learner energy and strategic effort”. In other words, learning is an effortful process and to become “skilled” at a specific task it requires training. Here “skilled” means that the player uses less part on his brain and the effort needed for a given task is less, he can then more thing at the same time.
The figure above shows that as users are playing, their win-rate tends to increase the more they play. Moreover, the p-value of the regression is 0.000, it is inferior to a confidence interval of 0.001. Therefore, we can reject the null hypothesis assuming that the win-rate is independent of the number of the attempt in chronological order. The results fit the theory and assumptions, as players get used to the game they perform better on the task: “finding a valid path from initial page to target”. It can be assumed that they did learn mechanics of the game. As defined in Wikispeedia’s description, the goal of the game is to find a valid path between two pages, but it is mainly to find it in the smallest possible time. Therefore, the real point to be analysed is the time needed to find the path. From a learning’s perspective, as define for the win-rate analyse, we can differentiate a learner from a skilled player as the number of regions in activity in brain [Wu, Liu, Zhang, Hallet, Zheng, and Chan (2014)]. As the more experienced players uses less parts of their brain to “find the next page to find target”, they can concentrate on other things, such as time, where to find links, etc… they should have better results regarding total time spent to win as they are more efficient.
The graph above has a regression’s p-value of 0.003, it is inferior to a confidence interval of 0.005. Therefore, we can reject the null hypothesis assuming that the duration of a game is independent of the number of the attempt in chronological order. The negative regression slope indicates that the duration of a game decreases the more users play the game. This would mean that there is an observed learning as people plays as players get better and win more rapidly. This finding on Wikispeedia also match research’s ones, as trained players do better at the game. Results then demonstrates that players get better and better the more they play, winning-rate wise and duration to win wise. Research argues that by playing, a user understands more and more how the game works, they can use less parts of their brain, some tasks become automatisms. This leads to a question:
How does someone get better at Wikispeedia? What strategy inherently goes out when playing? As seen in previous part players gets better at Wikispeedia by playing. The goal of this part is to find how? And how their learning is influence by external and intrinsic way of functioning of the brain. The first point that comes in mind is if players click differently on links, in other word if there is a quantifiable change in their clicking behaviour over tries. The analyses on it will be assessed in two parts:
- One that will treat the number of clicks and their speed over attempts, in other works if it is possible to find à strategy that players could have used to improvise their performance by changing their rate of clicks.
- The other about where does users clicks and how does that evolve as well. This part focuses on the position of links in the pages and on what links are players clicking.
Speed-clicking AnalysisTo do such analysis, assumptions done are:
- Only take into consideration game won
- Time per click is the quotient between the total duration and the total number of clicks
From graphs, both regression's p-values are close to zero meaning that we can reject the null hypothesis, stating that there is not net change over the number of try of respectively: the number of clicks and the time per click. The graph also shows that the number of clicks stays quite constant on the range, meaning that the strategy on the number of clicks don't really evolve, just slightly (of about 0.4 clicks per game more). The statistical results tend to indicate that there is a positive change in the number of clicks per game, but this change is very small. Does that indicate that the strategy used didn’t change? The theory about "Deliberate Practice" developed by Ericsson in 1993, affirms that a person deliberately taking highly structured activities specifically created to improve performance can be considered as an expert in the activity after roughly 10 000h. At this point a person can be assumed to master his subject. In this Wikispeedia case, a person that have played 100 games have been played for: 4 hours hand 20 minutes in average. Therefore, from this theory, and from the insufficient number of hours they spent, they cannot be seen as masters of the game. This can explain why in this Wikispeedia case, any significant change of strategy from early stages to the 100 games have been observed. On the other hand, the average time per clicks quickly decrease as players are doing attempts. It is since the overall time par game gets better as users plays as seen in the previous part. In consequence, because the number of clicks doesn’t evolve that much over time, the average time per clicks trivially decrease as well.
Spatial clicking AnalysisIn order to see if there is a change in the spatial clicking behaviour of players, we first need to understand how does the links are distributed over pages:
We want to determine whether the location of links on a page affects the likelihood of them being clicked on. To do this, we will compare the spatial distribution of clicked links to the overall distribution of all links on the page. We will use a tool called Selenium to extract the coordinates of all links on a page, and create a dataset containing the coordinates of both clicked links and all links. This will allow us to analyze the differences in distribution between the two groups and determine if the spatial location of links plays a role in whether they are clicked on.
Using this dataframe, we can extract the coordinates of all links and the coordinates of the links clicked in the paths given in the dataset. We obtain around 170000 pairs of coordinates for both clicked links and and pages links. Let's see what the distribution look like :
There appears to be a pattern in both distributions ! Let's analyze how a generic page is build. In the following image, you have the body in red, the infobox in green and the title in blue :
The density of clicks and page views decreases as the distance from the top of the page increases. Additionally, there are two areas of high density : the beginning of the article and around the infobox . This suggests that people tend to not scroll down the page to find links and are more likely to click on links that are immediately visible when they arrive on the page. This is supported by the fact that the density of clicks is higher than the density of page views in these areas. There are also distortions in the density of page views at approximately the 50th and 150th pixels from the top of the page, which may be due to the presence of lists of articles in those areas. These distortions are not present in the density of clicks, which could mean that people do not tend to click on lists or that it is difficult to land on these articles. Further statistical analysis may be necessary to confirm these observations.
In order to compare both distribution, we perform student statistical test on the y-coordinates list, and on the links distances from the top left corner of the page here are the results :
After analyzing the data, we have found that there is a significant difference in the distribution of clicks and page views on Wikipedia articles. The density of clicks is higher at the top and around the infobox, while the density of page views decreases as the distance from the top of the page increases. This suggests that people tend to not scroll down the page to find links and are more likely to click on links that are immediately visible when they arrive on the page. This is supported by statistical tests, including a paired t-test, which showed a significant difference between the y-coordinates of clicked links and page views, and a t-test on the distance of clicks from the top left corner of the screen, which also showed a significant difference. These findings suggest that there is a distinct pattern in the way people interact with Wikipedia articles.
With that affirmation known, we can look at it from the psychological point of view. By looking at results is quite evident that players mostly click on the first links they see. This can be seen either at laziness to check all links available or as automatism from the brain: “this looks correct to click on, I have limited time, let’s do it”. This psychological process has been names as “Lazy Controller” by Daniel Kahneman [Thinking Fast and Slow, 2016]. As discuss before, thinking and learning are effortful process, then our brain is switching-off attention whenever possible, this leads to what we call: intuition. Its definition is: “the ability to understand something instinctively, without the need for conscious reasoning.” [Oxford languages]. Therefore, according to literature, Wikispeedia players would rather prefer to click on a link by intuition than to use “reasoning” and processed thinking. It is also exactly coinciding as players mostly clicks on link at a minimal distance to the beginning of the page, i.e. players click on the first “good enough” link that they find.
Then, now that we know that players have a coordinate-clicking pattern, there is a need to know if players change their way of interacting with the Wikipedia page overtime. Results found are the following:
From the Y coordinate graph, the p-value is of 0.336, this value is then high enough to assume that null hypothesis holds. Therefore, players don't change their clicking behaviour in the Y coordinate from the first to the 100th game. If we use a confidence factor of 5%, the p-value would indicate that the average X coordinate clicked would vary from the first to the 100th game. However, the observed change is minor (less than 50 pixels) and in all other analysis the confidence factor taken was between, 1% and 0.1%. Then, in this case let's assume than the coefficient factor is also of 1% and that the null hypothesis is valid for X coordinate as well.
In conclusion, players don't change their clicked link coordinate position behaviour for the observed data. They have built an initial intuition and have not develop any strategies on the coordinate of links since.
Now, let’s go back to the original question : What strategies did player used to get better at the game?
We found that they tend to click on bit more links at the 100th try than at early stages of the game, approximately 0.4 click more and they did not really change their spatial clicking strategies. Is a 0.4 click per game sufficient to explain the way better results of players after a couples of tries?
The assumption took is no, there should be another phenomenon that should help the clicking rate to make players better.
Wikipedia articles are (most of the time) made by experts in the subject, therefore someone who is searching for “the optimal link” to click on, would have more ease on a subject he really. On way to learn about new subject is the news. Indeed, the point of the news is to inform people about subject they don’t already know. We can then do the following hypothesis:
People would know better about subject that are currently happening in the world and relayed par the press. And would then be better by going through these subjects in Wikispeedia.
As we delve into the temporal analysis of the path taken by our players, we aim to uncover any patterns or trends in the choice of clicked articles that may be influenced by the current events or themes of the moment. Could it be that the excitement of a football world cup or the intensity of an election season affects the categories of articles that players are drawn to? By analyzing the similarities and differences in the categories of clicked articles across different years, we hope to gain insights into the dynamics of player behavior and how it is influenced by the world around us
The first step in analyzing the temporal aspects of our data is to examine the distribution of the data over time. This will give us a sense of how much data we have for each year and allow us to determine if there are any significant discrepancies or biases in our dataset.
The barplot reveals the distribution of Wikispeedia paths by year. From 2008 to 2014, it appears that the data is not evenly distributed, with 2014 and 2008 being underrepresented and 2009 being overrepresented, particularly in July and August with a peak of 7000 paths (look like less interesting vacations for some people ;)). This uneven distribution must be taken into consideration in any analysis, as it is important to compare similar years in order to avoid biases. After examining the distribution of our data over time, we have determined that there is not enough data for the years 2008 and 2014 to include in our analysis.
First, to determine whether the use of a certain category is significantly different in a certain year we performed paired t-tests. The test is performed between the proportion of clicks in the category in a certain year and the average usage in other years
The following table shows the subjects that had significantly different usage in Wikiseedia game by year. While some subjects like "Design and Technology: Road Transport" in 2012 and "Mineralogy" in 2009 may not be as convincing, the popularity of the "USA Presidents" subject in 2009 could be attributed to the excitement of the 2008 election. However, it's important to remember that the trends in Wikiseedia's game may not always align with real-world events and occurrences, and the "USA Presidents" subject could simply be a commonly covered topic within the game itself."
Using our machine learning expertise, we set out to determine if there was a correlation between the time a path was taken and the categories played in it. We designed a binary random forest classifier that took in the categories of a path as input and attempted to classify whether the path was taken in 2011 or 2013, as these two years had a similar amount of data. Unfortunately, our classifier only achieved an accuracy of 0.54 on the test set, indicating that the model was unable to learn from the provided data. This suggests that the categories used in a path are not significantly different in predicting the year in which the path was taken, indicating that the path may not be heavily influenced by its played time or that the effect is minimal.
What we learned in this analysis is that human capabilities are far off computer ones because they tend to take path twice longer than the optimal ones. From the analysis part, it has been observed that even if humans are not as efficient as computer, they have learned to become better over time. The strategies players have found after some tries are:
- Click on more links
- Click on specific links
Indeed, as seen previously, they learned to clicks on 0.4 links more than when they started. From the special point of view, they recognize area with high density links and are use them better as they progress in the game.
In overall, players get way better at the game time wise, with better time performance after a couple of plays. But they didn’t really show any progress on the finding of the best path possible. This can be explained by the fact that the goal of the game is to find a path the in the smallest possible time, more than finding the best path.
To go further: get better at playing Wikispeedia
Finally, because we have learn what was the best strategies used by the players from our dataset to be better at Wkispeedia. Here are some final hints on what you should do to beat the game:
The last topic we approach is about the 'HUBs', i.e. the pages either linking to the most pages, either appealing the most players. The following graphs give a sample of the most important words of the dataset that has been presented. This assembles common knowledge of the players and some common interest which is quite intersting.
Also for the players that would like to get better at the game, it gives a sample of the best / worst words which are sometimes counter-intuitive.
Once again, this gives an overview of the common knowledge and of the strategy used by the players. A lesson we could draw from those graphs might be to be careful when clicking on the United States ;).