-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance throughout the game overview/rating #388
Comments
Some discussion of this in #340 - it's a bit strange to suggest adding categories when they already exist though :) |
You're right, of course! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
On first impression, 1d layout seems more easily readable than 2d. Some questions to aid iteration: -Is it possible to create an overall performance score? e.g. if every move lost no points/was the AI top move this gives a 100% accuracy, if every move lost more than 12 points/was the AI worst move gives 0%. This would provide extra insights: "did I win because I played well, or because my opponent played terribly?", "I lost even though my performance was good so this loss isn't so bad". -Is it possible/would it be useful to combine the points lost/AI rank into one overall metric? An overall score of each move would be more concise. Overall it would be great to give the user control over the level of granularity, i.e. whether 2d or 1d, which stats are shown; whether the moves are shown graded separately by points lost/AI top move or whether these are combined into one metric etc. I hope these thoughts help, I love this feature already and am excited to use it. |
The Chess.com site really did a nice job with their game report. However, they probably have 100+ developers working for them. Here's a simple version of their accuracy report. It would allow users to customize the category names in the Teaching/Analysis settings. Instead of a separate Game Report, this could also be just another tab, unless you're planning to add additional information in the future. The accuracy stat would be a weighted average of the categories. Ideally, this accuracy information would update as users moved through the game tree, not just at the end of the game. Note that all chess apps and sites (that I've seen) use strictly board evaluations for computing mistakes, i.e. top move - actual move. There's no reporting done on how much a move improves the prior position, i.e. actual move - prior move. We've discussed this before. P.S. I don't know how useful the 2D performance table would be. It seems more like a curiosity rather than helpful information. But, I guess it might show how well your intuition (policy) is working versus your calculation (tree search). The accuracy information seems more helpful. |
movecomplexity = sum(policy over candididates) - sum(policy over candididates with point loss <= 0.5) i.e. what policy % is bad moves the ai thought were worth considering. formulas aren't great yet. but I like the layout and fields |
^ this looks great! being able to focus on one stage of the game is an excellent idea. |
Sander, I like it! Much improved over my version. :-) I'm working on trying to understand your formulas. What is the cutoff point for the AI candidates, or is this determined by max_visits? Wouldn't higher visits skew the complexity rate upwards (more poor moves searched?) The accuracy formula seems reasonable. I want to research to see how some of the chess apps do it. I think the colors on the Teaching/Analysis settings should be re-ordered to match this for consistency. Good stuff! |
it's using all candidate moves returned by the ai, this is of course influenced by visits, root noise, etc. I am not convinced this is the best approach, but it's the first thing that kind of did something reasonable. |
|
I did some quick research on Go and Chess apps, and the only one that seems to calculate a game accuracy is Chess.com. They call theirs "Computer Aggregated Precision Score" or CAPS. It's a proprietary model that incorporates game mistakes and other "pattern of strength" algorithms. In other words, it is a black box. There's some controversy in the forums about how well it works. Apparently, the statistic can vary widely over games, and it does not give a great predictive power into the rank of a player. As for complexity, I think your idea has merit. I spent some time studying L&D problems earlier trying to understand why some were more complex than others. The number of reasonable-looking branches in the search tree has mostly to do with it. Whether you can tease this information out of differences between policy priors and search results will be interesting. This feature may take lots of thought and testing. I vote to roll out something simple and get feedback on it. Maybe create a beta version that we can do some testing on. |
I generally don't hide things, it's in branch and anyone can test it. Releasing is a lot of work though, and the last time I released for feedback I got zero comments, soooo |
testing another weighting in 3adde54 |
want to test this a bit more properly. if someone could help collect a nice test set that would be appreciated: A variety of around 50-100 sgf games from 15k to 7d. 19x19, at least 200-250 moves played. Should have the BR and WR fields set (as in e.g. ogs) |
I'll commit to scraping 50 games from OGS spread across 15k to 7d. Does it matter if they're even or handicap? |
Shouldn't matter. The idea is to see the numbers by player rank more systematically |
Here are 30 OGS games ranging from 9k to 4d. 10 OGS games 1d-4d.zip |
code as in dde545b (weighted by complexity ~ expected point loss if playing candidates with p=policy) data: https://pastebin.com/k44TYjY9 accuracy seems ok, a bit weird on 2 outliers. complexity is a bit all over the place, may just remove it. |
20 more OGS games |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
best result for now, as of 6a71266
keeping this unless there's any bright ideas |
The accuracy stat r^2 is looking pretty good. The complexity stat will need some more thinking.
Agree that 'ai approved' is a bit awkward. If you had category labels for pt loss, you could use the label. Not sure you need to limit it to top 5, but just the pt loss range seems good enough. Let me know if you need more games for testing. |
Will probably just kill the complexity stat.
The idea is that top 1 is very network dependent, and no limit is very visits dependent, this should be less so (as seen by katago selfplay games ending up at near 100%)
I think this is a very nice data set. If you can figure out what's up with the 1-2 outliers that might help though! |
In the first outlier game (sunny25 vs lyq), both players missed the killing/saving of a group for many moves (163 - 202). This resulted in 20+ point loss swings for a significant portion of the game. In the second outlier game (silent1 vs sunny25), both players missed a severe cut for many moves (49 - 126). Then, they missed a double sente endgame sequence for many moves (131 - 187). |
Also if someone has/can make a texture that helps make the bars look like bars, that could be nice! (as in transparency mask-only texture) |
This version of the panel seems a little cluttered. I'm not sure how useful seeing the proportion of points lost between each player is either, I personally would be interested mostly in the amount/% of my moves that fall into each category. If a user is that interested in the proportion between each player, they can compare the numbers. |
^^^ Agree. I think Sander has the cleanest layout. You don’t want to make it more complicated than this. The X-axis scale for each item can usually be inferred from the label, which is good. I think showing the bars for all move classes is fine to achieve consistency. Except, I don’t understand the X-axis scale used in Sander’s version (what are the blunder lengths suppose to be?). I would think this should be the % of time you played that move class within the game. |
^ This looks great to me. |
Are you arguing mostly about content or format? If content, then I agree that showing the % of moves in each category is best. The format could either be as a pie chart, a stacked chart (like yours), or a bar chart (like Sander's). (Although, I'm not sure what Sander was trying to show in his mockup :-) As for format, a simple bar chart like Sander's is good enough for me, and it matches the style of the top section. But, I'd be Ok with either. |
I prefer Sander's latest iteration; a microsecond glance - "I got mostly green - yippee!" (Coloring the bars makes the data leap out at you.) In the same vein, I would also color the bars in the Key Statistics section; a nice blue would look good. |
@Dontbtme sure that looks better, but keep in mind the whole thing is a single grid layout of labels, and the line is the bottom of the header cell. Give it a try and you'll see how difficult simple things can be in kivy ;) |
Still, as is, any bar looks big is what I meant. Can't you limit the colors in the middle around >0.5 etc. without changing the grid? Since colors in the left and right columns are only filling them depending on the %, why colors in the midle column have to fill it entirely? Colors in the middle are what's popping up the most in your picture, when we should be focusing on colors from the player's bars. I would even rather not having any colors in the middle colomn if that's too complicated, that way the mistakes's data colors would appear clearly and brightly on each players's column |
Closing this as it's soon released, but feel free to continue discussion. |
Hello, many thanks for this amazing trainer! Feature suggestion: I imagine each move being placed in categories (e.g. blunder, mistake, inaccuracy, okay, excellent, best move) based on the percentage change in winrate it effects. The percentage ranges for these categories could be user definable.
An overall "accuracy" score out of 100 could then be generated for each player based on the percentage of their moves the engine rates as best. These ideas are inspired by the analysis features of chess.com, that give an overall insight into the players' performance in a game; this would supplement analysis of each individual move.
Thanks again for your work, I'd love to hear your thoughts.
The text was updated successfully, but these errors were encountered: