Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SQLite database and get data from current year #67

Merged
merged 15 commits into from
Feb 21, 2023

Conversation

nkgilley
Copy link
Contributor

No description provided.

@kyleskom
Copy link
Owner

A lot going on here, can you better describe the changes.

@nkgilley
Copy link
Contributor Author

This PR creates a new dataset which includes training data from the current season. I added a script to get the odds data from previous games this season (src/Process-Data/Get_Odds_Data.py). I also ran Get_Data and Create_Games to get a new DataSet file. I then reran the XG_Boost_Model scripts and saved the best runs in Models/XGBoost_Models which I now reference in src/Predict/XGBoost_Runner.py.

@nkgilley nkgilley marked this pull request as draft January 27, 2023 19:02
@nkgilley nkgilley changed the title get games from current year Use SQLite database and get data from current year Jan 27, 2023
@nkgilley
Copy link
Contributor Author

nkgilley commented Jan 28, 2023

I'm not able to upload the sqlite database as it is too large for github (101 MB). A workaround would be to enable LFS (https://git-lfs.com/). It needs to be enabled on the parent repo for me to be able to add the file to my fork.

@kyleskom
Copy link
Owner

Give it a try now

@nkgilley
Copy link
Contributor Author

@kyleskom Thanks...It looks like I was wrong though. Github doesn't let you upload lfs objects to forks. I keep getting this error:

git push --force
batch response: @nkgilley can not upload new objects to public fork nkgilley/NBA-Machine-Learning-Sports-Betting
error: failed to push some refs to 'github.com:nkgilley/NBA-Machine-Learning-Sports-Betting.git'

I was able to upload it to a non-forked repo:
https://github.com/nkgilley/NBA-ML/blob/2022-23/Data/db.sqlite

It's definitely faster processing using the sqlite db, but this github limitation is frustrating.

@kyleskom
Copy link
Owner

What exactly did you store in the DB, 101mb seems very high

@nkgilley
Copy link
Contributor Author

Basically everything that's currently in excel files. I could split it into multiple database files to lower the size.

@kyleskom
Copy link
Owner

This first off this has been one of the biggest things iv wanted to do with this project so thank you. I was thinking all the team data in 1 then the fulls games with everything in another. What do you think? Also you have been super helpful with all this and doing amazing work. If you interested we should hop on a call, id be interested in taking this project a step further with your help.

@nkgilley nkgilley marked this pull request as ready for review January 28, 2023 23:33
Copy link
Owner

@kyleskom kyleskom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are moving to a DB, do we need the excel files?

</tbody>
</table>
</td>
{% endif %}
Copy link
Owner

@kyleskom kyleskom Jan 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you are doing here? I think maybe it would be best to split this PR. I see the major change of moving the to a database, I think we should keep it clean by separating those.

EDIT: I meant changes to index.html, not just highlighted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just whitespace changes and a change of a classname that doesn't affect anything.


# season_array = ["2007-08", "2008-09", "2009-10", "2010-11", "2011-12", "2012-13", "2013-14", "2014-15", "2015-16",
# "2016-17", "2017-18", "2018-19", "2019-20", "2020-21", "2021-22", "2022-23"]
season_array = ["2015-16", "2016-17", "2017-18", "2018-19", "2019-20", "2020-21", "2021-22", "2022-23"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you started with 2015?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because I remembered in your notes this comment:

we achieved the highest levels of validation accuracy when the training dataset started from the 2012 − 2013 season

I'll change it to go from 2012-13 instead of 2015-16.

name = directory2 + '/' + '{}-{}-{}'.format(str(int(x[1])), str(int(x[2])), season1) + '.xlsx'
general_df.to_excel(name)
except:
if month1 == 10 and day1 < 19:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some seasons didn't start at 10. I think maybe the old way was better to more easily run. I still think this could be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting errors with the original code so I made some changes. Didn't realize that some seasons didn't start in october. I'll take a closer look at this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code still works fine for seasons that don't start with 10 as far as I can tell. Let me know if I'm missing something

Comment on lines +1 to +23
import random
import time
import pandas as pd
import sqlite3

from datetime import datetime
from tqdm import tqdm
from sbrscrape import Scoreboard

year = [2022, 2023]
season = ["2022-23"]

month = [10, 11, 12, 1, 2, 3, 4, 5, 6]
days = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

begin_year_pointer = year[0]
end_year_pointer = year[0]
count = 0
year_count = 0

sportsbook='fanduel'
df_data = []

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any pros / cons to moving to this for odds? I always grabbed them from here
https://www.sportsbookreviewsonline.com/scoresoddsarchives/nba/nbaoddsarchives.htm

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any pros / cons to moving to this for odds? I always grabbed them from here https://www.sportsbookreviewsonline.com/scoresoddsarchives/nba/nbaoddsarchives.htm

I don't know if it is relevant but the website you linked no longer provides odds after January 16th, 2023. If you need odds for the 2022-23 season, using the sbrscrape library is the only option.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will update it soon. I think they do it in batches.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further review you are right. I see their note.

Comment on lines 1 to 22
import os
import sqlite3
import pandas as pd
import sys
from tqdm import tqdm
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from Utils.Dictionaries import team_codes

directory = os.fsdecode('../../Odds-Data/Odds-Data-Clean')
con = sqlite3.connect("../../Data/odds.sqlite")

for file in tqdm(os.listdir(directory)):
filename = os.fsdecode(file)

try:
df = pd.read_excel(f"../../Odds-Data/Odds-Data-Clean/{file}") # create DataFrame by reading Excel
df.to_sql(f"odds_{file[:-5]}", con, if_exists="replace")

except Exception as e:
print(e)

con.close()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip this and just go right from the get_data script to DB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I don't see any need for this script in the future. I just used to move the data that was already scraped so I didn't have to re-gather all of it.

@nkgilley
Copy link
Contributor Author

If we are moving to a DB, do we need the excel files?

Nope, I'll remove those.

@nkgilley
Copy link
Contributor Author

I removed all the excel files. I'm training the model again using seasons 2012-13 to current. I'll add new json files when it's done.

@kyleskom
Copy link
Owner

Gonna need a little time to review and test all this.

@kyleskom
Copy link
Owner

Also not sure if you saw my previous message. Id be interested in having a zoom call and discuss this project further / some ideas I think you would have great help with. Let me know if that interests you.

@nkgilley
Copy link
Contributor Author

Hey, sorry I've got limited availability to work on this. I don't think I'll have time for a zoom call, but please post your thoughts here and if I find time I may be able to help out further.

@kyleskom
Copy link
Owner

Sorry for delay been super busy and still interested in this. Will review as soon as I can.

So you have been putting in really great work with this and I have always wanted to take this to the next level. Maybe a web app / even a paid product. Along with that the major thing this is missing is player data. Those are the things I was looking to discuss with you.

@kyleskom
Copy link
Owner

Fixed dropped columns with #85

@nkgilley
Copy link
Contributor Author

Just pushed some new updates that now take into account the days rest for each team. I'm not sure that I did this part right, please double check. I've got a good grasp on the python but the tensorflow stuff is all new to me.

Those ideas are definitely interesting, ping me here to discuss more: email/gchat: nkgilley@gmail.com

@kyleskom
Copy link
Owner

Thanks for adding this has always been a major want for me, I see one major issue, we had those rest days columns in when training but we don't have them in when using live data for daily predictions.
Example:
Before this we would train with data abc and predict using data abc. Now we are training with abcx but predicting still with abc.

Hopefully that made sense. Also it might be better to separate that out with the moving of data to a database. It's easier for me to review and test and lowers the possible errors when merging. But again really appreciate the work.

@nkgilley
Copy link
Contributor Author

I thought I was providing that data during the daily predictions. See line 59-60 of main.py:

        stats['Games-Rested-Away'] = away_days_off.days
        stats['Games-Rested-Home'] = home_days_off.days

You may not have seen these changes as github only shows the first 3000 files changed. I could create a new PR that doesn't delete the excel files and would be easier to review. We could then delete the excel files later if we decide to proceed with the sqlite files.

@kyleskom
Copy link
Owner

@nkgilley Ya lets do this. first lets split up the sqlite stuff with the rest days. Let's also keep the excel files for now and remove those after to make this pr manageable.

@nkgilley nkgilley force-pushed the 2022-23 branch 2 times, most recently from f4b3cb9 to 62c46ed Compare February 11, 2023 20:41
@nkgilley
Copy link
Contributor Author

Should be good now. I removed the days rest columns for now

@nickmalbsn
Copy link

im new to coding, how am i able to add this to my fork?

@nkgilley
Copy link
Contributor Author

im new to coding, how am i able to add this to my fork?

git remote add nkgilley git@github.com:nkgilley/NBA-Machine-Learning-Sports-Betting.git
git checkout nkgilley/2022-23

@kyleskom
Copy link
Owner

Really sorry I still haven't gotten to this. Just so busy with work and personally and the small issues the pop up in this repo. I am still trying to get to this.

@kyleskom
Copy link
Owner

Do you have a recommended tool to view sqlite databases locally? The one I have is terrible slow and just overall horrible.

'Unnamed: 0': 0,
'Date': f"{season1}-{month1:02}{day1:02}",
'Home': game['home_team'],
'Home': game['home_team'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is duplicate 'Home' key here expected?

@kyleskom
Copy link
Owner

Going through part by part right now. On the get odds data, it will fail if no odds data for that day, just need to add a catch for that.
image

@kyleskom
Copy link
Owner

Tested everything else up to creating the full dataset and everything looks awesome. Just those few small fixes. I think after this let's do a clean up of old data.

You also had the days rest work to id love to add.

Id also still like to reach out about a few other things.

@kyleskom
Copy link
Owner

One more thing. If you can add the db dataset to be used with training the NN models as well.

@nkgilley
Copy link
Contributor Author

Do you have a recommended tool to view sqlite databases locally? The one I have is terrible slow and just overall horrible.

I've been using DB Browser for SQLite

@nkgilley
Copy link
Contributor Author

Fixed a few issues, updated the models. I think it should be good but it should be testing with real games.

I'll create a new PR for the addition of the days rested column. I need to do some more testing on that.

@kyleskom kyleskom merged commit 0339624 into kyleskom:master Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants