Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can deepFM use sparse data format? #10

Open
sddi opened this issue Dec 26, 2017 · 10 comments
Open

can deepFM use sparse data format? #10

sddi opened this issue Dec 26, 2017 · 10 comments

Comments

@sddi
Copy link

sddi commented Dec 26, 2017

I try using deepFM.py with sparse data a8a.train, and its format likes "label index:value index:value..." .
I see in S1_4.txt, if some value is 0 it is also in the feature line, but in a8a.train it is not.
I run python deepFM.py, I got "Input to reshape is a tensor with 5528 values, but the requested shape requires a multiple of 672"
I don't know if the code not supports the format?

@sddi sddi changed the title can deepFM using sparse data format? can deepFM use sparse data format? Dec 28, 2017
@Leavingseason
Copy link
Owner

hi sddi,
deepFM reads sparse data as input, but notice that each instance must have exactly the same number of features, which is the "field number" in the paper. So if a field is empty or missing, you should append a zero fake value for it.

@sddi
Copy link
Author

sddi commented Jan 5, 2018

hi @Leavingseason , thank you for answering. I have millions of features, if i append a zero fake value , the input file maybe very large, could you update the code support the input format likes libsvm format(index:value, if value is zero, omit it in the input file )?

@Leavingseason
Copy link
Owner

hi sddi,
How many fields of feature (not the number of feature) do you have? Actually we do not request for every feature append a zero value, instead for each field, if there is no feature under it, we will append a zero fake value. The deepFM model use field-wise dense embedding as the input for deep neural network, so the number of fields can not be too large.

@sddi
Copy link
Author

sddi commented Jan 8, 2018

oh~~ @Leavingseason i see~~~
for instance, i have two fieds of feature, userID features(index from 0 to 100), itemID features(index from 101 to 1000). for one sample, in the input file maybe, "1 36:1 108:1 123:1 365:1", is it ok?

@Leavingseason
Copy link
Owner

That's partially right. Now my code only supports at most one feature for each field, which follows the original paper's framework. So for itemID features, you can only keep one itemID. I know you concerns, in the real world, multiple features under one field happens a lot. We have the corresponding version of code to handle this case, which leverages sparse embedding lookup https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup_sparse, and the input format becomes fieldID:featureID:value. We will consider to release this version.

@sddi
Copy link
Author

sddi commented Jan 10, 2018

OK,thank you very much! I am waiting for your new version~~~:D

@waitingc
Copy link

Have the version which supports "multiple features under one field" released ? Thanks

@Leavingseason
Copy link
Owner

Not yet. All right, since some people are interested in this version, I will release a preview code which is now very ugly. I will try to find some time in two days (it is so sad that KDD deadline is near...)

@Leavingseason
Copy link
Owner

Done.

@CheungZeeCn
Copy link

CheungZeeCn commented May 10, 2018

@Leavingseason hello, 请教一个格式上的问题,fieldID:featureID:value 这里,如果fieldID==1 对应的featureID 有3个,如果fieldID==2对应的featureID 有2个,fieldID==2的 featureID 的值的编码需要基于 fieldID==1 的featureID 上吗? for example:

0 1:1:1 1:2:1 1:3:1 2:1:1 #这里fieldID==2的 featureID 可以重新编码
0 1:1:1 1:2:1 1:3:1 2:4:1 #这里fieldID==2的 featureID 不可以重新编码,需要基于原来的1之上,谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants