In-Database Machine Learning
A proof-of-concept approach on an entity resolution task, with Tensorflow for the machine learning part and everything happening inside a MonetDB instance. This project was part of the course Large Scale Data Management at the MSc in Data Science of AUEB, carried out during the Winter Quarter 2020-21.
In other words, a MonetDB database instance that performs machine learning and de-duplicates products that are essentially the same, in its own right.
The task consists of identifying which product specifications (in short, specs) from multiple e-commerce websites represent the same real-world product.
example product spec
{
"<page title>": "Samsung Smart WB50F Digital Camera White Price in India",
"brand": "Samsung",
"dimension": "101 x 68 x 27.1 mm",
"display": "LCD 3 Inches",
"pixels": "Optical Sensor Resolution (in MegaPixel)\n16.2 MP",
"battery": "Li-Ion"
}
We are provided with a dataset including ~30k specs in JSON format, each spec containing a list of (attribute_name, attribute_value) pairs extracted from a different web page, collected across 24 different web sources.
Each spec is stored as a file, and files are organized into directories, each directory corresponding to a different web source (e.g., www.alibaba.com). All specs refer to cameras and include information about the camera model (e.g. Canon EOS 5D Mark II) and, possibly, accessories (e.g. lens kit, bag, tripod). Accessories do not contribute to product identification: for instance, a Canon EOS 5D Mark II that is sold as a bundle with a bag represents the same core product as a Canon EOS 5D Mark II that is sold alone.
For a more detailed view of the task, the datasets, the evaluation process, see the Other References section below.
- Debian-based distribution (tested against Ubuntu 18.04.5 and Pop!_OS 20.04)
- MonetDB with embedded Python3 support (tested against v11.39.11, v11.41.5)
The official documentation procedure should be followed, in order
for your deb-based distribution to be able to install the MonetDB via the apt
package manager.
To be precise, everything up until the sudo apt update
command is sufficient. From there on, we will diverge a bit in order
to get a MonetDB installation with embedded Python3 support, with the
following commands:
# root permission may be needed here
apt install monetdb-python3
apt install monetdb-client
# create a db farm, preferably at this project's root dir
monetdbd create ecommercedbfarm
monetdbd start ecommercedbfarm
# create a database, with python3 support enabled
monetdb create ecommercedb
monetdb set embedpy3=true ecommercedb
monetdb release ecommercedb
Now, to run a simple health check:
mclient -u monetdb -d ecommercedb # password is the default (<monetdb>)
and run the following:
CREATE OR REPLACE FUNCTION python_healthcheck ()
RETURNS STRING
LANGUAGE python {
import sys
return sys.version
};
SELECT python_healthcheck();
this should display your system-wide Python version.
Make sure that the dataset is in extracted form. For example, you can use:
tar xvzf data/camera_specs.tar.gz -C ./data
All the necessary SQL statements (UDFs etc.) are available in the entity-resolution-with-monetdb/sql/
directory,
with each independent component hosted in a separate .sql
file.
NOTE: Also, for convenience, there is the entity-resolution-with-monetdb/main.sql
available, which contains all the SQL statements together
and can be used for a quicker build.
Simply cast the SQL statements, preferably in the order below, in the mclient
shell:
- sql/cameras_loader.sql
- sql/specs_loader.sql
- sql/cameras_bulk_insert.sql
- sql/cameras_constraints.sql
- sql/specs_bulk_insert.sql
- sql/specs_constraints.sql
- sql/labels_create.sql
- sql/labels_bulk_insert.sql
- sql/labels_constraints.sql
By this point, you should have a first, working database schema.
We would like to block together cameras of the same brand. This will help us restrict the "potential matching" space, as cameras that belong to different blocks should not, in principle, refer to the same camera!
To implement the blocking step, we will need the following:
- sql/text_utils.sql
- sql/brands_create.sql
- sql/cameras_add_brand_id.sql
- sql/brands_bulk_insert.sql
- sql/blocking.sql
After block formation, we would also like to filter out "easy" matches, by extracting-via-heuristics and comparing the models of the cameras. We would thus end up - for each block - with a subset of cameras that match both on brand and model (and basically refer to the same camera).
To implement the filtering step, we will need the following:
- sql/filtering.sql
What we have gained from blocking & filtering is that we now only have to work (and eventually perform pair-wise comparisons) with the camera subsets that remained unmatched within blocks.
Before we proceed, let's make sure that Tensorflow is installed. Running the
following statement in the mclient
:
CREATE OR REPLACE FUNCTION tensorflow_healthcheck ()
RETURNS STRING
LANGUAGE python {
import tensorflow as tf
return tf.version.VERSION
};
SELECT tensorflow_healthcheck();
should display the Tensorflow version. In case the error No module named 'tensorflow'
emerges, simply install the library:
python3 -m pip install tensorflow
and perform again the health check through the mclient
.
Now, onto the "matching" step.
A first crucial observation is that the labelled dataset is transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).
-- # work in progress 🚧
- End-to-End Entity Resolution for Big Data: A Survey
- Evaluation of entity resolution approaches on real-world match problems
- A Survey of Blocking and Filtering Techniques for Entity Resolution
- IDEL: In-Database Entity Linking with Neural Embeddings
- Deep Integration of Machine Learning Into Column Stores
- Deep Learning for Entity Matching: A Design Space Exploration
- Vectorized UDFs in Column-Stores
- devUDF: Increasing UDF development efficiency through IDE
- Don’t Keep My UDFs Hostage - Exporting UDFs For Debugging Purposes