WordE4MDE - Word Embeddings for MDE
In natural language processing, word embeddings are a foundational tool. They transform words into numeric vectors that capture semantic meaning. These vector representations allow machine learning models to process language in a way that reflects relationships and context. There are common embeddings like Word2Vec and GloVe, which are trained on general corpora such as news articles or Wikipedia. However, the vocabularies of these models don’t understand MDE-specific terms. As a concrete example, term statechart does not belong to the vocabulary of the GloVe embeddings and thus we cannot compute its representation. In the same line, The term state does belong to the GloVe vocabulary, but the three most similar terms according to this embedding are federal, states, and government, which does not reflect the intended semantic in the MDE domain.
To address this gap, we have built introduced WordE4MDE, a collection of word embeddings specifically trained on MDE-focused texts. The goal of the WordE4MDE embeddings is to capture the semantics of modelling-specific terminology more effectively.
The development of the embeddings is described in Word Embeddings for Model-Driven Engineering and Experimenting with modeling-specific word embeddings. This post is intended to explain some technical details about how to install and use them in practice.
Installation
To install the latest version of WordE4MDE from pip, run the following:
pip install worde4mde
Use
The library simplify the loading of the different models trained as part of WordE4MDE. In particular, the following three main models are available:
- sgram-mde: A word2vec model trained with modeling texts. It is the smaller model but performs similarly to the others.
- sgram-mde-so. A similar model but trained also with posts from StackOverlfow.
- glove-mde: A GloVe model trained with modeling texts. Also a small model.
- fasttext-mde: A FastText model which solves the out-of-vocabulary problem by including subword information. This model is much larger than the others (~2GB).
- fasttext-mde-so: A similar model but trained also with posts from StackOverflow.
Loading a model is very simple using the load_embeddings
function, which takes care of downloading the model and storing it in a the .worde4mde
folder in the user home.
import worde4mde
model_id = 'sgram-mde'
model = worde4mde.load_embeddings(embedding_model=model_id)
Computing word similarities
As a simple example of using the model, let’s build a function to compute the words that are most similar to a given one.
def similar_words_to(model, term, topn = 10):
"Returns the top n most similar words using gensim facilities"
words = []
similar = model.most_similar(positive=[term], topn = topn)
for word, score in similar:
words.append(word)
return words
similar_words_to(model, 'gmf', topn = 6)
# If the model is FastText, then it has to be model.wv to pass a gensim mode
The result of computing the similarities is the following list: eugenia, gef, emf, sirius, eclipse, graphiti, editor
which reflects quite well the
semantics of the GMF: a tool for building editors, based on EMF similar to other editors like Eugenia (which is based on GMF), GEF (GMF is based on GEF).
Conclusion
The WordE4MDE embeddings were designed not just as a proof of concept, but as a practical tool for advancing Model-Driven Engineering (MDE) tools and machine learning pipelines. We foresee its use in several scenarios like semantic model repository search (e.g., with a vector database), model comparison based on computing word similarities, building a RAG for a modeling project, for fine-tuning classification models, etc. We hope that WordE4MDE is a useful tool for the modeling community.