Generation of Synthetic Data

How to augment tabular data without writing a single line of code

In the era of artificial intelligence, the amount of data available is a key factor in model performance. Both traditional machine learning and neural networks tend to improve as the volume of training data increases. However, we often face limited data sets. This is where synthetic data generation comes in, a technique that allows us to create fictitious yet realistic records while maintaining patterns and relationships present in the original data.

In addition to expanding the size of a dataset, synthetic data can protect privacy in scenarios where real data cannot be shared. This is especially useful in research or projects that require compliance with data protection regulations. They are also valuable for advanced statistical analysis and for training models in areas where real data is scarce.

When dealing with tabular data, several techniques exist, such as generative adversarial networks (GANs) and copulas. GANs use two neural networks competing with each other to produce increasingly realistic data, while copulas are mathematical functions that capture dependencies between variables and allow for the generation of new observations while preserving those correlations.

The challenge is that, although there are libraries in Python and R to implement these techniques, not everyone has programming knowledge. The good news is that there are tools that eliminate the need to write code, such as the KNIME platform and its Synthetic Data (Copulas) component.

Environment setup

To work with this component, you need to install KNIME and some specific extensions, as well as configure Python, R, and the corresponding libraries. This process includes creating an Anaconda environment, installing the copulas library and the R package corrplot, and linking these tools within the KNIME preferences.

Workflow and configuration

The workflow involves loading a dataset, preprocessing it, and then configuring the component to generate synthetic data. Among the available options, you can select different types of copulas (Gaussian or Vine) and univariate distributions, define the synthetic sample size, and choose which numerical columns to include in the generation.

The choice of parameters depends on the data set and the objective of the analysis, so it’s best to experiment with different combinations and evaluate the results. You can also enable or disable the display of correlations to optimize runtime.

Data output

The component offers two outputs:

  • Raw synthetic data, generated directly by the model, which may contain values outside the original ranges.
  • Post-processed synthetic data, in which anomalous values are replaced with others closer to the original, ensuring consistency and realism.

Along with the tables, correlation matrices and comparative visualizations between real and synthetic data are generated, facilitating the evaluation of the quality of the result.

Quality assessment

Comparing statistical metrics between real and synthetic data is key. For example, analyzing Pearson and Spearman correlations or standard deviation allows you to verify that the relationships between variables have been maintained. More advanced tests, such as the Kolmogorov–Smirnov test or the Jensen–Shannon divergence test, offer more in-depth analyses.

An important aspect is controlling the number of duplicates in the synthetic data. If the percentage exceeds that of the original data, it may be advisable to reduce the sample size generated.

Final recommendations
  • Verify that there are no exact matches between real and synthetic data before using them.
  • Adjust parameters and repeat the process until you achieve the right balance between realism and variability.
  • Use this technique not only to expand data, but also to preserve privacy or facilitate model testing.

Generating synthetic tabular data without code is no longer a challenge exclusive to programming experts. Tools like KNIME allow more people to experiment with this approach and leverage its benefits in analysis and modelling projects.

Scroll to Top