GridscriptDonate

Building a Machine Learning Data Pipeline with Gridscript: From JSON and CSV to Machine Learning Model

Data pipelines are the foundation of modern analytics and machine learning workflows. With gridscript.io, you can connect multiple data sources, clean and transform your data, and seamlessly integrate Python and scikit-learn - all inside one visual, reproducible environment. In this tutorial, you'll create a simple yet powerful Gridscript pipeline that predicts whether employees will leave a company based on HR, payroll, and satisfaction data.

Scenario

You have three datasets:

Your goal:

Sample Data

employees.json

[
  {"employee_id": 1, "name": "Alice", "department": "Engineering", "years_at_company": 3, "left_company": 0},
  {"employee_id": 2, "name": "Bob", "department": "Sales", "years_at_company": 2, "left_company": 1},
  {"employee_id": 3, "name": "Charlie", "department": "Engineering", "years_at_company": 5, "left_company": 0},
  {"employee_id": 4, "name": "Diana", "department": "HR", "years_at_company": 4, "left_company": 1},
  {"employee_id": 5, "name": "Ethan", "department": "Marketing", "years_at_company": 1, "left_company": 0}
]

payroll.csv

employee_id,salary,bonus
1,85000,5000
2,62000,2000
3,90000,7000
4,55000,1500
5,60000,3000

satisfaction.xlsx

employee_id	satisfaction_score
1	         0.8
2                0.4
3	         0.9
4	         0.3
5	         0.6

Step 1: Load and Merge Data

In your Gridscript pipeline, create three “Import” stages as follows:

The Gridscript pipeline

Then create a Code stage to merge the data together (I will use JavaScript for this example but you can go ahead and use Python if you prefer it):

const {employees, satisfaction, payroll} = context;

let names = [];
let department = [];
let years_at_company = [];
let left_company = [];
let salary = [];
let bonus = [];
let satisfaction_score = [];

context.employees.forEach((emp, i) => {
  names.push(emp.name);
  department.push(emp.department);
  years_at_company.push(emp.years_at_company);
  left_company.push(emp.left_company);
  salary.push(context.payroll[i + 1][1]);
  bonus.push(context.payroll[i + 1][2]);
  satisfaction_score.push(context.satisfaction[i + 1][1]);
})

context.data = {
    "name": names,
    department,
    years_at_company,
    left_company,
    salary,
    bonus,
    satisfaction_score
};
log(context.data)

This is the result that we get from running the previous stage:

{
    "name":["Alice","Bob","Charlie","Diana","Ethan"],
    "department":["Engineering","Sales","Engineering","HR","Marketing"],
    "years_at_company":[3,2,5,4,1],
    "left_company":[0,1,0,1,0],
    "salary":[85000,62000,90000,55000,60000],
    "bonus":[5000,2000,7000,1500,3000],
    "satisfaction_score":[0.8,0.4,0.9,0.3,0.6]
}

JavaScript Code Stage in Pipeline

Step 2: Clean and Transform Data

Once the raw data has been merged into a single unified table, the next step is data cleaning and feature engineering. This is where we convert all values into a modeling-friendly format and create new features that may improve model performance.

In Gridscript, we can do this using a Python code stage. The code below performs several important preprocessing steps:

import pandas as pd
from gridscript import table

df = pd.DataFrame(context['data'])
df["salary_k"] = df["salary"] / 1000
df["loyalty_score"] = df["years_at_company"] * df["satisfaction_score"]
df = pd.get_dummies(df, columns=["department"], drop_first=True)
df = df.drop(columns=["name", "salary"])
context['df'] = df

What’s happening here?

  1. Converting raw salary to salary_k: Machine learning models often work better when numerical values are in a similar range. Dividing salary by 1,000 makes the feature easier to learn from and easier to interpret.

  2. Creating a new feature loyalty_score: years_at_company × satisfaction_score gives a simple proxy for engagement or stability. Someone who has been at the company 5 years with a 0.9 satisfaction score is likely much more loyal than someone who has been there 1 year with 0.3 satisfaction. Feature engineering like this often boosts predictive performance.

  3. One-hot encoding the department column: Machine learning algorithms can't directly understand text categories (like "Engineering" or "HR"). pd.get_dummies() converts each department into a binary (0/1) column. Using drop_first=True avoids multicollinearity by removing one reference category.

  4. Removing unnecessary columns: name is not useful for prediction and salary is replaced by the cleaner salary_k.

Result

After this transformation, every column in df is numeric, making it fully ready for machine learning in the next stage.

Clean and Transform Data with Python code stage in Pipeline

Step 3: Prepare Features and Train a Logistic Regression Model

Now that the dataset is clean and numeric, it’s time to train a machine learning model. In this example, we use Logistic Regression, one of the most common algorithms for binary classification problems like predicting employee attrition (left_company).

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

df = pd.DataFrame(context["df"])

# Separate features and target
X = df.drop(columns=["left_company"])
y = df["left_company"]

# Full Pipeline: scale numeric features + logistic regression
model = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("classifier", LogisticRegression())
])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model.fit(X_train, y_train)

# Predict
preds = model.predict(X_test)

# Results
print("Predictions:", preds)
print("Accuracy:", accuracy_score(y_test, preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, preds))

Explanation of Step 3

  1. Separating features (X) from the target (y): X contains the variables used to make a prediction, y contains 0 or 1 — whether the employee left the company. This is standard practice in machine learning.
  2. Using a Pipeline: The pipeline bundles preprocessing and model training into one reproducible sequence.
  3. Splitting the dataset (train/test split): 80% of the data is used to train the model, 20% is kept unseen to test how well the model generalizes, this is essential to avoid overfitting.
  4. Evaluating the model, the code outputs: Predictions, Accuracy — percentage of correct predictions, Confusion matrix — how many 0s and 1s were classified correctly or incorrectly.

This gives you a quick understanding of the model’s performance.

Train a Logistic Regression Model with Python code stage in Pipeline

Step 4: Run It All as a Gridscript Pipeline

Now that each stage works, you can chain them visually inside Gridscript.io to build a fully reproducible end-to-end data pipeline.

A typical Gridscript pipeline for this project looks like this:

  1. Import the JSON, CSV, and XLSX datasets
  2. Merge Data
  3. Clean & Transform Data
  4. Engineer features
  5. Create, Train & Evaluate Model

Why This Example Works

This example demonstrates how Gridscript pipelines:

Pipeline Features

Next Steps

Try extending the pipeline by:

Try now Gridscript pipelines here.