Writing data guides • excelDataGuide

Writing a guide

Every spreadsheet template should be accompanied by a data guide — a YAML file that documents where data are located within the template and how they should be interpreted. This guide serves as the blueprint that tells the excelDataGuide package exactly where to find and how to read your data.

A data guide is a human-editable, computer-readable YAML file. YAML syntax is simple: use colons (:) for key-value pairs, hyphens (-) for lists, and indentation for nesting. This section walks you through constructing a guide step by step.

Step 1: Define the header

Every guide begins with five pieces of metadata that describe the template itself:

guide.version: '1.0'
template.name: myassay
template.min.version: '1.0'
template.max.version: '2.0'
plate.format: 96

guide.version (required): The version of the guide itself (not the template). Use major.minor format, e.g., '1.0' or '2.1'. This allows you to track changes to the guide independently from the template.
template.name (required): A short, descriptive name for the template (e.g., 'myassay', 'fluorescence_assay'). This is used to identify which template a guide belongs to and is checked at runtime if you enable template name validation.
template.min.version (required): The minimum template version for which this guide is compatible. Use major.minor format (e.g., '1.0').
template.max.version (required, can be null): The maximum template version for which this guide is compatible. If there is no upper limit, use ~ (YAML null) to indicate “any newer version is OK”. For example, template.max.version: ~ means the guide works with all template versions from template.min.version onwards.
plate.format (conditionally required): The microplate format used in your experiments. Valid values are 24, 48, 96, or 384 (referring to the number of wells). This is required only if your guide includes any platedata locations (see below). The guide uses this to validate that your platedata ranges have the correct dimensions.

Step 2: Add your first location — template version cell

The .template location is a special, required location that records the template version from a single cell. This allows the package to check at runtime that the data file was created with a compatible template version.

locations:
  - sheet: description
    type: cells
    varname: .template
    translate: false
    variables:
      - name: version
        cell: B2

What this does: - Reads cell B2 from the sheet named description - Stores its value in a variable called version within the .template group - translate: false means no translation is applied (the cell value is used as-is)

This location must be present in every guide. The package uses the version value to check compatibility with template.min.version and template.max.version.

Step 3: Add metadata as key-value pairs

Metadata such as experimenter names, dates, and study IDs are typically stored as key-value pairs in your template. One location entry can index multiple ranges.

  - sheet: description
    type: keyvalue
    varname: metadata
    translate: true
    atomicclass:
      - character
      - character
      - date
      - numeric
    ranges:
      - A5:B8
      - A10:B12

What this does: - Reads two ranges (A5:B8 and A10:B12) from the description sheet - The left column of each range contains keys (variable names); the right column contains values - translate: true means the keys will be translated from long names (in the spreadsheet) to short names (used in your script) using the translations section defined later - atomicclass is a list of data types, one per key-value pair in order. In this example, the first three key-value pairs should be coerced to character, character, and date respectively, then numeric for any additional pairs (coercion cycles through the list if needed). All values are coerced to character by default if atomicclass is omitted.

Combining multiple ranges into one variable: Multiple ranges are combined into a single named list (metadata in this example). This is convenient when metadata is scattered across multiple sheets or ranges but logically belongs together.

Step 4: Add experiment parameters

Parameters that are constant across all experiments (e.g., acceptance criteria, standard concentrations defined in SOPs) are best stored in a dedicated _parameters sheet. They are indexed like metadata:

  - sheet: _parameters
    type: keyvalue
    varname: parameters
    translate: false
    atomicclass: numeric
    ranges:
      - A2:B5
      - A8:B10

Here, all values will be coerced to numeric. Use translate: false if your parameter names are already in the format you want in the script.

Step 5: Add measured plate data

If your template records plate data (e.g., fluorescence measurements from a 96-well plate), use the platedata type:

  - sheet: _data
    type: platedata
    varname: plate
    translate: false
    atomicclass:
      - character
      - numeric
      - numeric
    ranges:
      - A1:M9
      - A11:M19

What this does: - Reads data in plate format (e.g., well positions and measurements) - Each range is a block of data in the same row-column layout as the physical plate - Multiple ranges are stacked vertically into a single data frame - The resulting data frame includes columns row and col (well positions) plus one column per measured variable, each coerced to its corresponding atomicclass

In this example, the first three variables will be character, numeric, numeric respectively. The first range (A1:M9) represents rows A–H of the plate (8 rows) with 12 columns of data.

Step 6: Add tabular data

Tables with column headers and multiple rows of data are indexed using the table type:

  - sheet: results
    type: table
    varname: userresults
    translate: false
    atomicclass: numeric
    ranges:
      - C3:E10

What this does: - Reads a table from range C3:E10 - The first row contains column names (variable names) - All values are coerced to the specified atomicclass - The result is a data frame, not a list

Step 7: Add single-cell values

Sometimes you need to read isolated values from single cells and give them names in the guide (rather than relying on names in the spreadsheet):

  - sheet: analysis
    type: cells
    varname: qc_checks
    translate: false
    atomicclass: numeric
    variables:
      - name: spread_well_1
        cell: G6
      - name: spread_well_2
        cell: G33

What this does: - Reads individual cells and assigns them names specified in the guide - spread_well_1 gets the value from cell G6, spread_well_2 from G33 - All values are coerced to numeric - The result is a named list

This is useful when values are scattered or when the spreadsheet’s cell-naming scheme doesn’t align with your script’s variable-naming scheme.

Step 8: Add translations

If your spreadsheet uses long, user-friendly variable names (e.g., “Date of experiment”) but your script prefers short, code-friendly names (e.g., date), define a translations section:

translations:
  - long: "Date of experiment"
    short: date
  - long: "Experimenter name"
    short: experimenter
  - long: "Study identifier"
    short: studyID
  - long: "Plate identifier"
    short: plateID

When to use translations: - Set translate: true in a location to apply these translations to its variable names - Translations are optional; if you don’t define any, set translate: false in all locations - The package provides long_to_shortnames() and short_to_longnames() functions to convert between formats

Complete example guide

Below is a complete, working guide for a small fluorescence assay template:

guide.version: '1.0'
template.name: simple_fluorescence
template.min.version: '1.0'
template.max.version: ~
plate.format: 96

locations:
  # Required: template version
  - sheet: description
    type: cells
    varname: .template
    translate: false
    variables:
      - name: version
        cell: B2

  # Metadata: user and experiment information
  - sheet: description
    type: keyvalue
    varname: metadata
    translate: true
    atomicclass:
      - character
      - character
      - date
      - numeric
    ranges:
      - A5:B8

  # Parameters: SOP-defined constants
  - sheet: _parameters
    type: keyvalue
    varname: parameters
    translate: false
    atomicclass: numeric
    ranges:
      - A2:B5

  # Plate data: fluorescence measurements
  - sheet: _data
    type: platedata
    varname: fluorescence
    translate: false
    atomicclass: numeric
    ranges:
      - A1:M9
      - A11:M19

  # Quality control: derived metrics from analysis sheet
  - sheet: analysis
    type: cells
    varname: qc
    translate: false
    atomicclass: numeric
    variables:
      - name: plate_uniformity
        cell: D3
      - name: signal_range
        cell: D4

translations:
  - long: "Experimenter name"
    short: experimenter
  - long: "Date of experiment"
    short: date
  - long: "Temperature (°C)"
    short: temperature
  - long: "Sample concentration (µM)"
    short: concentration

Validation with CUE

Before using your guide in your scripts, validate it against the CUE schema to catch syntax and structure errors early:

# From the data-raw folder where the schema is located:
cd data-raw
./validate_and_sign.sh path/to/your/guide.yml

This command will: 1. Check your guide against excelguide_schema.cue 2. Abort with a detailed error message if validation fails 3. Embed a SHA256 hash (cue.verified field) if validation succeeds

The hash acts as proof that the guide was validated; you can verify it later with:

./verify_guide.sh path/to/your/guide.yml

In R, you can also check the hash when loading a guide:

guide <- read_guide("path/to/guide.yml", verify_hash = TRUE)

This will warn if the hash is missing or abort if the hash is invalid.