CSV and TSV Explained — Universal Data Exchange Format

Q: How do I handle CSV with embedded newlines?

Always with a real CSV parser, never with .split('\n'). Embedded newlines inside quoted fields are valid per RFC 4180. Python's csv.reader, Node's csv-parse, Ruby's CSV, and Go's encoding/csv all handle this correctly. The "tip-off" that a parser is broken: it crashes or splits records when one of your fields has a multi-line description.

Q: What's the right way to detect a delimiter automatically?

Read the first 1–2KB of the file and count occurrences of ,, \t, ;, and | per line. Whichever delimiter has the most consistent count across rows is almost certainly the right one. Python's csv.Sniffer and Node's csv-sniff implement this heuristic. For untrusted input, prompt the user to confirm the detection — auto-detection works ~95% of the time but occasionally picks wrong on edge cases.

Q: How do I handle CSV files that exceed memory?

Stream them. Python's csv.reader on a file object reads row-by-row without loading the whole file. Node's csv-parse with the streaming API does the same. PostgreSQL's \copy reads CSV in chunks. For very large CSVs (multi-gigabyte), tools like xsv, csvkit, or q provide grep/sql-like operations directly on streamed CSV without loading it into memory. Avoid loading multi-GB CSV into pandas or a single string buffer.

Why CSV Has Survived 50 Years

CSV is older than the personal computer. The format dates to at least the 1970s on IBM mainframes, predating VisiCalc, Excel, and the entire concept of a spreadsheet as we know it today — see the Wikipedia entry on Comma-separated values for the historical timeline. By the time structured data exchange became a web problem, CSV was already entrenched in every operating system, business application, and database export tool on the planet.

The reason it survived isn't elegance. CSV has no formal specification — the closest thing, RFC 4180, was written in 2005 as documentation of existing practice, not a design document. No types, no schema, no native support for nested data. Just text rows with values separated by commas.

It survived because spreadsheets survived. Excel opens CSV. Google Sheets opens CSV. Every database can export CSV. Every analytics tool can import it. Universal compatibility beats almost everything else in data exchange.

The Quoting Rules: RFC 4180

The basic case is obvious — comma-separated values, one record per line:

name,age,city
Alice,30,New York
Bob,25,London

But values that contain commas, newlines, or double-quote characters need to be quoted. That's where most hand-rolled CSV parsers break. RFC 4180 specifies:

Fields may be enclosed in double quotes. If a field is not quoted, it must not contain commas, newlines, or double quotes.
If a field contains commas, newlines, or double quotes, it must be enclosed in double quotes.
Double quotes within a quoted field are escaped by doubling them: "" represents a literal ".

name,bio,score
Alice,"She said, ""hello world"" to the crowd",95
Bob,"Line one
Line two",87

That second record has a newline inside a quoted field — which means the record spans two physical lines. A naive line-by-line parser will corrupt this. Always use a proper CSV library.

// Node.js — don't hand-roll parsing
import { parse } from 'csv-parse/sync';

const records = parse(csvText, {
  columns: true,      // use first row as header names
  skip_empty_lines: true,
  trim: true
});

import csv
import io

reader = csv.DictReader(io.StringIO(csv_text))
for row in reader:
    print(row['name'], row['age'])

Both Python's csv module and Node's csv-parse handle quoted fields, embedded newlines, and escaped quotes correctly. Any time you're tempted to split on commas yourself, reach for a library instead.

TSV: A Simpler Alternative

TSV (Tab-Separated Values) uses a tab character (\t) as the delimiter instead of a comma — IANA's text/tab-separated-values media type is the closest thing to a formal spec. The appeal: tab characters almost never appear in real data, so you rarely need quoting. The format is often simpler to produce and parse for simple data.

name	age	city
Alice	30	New York
Bob	25	London

TSV is common in bioinformatics, genomics data, and system-generated exports where the data never goes near Excel. It's also the format MySQL's LOAD DATA INFILE and PostgreSQL's `\copy` prefer by default.

The tradeoff: tabs are invisible in most text editors. You can't tell whether something is tab-separated or space-separated at a glance, and copy-pasting TSV into Excel or a browser form sometimes converts tabs to spaces. For human-edited data, CSV is usually more practical.

Header Row Conventions

RFC 4180 explicitly makes the header row optional and leaves it to the application to decide. In practice, a header row is almost always present and expected. The convention:

First row contains field names
Names are lowercase with underscores or camelCase depending on your ecosystem
Names should be stable — downstream consumers hard-code them

If you're exporting CSV from a database, match the column names to the table column names unless there's a good reason not to. Consumers who write code against your CSV export will be grateful.

Some tools — particularly data science pipelines — also support a "no header" CSV where all rows are data. When sharing files with others, always clarify whether a header row is present.

The UTF-8 BOM Problem in Excel

Excel on Windows doesn't correctly recognize UTF-8 encoded CSV files unless they start with a BOM (byte order mark) — the three-byte sequence \xEF\xBB\xBF at the very beginning of the file. Reliably frustrating.

Without the BOM, Excel applies a legacy encoding (often Windows-1252) and any non-ASCII characters — accented letters, Chinese characters, emoji — appear as mojibake.

The fix when generating CSVs for Excel consumption:

// Add UTF-8 BOM so Excel opens it correctly
// '' is the Unicode BOM character (U+FEFF)
const BOM = '';
const csvContent = BOM + rows.join('\n');

# Python — specify utf-8-sig encoding (writes BOM automatically)
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data)

The irony is that the BOM is not part of the RFC 4180 spec and can confuse non-Excel parsers. If your audience is a mix of Excel users and programmatic consumers, the safest approach is to offer both: a -excel.csv variant with BOM and a standard -utf8.csv without.

Common CSV Parsing Bugs

Splitting on commas. Already covered, but worth repeating: line.split(',') gives wrong results on any field that contains a comma. Even if your current data doesn't, the next export will.

Ignoring line ending differences. Windows line endings are \r\n, Unix are \n. CSV files from Windows sources often have \r appended to the last field in each row. Most libraries handle this; split('\n') doesn't.

Mishandling null vs empty string. CSV has no null type. An empty field (,,) might mean null, an empty string, zero, or missing — that's application-level semantics that CSV doesn't encode. Be explicit in your documentation.

Assuming consistent column count. Some CSV files have rows with different numbers of fields — malformed exports, trailing commas. Validate before processing if the data comes from an untrusted source.

CSV vs JSON vs Excel: When to Use Which

Use CSV when: you need maximum interoperability, your data is flat tabular (no nesting), you're moving data between tools that both understand spreadsheets, or the recipient will open the file in Excel or Google Sheets. Data exports, report downloads, bulk import templates — CSV wins here.

Use JSON when: your data is nested or hierarchical, you're building an API, you need types (numbers, booleans, null), or you're working entirely within a JavaScript/TypeScript stack. For web-to-web data exchange, JSON is almost always better.

Use Excel (XLSX) when: formatting, formulas, multiple sheets, or rich formatting matter to the recipient. Raw data exchange should still prefer CSV even when the final consumer is Excel, because CSV is universally parseable and XLSX parsing libraries are heavy.

The CSV to JSON and JSON to CSV tools convert between these formats directly in your browser — nothing is uploaded to a server, so they're safe for sensitive data.

For a broader look at data format tradeoffs, see XML vs JSON: Which Format to Use and When. YAML is another common format for configuration data, covered in YAML Explained: Configuration Made Human-Readable.

FAQ

What's the official CSV specification?

There isn't one definitive standard — RFC 4180 (2005) is the closest thing, but it explicitly documents existing practice rather than mandating behavior. Different tools implement different dialects: Excel uses \r\n line endings and Windows-1252 encoding by default; PostgreSQL's COPY uses Unix line endings; some Asian locales swap commas for semicolons because comma is a decimal separator. Always document your dialect when sharing CSV.

Should I use Parquet instead of CSV for analytics work?

For purely analytical workloads (large datasets, columnar queries, data warehouses), yes — Parquet compresses 5–10x better than CSV, supports types and nested schemas, and reads orders of magnitude faster for column-subset queries. CSV is still better for human inspection, ad-hoc Excel analysis, and interop with non-data-engineering tools. The pattern is: CSV for exchange and editing, Parquet for storage and analytics.

Why does Excel mangle my UTF-8 CSV?

Excel on Windows checks for a UTF-8 BOM (byte order mark) at the start of the file — without it, it falls back to Windows-1252 or the user's locale codepage, mangling any non-ASCII characters. Excel on Mac is generally better. Fix: write the file with UTF-8 BOM (Python encoding='utf-8-sig', Node prepend ). The BOM confuses some non-Excel CSV parsers, so offer both versions if your audience is mixed.

Is TSV better than CSV for programmatic pipelines?

Often yes. Tabs almost never appear in real data, so you rarely need quoting — and quoting is where CSV parsers go wrong. TSV is the default for many bioinformatics tools, MySQL/PostgreSQL bulk loaders, and command-line utilities (cut, awk default to whitespace). The downside is that tabs are invisible in editors and copy-paste workflows often convert them to spaces. For database <-> script pipelines, TSV; for human-shared exchange, CSV.

How do I handle CSV with embedded newlines?

Always with a real CSV parser, never with .split('\n'). Embedded newlines inside quoted fields are valid per RFC 4180. Python's csv.reader, Node's csv-parse, Ruby's CSV, and Go's encoding/csv all handle this correctly. The "tip-off" that a parser is broken: it crashes or splits records when one of your fields has a multi-line description.

What's the right way to detect a delimiter automatically?

Read the first 1–2KB of the file and count occurrences of ,, \t, ;, and | per line. Whichever delimiter has the most consistent count across rows is almost certainly the right one. Python's csv.Sniffer and Node's csv-sniff implement this heuristic. For untrusted input, prompt the user to confirm the detection — auto-detection works ~95% of the time but occasionally picks wrong on edge cases.

Can CSV handle nested data?

Not natively. Workarounds: flatten the nesting (one row per leaf object), use JSON-encoded strings inside cells ("{\"key\":\"value\"}"), or denormalize across multiple files with foreign keys. None are great. If your data is genuinely hierarchical (orders with line items, surveys with multiple responses per question), CSV is fighting your data shape — JSON or a database export is usually a better choice.

How do I handle CSV files that exceed memory?

Stream them. Python's csv.reader on a file object reads row-by-row without loading the whole file. Node's csv-parse with the streaming API does the same. PostgreSQL's \copy reads CSV in chunks. For very large CSVs (multi-gigabyte), tools like xsv, csvkit, or q provide grep/sql-like operations directly on streamed CSV without loading it into memory. Avoid loading multi-GB CSV into pandas or a single string buffer.