PyLex Analyzer

A small lexer that reads C-style source, tokenizes it with PLY, and prints each lexeme with a stable numeric group for quick inspection.

Snapshot

Field	Details
Type	Command-line lexical analyzer
Context	Academic Prototype
Role	Solo developer
Year	2022
Status	Completed prototype
Main focus	Token rules, C-like keyword handling, grouped token output

Overview

PyLex Analyzer is a Python tool I built to practice the first stage of compilation: turning source text into a stream of typed tokens. It targets C-like syntax (delimiters, operators, identifiers, keywords, literals, and comments) and prints each token in a fixed-width table so I could see what the lexer was producing while testing sample programs.

The project is a learning prototype, not a full compiler. Parsing and code generation are out of scope. I keep it in my portfolio because it shows how I modeled token classes, integrated a mature lexer library, and built a simple CLI around repeatable examples.

The problem

Before building a parser or interpreter, I needed a reliable way to break source code into meaningful pieces and verify that edge cases (keywords vs identifiers, floats vs integers, comments) were handled consistently.

Raw text is hard to debug without a structured token view
Hand-rolling a lexer from scratch is error-prone for operator and string patterns
Course-style C samples needed a repeatable way to run the same rules against different files
I wanted numeric groups so related lexeme types could be summarized at a glance

Who it was for

Me, while learning compiler front-end concepts
Anyone experimenting with lexing C-like snippets in Python
Students or developers who want a minimal, readable token dump from sample .cpp files

My role

I designed and implemented the full flow end to end: token rules, keyword lists, grouping logic, file selection, console input mode, example programs, and formatted output. I chose PLY for the lexer engine and wrote the project-specific rules and helpers on top of it.

What the project does

The tool loads C-like source (from bundled examples or pasted input), runs it through a PLY-backed lexer, and prints each token with a group id, internal type name, and value. It also prints a reference table that maps group numbers to lexeme categories.

Tokenizes delimiters, operators, identifiers, keywords, strings, chars, integers, and reals
Discards or skips comments, preprocessor-style lines, and whitespace
Promotes known words from identifiers to keywords using a fixed keyword set
Maps each PLY token type to a numeric group for summary output
Offers interactive selection among sample programs or direct stdin input

Key features

C-like token rules

Regular expressions and small token functions cover the constructs I cared about for classroom-style C++ samples, including compound operators and both block and line comments.

python

def t_IDENTIFIER(t):
    r'[a-zA-Z_]+[a-zA-Z0-9_]*'
    if t.value in keyword:
        t.type = 'KEYWORD'
    return t

I kept keyword detection inside the identifier rule so reserved words never show up as generic identifiers in the output.

Numeric token groups

Each lexeme category maps to a stable group number used in the printed stream and summary table.

python

def token_group(tok):
    group = 0
    if tok == tokens[0]:        #DELIMITER
        group = group_number[0]
    elif tok == tokens[1]:      #OPERATOR
        group = group_number[1]
    # ...
    return group

The grouping layer sits above PLY so the console view stays compact even when the underlying token type names are verbose.

Example-driven and interactive modes

main.py lists files under test/examples/ and tokenizes the chosen program. console.py accepts pasted source for quick one-off checks without creating a file.

Readable token table output

The main loop formats token group, type, and value in aligned columns and appends a list of group ids for the whole file, which made regression checks on sample programs straightforward.

Technical approach

The architecture is intentionally flat: tokrules.py defines MyLexer() and returns a PLY lexer instance; main.py and console.py feed input and print results; constants.py holds keywords, token names, and group numbers; helpers.py handles file picking and grouping.

Scanning is powered by PLY (Python Lex-Yacc) by David M. Beazley (Dabeaz LLC). The project includes the vendored ply/ package (lex.py and yacc.py) under PLY’s license terms. I use ply.lex to build and run the lexer; ply.yacc is present as part of PLY but is not integrated into this prototype’s pipeline yet. PLY brings the classic lex/yacc workflow to Python; my custom work is the token rules, keyword table, grouping scheme, and CLI.

python

def MyLexer():
    # ... token rules ...
    return lex.lex()

I wrapped rule definitions in a factory function so PLY builds the lexer from a clean local namespace each time.

python

while True:
    tok = lexer.token()
    if not tok:
        break
    tok_group = token_group(tok.type)
    data_tokens.append(tok_group)
    print('{:>5} | {:<10} | {:<64}'.format(tok_group, tok.type, tok.value))

The consumer loop stays dumb on purpose: all classification complexity lives in tokrules.py and helpers.py.

Error handling for illegal characters prints the offending character and skips one position, which keeps exploratory runs going without crashing on a single bad symbol.

Design decisions

I optimized for clarity and inspection speed over completeness. The CLI is Spanish-language for prompts because that matched how I was working with the examples at the time.

Used PLY instead of a from-scratch lexer generator to focus on token design and output, not scanner tables
Collapsed many lexeme kinds into eight groups with numeric ids for easier scanning of long token lists
Bundled small C++ samples (hello world, division, variable sizing) as predictable fixtures
Separated console.py from file-based main.py so pasted snippets did not require temp files
Left parsing (yacc/grammar) out of scope so the prototype stayed focused on lexical analysis only

Challenges and tradeoffs

Operator and delimiter regexes are dense; overlapping patterns required careful ordering in PLY rule definitions
Treating printf and main as keywords is useful for demos but not a full C++ keyword standard
The vendored PLY tree adds weight; the tradeoff was zero pip dependency friction and offline use
No automated test suite in the repo; validation was manual via example files and console runs
Without a parser pass, the tool cannot validate syntax, only surface-level tokenization

What I learned

Building PyLex Analyzer made the lexer stage tangible: I saw how regex rules, keyword tables, and library-generated scanners fit together before any grammar work.

Regular expressions need discipline; small mistakes show up immediately in token streams
Separating library lexing from project-specific grouping keeps output policies flexible
Vendoring a well-known tool like PLY is a practical way to learn without reimplementing lex tables
Giving credit to upstream authors (David M. Beazley for PLY) is part of responsible use of embedded libraries

Current status

This project is no longer maintained and should be read as a completed learning prototype from 2022. I keep it in my portfolio because it documents early work on compiler front ends, third-party library integration, and CLI tooling around token inspection.

If I revisited this today

Add a minimal test harness that asserts token sequences for each example file
Wire up ply.yacc with a small grammar only if the goal expands beyond lexing
Align the keyword list with a real C/C++ standard or generate it from a spec
Support reading arbitrary file paths from the command line, not only bundled examples
Emit JSON or LSP-friendly output for tooling instead of only formatted console tables

PyLex Analyzer

Snapshot

Overview

The problem

Who it was for

My role

What the project does

Key features

C-like token rules

Numeric token groups

Example-driven and interactive modes

Readable token table output

Technical approach

Design decisions

Challenges and tradeoffs

What I learned

Current status

If I revisited this today

Summary

Tech Stack

Timeline