A small lexer that reads C-style source, tokenizes it with PLY, and prints each lexeme with a stable numeric group for quick inspection.
| Field | Details |
|---|---|
| Type | Command-line lexical analyzer |
| Context | Academic Prototype |
| Role | Solo developer |
| Year | 2022 |
| Status | Completed prototype |
| Main focus | Token rules, C-like keyword handling, grouped token output |
PyLex Analyzer is a Python tool I built to practice the first stage of compilation: turning source text into a stream of typed tokens. It targets C-like syntax (delimiters, operators, identifiers, keywords, literals, and comments) and prints each token in a fixed-width table so I could see what the lexer was producing while testing sample programs.
The project is a learning prototype, not a full compiler. Parsing and code generation are out of scope. I keep it in my portfolio because it shows how I modeled token classes, integrated a mature lexer library, and built a simple CLI around repeatable examples.
Before building a parser or interpreter, I needed a reliable way to break source code into meaningful pieces and verify that edge cases (keywords vs identifiers, floats vs integers, comments) were handled consistently.
.cpp filesI designed and implemented the full flow end to end: token rules, keyword lists, grouping logic, file selection, console input mode, example programs, and formatted output. I chose PLY for the lexer engine and wrote the project-specific rules and helpers on top of it.
The tool loads C-like source (from bundled examples or pasted input), runs it through a PLY-backed lexer, and prints each token with a group id, internal type name, and value. It also prints a reference table that maps group numbers to lexeme categories.
Regular expressions and small token functions cover the constructs I cared about for classroom-style C++ samples, including compound operators and both block and line comments.
def t_IDENTIFIER(t):
r'[a-zA-Z_]+[a-zA-Z0-9_]*'
if t.value in keyword:
t.type = 'KEYWORD'
return tI kept keyword detection inside the identifier rule so reserved words never show up as generic identifiers in the output.
Each lexeme category maps to a stable group number used in the printed stream and summary table.
def token_group(tok):
group = 0
if tok == tokens[0]: #DELIMITER
group = group_number[0]
elif tok == tokens[1]: #OPERATOR
group = group_number[1]
# ...
return groupThe grouping layer sits above PLY so the console view stays compact even when the underlying token type names are verbose.
main.py lists files under test/examples/ and tokenizes the chosen program. console.py accepts pasted source for quick one-off checks without creating a file.
The main loop formats token group, type, and value in aligned columns and appends a list of group ids for the whole file, which made regression checks on sample programs straightforward.
The architecture is intentionally flat: tokrules.py defines MyLexer() and returns a PLY lexer instance; main.py and console.py feed input and print results; constants.py holds keywords, token names, and group numbers; helpers.py handles file picking and grouping.
Scanning is powered by PLY (Python Lex-Yacc) by David M. Beazley (Dabeaz LLC). The project includes the vendored ply/ package (lex.py and yacc.py) under PLY’s license terms. I use ply.lex to build and run the lexer; ply.yacc is present as part of PLY but is not integrated into this prototype’s pipeline yet. PLY brings the classic lex/yacc workflow to Python; my custom work is the token rules, keyword table, grouping scheme, and CLI.
def MyLexer():
# ... token rules ...
return lex.lex()I wrapped rule definitions in a factory function so PLY builds the lexer from a clean local namespace each time.
while True:
tok = lexer.token()
if not tok:
break
tok_group = token_group(tok.type)
data_tokens.append(tok_group)
print('{:>5} | {:<10} | {:<64}'.format(tok_group, tok.type, tok.value))The consumer loop stays dumb on purpose: all classification complexity lives in tokrules.py and helpers.py.
Error handling for illegal characters prints the offending character and skips one position, which keeps exploratory runs going without crashing on a single bad symbol.
I optimized for clarity and inspection speed over completeness. The CLI is Spanish-language for prompts because that matched how I was working with the examples at the time.
console.py from file-based main.py so pasted snippets did not require temp filesprintf and main as keywords is useful for demos but not a full C++ keyword standardBuilding PyLex Analyzer made the lexer stage tangible: I saw how regex rules, keyword tables, and library-generated scanners fit together before any grammar work.
This project is no longer maintained and should be read as a completed learning prototype from 2022. I keep it in my portfolio because it documents early work on compiler front ends, third-party library integration, and CLI tooling around token inspection.
ply.yacc with a small grammar only if the goal expands beyond lexingA Python lexical analyzer for C-like source that classifies tokens into numbered groups and prints a readable token stream.
Oct 2022 – Nov 2022