Introduction
This post would be showcasing how to code language for build system, with a niche set of features custom tailored to accomplish a specific task(which would be creating a build system, in my case).
This post won’t be describing traditional approach to writing a language from scratch, but rather how to quickly set up a rule set that can take simple text and do something useful with it without delving very deep into concepts of language creation.
I’ll have these features implemented for my language based on the requirements I have:
- List can be defined with syntax: <VarName>=[<List Elements>]
- Dictionary can be defined with syntax: <VarName> = { <“Key1” : “Value1”, …> }
- Conditional compilation using #if, #elif, #else. This wont do any arithmetic processing
- Single line comments starting as ‘//’
Every language starts with defining a basic set of rules that it should accomplish. I have defined following rules to bide by:
- A block begins with @: and ends with :@
- Valid Symbols: (, ), [, ], {, }, “, #, if, else, elif, defined, and, or //
- List is defined by [], dictionary by {}
- Expressions are parsed from left to right
- # regions must begin with if and terminate with endif
- #if and #elif evaluates a boolean expression and skips parsing a block if evaluated to false
I’ll be using Python to code this as it fits my requirements. The final result of the parsed file would be a python dictionary with keys as variables defined in the file containing the necessary information for my build system. The above set of requirements would be sufficient to gather information necessary to generate files for a build system for C++ projects. This is a sample of my “BuildTemplate” file that implements the language with the rules mentioned above:
@: #if defined DEBUG_BUILD "CMakeDefines"={ "CMAKE_BUILD_TYPE" : "Debug" } #endif :@ //Any comments between blocks are simply ignored. Doesn't have to start with '//' @: "PreProcessorDefines"=[ #if defined DEBUG_BUILD "DEBUG_BUILD", // makes DEBUG_BUILD accessible as macro in C++ #endif "QUAINT_PLATFORM_${BUILD_PLATFORM}" ] :@
In the following code snippets, there would be common theme of passing contents of a block as string and current Index to all the functions.
Retrieving Blocks
I first retrieve one block at a time. Reason for this is that it’s easier to debug if any issues come up, nothing more. I use the python regex to make this work. The code to retrieve the block and start parsing it is quite simple. The following function opens the file in the path provided and retrieves the contents of each block as a string and stores them in a list.
After parsing, contents of block are converted to a dictionary. We loop over all the boxes, combining the output to the contents of dictionary that should be returned.
def ReadTemplateFile(TemplateFilePath): if not os.path.isfile(TemplateFilePath): print("Encountered something that's not a file when trying to read template file") return None stream = open(TemplateFilePath, "r") contents = stream.read() stream.close() Blocks = re.finditer("(@:)(.*(\n|\r|\r\n))+?(.*:@)", contents) if(Blocks == None): print("No Module params found for this Module. This will skip Parsing Template") ParamDictionary = {} for Params in Blocks: CleanedStr = re.sub("(@:)|(:@)", "", Params.group()) ResDict = ParseBlock(CleanedStr, -1) ParamDictionary.update(ResDict) return ParamDictionary
Traversing Block Contents
To start parsing anything, we need to move from one character to the next. Let’s define 2 functions that’s going to make our lives easier. Purpose of these functions to skip any unwanted characters and get the index of the next valid character from current position.
def GetNextValidCharacterIndex(Param, Index) -> int: Index += 1 while(Index < len(Param)) and (Param[Index] == ' ' or Param[Index] == '\n' or Param[Index] == '\r'): Index += 1 continue return Index def GetNextIndex(Param, Index) -> int: Index = GetNextValidCharacterIndex(Param, Index) if(Index == len(Param)): return Index if(Param[Index] == '/'): Index+=1 assert(Index <= len(Param)), "Terminated unexpectedly" if(Param[Index] == '/'): while(Index < len(Param)) and (Param[Index] != '\n' or Param[Index] == '\r'): Index += 1 continue Index = GetNextIndex(Param, Index) return Index
GetNextValidCharacterIndex does exactly as it’s name implies. It skips a bunch of characters that I deem irrelevant while parsing and returns the index of the next valid character.
GetNextIndex a something a little special. It gets the next valid character that should be processed by the parser. In the snippet, if it encounters a double-slash(//), it’s deemed as a single-line comment and all characters encountered before new-line character are skipped. We then retrieve the index that should be processed by the parser.
We will extend this function in a future post to evaluate the macro(#) statements and conditionally disable parsing certain portions of code.
parsing Block
Now that we can traverse characters in a block, let’s implement a function to parse it.
To do this, I would need to identify the type of parameter/character that I’m currently processing. This helper function will identify the type of parameter we are currently processing.
class ParamType(Enum): EDictionary = 0 EList = 1 EMacro = 2 ENumber = 3 EString = 4 EComma = 5 EInvalid = 6 def IdentifyParamType(Param : str, Index) -> ParamType | None: if (Param is None) or (len(Param) == 0): return ParamType.EInvalid Type = ParamType.EInvalid assert (Index < len(Param)), "Invalid Index retrieved" c = Param[Index] if c == '{' : Type = ParamType.EDictionary elif c == '[' or c == '(': Type = ParamType.EList elif c == '#': Type = ParamType.EMacro elif ord(c) >= 48 and ord(c) <= 57: Type = ParamType.ENumber elif c == "\"" or c == "\'": Type = ParamType.EString elif c == ",": Type = ParamType.EComma else: assert False, "Invalid Type Encountered" return Type
All the data in our “Blocks” are Key-Value pairs for now. To parse it, we first have to get a string, followed by ‘=’ and finally, it’s value(which could be a string, number, list, dictionary or any structure that’s implemented later).
To retrieve an ambiguous value, I have another function called “ProcessParam”.
def ProcessParam(Param, Index) -> tuple[dict | list | str | None, int]: Type = IdentifyParamType(Param, Index) Res = {} if Type == ParamType.EDictionary: (Res, Index) = ParseDictionary(Param, Index) elif Type == ParamType.EList: (Res, Index) = ParseList(Param, Index) elif Type == ParamType.ENumber: (Res, Index) = ParseNumber(Param, Index) elif Type == ParamType.EString: (Res, Index) = ParseString(Param, Index) else: assert False, "Trying to parse invalid type" return (Res, Index) def ParseBlock(Param, Index) -> dict: ParamDictionary = {} Index = GetNextIndex(Param, Index) while(Index < len(Param)): Type = IdentifyParamType(Param, Index) if(Type == ParamType.EString): (KeyEntry, Index) = ParseString(Param, Index) Index = GetNextIndex(Param, Index) assert Param[Index] == '=', "Not a valid entry" Index = GetNextIndex(Param, Index) (ParamDictionary[KeyEntry], Index) = ProcessParam(Param, Index) elif(Type == ParamType.EComma): pass else: assert False, "Invalid Character encountered when parsing block" Index = GetNextIndex(Param, Index) return ParamDictionary
Functions inside ProcessParams are placeholders for now, which I’ll implement in a future post. But, the idea here to parse and retrieve the correct “value” to our “key”.
That is it for this post. In the next one, I’ll showcase how to parse string, list and dictionary.
Thank you for reading! Don’t be stranger 🙂