Codebase list ohcount / 5ab1807 src / parser.h
5ab1807

Tree @5ab1807 (Download .tar.gz)

parser.h @5ab1807raw · history · blame

// parser.h written by Mitchell Foral. mitchell<att>caladbolg.net.
// See COPYING for license information.

#ifndef OHCOUNT_PARSER_H
#define OHCOUNT_PARSER_H

#include "sourcefile.h"

/**
 * @page parser_doc Parser Documentation
 * @author Mitchell Foral
 *
 * @section overview Overview
 *
 * I will assume the reader has a decent knowledge of how Ragel works and the
 * Ragel syntax. If not, please review the Ragel manual found at:
 *   http://research.cs.queensu.ca/~thurston/ragel/
 *
 * All parsers must at least:
 *
 * @li Call a callback function when a line of code is parsed.
 * @li Call a callback function when a line of comment is parsed.
 * @li Call a callback function when a blank line is parsed.
 *
 * Additionally a parser can call the callback function for each position of
 * entities parsed.
 *
 * Take a look at 'c.rl' and even keep it open for reference when reading this
 * document to better understand how parsers work and how to write one.
 *
 * @section writing Writing a Parser
 *
 * First create your parser in 'src/parsers/'. Its name should be the language
 * you are parsing with a '.rl' extension. You will not have to manually compile
 * any parsers, as this is automatically for you. However, you do need to add
 * your parser to 'hash/parsers.gperf'.
 *
 * Every parser must have the following at the top:
 *
 * @include parser_doc_1
 *
 * And the following at the bottom:
 *
 * @include parser_doc_2
 *
 * (Your parser will go between these two blocks.)
 *
 * The code can be found in the existing 'c.rl' parser. You will need to change:
 * @li OHCOUNT_[lang]_PARSER_H - Replace [lang] with your language name. So if
 *   you are writing a C parser, it would be OHCOUNT_C_PARSER_H.
 * @li [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be
 *   the name of your language to parse as defined in languages.h. [lang] is
 *   your language name. For C it would be C_LANG.
 * @li [lang]_entities - Set the variable name to be [lang]_entities (e.g.
 *   c_entries) The value is an array of string entities your language has. For
 *   example C has comment, string, number, etc. entities. You should definately
 *   have "space", and "any" entities. "any" entities are typically used for
 *   entity machines (discussed later) and match any character that is not
 *   recognized so the parser does not do something unpredictable.
 * @li enum - Change the value of the enum to correspond with your entities. So
 *   if in your parser you look up [lang]_entities[ENTITY], you will get the
 *   associated entity's string name.
 * @li parse_[lang] - Set the function name to parse_[lang] where again, [lang]
 *   is the name of your language. In the case of C, it is parse_c.
 * @li [lang]_en_[lang]_line - The line counting machine.
 * @li [lang]_en_[lang]_entity - The entity machine.
 *
 * You may be asking why you have to rename variables and functions. Well if
 * variables have the same name in header files (which is what parsers are), the
 * compiler complains. Also, when you have languages embedded inside each other,
 * any identifiers with the same name can easily be mixed up. It is also
 * important to prefix your Ragel definitions with your language to avoid
 * conflicts with other parsers.
 *
 * Additional variables available to parsers are in the parser_macros.h file.
 * Take a look at it and try to understand what the variables are used for. They
 * will make more sense later on.
 *
 * Now you can define your Ragel parser. Name your machine after your language,
 * "write data", and include 'common.rl', a file with common Ragel definitions,
 * actions, etc. For example:
 *
 * @include parser_doc_3
 *
 * Before you begin to write patterns for each entity in your language, you need
 * to understand how the parser should work.
 *
 * Each parser has two machines: one optimized for counting lines of code,
 * comments, and blanks; the other for identifying entity positions in the
 * buffer.
 *
 * @section line Line Counting Machine
 *
 * This machine should be written as a line-by-line parser for multiple lines.
 * This means you match any combination of entities except a newline up until
 * you do reach a newline. If the line contains only spaces, or nothing at all,
 * it is blank. If the line contains spaces at first, but then a comment, or
 * just simply a comment, the line is a comment. If the line contains anything
 * but a comment after spaces (if there are any), it is a line of code. You
 * will do this using a Ragel scanner. The callback function will be called for
 * each line parsed.
 *
 * @subsection line_scanner Scanner Parser Structure
 *
 * A scanner parser will look like this:
 *
 * @include parser_doc_4
 *
 * (As usual, replace [lang] with your language name.)
 *
 * Each entity is the pattern for an entity to match, the last one typically
 * being the newline entity. For each match, the variable is set to a constant
 * defined in the enum, and the main action is called (you will need to create
 * this action above the scanner).
 *
 * When you detect whether or not a line is code or comment, you should call the
 * appropriate \@code or \@comment action defined in 'common.rl' as soon as
 * possible. It is not necessary to worry about whether or not these actions are
 * called more than once for a given line; the first call to either sets the
 * status of the line permanently. Sometimes you cannot call \@code or \@comment
 * for one reason or another. Do not worry, as this is discussed later.
 *
 * When you reach a newline, you will need to decide whether the current line is
 * a line of code, comment, or blank. This is easy. Simply check if the
 * #line_contains_code or #whole_line_comment variables are set to 1. If neither
 * of them are, the line is blank. Then call the callback function (not action)
 * with an "lcode", "lcomment", or "lblank" string, and the start and end
 * positions of that line (including the newline). The start position of the
 * line is in the #line_start variable. It should be set at the beginning of
 * every line either through the \@code or \@comment actions, or manually in the
 * main action. Finally the #line_contains_code, #whole_line_comment, and
 * #line_start state variables must be reset. All this should be done within the
 * main action shown below. Note: For most parsers, the std_newline(lang) macro
 * is sufficient and does everything in the main action mentioned above. The
 * lang parameter is the [lang]_LANG string.
 *
 * @subsection line_action Main Action Structure
 *
 * The main action looks like this:
 *
 * @include parser_doc_5
 *
 * @subsection line_entity_patterns Defining Patterns for Entities
 *
 * Now it is time to write patterns for each entity in your language. That does
 * not seem very hard, except when your entity can cover multiple lines.
 * Comments and strings in particular can do this. To make an accurate line
 * counter, you will need to count the lines covered by multi-line entities.
 * When you detect a newline inside your multi-line entity, you should set the
 * entity variable to be #INTERNAL_NL and call the main action. The main action
 * should have a case for #INTERNAL_NL separate from the newline entity. In it,
 * you will check if the current line is code or comment and call the callback
 * function with the appropriate string ("lcode" or "lcomment") and beginning
 * and end of the line (including the newline). Afterwards, you will reset the
 * #line_contains_code and #whole_line_comment state variables. Then set the
 * #line_start variable to be #p, the current Ragel buffer position. Because
 * #line_contains_code and #whole_line_comment have been reset, any non-newline
 * and non-space character in the multi-line pattern should set
 * #line_contains_code or #whole_line_comment back to 1. Otherwise you would count
 * the line as blank.
 *
 * Note: For most parsers, the std_internal_newline(lang) macro is sufficient
 * and does everything in the main action mentioned above. The lang parameter
 * is the [lang]_LANG string.
 *
 * For multi-line matches, it is important to call the \@code or \@comment
 * actions (mentioned earlier) before an internal newline is detected so the
 * #line_contains_code and #whole_line_comment variables are properly set. For
 * other entities, you can use the #code macro inside the main action which
 * executes the same code as the Ragel \@code action. Other C macros are
 * #comment and #ls, the latter is typically used for the SPACE entity when
 * defining #line_start.
 *
 * Also for multi-line matches, it may be necessary to use the \@enqueue and
 * \@commit actions. If it is possible that a multi-line entity will not have an
 * ending delimiter (for example a string), use the \@enqueue action as soon as
 * the start delimitter has been detected, and the \@commit action as soon as
 * the end delimitter has been detected. This will eliminate the potential for
 * any counting errors.
 *
 * @subsection line_notes Notes
 *
 * You can be a bit sloppy with the line counting machine. For example the only
 * C entities that can contain newlines are strings and comments, so
 * #INTERNAL_NL would only be necessary inside them. Other than those, anything
 * other than spaces is considered code, so do not waste your time defining
 * specific patterns for other entities.
 *
 * @subsection line_embedded Parsers with Embedded Languages
 *
 * Notation: [lang] is the parent language, [elang] is the embedded language.
 *
 * To write a parser with embedded languages (such as HTML with embedded CSS and
 * Javascript), you should first \#include the parser(s) above your Ragel code.
 * The header file is "[elang]_parser.h".
 *
 * Next, after the inclusion of 'common.rl', add "#EMBED([elang])" on separate
 * lines for each embedded language. The build process looks for these special
 * comments to embed the language for you automatically.
 *
 * In your main action, you need to add another entity #CHECK_BLANK_ENTRY. It
 * should call the #check_blank_entry([lang]_LANG) macro. Blank entries are an
 * entry into an embedded language, but the rest of the line is blank before a
 * newline. For example, a CSS entry in HTML is something like:
 *
 * @code
 *   <style type="text/css">
 * @endcode
 *
 * If there is no CSS code after the entry (a blank entry), the line should be
 * counted as HTML code, and the #check_blank_entry macro handles this. But you
 * may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?". This will be
 * discussed in just a bit.
 *
 * The #emb_newline and #emb_internal_newline macros should be used instead of
 * the #std_newline and #std_internal_newline macros.
 *
 * For each embedded language you will have to define an entry and outry. An
 * entry is the pattern that transitions from the parent language into the child
 * language. An outry is the pattern from child to parent. You will need to put
 * your entries in your [lang]_line machine. You will also need to re-create
 * each embedded language's line machine (define as [lang]_[elang]_line; e.g.
 * html_css_line) and put outry patterns in those. Entries typically would be
 * defined as [lang]_[elang]_entry, and outries as [lang]_[elang]_outry.
 *
 * Note: An outry should have a \@check_blank_outry action so the line is not
 * mistakenly counted as a line of embedded language code if it is actually a
 * line of parent code.
 *
 * @subsection line_entry_action Entry Pattern Actions
 *
 * @include parser_doc_6
 *
 * What this does is checks for a blank entry, and if it is, counts the line as
 * a line of parent language code. If it is not, the macro will not do anything.
 * The machine then transitions into the child language.
 *
 * @subsection line_outry_action Outry Pattern Actions
 *
 * @include parser_doc_7
 *
 * What this does is sets the current Ragel parser position to the beginning of
 * the outry so the line is counted as a line of parent language code if no
 * child code is on the same line. The machine then transitions into the parent
 * language.
 *
 * @section entity Entity Identifying Machine
 *
 * This machine does not have to be written as a line-by-line parser. It only
 * has to identify the positions of language entities, such as whitespace,
 * comments, strings, etc. in sequence. As a result they can be written much
 * faster and more easily with less thought than a line counter. Using a scanner
 * is most efficient. The callback function will be called for each entity
 * parsed.
 *
 * The \@ls, \@ code, \@comment, \@queue, and \@commit actions are completely
 * unnecessary.
 *
 * @subsection entity_scanner Scanner Structure
 *
 * @include parser_doc_8
 *
 * @subsection entity_action Main Action Structure
 *
 * @include parser_doc_9
 *
 * @subsection entity_embedded Parsers for Embedded Languages
 *
 * TODO:
 *
 * @section tests Including Written Tests for Parsers
 *
 * You should have two kinds of tests for parsers. One will be a header file
 * that goes in the 'test/unit/parsers/' directory and the other will be an
 * input source file that goes in the 'test/src_dir/' and an expected output
 * file that goes in the 'test/expected_dir/' directory.
 *
 * The header file will need to be "#include"ed in 'test/unit/parser_test.h'.
 * Then add the "all_[lang]_tests()" function to the "all_parser_tests()"
 * function.
 *
 * Recompile the tests for the changes to take effect.
 *
 * The other files added to the 'test/{src,expected}_dir/' directories will be
 * automatically detected and run with the test suite.
 */

/**
 * Tries to use an existing Ragel parser for the given language.
 * @param sourcefile A SourceFile created by ohcount_sourcefile_new().
 * @param count An integer flag indicating whether to count lines or parse
 *   entities.
 * @param callback A callback to use for every line or entity in the source
 *   file discovered (depends on count).
 * @param userdata Pointer to userdata used by callback (if any).
 * @return 1 if a Ragel parser is found, 0 otherwise.
 */
int ohcount_parse(SourceFile *sourcefile, int count,
                  void (*callback) (const char *, const char *, int, int,
                                    void *),
                  void *userdata);

#endif