\documentclass[a4paper]{article}
%include lhs2TeX.fmt
%include lhs2TeX.sty
%include pescofmt.fmt

\newcommand{\codequote}{}


\title   {  module Regex
            \\
            {\large --- Regular expression matching ``better than Perl'' ---}
         }
\author  {  Sven Moritz Hallberg \texttt{<pesco@@gmx.de>}  }
\date    {  December 6th, 2004  }

%% REVISION HISTORY %%
%
% 0.2: December 6th, 2004  SMH
%   More documentation, code cleanup, export list.
% 0.1: November 30th, 2004  SMH
%   Initial version.


\begin{document}
\maketitle

\begin{abstract}
This document is a literate Haskell module.
It wraps |Text.Regex|. It exposes functions for regex compiling,
matching, and substitution.
The functions are overloaded so strings or compiled regexes
can be passed interchangeably wherever a regular expression
is expected. The substitution operator is a polyvariadic function taking
any combination of replacement strings and submatch references (|Int|s)
as arguments, thus
avoiding the horrors of escape characters.
\end{abstract}


\begin{code}
{-# OPTIONS -fglasgow-exts #-}
{-# OPTIONS -fallow-overlapping-instances #-}
{-# OPTIONS -fallow-undecidable-instances #-}
{-# OPTIONS -fallow-incoherent-instances #-}

module Regex
    (  Regex (match)       -- type class
    ,  Match (..)          -- data type
    ,  Subst               -- type class

    ,  (=~),   (~=)
    ,  ($~),   (~$)
    ,  (//~),  (~//)
    ,  (/~),   (~/)

    ,  CRegex              -- data type
    ,  Rexopt (..)         -- data type
    ,  cregex
    ,  subst
    ,  subst1

    ,  test                -- to be removed
    )
    where
\end{code}
\scriptsize
\begin{code}
import qualified Text.Regex as TR
import Data.Maybe (isJust)
import Data.List (unfoldr)
\end{code}
\normalsize

\pagebreak


% ========================================================================
\section*{Motivation}
When asked the inevitable\footnote{``Does it support regexes?''}
by a Perl programmer, what do we answer?

\begin{quote}
Of course it does, it uses the POSIX regex library,
just import |Text.Regex|, and have a look at
|mkRegex| and |matchRegex|\ldots
\end{quote}
which to the Perl programmer must sound like ``Basically, it works as in C''.
Therefore, instead I'd like to answer
\begin{quote}
Basically, it works just as in Perl.
\end{quote}
accompanied by appropriate mumbling about strong typing and aesthetics.

Of course Haskell neither can nor should absolutely resemble Perl.
I've tried to catch the essence that makes the use of regular expressions
so easy in Perl while still doing so
in what a prototypical Haskell programmer could consider ``the right way''.


% ========================================================================
\section*{Operators}

\begin{description}
\item[|s =~ r|] tests whether string |s| matches the regular expression |r|.
\savecolumns
\begin{code}
(=~)   :: (Regex rho) => String -> rho -> Bool
\end{code}
Notice the type class |Regex|. It alleviates the need to explicitly
``compile'' or ``make'' regexes. You can pass compiled expressions or
plain strings anywhere a |Regex| is expected.

\item[|s $~ r|] applies regex |r| to the string |s|, yielding the list
of all matches.
\restorecolumns
\begin{code}
($~)   :: (Regex rho) => String -> rho -> [Match]
\end{code}
The |Match| data type will be defined shortly. It's a record telling
the substring of |s| that matched, as well as any subexpression matches.

\item[|(s //~ r) p |\ldots] replaces any match of |r| in |s| with
the pattern |p |\ldots.
\restorecolumns
\begin{code}
(//~)  :: (Regex rho, Subst pi) => String -> rho -> pi
\end{code}
Notice the type class |Subst|. This operator takes a variable number of
arguments of possibly different types. The mechanism will become clear
when class |Subst| is defined. The effect, anyway, is that |p |\ldots
in the above can be an arbitrary sequence of |String| or |Int| arguments.
The |Int|s represent submatch references, so for example\footnote{%
\emph{Warning:} There is a bug in GHC versions $\le 6.2.2$ which causes
instance matching to fail on integer literals in the substitution patterns.
As a workaround, write |(id n)| instead:
\begin{code}
test = ("Hello, World!" //~ "W(o)rld") "Hell" (id 1) :: String
\end{code}
},
\restorecolumns
\codequote\begin{code}
test = ("Hello, World!" //~ "W(o)rld") "Hell" 1 :: String
\end{code}\codequote
yields |"Hello, Hello!"|.

\item[|(s /~ r) p |\ldots] is like |//~| but replaces only the first
match.
\restorecolumns
\begin{code}
(/~)   :: (Regex rho, Subst pi) => String -> rho -> pi
\end{code}
\end{description}

In addition to the above, each operator has a ``flipped'' sibling, the
rule being that ``the pattern goes on the same side as the
tilde\footnote{In plain text code, |=~| is written as @=~@ and |~=| as @~=@,
so |=~| is the one taking the pattern on the right.} (@~@)''.
\begin{code}
(~=)   :: (Regex rho) => rho -> String -> Bool
(~$)   :: (Regex rho) => rho -> String -> [Match]
(~//)  :: (Regex rho, Subst pi) => rho -> String -> pi
(~/)   :: (Regex rho, Subst pi) => rho -> String -> pi
\end{code}

All exported operators are non-associative and bind with priority 4. That
makes them bind looser than |++| and |:|, similar to |==|.
\begin{code}
infix 4 =~, ~=, $~, ~$, ~//, //~, ~/, /~
\end{code}


\section*{Non-Operators}

All the operators introduced above are based on the fundamental pattern
matching operation |match|, which is the single method of class |Regex|:
\begin{code}
class Regex rho where
    match :: rho -> String -> Maybe Match
\end{code}

The |Match| data type is a record containing all relevent information
pertaining to one match of the pattern.
\begin{code}
data Match  = Match {
    m_before      :: String,
       -- The substring preceding the match.
    m_match       :: String,
       -- The matching substring itself.
    m_after       :: String,
       -- The rest of the string after the match.
    m_submatches  :: [String]
       -- The list of strings matching the regex's subexpressions.
       -- Does \emph{not} include the entire match itself.
    }
    deriving (Eq, Show, Read)
\end{code}

Substitution is implemented in the non-polyvariadic function |subst|.
For completeness, it also has a sibling, |subst1|, that replaces only
the first match.
\begin{code}
subst   :: (Regex rho) => rho -> [Repl] -> String -> String
subst1  :: (Regex rho) => rho -> [Repl] -> String -> String
\end{code}
Both take the replacement pattern as a list of |Repl|s, each of which
represents either a literal replacement string or a submatch reference:
\begin{code}
data Repl  =  Repl_lit  String
           |  Repl_ref  Int
\end{code}


% ========================================================================
\section*{Compiling Regexes}

Technically, to match a string against a regular expression, the regex
is ``compiled'' into a finite automaton.
Compiled regular expressions are represented by the opaque data type
|CRegex|. The function |cregex| performs the compilation.
\begin{code}
cregex :: [Rexopt] -> String -> CRegex
\end{code}
As you can see above, the process also accepts a list of options that
determine specifics of the matching mechanism. The possible options are:
\begin{code}
data Rexopt =  Nocase | Multiline
               deriving (Eq,Show,Read)
\end{code}
If |Nocase| is given, the matching is case-insensitive. If |Multiline|
is given,
|"^.*$"|  matches individual lines within the given string, instead of the
whole string.

In particular: With |Multiline|, |'^'| matches the start of a
line, |'$'| matches the end of a line, and |'.'| does not match the
new-line character. Without |Multiline|, |'^'| matches the start of the
string, |'$'| matches the end of the string, and |'.'| matches any character.


% ========================================================================
\section*{Implementation}


\subsection*{Compiling}
|CRegex| simply wraps |Regex| from |Text.Regex| and |cregex| calls
the corresponding compilation function.
\begin{code}
newtype CRegex = CRegex TR.Regex

cregex os s = CRegex (TR.mkRegexWithOpts s ml cs)
    where
    ml  = elem Multiline os
    cs  = not (elem Nocase os)
\end{code}


\subsection*{Matching}
Matching a compiled regular expression only consists of unrapping
the |CRegex|, calling the matcher from |Text.Regex|, and wrapping the
result in a |Match|.
\begin{code}
instance Regex CRegex where
    match (CRegex cr) str =
        do
        (b,m,a,s) <- TR.matchRegexAll cr str
        return $ Match  {  m_before      = b
                        ,  m_match       = m
                        ,  m_after       = a
                        ,  m_submatches  = s
                        }
\end{code}
Plain |String|s are treated as regular expressions
by simply compiling them with the default options before matching.
\begin{code}
instance Regex String where
    match = match . cregex []
\end{code}

Now, the match testing operators are trivial to define.
\begin{code}
(~=) r = isJust . match r
\end{code}

I define |=~| in terms of |~=| and not the
other way around so that applying |(r ~=)| to several
strings compiles |r| only once (when |r| is a string). The
same note applies to all other operators.
\begin{code}
(=~)  = flip (~=)
($~)  = flip (~$)
\end{code}

The |~$| operator must find all matches within the given string.
That can be achieved by consecutively applying |match| to the
|m_after| field of the previous match, if any. That's an instance
of |unfoldr|:
\begin{code}
match_all :: (Regex rho) => rho -> String -> [Match]
match_all r = unfoldr step
    where
    step :: String -> Maybe (Match, String)
    step x = do  ma <- match r x
                 return (ma, m_after ma)
\end{code}
This way, however, each match's |m_before| field only extends
to the end of the previous match. The list returned
by |match_all| is only meaningful in its original order.
This plain function will still be useful for the implementation
of substitution below but for the user-visible
operators, the matches should be expanded to span the entire
string.
\begin{code}
(~$) r = expand_matches . match_all r
\end{code}

Let |m| be a match, as retured by |match_all|. If |m| is the first match in
the list, it does not need to be
expanded. It is prefixed with the empty string |""|.
If, on the other hand, |m| has a predecessor |p|, it is prefixed with
|m_before p ++ m_match p|. So the list of prefixes for all matches
is given by:
\begin{code}
exp_prefixes :: [Match] -> [String]
exp_prefixes ms = "" : map (\p -> m_before p ++ m_match p) ms
\end{code}
That list contains one extraneous entry at the end, but that can
be ignored because |expand_matches| is now a simple instance
of |zipWith|\footnote{\textsc{Applause!}}.
\begin{code}
expand_matches :: [Match] -> [Match]
expand_matches ms = zipWith expand ms (exp_prefixes ms)
    where
    expand m s = m { m_before = s ++ m_before m }
\end{code}


\subsection*{Substitution}

\begin{code}
class Subst pi where
    subst' :: String -> [Match] -> [Repl] -> pi

instance Subst String where
    subst' s ms rs = replace s ms (reverse rs)
instance (Subst pi, Tag alpha Repl) => Subst (alpha -> pi) where
    subst' s ms rs = \x -> subst' s ms (tag x : rs)

class Tag alpha beta where
    tag :: alpha -> beta

instance Tag String Repl where
    tag = Repl_lit
instance (Integral alpha) => Tag alpha Repl where
    tag = Repl_ref . fromIntegral


replace :: String -> [Match] -> [Repl] -> String
replace s  []      _   =  s
replace _  (m:ms)  rs  =  (   m_before m
                          ++  concatMap replstr rs
                          ++  replace (m_after m) ms rs
                          )
    where
    replstr r = case r of
        Repl_lit x  ->  x
        Repl_ref 0  ->  m_match m
        Repl_ref i  ->  m_submatches m !! (i-1)


subst   r = \rs s -> replace s (match_all r s) rs
subst1  r = \rs s -> replace s (take 1 (match_all r s)) rs


(~//) r  = \s -> subst' s (match_all r s) []
(~/) r   = \s -> subst' s (take 1 (match_all r s)) []

(//~)  = flip (~//)
(/~)   = flip (~/)
\end{code}

\end{document}

