RegEx match open tags except XHTML self-contained tags

RegEx match open tags except XHTML self-contained tags

技术背景

在处理 HTML 时,有时需要使用正则表达式来匹配开放标签(非 XHTML 自包含标签)。然而,HTML 是一种复杂的语言,正则表达式本身存在一定局限性,因为 HTML 是上下文无关语法(Chomsky Type 2 语法),而正则表达式是正则语法(Chomsky Type 3 语法),理论上正则表达式不能完全正确解析 HTML。但在某些特定场景下,正则表达式仍可用于处理有限且已知的 HTML 集合。

实现步骤

简单匹配 HTML 标签

以下正则表达式可用于匹配 HTML 标签:

1
<(?:\"[^\"]*\"[\'\"]*|\'[^\']*\'[\'\"]*|[^\'\">])+>

此正则表达式经过大量 HTML 测试,能捕捉网页上出现的一些奇怪标签,如 <a name="badgenerator"">

排除自包含标签

若要排除自包含标签,可使用 Kobi 的负向后行断言:

1
<(?:\"[^\"]*\"[\'\"]*|\'[^\']*\'[\'\"]*|[^\'\">])+(?<!/\s*)>

.NET 中使用平衡组定义解析 XML

在 .NET 框架中,正则表达式支持平衡组定义,可用于解析有效的 XML:

1
2
3
4
5
6
7
8
9
(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
<!-- .*? --> |
<[^>]* /> |
(?<opentag><(?!/)[^>]*[^/]>) |
(?<-opentag></[^>]*[^/]>) |
[^<>]*
)*
(?(opentag)(?!))

使用标志:

  • Singleline
  • IgnorePatternWhitespace(若折叠正则表达式并移除所有空格,则非必需)
  • IgnoreCase(非必需)

PHP 中匹配 HTML 标签

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// 匹配所有标签(包括自包含标签)
$pattern = '/<(\w+)(\s+(\w+)\s*=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';

// 排除自包含标签
$pattern = '/<(\w+)(\s+(\w+)\s*=\s*(\'|")(.*?)\\4\s*)*\s*>/'

// 支持未加引号的属性或无值属性
$pattern = '/<(\w+)(\s+(\w+)(\s*=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';

$string = 'Hello, try clicking <a href="#paragraph">here</a>
<br/>and check out.<hr />
<h2>title</h2>
<a name ="paragraph" rel= "I\'m an anchor"></a>
Fine, <span title=\'highlight the "punch"\'>thanks<span>.
<div class = "clear"></div>
<br>';

preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
print_r($matches[0]);

PHP 中使用递归正则表达式匹配 HTML 标签

1
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

核心代码

.NET 中平衡组定义正则表达式测试代码

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using System.IO.Compression;

class Program
{
    static string FromBase64(string str)
    {
        byte[] byteArray = Convert.FromBase64String(str);

        using (var msIn = new MemoryStream(byteArray))
        using (var msOut = new MemoryStream())
        {
            using (var ds = new DeflateStream(msIn, CompressionMode.Decompress))
            {
                ds.CopyTo(msOut);
            }

            return Encoding.UTF8.GetString(msOut.ToArray());
        }
    }

    static void Main()
    {
        string base64Regex = "7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28" +
                             "995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F" +
                             "86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169" +
                             "OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq" +
                             "i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv" +
                             "p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf" +
                             "LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e" +
                             "Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7" +
                             "O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm" +
                             "rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv" +
                             "z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme" +
                             "nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e" +
                             "vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y" +
                             "gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs" +
                             "mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH" +
                             "W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52" +
                             "MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU" +
                             "1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn" +
                             "xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ" +
                             "GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY" +
                             "12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37" +
                             "R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn" +
                             "3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25" +
                             "D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP" +
                             "mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS" +
                             "mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX" +
                             "X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8" +
                             "DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c" +
                             "etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3" +
                             "zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS" +
                             "ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ" +
                             "j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX" +
                             "/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d" +
                             "mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u" +
                             "v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj" +
                             "4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq" +
                             "GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6" +
                             "mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K" +
                             "MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z" +
                             "0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26" +
                             "7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29" +
                             "7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9" +
                             "r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va" +
                             "j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd" +
                             "w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa" +
                             "2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm" +
                             "AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C" +
                             "j/9vz9+PAo8f+Vq35Jb/n

RegEx match open tags except XHTML self-contained tags
https://119291.xyz/posts/2025-05-16.regex-match-open-tags-except-xhtml-self-contained-tags/
作者
ww
发布于
2025年5月16日
许可协议