尝试针对借由 string stream 抽取成的 string 对象使用 regex_replace 时出错，求解。

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 846 天前的主题，其中的信息可能已经有所发展或是发生改变。

下文有三句 regex_replace 。

注释掉还好，不注掉的话会卡壳（且作为操作对象的文本档案会被清空）。所以我想来请教各位大佬。

（我一个月前刚开始自学编程，为了维护某个开源输入法专案。）

path 是某个有权限写入的纯文本 txt 档案的路径。

这段程式码主要用来做这几点（按顺序）：

一、将所有 tab 与全形空格等广义上的空格都转成半形英数空格。如果这些空格是连续的话，合并成一个空格。

二、抽掉行首与行尾的空格。

// FORMAT CONSOLIDATOR. CREDIT: Shiki Suen.
bool LMConsolidator::ConsolidateFormat(const char *path, bool hypy) {
    ifstream zfdFormatConsolidatorIncomingStream(path);
    stringstream zfdLoadedFileStreamToConsolidateBuff; // 設立字串流。
    ofstream zfdFormatConsolidatorOutput(path); // 這裡是要從頭開始重寫檔案內容，所以不需要「 ios_base::app 」。
    
    zfdLoadedFileStreamToConsolidateBuff << zfdFormatConsolidatorIncomingStream.rdbuf();
    string zfdBuffer = zfdLoadedFileStreamToConsolidateBuff.str();
    
    // 下面這幾句用來執行非常複雜的 Regex 取代。
    regex sedWhiteSpace("\\h+"), sedLeadingSpace("^ "), sedTrailingSpace(" $");
    zfdBuffer = regex_replace(zfdBuffer, sedWhiteSpace, " ").c_str();
    zfdBuffer = regex_replace(zfdBuffer, sedLeadingSpace, "").c_str();
    zfdBuffer = regex_replace(zfdBuffer, sedTrailingSpace, "").c_str();
    
    // 漢語拼音二式轉注音。
    if (hypy) {
        // 該功能尚未正式引入。
    }
    
    // 最終將取代結果寫入檔案。
    zfdFormatConsolidatorOutput << zfdBuffer << std::endl;
    zfdFormatConsolidatorOutput.close();
    if (zfdFormatConsolidatorOutput.fail()) {
        syslog(LOG_CONS, "// REPORT: Failed to write format-consolidated data to the file. Insufficient Privileges?\n");
        syslog(LOG_CONS, "// DATA FILE: %s", path);
        return false;
    }
    zfdFormatConsolidatorIncomingStream.close();
    if (zfdFormatConsolidatorIncomingStream.fail()) {
        syslog(LOG_CONS, "// REPORT: Failed to read lines through the data file for format-consolidation. Insufficient Privileges?\n");
        syslog(LOG_CONS, "// DATA FILE: %s", path);
        return false;
    }
    return true;
} // END: FORMAT CONSOLIDATOR.

zfdbuffer

path

空格

syslog

25 条回复 • 2022-01-29 18:13:21 +08:00

ShikiSuen

2022-01-28 14:43:56 +08:00

忘了给出需要 include 的清单了：
```
#include <syslog.h>
#include <stdio.h>
#include <fstream>
#include <sstream>
#include <iostream>
#include <string>
#include <map>
#include <set>
#include <regex>
```

zzxxisme

2022-01-28 15:05:34 +08:00

std::regex 的语法好像比较特别。不太常用不太熟，但是我改成下面的 regex 好像是可以的。

```c++
regex sedWhiteSpace("\\s+"), sedLeadingSpace("^\\s+"), sedTrailingSpace("\\s+$")
```

KuroNekoFan

2022-01-28 15:11:03 +08:00

好像知乎还关注了贴主哈哈

ShikiSuen

2022-01-28 15:22:07 +08:00

@zzxxisme 谢谢。程序正常执行了（不卡住了），但 Xcode 编译出来之后我发现我这 txt 档案的内容会被清空。

Inn0Vat10n

2022-01-28 15:24:54 +08:00

zfdBuffer = regex_replace(zfdBuffer, sedWhiteSpace, " ").c_str();
regex_replace 返回的是一个 rvalue, 这一行结束之后就会析构掉，zfdBuffer 指向的内存内容在后面使用的时候已经是非法的了

ShikiSuen

2022-01-28 15:25:12 +08:00

@KuroNekoFan 我确实是威注音输入法的开源专案的维护人。
最近在做新的启发式自订语汇格式统整功能，但正好在正则这一块吃了瘪。
要是 C++ 真不行的话，我就只能用 swift 写这段了。

ShikiSuen

2022-01-28 15:26:28 +08:00

@Inn0Vat10n 谢谢。没注意到居然会有这种情况。

zzxxisme

2022-01-28 15:36:55 +08:00

@Inn0Vat10n 说的那个 c_str()，只是有点奇怪，但不至于非法。zfdBuffer 它是一个 std::string ，regex_replace 返回的也是一个 std::string ，对 std::string 取 c_str()得到一个 const char*，这样最终就是把一个 const char* 赋值给 std::string ，会把 const char*复制一遍到 std::string 的。我觉得奇怪只是说，c_str()其实是不需要的。

@ShikiSuen 我当初测试用的是这个例子
```c++
std::string zfdBuffer = " \t123 456\t\n789 \taaaa ";
std::cout << '@' << zfdBuffer << '@' << std::endl;
std::regex sedWhiteSpace("\\s+"), sedLeadingSpace("^\\s+"), sedTrailingSpace("\\s+$");
zfdBuffer = std::regex_replace(zfdBuffer, sedWhiteSpace, " ");
zfdBuffer = std::regex_replace(zfdBuffer, sedLeadingSpace, "");
zfdBuffer = std::regex_replace(zfdBuffer, sedTrailingSpace, "");
std::cout << std::endl << '@' << zfdBuffer << '@' << std::endl;
```

它的结果是会把空格，\t ，\n 这些去掉或者合并成一个空格的。所以我觉得原来 regex_replace 的问题应该是解决了的。
```
@ 123 456
789 aaaa @

@123 456 789 aaaa@
```

> 但 Xcode 编译出来之后我发现我这 txt 档案的内容会被清空
我猜可能是其他方面的问题。或者你在第一个 regex_replace 之前把 zdfBuffer 打印出来看是什么内容，然后在最后一个 regex_replace 之后再把 zdfBuffer 打印出来看是什么内容，进行对比？

ShikiSuen

2022-01-28 16:01:19 +08:00 via iPhone

@zzxxisme 我保留\n 是有原因的，因为输入法的用户辞典每个词音定义占一行。

Inn0Vat10n

2022-01-28 16:01:49 +08:00

@zzxxisme 你说的是对的，之前只看了那几行，想当然的以为 zfdBuffer 是个 pointer 了

zzxxisme

2022-01-28 16:38:24 +08:00

@ShikiSuen 这样我建议你从文件（也就是档案） zfdFormatConsolidatorIncomingStream 里面一行一行的读进来，每次 regex_replace 处理一行然后输出到新的文件 zfdFormatConsolidatorOutput 。

你可能会问，我能不能改一下 regex 的规则，让它不删除\n 就好了？是可以试着改成"[^\\S\r\n]+"，这里\\S 就是所有非空格字符，\\S\r\n 就是所有非空格加上换行，[^\\S\r\n]就是对“所有非空格字符和换行字符”取反，就变成了“除去换行字符的所有空格字符”。但是这会有一个不好的地方，对于"a \nb"，它的替换结果还是"a \nb"，而不是“a\nb”，因为这个空格不是 leading 或者 tailing 的空格，所以是被压缩成一个而不是去掉。所以我建议的是读一行处理一行。

ShikiSuen

2022-01-28 17:07:08 +08:00

@zzxxisme 我在想「\h 」是否受 C++11 / ObjCpp 11 的支持。
\h 的话，是不会包含 \n 的。

ShikiSuen

2022-01-28 17:16:38 +08:00

@zzxxisme 我改用 ObjCpp 利用 NSString 与 Foundation 内部的正则，却发现整个词库档案的内容被替换成了一个数字。

```cpp
// FORMAT CONSOLIDATOR. CREDIT: Shiki Suen.
bool LMConsolidator::ConsolidateFormat(const char *path, bool hypy) {
stringstream zfdLoadedFileStreamToConsolidateBuff; // 設立字串流。
ifstream zfdFormatConsolidatorIncomingStream(path);
zfdLoadedFileStreamToConsolidateBuff << zfdFormatConsolidatorIncomingStream.rdbuf();
ofstream zfdFormatConsolidatorOutput(path); // 這裡是要從頭開始重寫檔案內容，所以不需要「 ios_base::app 」。

// 下面這幾句用來執行非常複雜的 Regex 取代。
string zfdBufferStringC = zfdLoadedFileStreamToConsolidateBuff.str().c_str();
NSString *zfdBufferString = [NSString stringWithCString:zfdBufferStringC.c_str() encoding:[NSString defaultCStringEncoding]];
zfdBufferString = [zfdBufferString replacingWithPattern:@"[^\\S\\r\\n]+" withTemplate:@" " error:nil]; // Replace consecutive spaces to single spaces.
zfdBufferString = [zfdBufferString replacingWithPattern:@"^\\s" withTemplate:@"" error:nil]; // Initial Spaces in a line.
zfdBufferString = [zfdBufferString replacingWithPattern:@"\\s$" withTemplate:@"" error:nil]; // Trailing Spaces in a line.

// 漢語拼音二式轉注音。
if (hypy) {
// 該功能尚未正式引入。
}

// 最終將取代結果寫入檔案。
zfdFormatConsolidatorOutput << zfdBufferString << std::endl;
zfdFormatConsolidatorOutput.close();
if (zfdFormatConsolidatorOutput.fail()) {
syslog(LOG_CONS, "// REPORT: Failed to write format-consolidated data to the file. Insufficient Privileges?\n");
syslog(LOG_CONS, "// DATA FILE: %s", path);
return false;
}
zfdFormatConsolidatorIncomingStream.close();
if (zfdFormatConsolidatorIncomingStream.fail()) {
syslog(LOG_CONS, "// REPORT: Failed to read lines through the data file for format-consolidation. Insufficient Privileges?\n");
syslog(LOG_CONS, "// DATA FILE: %s", path);
return false;
}
return true;
} // END: FORMAT CONSOLIDATOR.
```

ShikiSuen

2022-01-28 17:18:42 +08:00

@zzxxisme 逐行处理的话，我今天早上倒是有测试过，问题没有任何改善就是了。
相关的脚本已经被我盖掉了。我现在想想干脆用 swift 写算了。
没想到 Cpp 这语言居然如此麻烦。

zzxxisme

2022-01-28 18:27:49 +08:00

@ShikiSuen 我对 C++的 std::regex 真的不熟，它的语法应该是这里提到的 https://en.cppreference.com/w/cpp/regex/ecmascript ，这里面没有提到\h ，所以应该是不支持的。

ObjCpp 和 C++ 是两门不同的语言，ObjCpp 我就完全不懂了…

不过如果可以选择其他语言的话，用自己更擅长的语言会更好 0.0

ShikiSuen

2022-01-28 18:43:30 +08:00

@zzxxisme ObjCpp 其实就是 C++ 与 Objective C 的缝合怪。
C++ 的东西改 cpp 后缀为 mm 、改 hpp 后缀为 hh 之后就变成了 ObjCpp 。
不过因为 Objective C 本来就支持对象特性，所以 ObjCpp 的知名度并不是很高。

ShikiSuen

2022-01-28 19:10:27 +08:00

@zzxxisme 我重写了逐行处理的版本，倒是成功了。
看来是我今天早上写的版本有别的错误。
```cpp
// FORMAT CONSOLIDATOR. CREDIT: Shiki Suen.
bool LMConsolidator::ConsolidateFormat(const char *path, bool hypy) {
ifstream zfdFormatConsolidatorIncomingStream(path);
vector<string>vecEntry;
while(!zfdFormatConsolidatorIncomingStream.eof())
{
string zfdBuffer;
getline(zfdFormatConsolidatorIncomingStream,zfdBuffer);
vecEntry.push_back(zfdBuffer);
}
ofstream zfdFormatConsolidatorOutput(path); // 這裡是要從頭開始重寫檔案內容，所以不需要「 ios_base::app 」。
// RegEx 先定義好。
regex sedWhiteSpace("\\s+"), sedLeadingSpace("^\\s+"), sedTrailingSpace("\\s+$");
for(int i=0;i<vecEntry.size();i++)
{
vecEntry[i] = regex_replace(vecEntry[i], sedWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedLeadingSpace, "").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedTrailingSpace, "").c_str();
if (hypy) {
// 該功能尚未正式引入。
}
zfdFormatConsolidatorOutput<<vecEntry[i]<<endl;
}
zfdFormatConsolidatorOutput.close();
if (zfdFormatConsolidatorOutput.fail()) {
syslog(LOG_CONS, "// REPORT: Failed to write format-consolidated data to the file. Insufficient Privileges?\n");
syslog(LOG_CONS, "// DATA FILE: %s", path);
return false;
}
zfdFormatConsolidatorIncomingStream.close();
if (zfdFormatConsolidatorIncomingStream.fail()) {
syslog(LOG_CONS, "// REPORT: Failed to read lines through the data file for format-consolidation. Insufficient Privileges?\n");
syslog(LOG_CONS, "// DATA FILE: %s", path);
return false;
}
return true;
} // END: FORMAT CONSOLIDATOR.
```

ShikiSuen

2022-01-28 19:22:21 +08:00

另外，\s 不包含中日韩全形空格，需要补一道前置处理。改过的处理模组如下：
```cpp
regex sedCJKWhiteSpace("\\u3000"), sedWhiteSpace("\\s+"), sedLeadingSpace("^\\s"), sedTrailingSpace("\\s$"); // RegEx 先定義好。
for(int i=0;i<vecEntry.size();i++)
{ // RegEx 處理順序：先將全形空格換成西文空格，然後合併任何意義上的連續空格（包括 tab 等），最後去除每行首尾空格。
vecEntry[i] = regex_replace(vecEntry[i], sedCJKWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedLeadingSpace, "").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedTrailingSpace, "").c_str();
if (hypy) {
// 該功能尚未正式引入。
}
zfdFormatConsolidatorOutput<<vecEntry[i]<<endl;
}
```

ShikiSuen

2022-01-28 19:43:50 +08:00

上述脚本还有一个问题：每跑一遍都会让空行倍增。
得让循环部分仅对非空行做处理才行：
```cpp
for(int i=0;i<vecEntry.size();i++)
{
if (vecEntry[i].size() != 0) { // 不要理會空行，否則給空行加上 endl 等於再加空行。
// RegEx 處理順序：先將全形空格換成西文空格，然後合併任何意義上的連續空格（包括 tab 等），最後去除每行首尾空格。
vecEntry[i] = regex_replace(vecEntry[i], sedCJKWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedLeadingSpace, "").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedTrailingSpace, "").c_str();
zfdFormatConsolidatorOutput<<vecEntry[i]<<endl; // 這裡是必須得加上 endl 的，不然所有行都變成一個整合行。
}
}
```

ShikiSuen

2022-01-28 20:24:38 +08:00

然后还得防止那些在经过清理后出现的空行被写入档案内：
```cpp
for(int i=0;i<vecEntry.size();i++)
{
if (vecEntry[i].size() != 0) { // 不要理會空行，否則給空行加上 endl 等於再加空行。
// RegEx 處理順序：先將全形空格換成西文空格，然後合併任何意義上的連續空格（包括 tab 等），最後去除每行首尾空格。
vecEntry[i] = regex_replace(vecEntry[i], sedCJKWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedWhiteSpace, " ").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedLeadingSpace, "").c_str();
vecEntry[i] = regex_replace(vecEntry[i], sedTrailingSpace, "").c_str();
}
if (vecEntry[i].size() != 0) { // 這句得單獨拿出來，不然還是會把經過 RegEx 處理後出現的空行搞到檔案裡。
zfdFormatConsolidatorOutput<<vecEntry[i]<<endl; // 這裡是必須得加上 endl 的，不然所有行都變成一個整合行。
}
}
```

zzxxisme

2022-01-28 23:27:21 +08:00

@ShikiSuen
> 我重写了逐行处理的版本，倒是成功了。
Congratulation!

0.0 然后有一些小建议。我看你现在的做法是把整个文件读到了内存，我不确定你的文件的大小怎么样，如果文件太大的话，这样其实会占用很多内存，而且本身没有必要一下子把所有行都放到内存再处理。其实你可以在
getline(zfdFormatConsolidatorIncomingStream,zfdBuffer);
拿到一行之后马上就去 regex_replace 你读到的行。另外 regex_replace 的那个地方，其实也没有必要用 c_str()，直接
vecEntry[i] = regex_replace(vecEntry[i], sedWhiteSpace, " ")
反而可能会更好。

IRuNamu

2022-01-29 13:14:11 +08:00

借樓想問 OP 是在做 Rime 專案的維護嗎

ShikiSuen

2022-01-29 18:08:23 +08:00

@IRuNamu 就佛振對大千並擊注音與台灣慣用普通話漢字讀音的不屑態度，我維護 RIME 才怪。

ShikiSuen

2022-01-29 18:11:28 +08:00

@zzxxisme 是这样：这要载入的内容最多也就是十几万行的全字库的读音（虽然用户辞典的内容撑死应该不会超过一千行吧）。因为我后来对这个 vector 引入了排序与去重复的功能（以及对 non-break 型英数空格的支援），所以整个操作仍旧需要在记忆体内完成。这也是为了减少 SSD 的重复读写次数（如果我没搞错的话）。

@IRuNamu 威注音专案在此，目前只有 macOS 版本： https://github.com/ShikiSuen/vChewing-macOS
目前对全字库的支援有些问题，我还在慢慢排查。

ShikiSuen

2022-01-29 18:13:21 +08:00

@zzxxisme 关于 c_str() 的问题，我先记下。今后需要再维护这个档案的时候我再考虑清理。
威注音最近开始采用基于 pull request 的进度管理，所以会引入 merge commit 、使得既往的内容无法再 rebase 。