一种简单粗暴无需 NLP 的区分中文和日文文本的方法

和博客里其他大多数的文章一样,这篇文章也是来自我平时开发个人项目时候的发现。在处理我的音乐库、歌词和其他数据的标音时,我需要一种简单的方式来区分中文文本和日文文本。因为我的曲库里面基本上只有中文、日文和其他拉丁字母构成的语种。而那些拉丁语种不需要太多复杂的处理就能够直接自然的排序,而中文和日文就没有这么简单,尤其是两种语言在对汉字的处理上有着截然不同的方法的时候。

Read and Write Tags of Music Files with FFmpeg

In both my previous and recent projects, I have been working with tags (metadata) of music files. One of the reason being I am rather particular about having a nicely organised library with all tag data aligned to the same format. Until recently while I was seeking for a solution to read and write tags of (potentially) all music formats1I only have MP3, FLAC, AIFF and M4A in my library, so that’s kinda all for me., and I encountered FFmpeg, the Swiss Army Knife of media processing.

FFmpeg has always been my go-to solution for processing media programmatically or in batch, and I have recently found the way to write into the tags of music files using it. The way of doing so might be a little verbose as everything have to fit into the command line interface with other components.

Translate Text in Sphinx Templates and Configurations

Weeks ago when I was playing around with the docs of EFB and the Crowdin translation widget, I realized that the default theme for Sphinx — Alabaster isn’t really doing well in term of translation. It seems like the author isn’t really confident on that (or simply didn’t care since 4 years ago).

As the theme itself is open source, and Sphinx is flexible enough, couldn’t we just translate it ourselves? It turns out that things are not that complicated.

How to Write Integration Tests for a Telegram Bot

This is my 6th article on Telegram, the IM platform of my preference. In this article I’m going to introduce about how I wrote the integration tests for my EFB Telegram Master channel — a Telegram interface for EFB, using a userbot-like strategy.

To get started, you need to have a bot ready to be tested, and a Telegram client app key that is registered with your account. While alternative tools are available, we will be using Telethon and PyTest in this article.

Awesome Command Line Tools

This is a list of awesome command line tools collected by SelfhostedServer. They have provided a detailed article for each of the tools in their paid membership subscriptions. The list below is based on the list of article titles from SelfhostedServer which are freely available, and attached with a short description from each project.

If you are interested in reading more about these tools, I’d recommend you to subscribe to the articles on SelfhostedServer (in Chinese).

Message delivery issues in EFB Telegram Master channel (comparing to generic IM services)

Different from how usually an IM would work, EFB Telegram Master channel (ETM) strongly rely on Telegram Bot platform. This had made ETM more difficult to deal with messages failed to deliver.

This article is first published on ETM Wiki on 20 April, 2019.

Custom sort order in music libraries: macOS and Android

Custom sort order in music libraries is a rather rare need. Most major languages use phonograms in their scripts, where the natural sort order is more or less identical to what is seen in Unicode (probably after some normalizations). On the other hand, languages using logograms (logosyllabic scripts, mainly Chinese characters in our context) does not have their characters sorted in their primary natural (usually phonetic) order in Unicode.
This causes a problem where a list of text sorted in Unicode code point order can be odd and difficult to look up from in these languages. Custom sort order in music libraries is thus useful when you have songs in one of these languages, or even a mix of them.

As this article involves mainly with concepts common among Chinese and Japanese language users, this article is also written in zh-hans and ja.
本文有中文版
この記事は日本語バージョンがあります。

在 macOS 和 Android 平台实现音乐库中的自定义排序

歌曲名称、歌手以及专辑的自定义排序顺序常被认为非常罕见的需求。大多数主要语言使用的是表音文字。它们的自然顺序通常与 Unicode 中的排序的大致相同(有些文字可能需要进行规范化处理)。 而在使用表意文字(主要是汉字)的语言中,它们的自然顺序(通常是读音顺序)与 Unicode 中的编码顺序相当不同。这会导致这类语言以 Unicode 编码顺序时会看起来很奇怪,并且很难从中查找。因此,当歌曲库中存在着一种或多种这样的语言时,自定义排序顺序则是一个很有用功能。

macOS と Android での音楽ライブラリーのカスタムソート順

音楽ライブラリにおいてのカスタムソート順は、かなりまれなニーズだとされている。 ほとんどの主要言語は表音文字を使って、その自然な並べ替えが Unicode での順序とほぼ同じである(場合によっては正規化は必要)。一方、表意文字(主に漢字)を使う言語では、Unicode での並び方が自然な並び方(通常は読み順)と異なる。これにより、Unicode コードポイント順で並べ替えられたリストがおかしいな順番たと見られ、調べづらくなる。したがって、音楽ライブラリ中のタイトル、アーティスト、アルバム名などのカスタムソート順は、これら一つ、または複数の言語で構成している場合に役に立つ。