Skip to main content

mb_strtok implementation in PHP–String tokenizer for Multibyte

This is a simple function to implement some kind of mb_strtok() in PHP. As maybe you all are aware the mb_strtok function does not available for multibyte string (aka Unicode string). So this is my attempt to solve the problem. Anyway, there are bugs where the program halt if the input text is too long (how long? not sure yet). Maybe you could improve to provide better result.

Thank you and happy coding.

string-tokenizer-for-multibyte

The PHP code mb_strtok.php ;

<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t?\'.", $in);
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
{

$char = mb_substr($string, $pos, 1);//fetch one character, pos = char position
$pos++;
//echo ("Char at $pos => $char <br>\n");//trace character at position

if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
$token .= $char;//put character in the token node
}
else
{
//if arrive at delimeter, push token to listtoken
array_push($listtoken, $token);
$token="";//clear the token node
}
}
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}
}
?>
</body>
</html>


There is another one, this time the separator (.,;:) will be stored in the list of token (listtoken).


<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t/f", $in);//delimeter by whitespace only
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
{

$char = mb_substr($string, $pos, 1, 'UTF-8');//fetch one character, pos = char position

echo ("Char at $pos => $char <br>\n");//trace character at position


if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
if($char=="." || $char==";"||$char==":"||$char==","){
echo "Token detected $token <br>\n";
array_push($listtoken, $char);
//$token="";//clear the token node
}else{
$token .= $char;//put character in the token node
}
}
else
{
//if arrive at delimeter, push token to listtoken
echo "Token detected $token <br>\n";
array_push($listtoken, $token);
$token="";//clear the token node
}
$pos++;
}
return $listtoken;
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}

}
?>
</body>
</html>


mb_strtok implementation in PHP for Arabic unicode strings


Modified using the code by http://www.anastis.gr/mb_strtok-a-php-implementation/

Comments

Popular posts from this blog

Several English proverbs and the Malay pair

Or you could download here for the Malay proverbs app – https://play.google.com/store/apps/details?id=net.kerul.peribahasa English proverbs and the Malay pair Corpus Reference: Amir Muslim, 2009. Peribahasa dan ungkapan Inggeris-Melayu. DBP, Kuala Lumpur http://books.google.com.my/books/about/Peribahasa_dan_ungkapan_Inggeris_Melayu.html?id=bgwwQwAACAAJ CTRL+F to search Proverbs in English Definition in English Similar Malay Proverbs Definition in Malay 1 Where there is a country, there are people. A country must have people. Ada air adalah ikan. Ada negeri adalah rakyatnya. 2 Dry bread at home is better than roast meat home's the best hujan emas di negeri orang,hujan batu di negeri sendiri Betapa baik pun tempat orang, baik lagi tempat sendiri. 3 There's no accounting for tastes We can't assume that every people have a same feel Kepala sama hitam hati lain-lain. Dalam kehidupan ini, setiap insan berbeza cara, kesukaan, perangai, tabia...

Submit your blog address here

Create your own blog and send the address by submitting the comment of this article. Make sure to provide your full name, matrix and URL address of your blog. Refer to the picture below. Manual on developing a blog using blogger.com and AdSense, download here … Download Windows Live Writer (a superb offline blog post editor)

Installing Google AdMob into Android Apps

Previously I wrote on why ads are needed to help maintaining an app. Read the article here http://blog.kerul.net/2011/05/generating-revenue-from-free-mobile.html . ---This is quite an old article. You may find the latest supporting AdMob 6.x in here http://blog.kerul.net/2012/08/example-how-to-install-google-admob-6x.html --- This is quite a long tutorial, there are 3 major steps involved. The experiment is done using Windows 7, Eclipse Helios and AdMob SDK 4.1.0 (which currently is the latest-during time of writing). STEP 1: Get the ads from AdMob.com To display the AdMob ads in your Android mobile apps, you need to register first at the admob.com . After completing the registration, login and Add Site/App. Refer to Figure 1. Figure 1 Choose the desired platform and fill in the details (as in Figure 2). Just put http:// in the Android Package URL if your app is not published in the market yet. And click Continue. Figure 2 Download the AdMob Android SDK, and save the zip fil...

Pemasangan Joomla! 1.7 pada pelayan web komputer anda

Latihan ini akan memasang sistem pengurusan kandungan laman web ke dalam pelayan web yang anda telah pasang sebelum ini . LANGKAH 1: Aktifkan Pelayan Web dan Pangkalan Data Aktifkan XAMPP Control Panel, melalui “ Start->All Programs->ApacheFriends->XAMPP Control Panel ”. Rajah 2.1 Pastikan pelayan web Apache dan pelayan pangkalan data MySQL diaktifkan dengan klik butang START. -> Rajah 2.2

Multimedia Dictionary app for Android – with Source codes

I decided to publish the following source code of a Multimedia Dictionary app for Android. The apps is a dictionary of the following languages, Malay, Arabic and English. All the word entries are available in the locally (sqlite), however we do not have the complete word list. This code is specifically designed to support only Android 2.1, 2.2 and 2.3. This is due to lack of Arabic character support in the listed Android version. And there ‘s a special treatment to the Arabic characters which we found quite complicated. It won’t run beautifully on the Android 3.0 and above. Sorry for the lack of documentation, however if you’re familiar with Java in Android, it actually simple. I’ll do further documentation later. You may somehow revamp the apps, or do what ever you want. I would like to However make only one request, please do not remove the advertisement in the main screen. To test the Arabic input, you need to have the Arabic keyboard – use this http://bit.ly/ask1-1 . Not sure with...