Skip to main content

mb_strtok implementation in PHP–String tokenizer for Multibyte

This is a simple function to implement some kind of mb_strtok() in PHP. As maybe you all are aware the mb_strtok function does not available for multibyte string (aka Unicode string). So this is my attempt to solve the problem. Anyway, there are bugs where the program halt if the input text is too long (how long? not sure yet). Maybe you could improve to provide better result.

Thank you and happy coding.

string-tokenizer-for-multibyte

The PHP code mb_strtok.php ;

<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t?\'.", $in);
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
{

$char = mb_substr($string, $pos, 1);//fetch one character, pos = char position
$pos++;
//echo ("Char at $pos => $char <br>\n");//trace character at position

if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
$token .= $char;//put character in the token node
}
else
{
//if arrive at delimeter, push token to listtoken
array_push($listtoken, $token);
$token="";//clear the token node
}
}
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}
}
?>
</body>
</html>


There is another one, this time the separator (.,;:) will be stored in the list of token (listtoken).


<html>
<head>
<title>String token for MB_STRING</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>String token for MB_STRING</h1>
<h2><a href="http://kerul.net">kerul.net</a></h2>

<form method="GET" ACTION="">
Input text <br>
<textarea name="txtinput" cols=30 rows=10></textarea>
<br>
<input type="submit" >
</form>

<?php
$in=$_GET["txtinput"];
$inputlen=mb_strlen($in, 'UTF-8');
echo ("Input length: $inputlen characters. <br>\n");

$tokens=mb_strtok(" /n/t/f", $in);//delimeter by whitespace only
echo ("List of TOKENS<br>\n");
//echo $tokens;
for($i=0; $i<count($tokens); $i++){
echo ("[$i] -> ".$tokens[$i] ." <br> \n");
}

function mb_strtok($delimiters, $str=NULL)
{
static $pos = 0; // Keep track of the position on the string for each subsequent call.
static $string = "";
static $listtoken=array();
// If a new string is passed, reset the static parameters.
if($str!=NULL)
{
$pos = 0;
$string = $str;
}

// Initialize the token.
$token = "";

while ($pos < mb_strlen($string,'UTF-8'))//loop till end of input string
{

$char = mb_substr($string, $pos, 1, 'UTF-8');//fetch one character, pos = char position

echo ("Char at $pos => $char <br>\n");//trace character at position


if(mb_strpos($delimiters, $char)===FALSE)//if character is not delimeter
{
if($char=="." || $char==";"||$char==":"||$char==","){
echo "Token detected $token <br>\n";
array_push($listtoken, $char);
//$token="";//clear the token node
}else{
$token .= $char;//put character in the token node
}
}
else
{
//if arrive at delimeter, push token to listtoken
echo "Token detected $token <br>\n";
array_push($listtoken, $token);
$token="";//clear the token node
}
$pos++;
}
return $listtoken;
// return the list of tokens
if ($listtoken!="")
{
return $listtoken;
}
else
{
return false;
}

}
?>
</body>
</html>


mb_strtok implementation in PHP for Arabic unicode strings


Modified using the code by http://www.anastis.gr/mb_strtok-a-php-implementation/

Comments

Popular posts from this blog

Several English proverbs and the Malay pair

Or you could download here for the Malay proverbs app – https://play.google.com/store/apps/details?id=net.kerul.peribahasa English proverbs and the Malay pair Corpus Reference: Amir Muslim, 2009. Peribahasa dan ungkapan Inggeris-Melayu. DBP, Kuala Lumpur http://books.google.com.my/books/about/Peribahasa_dan_ungkapan_Inggeris_Melayu.html?id=bgwwQwAACAAJ CTRL+F to search Proverbs in English Definition in English Similar Malay Proverbs Definition in Malay 1 Where there is a country, there are people. A country must have people. Ada air adalah ikan. Ada negeri adalah rakyatnya. 2 Dry bread at home is better than roast meat home's the best hujan emas di negeri orang,hujan batu di negeri sendiri Betapa baik pun tempat orang, baik lagi tempat sendiri. 3 There's no accounting for tastes We can't assume that every people have a same feel Kepala sama hitam hati lain-lain. Dalam kehidupan ini, setiap insan berbeza cara, kesukaan, perangai, tabia...

Submit your blog address here

Create your own blog and send the address by submitting the comment of this article. Make sure to provide your full name, matrix and URL address of your blog. Refer to the picture below. Manual on developing a blog using blogger.com and AdSense, download here … Download Windows Live Writer (a superb offline blog post editor)

Installing Google AdMob into Android Apps

Previously I wrote on why ads are needed to help maintaining an app. Read the article here http://blog.kerul.net/2011/05/generating-revenue-from-free-mobile.html . ---This is quite an old article. You may find the latest supporting AdMob 6.x in here http://blog.kerul.net/2012/08/example-how-to-install-google-admob-6x.html --- This is quite a long tutorial, there are 3 major steps involved. The experiment is done using Windows 7, Eclipse Helios and AdMob SDK 4.1.0 (which currently is the latest-during time of writing). STEP 1: Get the ads from AdMob.com To display the AdMob ads in your Android mobile apps, you need to register first at the admob.com . After completing the registration, login and Add Site/App. Refer to Figure 1. Figure 1 Choose the desired platform and fill in the details (as in Figure 2). Just put http:// in the Android Package URL if your app is not published in the market yet. And click Continue. Figure 2 Download the AdMob Android SDK, and save the zip fil...

Pemasangan Joomla! 1.7 pada pelayan web komputer anda

Latihan ini akan memasang sistem pengurusan kandungan laman web ke dalam pelayan web yang anda telah pasang sebelum ini . LANGKAH 1: Aktifkan Pelayan Web dan Pangkalan Data Aktifkan XAMPP Control Panel, melalui “ Start->All Programs->ApacheFriends->XAMPP Control Panel ”. Rajah 2.1 Pastikan pelayan web Apache dan pelayan pangkalan data MySQL diaktifkan dengan klik butang START. -> Rajah 2.2

Applications of Web 2.0

Web 2.0 describes the changing trends in the use of World Wide Web technology and web design that aim to enhance creativity , secure information sharing, collaboration and functionality of the web. Web 2.0 concepts have led to the development and evolution of web-based communities and hosted services , such as social-networking sites , video sharing sites , wikis , blogs . Find a website or web application that conform to the criteria of Web 2.0. Put the name of the application and the URL in the comment below. Please provide your full name and matrix number. Make sure the application you choose is not already chosen by your friend in the previous comment.