tag:blogger.com,1999:blog-77607530199669912402024-03-13T03:16:46.990-07:00Alecco LoccoAlecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-7760753019966991240.post-19836371979247927302010-12-20T13:25:00.000-08:002010-12-20T13:37:39.729-08:00The Inevitable Revolution<p>A couple of years ago, Donald Knuth said in an interview<sup><a href="#cite-1">[1]</a></sup>:</p><br />
<blockquote><p>Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading.</p></blockquote><br />
<p>Most algorithms were designed with a classic approach of doing one instruction at a time. This is called <em>scalar</em> processing. Algorithms are then implemented on a programming language and finally run compiled or interpreted. There are important issues on these transformations of the code, but that's out of the scope of this post.</p><br />
<p>The number of transistors inside processors has been roughly doubling each year<sup><a href="#cite-2">[2]</a></sup>. Due to the nature of the transistors electricity gets transformed to heat. A lot of heat in a very small space packed with millions of transistors. So CPU manufacturers have reached a hard limit of <em>CPU power heat dissipation</em><sup><a href="#cite-3">[3]</a></sup>. The only way out of this dead end was to increase the number of CPU cores, stopping the Ghz race.</p><br />
<p>Another big issue with current computers is latency. While processors can run multiple instructions per nanosecond (ns), accessing memory takes about 100 ns. To work around this problem manufacturers place memory caches running at speeds much closer to CPU execution times. But this caches are very expensive and only processors aimed at the high-end server market have big configurations.</p><br />
<p>If programs were executed one instruction at a time most of the CPU would be idle waiting for a particular unit to finish its current task. Very often instructions in a given block of code can be run in parallel because they are not dependent (e.g. initializing unrelated variables.) The CPU performs this optimizations by having a <em>deep pipeline</em> of instructions currently executing (resembling assembly lines in factories.) To manage this pipeline the circuitry has to determine the instruction dependencies and sometimes reorder the instructions to improve throughput. Good compilers and VMs optimize compiled programs by ordering the instructions to match the characteristics of different CPU models.</p><br />
<p>When the flow of control of a program reaches a <em>conditional branch</em> (if-then-else) the processor evaluates the condition (e.g. variable a is zero) and make a jump (or not) to the following instruction. This evaluation usually takes long time and disrupts superscalar pipelining. To overcome this the processor has a dedicated unit called <em>branch predictor</em><sup><a href="cite-4">[4]</a></sup> to evaluate conditional branches by remembering what branch was taken before. If the predictor is successful the flow is keeps running fast uninterrupted. But when there's a <em>branch misprediction</em> the wrongly picked instructions in the pipeline must be undone. This causes a pipeline flush often costing 5ns. For algorithms with many conditional branches with a probability close to .5 this can multiply into major bottleneck (e.g. walking binary trees or classic compression.)</p><br />
<p>The premises for making algorithms in the literature are completely out of touch with all these issues. One of the old maxims was to avoid full scans at all costs, but a sequential memory scan nowadays goes at <em>6 to 14 GiB per second</em> and the limits are mostly the data bus bandwidth. Random access traversal in memory often is several orders of magnitude slower due to the latency issues compounded with branch mispredictions. The <em>data structures</em> often used don't scale due to fragmentation. In many cases are sequential in nature so parallel execution is either impossible or requires very slow locking mechanisms for critical areas making the code error-prone and filled with hard to predict bottlenecks.</p><br />
<p>To use larger amounts of main memory the processors switched to big data addresses. Every memory reference in a 64bit address space requires 8 bytes. In very common scenarios where linked structures are widely used for data storage in memory these huge pointers become a significant overhead. For example, each node of a binary tree will require at least 16 bytes and often 8 more to link to parent (as in popular red-black tree implementations.) If the node only stores 32bit integers the overhead is 400% to 600%!</p><br />
<p>In many cases the same code is applied to many elements of a data structure iteratively. With current processors it is possible to run the same instructions to packs of data and it's called <strong>SIMD</strong> (Single Instruction on Multiple Data.) This is widely used for media processing but it is becoming common for general purpose data processing. Most SIMD implementations are evolving to give better support for non-multimedia uses (e.g. Intel's SIMD string instructions.) Some of the most interesting new algorithms of the last decade were redisigns to exploit SIMD processing. Programming directly to one of this SIMD implementations can be quite tricky but it is possible to simplify this with compiler intrinsics or re-targetting compiler tools (like MIT's Cilk.)</p><br />
<p>There are many interesting instructions available on processors performing very useful tasks (e.g. bit scan) completely neglected by programming textbooks. The savings in time and complexity for algorithms can be very significant.</p><br />
<p>I differ with proffessor D. Knuth and think this is a great opportunity and all this limitations actually make these times very interesting for computer scientists. It's time for a <strong>revolution</strong>! Almost everything has to be revised to match the new hardware paradigm. For this there's a need for an army of <em>Knuth</em>s to find the way. We need to shake the CS establishment out of their comfortable leather chairs.</p>[Note: I do not mean Prof. D. Knuth, in fact he is one of the very few CS masters who ties high level theory with low level realistic implementations.]<br />
<br />
<h2>The New Algorithm Design Maxims</h2><br />
<ul><li>Find alternatives to sequential processing</li>
<li>Minimize memory use and avoid bloat (Latency)</li>
<li>Store related data as close as possible (Caching)</li>
<li>Minimize branch mispredictions or remove branches altogether</li>
<li>Favor vector/matrix data structures over linked nodes (Pointers, Caching)</li>
<li>Exploit vector processing if possible (SIMD)</li>
<li>Embrace specialized instructions widely available</li>
</ul><br />
<p>A very common approach to speed up processing is to perform space-time trade-offs. For example changing programmatic code to big table lookups. While this works in the small micro-benchmarks it is a short-sighted trick that usually makes scaling very hard. CPU cores can perform billions of instructions per second if used wisely and the new trend for big data processing is to do the opposite, compute-space trade-off. In particular with fast compression algorithms.</p><br />
Good times.<br />
<br />
[1] <a id="cite-1" href="http://www.informit.com/articles/article.aspx?p=1193856">InformIT: Interview with Donald Knuth (2008)</a><br />
[2] <a id="cite-2" href="http://en.wikipedia.org/wiki/Moore's_law"> Wikipedia: Moore's Law</a><br />
[3] <a id="cite-3" href="http://en.wikipedia.org/wiki/CPU_power_dissipation"> Wikipedia: CPU power dissipation</a><br />
[4] <a id="cite-4" href="http://en.wikipedia.org/wiki/Branch_predictor"> Wikipedia: Branch Predictor</a>Alecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.com0tag:blogger.com,1999:blog-7760753019966991240.post-77727371298201077472009-08-25T10:36:00.000-07:002009-08-26T10:11:30.232-07:00SQLite: A Lesson In Low-Defect Software[EDIT: Minor edits to fix HTML formatting, style, corrected footnote credit, and code indentation the editor kills on every edit.]<br />
<br />
While looking for some old papers and presentations on SQLite (remember my disk died a few months ago without proper backup), even though it seems the old ones are offline or lost, there's a new presentation on their experience producing excellent software.<br />
<br />
<div style="text-align: center;"><a href="http://www.sqlite.org/talks/wroclaw-20090310.pdf"><span style="font-size:130%;">A Lesson In Low-Defect Software</span><br />
-or-<br />
<span style="font-size:130%;">A Journey From A Quick Hack</span><br />
<span style="font-size:130%;">To A High-Reliability Database Engine</span><br />
and how you can use the same techniques to reduce the number of bugs in your own software projects<br />
D. Richard Hipp<br />
2009-03-10</a><br />
</div><br />
It seems this presentation wasn't picked up properly so here's a shortened review with more detail on statement and branch testing with GCC. This is mind-blowing stuff, at least for me. Almost everything in this post comes from the paper. I'm trying to stay as close as possible to the original with only a few things added to make it a bit easier to understand at first sight and allow straight copy-paste to play with. I also skipped the less extraordinary SCM recommendations at the end.<br />
<ul><li>Use comments on your source code to engage the other half of your brain</li>
<ul><li>One side does "math" and the other "language".</li>
</ul><li>Safety != Reliability. Safety: no harm; reliability: no failures.<br />
</li>
<li><span style="font-weight: bold;">Safe Software = Extremely Reliable Software</span>.</li>
<li>What Programming Languages Does The World's Most Reliable Software Use? Avionics: ADA, C; Space Shuttle: HAL/S. Not Python, TCL/TK, .NET, LUA, Java (safe but not reliable.)<br />
</li>
<li><a href="http://en.wikipedia.org/wiki/DO-178B">DO-178B and ED-12B</a> "Software Considerations in Airborne Systems and Equipment Certification": <span style="font-weight: bold;">Development process</span> matters, not the programming language. Captures <span style="font-weight: bold;">best practices</span>.</li>
<li>[Remember from <a href="http://video.google.com/videoplay?docid=-5160435487953918649">An Introduction to SQLite</a>: What makes SQLite great is the <span style="font-weight: bold;">extensive testing</span> (60%).]</li>
<li><span style="font-weight: bold;">Use your whole brain.</span></li>
<li><span style="font-weight: bold;">Full coverage testing.</span></li>
<li><span style="font-weight: bold;">Good configuration management.</span></li>
<li><span>Don't just fix bugs, </span><span style="font-weight: bold;">fix your process</span>.</li>
<li><span style="font-weight: bold;">Comment</span>: each <span style="font-weight: bold;">function</span> or procedure and major code blocks.</li>
<li><span style="font-weight: bold;">Comment</span>: all <span style="font-weight: bold;">variables</span> and <span style="font-weight: bold;">constant</span> declarations.</li>
<li>SQLite has a code to comment ratio of 2:1.</li>
<li>Full coverage testing:</li>
<ul><li><span style="font-weight: bold;">Automated tests</span> that exercise <span style="font-weight: bold;">all features of the program</span>: all entry points, subroutines, all branches and conditions, all cases, all boundary values.</li>
<li>DO-178B places special emphasis on testing.</li>
<li>Statement Coverage: Tests cause <span style="font-weight: bold;">every line of code</span> to run at least once.</li>
<li>Branch Coverage: Tests cause <span style="font-weight: bold;">every machine-language branch operation </span><br />
to evaluate to both TRUE and FALSE at least once.</li>
<li>(<span style="font-size:130%;"><span style="font-weight: bold;">See section below on how to do test coverage with GCC</span></span>.)<br />
</li>
</ul><li>Fly what you test! (Measuring coverage validates your tests, not your product.)</li>
<li><span style="font-weight: bold;">Defensive Programming</span> (<span style="font-weight: bold;">See section below</span>.)<br />
</li>
</ul><h3>Testing in SQLite</h3><ul><li>99% Statement Coverage<br />
</li>
<li>95% Branch Coverage<br />
</li>
<li>Goal: 100% branch coverage by Dec 2009<br />
</li>
<li>Striving for 100% test coverage has been [their] most effective method for finding bugs.</li>
<li>Testing in C/TCL (1M), C (2.3M), SQL logic (5.8M)</li>
<li>Crash testing.</li>
<li>I/O error and out-of-memory testing.</li>
<li><a href="http://en.wikipedia.org/wiki/Fuzz_testing">Fuzz</a> testing.</li>
<li><a href="http://en.wikipedia.org/wiki/Valgrind">Valgrind</a> (memory: usage, profiling, leak tracking.)</li>
<li>Most bugs are found internally – before release.</li>
<li>External bugs are mostly build problems.</li>
<li>[SQLite team] doesn't ship “alpha” or “beta” releases, all SQLite releases are production-ready.<br />
</li>
<li>It is rare for users to find "wrong answers."<br />
</li>
</ul><h3>Test Coverage with GCC</h3><br />
<br />
Consider this snippet:<br />
<pre><span class="lnr"> 1 </span><span class="Type">int</span> exampleFunction(<span class="Type">int</span> a, <span class="Type">int</span> b){
<span class="lnr"> 2 </span>
<span class="lnr"> 3 </span> <span class="Type">int</span> ans = <span class="Constant">0</span>;
<span class="lnr"> 4 </span>
<span class="lnr"> 5 </span> <span class="Statement">if</span> ( a>b && a<<span class="Constant">2</span>*b ){
<span class="lnr"> 6 </span> ans = a;
<span class="lnr"> 7 </span> } <span class="Statement">else</span> {
<span class="lnr"> 8 </span> ans = b;
<span class="lnr"> 9 </span> }
<span class="lnr">10 </span>
<span class="lnr">11 </span> <span class="Statement">return</span> ans;
<span class="lnr">12 </span>}
</pre>There are 6 statements in lines 1, 3, 5, 6, 8, 11. To test this function you could run this test:<br />
<pre>exampleFunction(<span class="Constant">1</span>,<span class="Constant">1</span>);
</pre><br />
That would cover statements in lines 1, 3, 5, 8, 11, but not line 6 (because !(a>b)). The following test addresses this statement:<br />
<pre>exampleFunction(<span class="Constant">3</span>,<span class="Constant">2</span>);
</pre><br />
Now that would get all statements covered (line 6, ans = a.) But what about all possible branches? There are 4 possible branches:<br />
<ol><li>a>b: Taken on test 2.</li>
<li>!(a>b): Taken on test 1.</li>
<li>a<2*b: Taken on test 2.</li>
<li>!(a<2*b): Never taken.</li>
</ol>So to match branch 4 (a>b && !(a<2*b)):<br />
<pre>exampleFunction(<span class="Constant">4</span>,<span class="Constant">2</span>);
</pre>Voilà! All statements and branches tested.<br />
<br />
It seems <span style="font-weight: bold;">gcc</span> has some great features to help. With <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/Debugging-Options.html#index-fprofile_002darcs-492">-fprofile-arcs</a> and <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/Debugging-Options.html#index-ftest_002dcoverage-495">-ftest-coverage</a> it will generate branch coverage data, from gcc <a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/Debugging-Options.html">Debugging Options manual</a>:<br />
<blockquote><a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/Debugging-Options.html#index-fprofile_002darcs-492">-fprofile-arcs</a><br />
Add code so that program flow arcs are instrumented. During execution the program records <span style="font-weight: bold;">how many times each branch and call is executed and how many times it is taken or returns</span>. When the compiled program exits it saves this data to a file called aux-name.gcda for each source file. The data may be used for profile-directed optimizations (-fbranch-probabilities), or for <span style="font-style: italic;">test coverage analysis</span> (-ftest-coverage).<br />
[...]<br />
<a href="http://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/Debugging-Options.html#index-ftest_002dcoverage-495">-ftest-coverage</a><br />
<span style="font-weight: bold;">Produce a notes file that the gcov code-coverage utility can use to show program coverage</span>. Each source file's note file is called aux-name.gcno. Refer to the -fprofile-arcs option above for a description of auxname and instructions on how to generate test coverage data. Coverage data will match the source files more closely, if you do not optimize.<br />
</blockquote>So let's make this test with testme.c with only the first test:<br />
<pre><span class="Type">int</span> exampleFunction(<span class="Type">int</span> a, <span class="Type">int</span> b){
<span class="Type">int</span> ans = <span class="Constant">0</span>;
<span class="Statement">if</span> ( a>b && a<<span class="Constant">2</span>*b ){
ans = a;
} <span class="Statement">else</span> {
ans = b;
}
<span class="Statement">return</span> ans;
}
<span class="Type">int</span> main() {
exampleFunction(<span class="Constant">1</span>,<span class="Constant">1</span>);
<span class="Statement">return</span> <span class="Constant">0</span>;
}</pre>Compiling and running:<br />
<pre>$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \
&& gcov -c testme.c && cat testme.c.gcov
File 'testme.c'
<span class="Constant">Lines executed:88.89% of 9</span>
testme.c:creating 'testme.c.gcov'
-: 0:Source:testme.c
-: 0:Graph:testme.gcno
-: 0:Data:testme.gcda
-: 0:Runs:1
-: 0:Programs:1
1: 1:int exampleFunction(int a, int b){
-: 2:
1: 3: int ans = 0;
-: 4:
1: 5: if ( a>b && a<2*b ){
<span class="Constant">#####: 6: ans = a;</span>
-: 7: } else {
1: 8: ans = b;
-: 9: }
-: 10:
1: 11: return ans;
-: 12:}
-: 13:
1: 14:int main() {
<span class="Type">1: 15: exampleFunction(1,1);</span>
1: 16: return 0;
-: 17:}</pre>The first column on the report shows how many times the source line was evaluated. The second column is the source line number. Note the <span style="font-weight: bold;">#####</span> mark on the missed statement. Now if we add the second test:<br />
<pre>$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \
&& gcov -c testme.c && cat testme.c.gcov
File 'testme.c'
<span class="Type">Lines executed:100.00% of 10</span>
testme.c:creating 'testme.c.gcov'
-: 0:Source:testme.c
-: 0:Graph:testme.gcno
-: 0:Data:testme.gcda
-: 0:Runs:1
-: 0:Programs:1
2: 1:int exampleFunction(int a, int b){
-: 2:
2: 3: int ans = 0;
-: 4:
3: 5: if ( a>b && a<2*b ){
1: 6: ans = a;
-: 7: } else {
1: 8: ans = b;
-: 9: }
-: 10:
2: 11: return ans;
-: 12:}
-: 13:
1: 14:int main() {
1: 15: exampleFunction(1,1);
<span class="Type">1: 16: exampleFunction(3,2);</span>
1: 17: return 0;
-: 18:}</pre>It now shows the statement in line 6 was taken.<br />
<br />
Let's do the branch test with only the first test. Note the <span style="font-weight: bold;">-b</span> flag to gcov is the only command line change:<br />
<br />
<pre>$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \
&& gcov <span class="Type">-b</span> -c testme.c && cat testme.c.gcov
File 'testme.c'
Lines executed:88.89% of 9
<span class="Constant">Branches executed:50.00% of 4</span>
<span class="Constant">Taken at least once:25.00% of 4</span>
Calls executed:100.00% of 1
testme.c:creating 'testme.c.gcov'
-: 0:Source:testme.c
-: 0:Graph:testme.gcno
-: 0:Data:testme.gcda
-: 0:Runs:1
-: 0:Programs:1
function exampleFunction called 1 returned 100% blocks executed 60%
1: 1:int exampleFunction(int a, int b){
-: 2:
1: 3: int ans = 0;
-: 4:
1: 5: if ( a>b && a<2*b ){
branch 0 taken 0 (fallthrough)
branch 1 taken 1
<span class="Constant">branch 2 never executed</span>
<span class="Constant">branch 3 never executed</span>
<span class="Constant">#####: 6: ans = a;</span>
-: 7: } else {
1: 8: ans = b;
-: 9: }
-: 10:
1: 11: return ans;
-: 12:}
-: 13:
function main called 1 returned 100% blocks executed 100%
1: 14:int main() {
1: 15: exampleFunction(1,1);
call 0 returned 1
1: 16: return 0;
-: 17:}</pre>The report says the branches taken at least once are 25%. Further in the report, after line 5: it lists the 4 branches from that line. Branch 0 (a>b) was evaluated but not taken (because a=b.) Branch 1 (!(a>b)) was taken. Since branches 2 and 3 depend on branch 0 those weren't even executed. Now let's run with the second test:<br />
<pre>$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \
&& gcov -b -c testme.c && cat testme.c.gcov
File 'testme.c'
Lines executed:100.00% of 10
Branches executed:100.00% of 4
<span class="Constant">Taken at least once:75.00% of 4</span>
Calls executed:100.00% of 2
testme.c:creating 'testme.c.gcov'
-: 0:Source:testme.c
-: 0:Graph:testme.gcno
-: 0:Data:testme.gcda
-: 0:Runs:1
-: 0:Programs:1
function exampleFunction called 2 returned 100% blocks executed 100%
2: 1:int exampleFunction(int a, int b){
-: 2:
2: 3: int ans = 0;
-: 4:
3: 5: if ( a>b && a<2*b ){
branch 0 taken 1 (fallthrough)
branch 1 taken 1
branch 2 taken 1 (fallthrough)
<span class="Constant">branch 3 taken 0</span>
1: 6: ans = a;
-: 7: } else {
1: 8: ans = b;
-: 9: }
-: 10:
2: 11: return ans;
-: 12:}
-: 13:
function main called 1 returned 100% blocks executed 100%
1: 14:int main() {
1: 15: exampleFunction(1,1);
call 0 returned 1
<span class="Type">1: 16: exampleFunction(3,2);</span>
call 0 returned 1
1: 17: return 0;</pre>Here we see the only remaining branch not taken is 3 (but it was executed). Let's run with the third test:<br />
<br />
<pre>$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \
&& gcov -b -c testme.c && cat testme.c.gcov
File 'testme.c'
Lines executed:100.00% of 11
<span class="Type">Branches executed:100.00% of 4</span>
<span class="Type">Taken at least once:100.00% of 4</span>
Calls executed:100.00% of 3
testme.c:creating 'testme.c.gcov'
-: 0:Source:testme.c
-: 0:Graph:testme.gcno
-: 0:Data:testme.gcda
-: 0:Runs:1
-: 0:Programs:1
function exampleFunction called 3 returned 100% blocks executed 100%
3: 1:int exampleFunction(int a, int b){
-: 2:
3: 3: int ans = 0;
-: 4:
4: 5: if ( a>b && a<2*b ){
branch 0 taken 2 (fallthrough)
branch 1 taken 1
branch 2 taken 1 (fallthrough)
branch 3 taken 1
1: 6: ans = a;
-: 7: } else {
2: 8: ans = b;
-: 9: }
-: 10:
3: 11: return ans;
-: 12:}
-: 13:
function main called 1 returned 100% blocks executed 100%
1: 14:int main() {
1: 15: exampleFunction(1,1);
call 0 returned 1
1: 16: exampleFunction(3,2);
call 0 returned 1
<span class="Type">1: 17: exampleFunction(4,2);</span>
call 0 returned 1
1: 18: return 0;</pre>Great, 100% coverage now!<br />
<br />
<br />
<h3>Defensive Programming</h3><br />
Input variable boundary checking would create many branches never taken on tests. Consider for example checking the value of input nBytes doesn't get over a limit 0x7fffff00:<br />
<pre><span class="Type">void</span> *sqlite3InternalMalloc(<span class="Type">int</span> nBytes){
<span class="Statement">if</span>( nBytes<=<span class="Constant">0</span> || nBytes>=<span class="Constant">0x7fffff00</span> ){
<span class="Statement">return</span> <span class="Constant">0</span>;
}<span class="Statement">else</span>{
<span class="Statement">return</span> sqlite3LowLevelMalloc(nBytes);
}
}</pre>That branch can't be tested. SQLite has some interesting macros <span style="font-weight: bold;">NEVER()</span> and <span style="font-weight: bold;">ALWAYS()</span>:<br />
<pre><span class="PreProc">#if defined(SQLITE_COVERAGE_TEST) </span>
<span class="PreProc"># define ALWAYS(X) </span><span class="Constant">1</span><span class="PreProc"> </span>
<span class="PreProc"># define NEVER(X) </span><span class="Constant">0</span><span class="PreProc"> </span>
<span class="PreProc">#elif defined(SQLITE_DEBUG) </span>
<span class="PreProc"># define ALWAYS(X) ((X)?</span><span class="Constant">1</span><span class="PreProc">:sqlite3Panic()) </span>
<span class="PreProc"># define NEVER(X) ((X)?sqlite3Panic():</span><span class="Constant">0</span><span class="PreProc">) </span>
<span class="PreProc">#else</span>
<span class="PreProc"># define ALWAYS(X) (X) </span><span class="Comment">// PASS THROUGH (What you fly)</span>
<span class="PreProc"># define NEVER(X) (X) </span><span class="Comment">// PASS THROUGH (What you fly)</span>
<span class="PreProc">#endif</span></pre>Applied to the bounds check above:<br />
<pre><span class="Type">void</span> *sqlite3InternalMalloc(<span class="Type">int</span> nBytes){
<span class="Statement">if</span>( nBytes<=<span class="Constant">0</span> || NEVER(nBytes>=<span class="Constant">0x7fffff00</span>) ){
<span class="Statement">return</span> <span class="Constant">0</span>;
}<span class="Statement">else</span>{
<span class="Statement">return</span> sqlite3LowLevelMalloc(nBytes);
}
}</pre><br />
<h3>Footnote</h3><br />
<br />
AFAIK, testing can't get any better than this. Somebody could do a machine-checked proof, like the <strike>Cambridge</strike> <a href="http://ertos.nicta.com.au/research/sel4/">seL4 team (UNSW/Sydney)</a> <a href="http://www.nicta.com.au/news/home_page_content_listing/world-first_research_breakthrough_promises_safety-critical_software_of_unprecedented_reliability">did</a> (also mind-blowing by itself, but probably overkill for now.)Alecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.com9tag:blogger.com,1999:blog-7760753019966991240.post-2293424801991326092009-02-10T20:33:00.000-08:002009-03-14T06:50:48.091-07:00Making a free hostel reservation system - pt 1Leveraging my inclination to postpone the comet server and other things, I got myself into a <span style="font-style: italic;">2 day sprint</span> to make a <span style="font-style: italic;">proof of concept</span><span> of a </span><span style="font-weight: bold;">hostel/b&b reservations system</span> that wouldn't suck and could be implemented for <span style="font-weight: bold;">free</span>. Current services are all closed source <span style="font-weight: bold;">quasi-scams</span> holding the customer's payment as ransom and take the first payment of the first day as commission (~<span style="font-style: italic;">15 bucks per transaction!</span>) while still making the establishments update <span style="font-style: italic;">manually</span> (wtf!) by themselves. That for something that can be coded in a couple of days. And there are at least dozens of these middlemen. <span style="font-style: italic;">Customers</span> lose the most as the process today takes very long with multiple sites and with the risk of using their credit cards with these <span style="font-style: italic;">dodgy</span> intermediaries. These customers usually have to book on shared computers, and they might have to pay for the time to use the computer. Most commercial reservation systems are so bad their code needs your browsing session to jump to second and even third domains. On the other hand, major hostel chains ask over u$d 1,500 <span style="font-style: italic;">per year</span> and demand and around <span style="font-style: italic;">20% discount</span> for their members. Each member has to cough up u$d 15 per year.<br /><br />I know many will tell me about how I should make money out of this, but I don't think in this present system of chaos, abuse, and misinformation anybody can have a decent working business model. I'll be happy to just show how disposable they really are to both hostels and travelers.<br /><br />Some goals for <span style="font-style: italic;">hostel owners</span>:<br /><ul><li>To own the data and be able to pull out as they please, <span style="font-weight: bold;">no lock-in</span>.</li><li>Simple way to have their own data layout to play with, thus avoiding multiple reservation tracking systems.<br /></li><li>Zero cost.</li><li>Minimum possible technical knowledge requirements.</li></ul>Some goals for travelers:<br /><ul><li>Have a fast availability system either global or on the hostel's page.<br /></li><li><span style="font-weight: bold;">No credit card</span> required.<br /></li><li>Easy reservation form request with confirmation by email/phone confirmation (most hostels have this already in place.)</li></ul>Technical:<br /><ul><li><span style="font-weight: bold;">Google Docs</span> (it shouldn't be that hard to switch to alternatives and it's a sort of neutral brand for everybody.) <span style="font-weight: bold;">Spreadsheet</span> to manage and export the reservations, <span style="font-weight: bold;">Forms</span> to receive the reservation requests, <span style="font-weight: bold;">Gadget</span> with the core logic and easy to place inside even static pages hostels have, <span style="font-weight: bold;">Sites</span> for the central page.<br /></li></ul>The idea so far is to give each hostel a Google <span style="font-style: italic;">Spreadsheet</span> (or a template for one and instructions) where they have on one <span style="font-style: italic;">worksheet</span> all the rooms and can fill with the names of the reservations or of the current occupiers. Then a separate <span style="font-style: italic;">worksheet</span> can count and show the result of available beds/rooms for a specific date and share that with the world. The gadget feeds from that data creating a <span style="font-weight: bold;">script tag hack</span> requesting a specific range of cells with <span style="font-weight: bold;">JSON</span> formatting and callback. The customer searches specifying date and amount of people, then if there is availability a small form is displayed. This form is adjusted to match a Google Docs form requesting the basic data (name, email, phone, date of arrival, number of nights.)<br /><br />So it's one spreadsheet with two worksheets and a form, per hostel.<br /><br /><br />On the legal side this site could be set with these:<br /><ul><li>Use only Creative Commons non-commercial share-alike license for the all published data (the <span style="font-style: italic;">shared</span> bit of the spreadsheets, not all the data!)<br /></li><li>Use GPLv3 for the JavaScript code (a bit redundant but just to make it clear, there is still heavy server-side JavaScript on the corporate world, you wouldn't believe what I've seen.)</li></ul>That should be enough to have people feel like joining on the project as developers or users. I hope.<br /><br />Missing but possibly easy to implement features:<br /><ul><li>Internationalization and localization (existing booking systems suck at this, also.)</li><li>A sane way to make a guarantee to replace deposits (e.g. ask for a $1 paypal/amazon donation to a range of charities like EFF, CC, or WaterAid.)<br /></li><li>Hostel media hosting or display (i.e. pictures, videos, descriptions.)</li><li>A fair review system (well, not that easy due to the abundance of trolls.)<br /></li></ul>And here is the working gadget (source <a href="http://sites.google.com/a/alejosanchez.com/main/projects/reservations/reservations.xml">here</a>):<br /><br /><iframe igsrc="http://www.google.com/chart?chc=sites&cht=d&chdp=sites&chl=%5B%5BGoogle+Gadget%27%3D16%27f%5Cbf%5Chv%27a%5C%3D248%270%27%3D247%270%27dim%27%5Cbox1%27b%5CDBD9BB%27fC%5CDBD9BB%27eC%5C15%27sk%27%5C%5B%27%5Dh%27a%5CV%5C%3D12%27f%5Cbf%5C%5DV%5Cta%5C%3D249%27%3D0%27%3D248%27%3D297%27dim%27%5C%3D249%27%3D0%27%3D248%27%3D297%27vdim%27%5Cbox1%27b%5Cva%5CFFFEF0%27fC%5CDBD9BB%27eC%5Csites_gadget%27i%5Chv-0-0%27a%5C%5Do%5CLauto%27f%5C&sig=MAk-W5qpvuMsd4aI6jj7yu6dqOo" src="http://251.gmodules.com/ig/ifr?mid=251&synd=trogedit&url=http%3A%2F%2Fsites.google.com%2Fa%2Falejosanchez.com%2Fmain%2Fprojects%2Freservations%2Freservations.xml%3Fattredirects%3D0&h=300&w=500" style="margin: 5px auto 5px 0pt; display: block; text-align: left;" class="igm" props="igsrc:http://251.gmodules.com/ig/ifr?mid=251&synd=trogedit&url=http%3A%2F%2Fsites.google.com%2Fa%2Falejosanchez.com%2Fmain%2Fprojects%2Freservations%2Freservations.xml%3Fattredirects%3D0&h=300&w=500;height:300;width:500;scrolling:no;align:left;" frameborder="0" height="300" width="500"></iframe><br /><br />It can be placed on any static page with just this iframe:<br /><textarea cols="60" rows="6"><iframe igsrc="http://www.google.com/chart?chc=sites&cht=d&chdp=sites&chl=%5B%5BGoogle+Gadget%27%3D16%27f%5Cbf%5Chv%27a%5C%3D248%270%27%3D247%270%27dim%27%5Cbox1%27b%5CDBD9BB%27fC%5CDBD9BB%27eC%5C15%27sk%27%5C%5B%27%5Dh%27a%5CV%5C%3D12%27f%5Cbf%5C%5DV%5Cta%5C%3D249%27%3D0%27%3D248%27%3D297%27dim%27%5C%3D249%27%3D0%27%3D248%27%3D297%27vdim%27%5Cbox1%27b%5Cva%5CFFFEF0%27fC%5CDBD9BB%27eC%5Csites_gadget%27i%5Chv-0-0%27a%5C%5Do%5CLauto%27f%5C&sig=MAk-W5qpvuMsd4aI6jj7yu6dqOo" src="http://251.gmodules.com/ig/ifr?mid=251&synd=trogedit&url=http%3A%2F%2Fsites.google.com%2Fa%2Falejosanchez.com%2Fmain%2Fprojects%2Freservations%2Freservations.xml%3Fattredirects%3D0&h=300&w=500" style="margin: 5px auto 5px 0pt; display: block; text-align: left;" class="igm" props="igsrc:http://251.gmodules.com/ig/ifr?mid=251&synd=trogedit&url=http%3A%2F%2Fsites.google.com%2Fa%2Falejosanchez.com%2Fmain%2Fprojects%2Freservations%2Freservations.xml%3Fattredirects%3D0&h=300&w=500;height:300;width:500;scrolling:no;align:left;" frameborder="0" height="300" width="500"></iframe></textarea>Alecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.com2tag:blogger.com,1999:blog-7760753019966991240.post-38998503188894239732008-12-29T12:23:00.001-08:002008-12-29T12:56:26.204-08:00And The Dog Ate My HomeworkAfter relocating from London to Buenos Aires my HD broke. Then, the pile of DVDs and CDs got lost including the OSX rescue (finding Mac install DVDs on christmas to borrow was quite a challenge) and my pre-flight backup DVDs were gone there too. Oh, and my hidden TrueCrypt volume got trashed by stupidly adding things to the outer volume (paranoia doesn't pay.)<br /><br />Luckily the outlines of upcoming posts were stored on Blogger, and I remember most of the code's ideas. I'll try to move on from this.<br /><br />Besides that hiccup things are doing very well and even got a suntan.<br /><br />Backup now to a pendrive, to some DVDs, to an external HD, to Gmail, everywhere. Use TrueCrypt, but wisely, get yourself some subtle warning if you do the hidden volume trick! (And at least write-protect the drive.)Alecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.com1tag:blogger.com,1999:blog-7760753019966991240.post-91098431520141436322008-11-09T20:59:00.000-08:002008-11-12T18:50:38.329-08:00The Ephemeral Ports Problem (and solution)<a href="http://www.metabrew.com/">Richard Jones</a> <a href="http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3/">ran into a problem</a> doing load tests with many open sockets. It was quite an interesting thing to investigate. For reasons still unknown to me after <span style="font-style: italic;">many</span> hours of reading kernel code, documentation, and mailing lists, Linux shares the assigned list of <a href="http://en.wikipedia.org/wiki/Ephemeral_port"><span style="font-weight: bold;">ephemeral ports</span></a> across all local <span style="font-weight: bold;">IPs</span> for unconnected sockets. I hope this post will save some time to someone and there is a workaround suggestion for libevent. (By the way RJ, you were right on this problem! Though I differ now on the solution... Sorry, can't help it, it seems :)<br /><br /><h3>Ephemeral Ports</h3>Ephemeral ports are an assigned range of IP ports to be used by <span style="font-style: italic;">unprivileged</span> processes. This range is usually a few thousand ports above 1024. This range can be modified by the administrator by modifying the relevant <a href="http://en.wikipedia.org/wiki/Sysctl"><span style="font-style: italic;">sysctl</span></a> variable. As you probably know, transport protocols in the TCP/IP suite use ports, and when a program wants to initiate communication with this protocols it needs to ask for a local address, and if not done explicitly the OS assigns one automatically. The problem lies on the lookup of available ports.<br /><br />The operating system tracks the port numbers in use and also the ones in use recently (to know how to handle leftover incoming network packets.) In Linux, this is done with a technique called <a href="http://en.wikipedia.org/wiki/Hash_table">hash table</a>. The code in the Linux kernel for TCP/IP networking is quite complicated, lacks documentation or comments, and is hard to track what is defined where. After many days banging my head against the crude code, I finally <span style="font-style: italic;">got it</span>. Random posts on internet and a side comment on a standard draft said the ephemeral range was shared for all local addresses on most operating systems. I wanted to know where, how, and if possible, why. So far, I only got the first two and only hints of the last.<br /><br /><h3>Ephemeral Port Assignment</h3>The pattern to create a TCP socket for client software is to call:<br /><pre><span class="Type">int</span> sock_fd = socket(AF_INET, SOCK_STREAM, <span class="Constant">0</span>);</pre>This will make a socket of TCP/IP famiy, of stream type (connected), of default protocol ("0" in this case, TCP.) After this you can manually assign it to a local address by calling:<br /><pre>bind(sock_fd, local_addr, local_addr_length);</pre>That address should contain both the IP address and the port. If the port specified is 0, the kernel looks up for an available port in the ephemeral range. After this you make the actual connection to the server with:<br /><pre>connect(sock_fd, destination_addr, destination_addr_length);</pre>If the bind step was omitted the kernel's connect code does a similar, but slightly different lookup of available ports. Let's compare both lookups.<br /><br />The bind lookup algorithm resides in <a href="http://lxr.linux.no/linux+v2.6.27.5/net/ipv4/inet_connection_sock.c#L88">net/ipv4/inet_connection_sock.c</a>'s function inet_csk_get_port():<br /><pre><span class="Comment">/*</span><span class="Comment"> Obtain a reference to a local port for the given sock,</span><br /><span class="Comment"> * if snum is zero it means select any available local port.</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br /><span class="Type">int</span> inet_csk_get_port(<span class="Type">struct</span> sock *sk, <span class="Type">unsigned</span> <span class="Type">short</span> snum)<br />{<br /><span class="Comment">/*</span><span class="Comment"> ... </span><span class="Comment">*/</span><br /><span class="Statement">if</span> (!snum) {<br /> <span class="Type">int</span> remaining, rover, low, high;<br /><br /> inet_get_local_port_range(&low, &high);<br /> remaining = (high - low) + <span class="Constant">1</span>;<br /> rover = net_random() % remaining + low;<br /><br /> <span class="Statement">do</span> {<br /> head = &hashinfo->bhash[inet_bhashfn(net, rover,<br /> hashinfo->bhash_size)];<br /> spin_lock(&head->lock);<br /> inet_bind_bucket_for_each(tb, node, &head->chain)<br /> <span class="Statement">if</span> (tb->ib_net == net && tb->port == rover)<br /> <span class="Statement">goto</span> next;<br /> <span class="Statement">break</span>;<br /> <span class="Statement">next</span>:<br /> spin_unlock(&head->lock);<br /> <span class="Statement">if</span> (++rover > high)<br /> rover = low;<br /> } <span class="Statement">while</span> (--remaining > <span class="Constant">0</span>);<br /><br /> <span class="Comment">/*</span><span class="Comment"> Exhausted local port range during search? It is not</span><br /><span class="Comment"> * possible for us to be holding one of the bind hash</span><br /><span class="Comment"> * locks if this test triggers, because if 'remaining'</span><br /><span class="Comment"> * drops to zero, we broke out of the do/while loop at</span><br /><span class="Comment"> * the top level, not from the 'break;' statement.</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br /> ret = <span class="Constant">1</span>;<br /> <span class="Statement">if</span> (remaining <= <span class="Constant">0</span>)<br /> <span class="Statement">goto</span> fail;<br /><br /> <span class="Comment">/*</span><span class="Comment"> OK, here is the one we will use. HEAD is</span><br /><span class="Comment"> * non-NULL and we hold it's mutex.</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br /> snum = rover;<br />} <span class="Statement">else</span> {<br /></pre>When <span style="font-style: italic;">snum</span> is 0, it looks for an available bucket in the hash table, but if there is anything in it (any socket using that port, or recently closed) it keeps looking. If the search hits the end, the function <span style="font-style: italic;">fail</span>s. To note, there is <span style="font-weight: bold;">no use of local IP address in the hash table</span>! The <span style="font-style: italic;">net</span> thing passed isn't forthat. The hash table only cares of port numbers. In contrast, the port lookup on connect in net/ipv4/inet_hashtables.c does:<br /><pre><span class="Type">int</span> __inet_hash_connect(<span class="Type">struct</span> inet_timewait_death_row *death_row,<br /> <span class="Type">struct</span> sock *sk, u32 port_offset,<br /> <span class="Type">int</span> (*check_established)(<span class="Type">struct</span> inet_timewait_death_row *,<br /> <span class="Type">struct</span> sock *, __u16, <span class="Type">struct</span> inet_timewait_sock **),<br /> <span class="Type">void</span> (*hash)(<span class="Type">struct</span> sock *sk))<br />{<br /><span class="Comment">/*</span><span class="Comment"> ... </span><span class="Comment">*/</span><br /><span class="Statement">if</span> (!snum) {<br /> <span class="Type">int</span> i, remaining, low, high, port;<br /> <span class="Type">static</span> u32 hint;<br /> u32 offset = hint + port_offset;<br /> <span class="Type">struct</span> hlist_node *node;<br /> <span class="Type">struct</span> inet_timewait_sock *tw = <span class="Constant">NULL</span>;<br /><br /> inet_get_local_port_range(&low, &high);<br /> remaining = (high - low) + <span class="Constant">1</span>;<br /><br /> local_bh_disable();<br /> <span class="Statement">for</span> (i = <span class="Constant">1</span>; i <= remaining; i++) {<br /> port = low + (i + offset) % remaining;<br /> head = &hinfo->bhash[inet_bhashfn(net, port,<br /> hinfo->bhash_size)];<br /> spin_lock(&head->lock);<br /><br /> <span class="Comment">/*</span><span class="Comment"> Does not bother with rcv_saddr checks,</span><br /><span class="Comment"> * because the established check is already</span><br /><span class="Comment"> * unique enough.</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br /> inet_bind_bucket_for_each(tb, node, &head->chain) {<br /> <span class="Statement">if</span> (tb->ib_net == net && tb->port == port) {<br /> WARN_ON(hlist_empty(&tb->owners));<br /> <span class="Statement">if</span> (tb->fastreuse >= <span class="Constant">0</span>)<br /> <span class="Statement">goto</span> next_port;<br /> <span class="Statement">if</span> (!check_established(death_row, sk,<br /> port, &tw))<br /> <span class="Statement">goto</span> ok;<br /> <span class="Statement">goto</span> next_port;<br /> }<br /> }<br /><br /> tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,<br /> net, head, port);<br /> <span class="Statement">if</span> (!tb) {<br /> spin_unlock(&head->lock);<br /> <span class="Statement">break</span>;<br /> }<br /> tb->fastreuse = -<span class="Constant">1</span>;<br /> <span class="Statement">goto</span> ok;<br /><br /> <span class="Statement">next_port</span>:<br /> spin_unlock(&head->lock);<br /> }<br /> local_bh_enable();<br /><br /> <span class="Statement">return</span> -EADDRNOTAVAIL;<br /><br /><span class="Statement">ok</span>:<br /> <span class="Comment">/*</span><span class="Comment"> ... </span><span class="Comment">*/</span></pre>The algorithm is quite similar but if the hash table bucket for the port is in use, it calls <a href="http://lxr.linux.no/linux+v2.6.27.5/net/ipv4/inet_hashtables.c#L259">check_established()</a> to perform further checks:<br /><pre><br /><span class="Comment">/*</span><span class="Comment"> called with local bh disabled </span><span class="Comment">*/</span><br /><span class="Type">static</span> <span class="Type">int</span> __inet_check_established(<span class="Type">struct</span> inet_timewait_death_row *death_row,<br /> <span class="Type">struct</span> sock *sk, __u16 lport,<br /> <span class="Type">struct</span> inet_timewait_sock **twp)<br />{<br /><span class="Comment">/*</span><span class="Comment"> ... </span><span class="Comment">*/</span><br /><span class="Comment">/*</span><span class="Comment"> Check TIME-WAIT sockets first. </span><span class="Comment">*/</span><br />sk_for_each(sk2, node, &head->twchain) {<br /> tw = inet_twsk(sk2);<br /><br /> <span class="Statement">if</span> (INET_TW_MATCH(sk2, net, hash, acookie,<br /> saddr, daddr, ports, dif)) {<br /> <span class="Statement">if</span> (twsk_unique(sk, sk2, twp))<br /> <span class="Statement">goto</span> unique;<br /> <span class="Statement">else</span><br /> <span class="Statement">goto</span> not_unique;<br /> }<br />}<br />tw = <span class="Constant">NULL</span>;<br /><br /><span class="Comment">/*</span><span class="Comment"> And established part... </span><span class="Comment">*/</span><br />sk_for_each(sk2, node, &head->chain) {<br /> <span class="Statement">if</span> (INET_MATCH(sk2, net, hash, acookie,<br /> saddr, daddr, ports, dif))<br /> <span class="Statement">goto</span> not_unique;<br />}<br /><br /><span class="Statement">unique</span>:<br /><span class="Comment">/*</span><span class="Comment"> Must record num and sport now. Otherwise we will see</span><br /><span class="Comment"> * in hash table socket with a funny identity. </span><span class="Comment">*/</span><br />inet->num = lport;<br />inet->sport = htons(lport);<br />sk->sk_hash = hash;<br />WARN_ON(!sk_unhashed(sk));<br />__sk_add_node(sk, &head->chain);<br />sock_prot_inuse_add(sock_net(sk), sk->sk_prot, <span class="Constant">1</span>);<br />write_unlock(lock);<br /><br /><span class="Statement">if</span> (twp) {<br /> *twp = tw;<br /> NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);<br />} <span class="Statement">else</span> <span class="Statement">if</span> (tw) {<br /> <span class="Comment">/*</span><span class="Comment"> Silly. Should hash-dance instead... </span><span class="Comment">*/</span><br /> inet_twsk_deschedule(tw, death_row);<br /> NET_INC_STATS_BH(net, LINUX_MIB_TIMEWAITRECYCLED);<br /><br /> inet_twsk_put(tw);<br />}<br /><br /><span class="Statement">return</span> <span class="Constant">0</span>;<br /><br /><span class="Statement">not_unique</span>:<br />write_unlock(lock);<br /><span class="Statement">return</span> -EADDRNOTAVAIL;<br />}<br /></pre>This allows to reuse the same local port as long as the 5-tuple (protocol, source address, source port, destination address, destination port) doesn't exist already (the <a href="http://lxr.linux.no/linux+v2.6.27.5/include/net/inet_hashtables.h#L299">INET_MATCH</a> call.)<br /><br /><h3>Catch 22</h3>So there is a <a href="http://en.wikipedia.org/wiki/Catch-22_%28logic%29">dilemma</a> on how to create more client TCP sockets than the number of available ephemeral ports (let's call <span style="font-style: italic;">n_sockets</span> and <span style="font-style: italic;">n_ephemeral</span>.)<br /><ul><li> Increasing <span style="font-style: italic;">n_sockets</span> by having with multiple source IP addresses (the RJ approach) won't work, because it will fail on the lookup of available ephemeral ports (it doesn't care about the source address.)</li><li>If you make just a <span style="font-style: italic;">connect</span> call you get limited to <span style="font-style: italic;">n_ephemeral</span> because the lookup isn't for IP and ephemeral port, it's just a lookup of a port within a local IP (as noted on the comment above.)<br /></li></ul>[Note: there is no way to do an incomplete bind of only the IP address part and leaving the port to be assigned for later.<br /><br />After this situation RJ offered a patch to libevent to do it the way <a href="http://www.hpl.hp.com/research/linux/httperf/wisp98/html/doc005.html#s4.2">httpperf</a> does, binding local address and port. This means the client code has to do the port allocation lookup and if not carefully managed it will be an incredible amount of work on tries to call <span style="font-style: italic;">bind()</span>. In my opinion, this is <span style="font-weight: bold;">hackish and ugly</span>. It's not their fault, they were cornered by poor implementations and poor interfaces. In RJ's case <span style="font-weight: bold;">libevent always calls bind before connect</span> so there isn't even a chance to do it right, as it is.<br /><br />Also I didn't like the idea of having to bother the user to have more local addresses and having to pass that to the client program.<br /><br /><h3>My $.02</h3>As a programmer, one way to allow so many connections to a server from a single host would be to instead <span style="font-weight: bold;">increase the number of ports the server is listening</span>. This is very common and should be trivial to do and scales very well (<span style="font-style: italic;">n_ephemeral</span> times the number of server ports.) The only limitation is if there is a firewall or some other kind of filter but it is quite unlikely. In this particular case this would require a modification of libevent, to prevent it from calling bind() before connect if no local address is specified (for client code.) This is in effect a <span style="font-weight: bold;">four line patch</span> and no change of libevent API (<a href="http://www.mail-archive.com/libevent-users@monkey.org/msg01304.html">RJ's diff</a> adds another API function and is about 16 lines):<br /><pre><span class="Type">--- http.c 2008-09-08 01:11:13.000000000 +0100</span><br /><span class="Type">+++ http.c.new 2008-11-13 02:09:12.000000000 +0000</span><br /><span class="Statement">@@ -1731,7 +1731,10 @@</span><br /> assert(!(evcon->flags & EVHTTP_CON_INCOMING));<br /> evcon->flags |= EVHTTP_CON_OUTGOING;<br /><br /><span class="Special">- evcon->fd = bind_socket(evcon->bind_address, 0 /*port*/, 0 /*reuse*/);</span><br /><span class="Identifier">+ if (evcon->bind_address)</span><br /><span class="Identifier">+ evcon->fd = bind_socket(evcon->bind_address, 0 /*port*/, 0 /*reuse*/);</span><br /><span class="Identifier">+ else</span><br /><span class="Identifier">+ evcon->fd = socket(AF_INET, SOCK_STREAM, 0); /* generic socket */</span><br /> if (evcon->fd == -1) {<br /> event_debug(("%s: failed to bind to \"%s\"",<br /> __func__, evcon->bind_address));</pre>(Yes, I already mailed <a href="http://www.provos.org/">Niels</a> a few days ago about it. But <a href="http://www.mail-archive.com/libevent-users@monkey.org/msg00320.html">as usual</a>, he'll probably have a better way to do it. Hi Niels ;)<br /><br /><h3>Trying to Make Sense of the Kernel Algorithm<br /></h3>Why are ephemeral ports searched this way? Why is bind() so strict? Well, at that point:<br /><ul><li>The kernel only knows it is a TCP socket.</li><li>It doesn't know if it is going to be a client or server (listen) socket.</li><li>And even if it knew it is a client, it wouldn't know yet the destination address and port.<br /></li></ul>Some <a href="http://www.google.com/search?hl=en&safe=off&q=site%3Alkml.org+ephemeral+port+bind&btnG=Search">peripheral comments on the subject</a> on Linux kernel mailing list mention the issues of strange things like double connects (valid in TCP.) I am still not convinced this isn't just an archaic lookup that doesn't consider the local address. This issue is discussed by Fernando Gont (a fellow UTN-er, what a coincidence.) in his <a href="http://www.gont.com.ar/drafts/port-randomization/draft-ietf-tsvwg-port-randomization-02.html#trad_selection">IETF draft</a>. This was made earlier this year (February 2008) I guess for working out the issues with port prediction (like <a href="http://www.doxpara.com/">Dan Kaminsky</a>'s DNS <a href="http://blogs.zdnet.com/security/?p=1460">bug</a>.) Very interesting read.<br /><br /><h3>Extra</h3><br />Here is some code to play with:<br /><pre><span class="Comment">/*</span><br /><span class="Comment"> Copyright (C) 2008 Alejo Sanchez</span><br /><br /><span class="Comment"> This program is free software: you can redistribute it and/or modify</span><br /><span class="Comment"> it under the terms of the GNU Affero General Public License as</span><br /><span class="Comment"> published by the Free Software Foundation, either version 3 of the</span><br /><span class="Comment"> License, or (at your option) any later version.</span><br /><br /><span class="Comment"> This program is distributed in the hope that it will be useful,</span><br /><span class="Comment"> but WITHOUT ANY WARRANTY; without even the implied warranty of</span><br /><span class="Comment"> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the</span><br /><span class="Comment"> GNU Affero General Public License for more details.</span><br /><br /><span class="Comment"> You should have received a copy of the GNU Affero General Public License</span><br /><span class="Comment"> along with this program. If not, see <<a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>>.</span><br /><span class="Comment">*/</span><br /><span class="PreProc">#include </span><span class="Constant"><sys/types.h></span><br /><span class="PreProc">#include </span><span class="Constant"><sys/socket.h></span><br /><span class="PreProc">#include </span><span class="Constant"><sys/resource.h></span><br /><span class="PreProc">#include </span><span class="Constant"><errno.h></span><br /><span class="PreProc">#include </span><span class="Constant"><netdb.h></span><br /><span class="PreProc">#include </span><span class="Constant"><stdio.h></span><br /><span class="PreProc">#include </span><span class="Constant"><string.h></span><br /><span class="PreProc">#include </span><span class="Constant"><stdlib.h></span><br /><span class="PreProc">#include </span><span class="Constant"><unistd.h></span><br /><br /><span class="Type">const</span> <span class="Type">char</span> *addrs[] = { <span class="Constant">"127.0.0.1"</span>, <span class="Constant">"127.0.0.2"</span> };<br /><br /><span class="Type">int</span><br />main (<span class="Type">int</span> argc, <span class="Type">char</span> **argv)<br />{<br /><span class="Type">struct</span> rlimit rl; <span class="Comment">/*</span><span class="Comment"> to bump up system limits for this process </span><span class="Comment">*/</span><br /><span class="Type">int</span> *sockets; <span class="Comment">/*</span><span class="Comment"> ptr to array of sockets </span><span class="Comment">*/</span><br /><span class="Type">int</span> nsockets = <span class="Constant">120000</span>;<br /><span class="Type">int</span> c, i;<br /><br /><span class="Statement">while</span> ((c = getopt(argc, argv, <span class="Constant">"n:"</span>)) != -<span class="Constant">1</span>) {<br /> <span class="Statement">switch</span> (c) {<br /> <span class="Statement">case</span> <span class="Constant">'n'</span>:<br /> nsockets = atoi(optarg);<br /> <span class="Statement">break</span>;<br /> <span class="Statement">default</span>:<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Illegal argument </span><span class="Special">\"</span><span class="Special">%c</span><span class="Special">\"\n</span><span class="Constant">"</span>, c);<br /> exit(<span class="Constant">1</span>);<br /> }<br />}<br /><br />rl.rlim_cur = rl.rlim_max = nsockets + <span class="Constant">10</span>;<br /><span class="Statement">if</span> (setrlimit(RLIMIT_NOFILE, &rl) == -<span class="Constant">1</span>) {<br /> perror(<span class="Constant">"setrlimit"</span>);<br /> exit(<span class="Constant">1</span>);<br />}<br /><br /><span class="Statement">if</span> ((sockets = (<span class="Type">int</span> *) malloc(nsockets * <span class="Constant">2</span> * <span class="Statement">sizeof</span>(<span class="Type">int</span>))) == <span class="Constant">NULL</span>) {<br /> perror(<span class="Constant">"malloc"</span>);<br /> exit(<span class="Constant">1</span>);<br />}<br /><br /><span class="Statement">for</span> (i = <span class="Constant">0</span>; i < nsockets; i++) {<br /><span class="PreProc">#ifdef BIND_ONLY</span><br /> <span class="Type">struct</span> addrinfo *aitop, ai_hints = { .ai_family = AF_INET,<br /> .ai_socktype = SOCK_STREAM, .ai_flags = AI_PASSIVE };<br /> <span class="Type">const</span> <span class="Type">char</span> *addr = addrs[i % (<span class="Statement">sizeof</span>(addrs) / <span class="Statement">sizeof</span> (addrs[<span class="Constant">0</span>]))];<br /> <span class="Type">const</span> <span class="Type">char</span> *portstr = <span class="Constant">"0"</span>;<br /><br /> getaddrinfo(addr, portstr, &ai_hints, &aitop);<br /><br /> sockets[i] = socket(AF_INET, SOCK_STREAM, <span class="Constant">0</span>);<br /><br /> <span class="Statement">if</span> (bind(sockets[i], aitop->ai_addr, aitop->ai_addrlen) == -<span class="Constant">1</span>) {<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Error binding </span><span class="Special">%s</span><span class="Constant">, for </span><span class="Special">%s</span><span class="Constant"> : </span><span class="Special">%d</span><span class="Special">\n</span><span class="Constant">"</span>,<br /> strerror(errno), addr, portstr);<br /> } <span class="Statement">else</span><br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"ok addr: </span><span class="Special">%s</span><span class="Constant">, i: </span><span class="Special">%d</span><span class="Special">\n</span><span class="Constant">"</span>, addr, i);<br /><span class="PreProc">#else</span><br /> <span class="Type">struct</span> addrinfo *aitop, ai_hints = { .ai_family = AF_INET,<br /> .ai_socktype = SOCK_STREAM, .ai_flags = AI_PASSIVE };<br /> <span class="Type">char</span> portstr[<span class="Constant">20</span>];<br /><br /> snprintf(portstr, <span class="Statement">sizeof</span>(portstr), <span class="Constant">"</span><span class="Special">%d</span><span class="Constant">"</span>, <span class="Constant">8080</span> + (i % <span class="Constant">4</span>));<br /> getaddrinfo(<span class="Constant">"127.0.0.1"</span>, portstr, &ai_hints, &aitop); <span class="Comment">/*</span><span class="Comment"> dst </span><span class="Comment">*/</span><br /> sockets[i] = socket(AF_INET, SOCK_STREAM, <span class="Constant">0</span>);<br /><br /> <span class="Statement">if</span> (connect(sockets[i], aitop->ai_addr, aitop->ai_addrlen) == -<span class="Constant">1</span>) {<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Error connecting </span><span class="Special">%s</span><span class="Constant">, for port </span><span class="Special">%d</span><span class="Special">\n</span><span class="Constant">"</span>,<br /> strerror(errno), portstr);<br /> }<br /><span class="PreProc">#endif</span><br /><br /> free(aitop);<br />}<br /><br />printf(<span class="Constant">"</span><span class="Special">%i</span><span class="Constant"> socket pairs created, check memory. Sleeping 10 sec.</span><span class="Special">\n</span><span class="Constant">"</span>, i);<br />sleep(<span class="Constant">10</span>);<br /><br />exit(<span class="Constant">0</span>);<br />}</pre>Alecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.com4tag:blogger.com,1999:blog-7760753019966991240.post-72106568591749705152008-10-23T20:41:00.000-07:002008-11-03T16:48:48.777-08:00A Gazillion-user Comet Server With libevent, Part 0<div style="text-align: center;"><span style="font-size:130%;"><span style="font-weight: bold;">Abstract</span></span><br />After reading an inspiring saga about building a <span style="font-style: italic;">Comet</span> server with <span style="font-style: italic;">Erlang</span> and <a href="http://code.google.com/p/mochiweb/"><span>Mochiweb</span></a>, I <span original="inadverently" haspopup="true" role="menuitem" tabindex="-1" id="3.sc" class="PMpYeb">inadvertently</span> snowballed into making my own full-blown challenger using <span>C</span> and <span style="font-style: italic;">libevent</span>. The results hint an order of magnitude increase of performance compared to state of the art <span style="font-style: italic;">open source</span> servers. <span>Comet</span> is very important for Web 2.0 services, it reduces the amount of requests to the backend by the clients and brings <span>real time updates</span>. This is a description of the many frustrations and achievements doing this project. It will be posted in 4 installments, from 0 to 3.<br /></div><br />Updates: Typos, spell check, thanks Stef/Nico.<br /><h3><span>Introduction</span></h3>A recent post by <span style="font-weight: bold;">Richard Jones</span> (of <a href="http://www.last.fm/user/RJ">last.fm</a> fame) inspired me to start a comet server that scales well reviving old-school skills. In his post <a style="font-weight: bold;" href="http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1/">A Million-user Comet Application With Mochiweb Part 1</a> he presents a (<strike>mockup</strike>) prototype <span style="font-weight: bold;">Erlang</span> HTTP server. The goal of that project is to make a functional <a href="http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1/">Comet</a> server.<br /><blockquote>Gazillion (n.): Informal An indefinitely large number.<br /></blockquote>There is no working prototype on this first introductory instalment, hence the name "Part 0." But plenty of code, don't despair.<br /><br /><h3>The Comet Problem</h3>Since <a href="http://en.wikipedia.org/wiki/Comet_%28programming%29">Comet</a> is a <span>push</span> technology, most possible solutions rely on <span>keeping an HTTP connection open</span> because the server can't connect back to clients. It's a type of subscription model with some hacks on top. Current <a href="http://cometdaily.com/2008/03/14/comet-gazing-maturity/">open source Comet servers</a> can handle 10 to <span>20,000 simultaneous connections</span> on a stock server. Most are written in Java, Python, and Erlang. On the same article the developers of Liberator, a closed source commercial server (C or C-something I guess), claim to be able to sustain up to <span>a million client updates per second</span> for <span>10,000 clients</span>. Their site expands hinting it's running a daemon per core with client-side (Browser/Javascript) doing load balancing. All these figures were reported by those projects own developers. I couldn't find any independent benchmark. But they really sound like a good crowd, so I can take their words for it and you should too.<br /><br />The scalability problem of AJAX, and now Comet, is a major problem for the adoption of web technologies. Imagine the dialog between a Javascript mail application and a server, with the client polling every X seconds:<br /><blockquote>Client: "Is there anything new?"<br />Server: "Not yet..."<br />Client: "Now?"<br />Server: "No..."<br />Client: "Are we there yet"<br />Server: "@#%@$^!" (HTTP 503 Service Unavailable)<br /></blockquote>Comet fixes that but pays the price of open connections. Word on the streets is around 50KB per open connection for Java/Python using careful programming, and don't even think on so many objects to write to the wire. Garbage Collection optimization can become your own private horror story.<br /><br />So after all that introduction, this is my own multipart presentation of a (<strike>mockup</strike>) prototype. It should be an (ANSI/POSIX) <span style="font-weight: bold;">C</span> <span>library and server</span> using the fantastic <a href="http://monkey.org/%7Eprovos/libevent/libevent-benchmark2.jpg">libevent</a> library (hi <a href="http://www.provos.org/">Niels</a>!), and the popular <a href="http://www.libpcre.org/">libpcre</a> regular expression engine library (more on later posts.) The goal is to crash the cool crowd party and show some old-school moves.<br /><br />Among the many observations RJ makes, on his <span>first installment</span> he mentions:<br /><blockquote><b>The resident size of the mochiweb beam process with 10,000 active connections was 450MB - that’s 45KB per connection</b>. CPU utilization on the machine was practically nothing, as expected.</blockquote>(Edit: But on his second post he takes those numbers down to <span style="font-weight: bold;">8KB</span> per user by tuning memory management. That is still about <span style="font-weight: bold;">8 GB</span> for <span style="font-weight: bold;">1M</span> users and <span style="font-style: italic;">without counting system resources</span>!)<br /><br /><span><h3>Scalability: Some Ballpark Math</h3></span><span>To have an idea what to expect we need some ballpark calculations. This crude numbers would affect <span style="font-weight: bold;">any</span> kind of approach because it is the Operating System side. A starting point is finding out what happens with any given program when there are <span style="font-weight: bold;">many sockets connected</span>. With this little program we can see:<br /><pre><span class="Comment">/*</span><br /><span class="Comment"> Copyright (C) 2008 Alejo Sanchez</span><br /><span class="Comment"> (Inspired on bench.c by Niels Provos)</span><br /><br /><span class="Comment"> This program is free software: you can redistribute it and/or modify</span><br /><span class="Comment"> it under the terms of the GNU Affero General Public License as</span><br /><span class="Comment"> published by the Free Software Foundation, either version 3 of the</span><br /><span class="Comment"> License, or (at your option) any later version.</span><br /><br /><span class="Comment"> This program is distributed in the hope that it will be useful,</span><br /><span class="Comment"> but WITHOUT ANY WARRANTY; without even the implied warranty of</span><br /><span class="Comment"> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the</span><br /><span class="Comment"> GNU Affero General Public License for more details.</span><br /><br /><span class="Comment"> You should have received a copy of the GNU Affero General Public License</span><br /><span class="Comment"> along with this program. If not, see <<a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>>.</span><br /><span class="Comment">*/</span><br /><span class="PreProc">#include </span><span class="Constant"><sys/types.h></span><br /><span class="PreProc">#include </span><span class="Constant"><sys/time.h></span><br /><span class="PreProc">#include </span><span class="Constant"><sys/socket.h></span><br /><span class="PreProc">#include </span><span class="Constant"><sys/resource.h></span><br /><span class="PreProc">#include </span><span class="Constant"><stdio.h></span><br /><span class="PreProc">#include </span><span class="Constant"><stdlib.h></span><br /><span class="PreProc">#include </span><span class="Constant"><unistd.h></span><br /><br /><span class="Type">int</span><br />main (<span class="Type">int</span> argc, <span class="Type">char</span> **argv)<br />{<br /> <span class="Type">struct</span> rlimit rl; <span class="Comment">/*</span><span class="Comment"> to bump up system limits for this process </span><span class="Comment">*/</span><br /> <span class="Type">int</span> *pipes; <span class="Comment">/*</span><span class="Comment"> pipe (pairs) memory block </span><span class="Comment">*/</span><br /> <span class="Type">int</span> *cp; <span class="Comment">/*</span><span class="Comment"> traverse pipes </span><span class="Comment">*/</span><br /> <span class="Type">int</span> npipes;<br /> <span class="Type">int</span> i, c;<br /><br /> npipes = <span class="Constant">100000</span>; <span class="Comment">/*</span><span class="Comment"> default </span><span class="Comment">*/</span><br /> <span class="Statement">while</span> ((c = getopt(argc, argv, <span class="Constant">"n:"</span>)) != -<span class="Constant">1</span>) {<br /> <span class="Statement">switch</span> (c) {<br /> <span class="Statement">case</span> <span class="Constant">'n'</span>:<br /> npipes = atoi(optarg);<br /> rl.rlim_cur = rl.rlim_max = npipes * <span class="Constant">2</span> + <span class="Constant">20</span>;<br /> <span class="Statement">break</span>;<br /> <span class="Statement">default</span>:<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Illegal argument </span><span class="Special">\"</span><span class="Special">%c</span><span class="Special">\"\n</span><span class="Constant">"</span>, c);<br /> exit(<span class="Constant">1</span>);<br /> }<br /> }<br /><br /> <span class="Statement">if</span> (setrlimit(RLIMIT_NOFILE, &rl) == -<span class="Constant">1</span>) {<br /> perror(<span class="Constant">"setrlimit"</span>);<br /> exit(<span class="Constant">1</span>);<br /> }<br /><br /> <span class="Statement">if</span> ((pipes = (<span class="Type">int</span> *) malloc(npipes * <span class="Constant">2</span> * <span class="Statement">sizeof</span>(<span class="Type">int</span>))) == <span class="Constant">NULL</span>) {<br /> perror(<span class="Constant">"malloc"</span>);<br /> exit(<span class="Constant">1</span>);<br /> }<br /><br /> <span class="Statement">for</span> (cp = pipes, i = <span class="Constant">0</span>; i < npipes; i++, cp += <span class="Constant">2</span>) {<br /> <span class="Statement">if</span> (socketpair(AF_UNIX, SOCK_STREAM, <span class="Constant">0</span>, cp) == -<span class="Constant">1</span>) {<br /> perror(<span class="Constant">"pipe"</span>);<br /> exit(<span class="Constant">1</span>);<br /> }<br /> }<br /><br /> printf(<span class="Constant">"</span><span class="Special">%i</span><span class="Constant"> socket pairs created, check memory. Sleeping 1 sec.</span><span class="Special">\n</span><span class="Constant">"</span>, i);<br /> sleep(<span class="Constant">1</span>);<br /><br /> exit(<span class="Constant">0</span>);<br />}</pre>A test with <span style="font-weight: bold;">200,000 sockets</span> (note it's 100,000 <span style="font-style: italic;">pairs</span>) showed a <span style="font-weight: bold;">process size</span> of <span style="font-weight: bold;">2MB</span>, so far so good. But the command <span style="font-style: italic;">free</span> showed about <span style="font-weight: bold;">210MB</span> less free memory. It can make you think it is buffers and cache but those numbers didn't move. Repeated tests gave a very similar number and it had correlation with the amount of sockets created. The output of <span style="font-style: italic;">free</span> wasn't useful, same with <span style="font-style: italic;">top</span>. A bit of investigation showed this changes on <span style="font-style: italic;">/proc/meminfo</span>:<br /><pre>MemTotal: 2041864 kB<br /><span class="Special">MemFree: 1007248 kB</span><br />Buffers: 57744 kB<br />Cached: 400772 kB<br />[13 uninteresting lines]<br /><span class="Special">Slab: 257196 kB</span><br />SReclaimable: 136784 kB<br />SUnreclaim: 120412 kB<br /><br />MemTotal: 2041864 kB<br /><span class="Special">MemFree: 1225060 kB</span><br />Buffers: 57744 kB<br />Cached: 400772 kB<br />[13 uninteresting lines]<br /><span class="Special">Slab: 40612 kB</span><br />SReclaimable: 34020 kB<br />SUnreclaim: 6592 kB</pre>The difference is about 217MB, that is around <span style="font-weight: bold;">1KB per connected socket</span>. The Linux kernel takes a large amount of memory for connected sockets, it seems. This memory is initialized and ready for those sockets, but not yet in use. There is a good writeup about the <a href="http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator/">slab allocator</a>.<br /><h3>OS X Crashing, and Linux too</h3>The operating system imposes some limits on the amount of open files. These and other limits can be modified by editing the file <span style="font-style: italic;">/etc/sysctl.conf</span> on both <a href="http://wwwx.cs.unc.edu/%7Esparkst/howto/network_tuning.php">Linux</a> and <a href="http://developer.apple.com/documentation/Darwin/Reference/ManPages/man8/sysctl.8.html">OS X</a>. The most important for our tests is fs.file-max (kern.maxfiles in OS X) as it controls the global maximum of open files (sockets included.) In Linux there is a per user limit to set in <span style="font-style: italic;">/etc/security/limits.conf</span>:<br /><pre><span class="Comment"># /etc/security/limits.conf</span><br /><span class="Comment">#</span><br /><span class="Comment">#Each line describes a limit for a user in the form:</span><br /><span class="Comment">#</span><br /><span class="Comment">#domain type item value</span><br />alecco hard nofile 1001000<br /><br /><span class="Comment"># End of file</span></pre>To reload the configuration run <span style="font-style: italic;">sysctl -p</span> changes. and re-login for limits.conf Then either the program has to increase the soft limits with <span style="font-style: italic;">setrlimit</span>, or running <span style="font-style: italic;">ulimit -n unlimited</span> on the shell before invoking the program.<br /><br />Trying to make a <span style="font-weight: bold;">million connected pipes both OS X and Linux crash-freezed</span>. I'm probably missing something here as RC claims he got his prototype to do it (according to post title.) The maximum my Linux could handle was 400,000 and by those numbers some programs start to get killed.<br /><br />I couldn't find configuration for the behaviour of the slab allocator. There should be a way to prevent it from eating so much memory, IMHO. But still there isn't clear evidence it is related to the crashes. Anyway, this is a fixable <span style="font-weight: bold;">environment limit</span>, the code clearly can scale as it never gets over 25MB. When RC gets to explain a bit more perhaps this will just be a non-issue.<br /><br /></span><span>With libevent and Linux the scalability of the building blocks should be <a href="http://en.wikipedia.org/wiki/Big_O_notation#Orders_of_common_functions">O(log n)</a> as they <a href="http://monkey.org/%7Eprovos/libevent/libevent-benchmark2.jpg">show</a>.</span><span> To get to a more realistic number a test with libevent's <span style="font-weight: bold;">HTTP support</span> was needed. In about an hour I wrote a simple 137 line server. To attack it what better than Apache's <a href="http://httpd.apache.org/docs/2.0/programs/ab.html">ab</a>. Resources now jumped to a maximum of <span style="font-weight: bold;">21MB resident</span> (25MB virtual) for 200,000 working connections, but once again the OS was showing ~<span style="font-weight: bold;">450MB</span> extra memory used (400,000 connected sockets as ab was running local.) But, <span>lo an behold</span>, the thingie was starting to take shape.<br /><ul><li>For 10,000 parallel clients hammering the server could answer at <span style="font-weight: bold;">44,000 requests per second</span> (12,000 for OS X.)</li><li>For 10,000 parallel clients with a <span style="font-style: italic;">reconnect</span> per request it was still high at 18,000 requests per second!<br /></li></ul>Not bad for a <span>notebook</span>, huh? <span style="font-weight: bold;">On a single CPU core</span>! The numbers were practically the same for repeated tests. Furthermore, libevent HTTP code does memory allocation all over the place and behaved much better than I expected (my hopes were for something around 10,000 req./sec.) Here's the code:<br /><pre><br /><span class="Comment">/*</span><br /><span class="Comment"> Comet-c, a high performance Comet server.</span><br /><span class="Comment"> Copyright (C) 2008 Alejo Sanchez</span><br /><br /><span class="Comment"> This program is free software: you can redistribute it and/or modify</span><br /><span class="Comment"> it under the terms of the GNU Affero General Public License as</span><br /><span class="Comment"> published by the Free Software Foundation, either version 3 of the</span><br /><span class="Comment"> License, or (at your option) any later version.</span><br /><br /><span class="Comment"> This program is distributed in the hope that it will be useful,</span><br /><span class="Comment"> but WITHOUT ANY WARRANTY; without even the implied warranty of</span><br /><span class="Comment"> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the</span><br /><span class="Comment"> GNU Affero General Public License for more details.</span><br /><br /><span class="Comment"> You should have received a copy of the GNU Affero General Public License</span><br /><span class="Comment"> along with this program. If not, see <<a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>>.</span><br /><span class="Comment">*/</span><br /><br /><span class="PreProc">#include </span><span class="Constant"><sys/types.h></span><br /><br /><span class="PreProc">#include </span><span class="Constant"><stdio.h></span><br /><span class="PreProc">#include </span><span class="Constant"><stdlib.h></span><br /><span class="PreProc">#include </span><span class="Constant"><unistd.h></span><br /><br /><span class="PreProc">#include </span><span class="Constant"><event.h></span><br /><span class="PreProc">#include </span><span class="Constant"><evhttp.h></span><br /><br /><span class="Type">int</span> debug = <span class="Constant">0</span>;<br /><br /><span class="Type">void</span><br />generic_request_handler(<span class="Type">struct</span> evhttp_request *req, <span class="Type">void</span> *arg)<br />{<br /><span class="Type">struct</span> evbuffer *evb = evbuffer_new();<br /><br /><span class="Comment">/*</span><br /><span class="Comment"> </span><span class="Todo">XXX</span><span class="Comment"> add here code for managing non-subscription requests</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br /><span class="Statement">if</span> (debug)<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Request for </span><span class="Special">%s</span><span class="Constant"> from </span><span class="Special">%s</span><span class="Special">\n</span><span class="Constant">"</span>, req->uri, req->remote_host);<br /><br />evbuffer_add_printf(evb, <span class="Constant">"blah blah"</span>);<br />evhttp_send_reply(req, HTTP_OK, <span class="Constant">"Hello"</span>, evb);<br />evbuffer_free(evb);<br /><span class="Statement">return</span>;<br />}<br /><br /><span class="Comment">/*</span><br /><span class="Comment"> * Bayeux /meta handler</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br /><br /><span class="Type">void</span><br />bayeux_meta_handler_cb(<span class="Type">struct</span> evhttp_request *req, <span class="Type">void</span> *arg)<br />{<br /><span class="Type">struct</span> evbuffer *evb = evbuffer_new();<br /><br /><span class="Statement">if</span> (debug)<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Request for </span><span class="Special">%s</span><span class="Constant"> from </span><span class="Special">%s</span><span class="Special">\n</span><span class="Constant">"</span>, req->uri, req->remote_host);<br /><span class="Comment">/*</span><br /><span class="Comment"> </span><span class="Todo">XXX</span><span class="Comment"> add here code for managing non-subscription requests</span><br /><span class="Comment"> </span><span class="Comment">*/</span><br />evbuffer_add_printf(evb, <span class="Constant">"blah blah"</span>);<br />evhttp_send_reply(req, HTTP_OK, <span class="Constant">"Hello"</span>, evb);<br />evbuffer_free(evb);<br /><span class="Statement">return</span>;<br />}<br /><br /><span class="Type">void</span><br />usage(<span class="Type">const</span> <span class="Type">char</span> *progname)<br />{<br />fprintf(<span class="Constant">stderr</span>,<br /> <span class="Constant">"</span><span class="Special">%s</span><span class="Constant">: [-B] [-d] [-p port] [-l addr]</span><span class="Special">\n</span><span class="Constant">"</span><br /> <span class="Constant">"</span><span class="Special">\t</span><span class="Constant"> -B enable Bayeux support (on)</span><span class="Special">\n</span><span class="Constant">"</span><br /> <span class="Constant">"</span><span class="Special">\t</span><span class="Constant"> -d enable debug (off)</span><span class="Special">\n</span><span class="Constant">"</span><br /> <span class="Constant">"</span><span class="Special">\t</span><span class="Constant"> -l local address to bind comet server on (127.0.0.1)</span><span class="Special">\n</span><span class="Constant">"</span><br /> <span class="Constant">"</span><span class="Special">\t</span><span class="Constant"> -p port port number to create comet server on (8080)</span><span class="Special">\n</span><span class="Constant">"</span><br /> <span class="Constant">"</span><span class="Special">\t</span><span class="Constant"> (C) Alejo Sanchez - AGPL)</span><span class="Special">\n</span><span class="Constant">"</span>,<br /> progname);<br />}<br /><br /><span class="Type">int</span><br />main(<span class="Type">int</span> argc, <span class="Type">char</span> **argv)<br />{<br /><span class="Type">extern</span> <span class="Type">char</span> *optarg;<br /><span class="Type">extern</span> <span class="Type">int</span> optind;<br /><span class="Type">short</span> http_port = <span class="Constant">8080</span>;<br /><span class="Type">char</span> *http_addr = <span class="Constant">"127.0.0.1"</span>;<br /><span class="Type">struct</span> evhttp *http_server = <span class="Constant">NULL</span>;<br /><span class="Type">int</span> c;<br /><span class="Type">int</span> bayeux = <span class="Constant">1</span>;<br /><br /><span class="Statement">while</span> ((c = getopt(argc, argv, <span class="Constant">"Bd:p:l:"</span>)) != -<span class="Constant">1</span>)<br /> <span class="Statement">switch</span>(c) {<br /> <span class="Statement">case</span> <span class="Constant">'B'</span>:<br /> bayeux++;<br /> <span class="Statement">break</span>;<br /> <span class="Statement">case</span> <span class="Constant">'d'</span>:<br /> debug++;<br /> <span class="Statement">break</span>;<br /> <span class="Statement">case</span> <span class="Constant">'p'</span>:<br /> http_port = atoi(optarg);<br /> <span class="Statement">if</span> (http_port == <span class="Constant">0</span>) {<br /> usage(argv[<span class="Constant">0</span>]);<br /> exit(<span class="Constant">1</span>);<br /> }<br /> <span class="Statement">break</span>;<br /> <span class="Statement">case</span> <span class="Constant">'l'</span>:<br /> http_addr = optarg;<br /> <span class="Statement">break</span>;<br /> <span class="Statement">default</span>:<br /> usage(argv[<span class="Constant">0</span>]);<br /> exit(<span class="Constant">1</span>);<br /> }<br />argc -= optind;<br />argv += optind;<br /><br /><span class="Comment">/*</span><span class="Comment"> init libevent </span><span class="Comment">*/</span><br />event_init();<br /><br />http_server = evhttp_start(http_addr, http_port);<br /><span class="Statement">if</span> (http_server == <span class="Constant">NULL</span>) {<br /> fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Error starting comet server on port </span><span class="Special">%d</span><span class="Special">\n</span><span class="Constant">"</span>,<br /> http_port);<br /> exit(<span class="Constant">1</span>);<br />}<br /><br /><span class="Comment">/*</span><span class="Comment"> </span><span class="Todo">XXX</span><span class="Comment"> bayeux /meta handler </span><span class="Comment">*/</span><br /><span class="Statement">if</span> (bayeux)<br /> evhttp_set_cb(http_server, <span class="Constant">"/meta"</span>, bayeux_meta_handler_cb, <span class="Constant">NULL</span>);<br /><br /><span class="Comment">/*</span><span class="Comment"> </span><span class="Todo">XXX</span><span class="Comment"> default handler </span><span class="Comment">*/</span><br />evhttp_set_gencb(http_server, generic_request_handler, <span class="Constant">NULL</span>);<br /><br />fprintf(<span class="Constant">stderr</span>, <span class="Constant">"Comet server started on port </span><span class="Special">%d</span><span class="Special">\n</span><span class="Constant">"</span>, http_port);<br />event_dispatch(); <span class="Comment">/*</span><span class="Comment"> Brooom, brooom </span><span class="Comment">*/</span><br /><br />exit(<span class="Constant">0</span>); <span class="Comment">/*</span><span class="Comment"> UNREACHED ? </span><span class="Comment">*/</span><br />}<br /></pre><br />A Comet version of the server would surely improve on those amounts as the client doesn't need to pull from the server, each request is mostly server-side writes.<br /><br />So that was a nice <span>mockup</span> but it's not a working prototype, yet. A prototype would manage registrations of clients to channels, perhaps using a standard transport protocol, and doing a little bit of this and that.<br /><br /><a style="font-weight: bold;" href="http://cometdaily.com/maturity.html">A report on the state of the art of Comet servers</a> shows the most popular transport is</span><span> <a href="http://svn.xantus.org/shortbus/trunk/bayeux/bayeux.html"><span style="font-weight: bold;">Bayeux</span></a></span><span>. So this prototype can't skip that.<br /><br />So I should just plug in one of the JSON C parsers and it should be OK, right? <span>Wrong again</span>. Just like <a href="http://www.guardian.co.uk/theguardian/2007/feb/03/weekend7.weekend5">Richard Dawkins described</a> this situation:<br /><blockquote>[So, programming was] a classic addiction: prolonged frustration, occasionally rewarded by a briefly glowing fix of achievement. It was that pernicious "just one more push to see what's over the next mountain and then I'll call it a day" syndrome. It was a lonely vice, interfering with sleeping, eating, useful work and healthy human intercourse. I'm glad it's over and I won't start up again. Except ... perhaps one day, just a little ...</blockquote>Let's just say those JSON implementations didn't live up to my expectations. But, <span style="font-weight: bold;">what a time waster</span>...<br /><br />So, if we got this far, let's make a little Bayeux parser! How bad could it be?<br /><br />Coming up: First <span style="font-weight: bold;">prototype</span>, trying to do decent <span style="font-weight: bold;">parsing</span> in C without killing performance (right), more analysis on the original saga by Richard Jones, and I hope, please, some, sleep... But, well, I'm typing and it's just a matter of alt-tab, so perhaps, let's see just a little bit more... Just the one...<br /><br /><span style="font-weight: bold;">Alecco<span style="font-weight: bold;"><span style="font-weight: bold;"><br /><br /></span></span></span><span>PS: Sorry about posting licenses, it is for the lack of warranty part mostly.</span><span style="font-weight: bold;"><span style="font-weight: bold;"><span style="font-weight: bold;"><br /></span></span></span></span>Alecco Loccohttp://www.blogger.com/profile/02688489614138537585noreply@blogger.com13