Extracted from here:
Wednesday, 22 February 2012
Scalability Rules of Thumb
Extracted from here:
Sunday, 25 September 2011
How to protect passwords with Python and Scrypt
First, install py-scrypt, you need mercurial client for this method:
$ hg clone http://bitbucket.org/mhallin/py-scrypt $ cd py-scrypt $ python setup.py build $ sudo python setup.py install ### root user
Alternative method, download the tarball from py-scrypt project page downloads. Then extract and run the build and install steps above. An example showing how to use the module:
import random, scrypt def randstr(length): return ''.join(chr(random.randint(0,255)) for i in range(length)) def hash_password(password, maxtime=0.5, datalength=64): return scrypt.encrypt(randstr(datalength), password, maxtime=maxtime) def verify_password(hashed_password, guessed_password, maxtime=0.5): try: scrypt.decrypt(hashed_password, guessed_password, maxtime) return True except scrypt.error: return False if __name__ == '__main__': user_pw = 'theansweris42' user_salt = randstr(2) # 2+ bytes unique, save it too pw_salt = user_pw + user_salt hashed_pw = hash_password(pw_salt) # To be stored len()==192 print verify_password(hashed_pw, pw_salt) # True print verify_password(hashed_pw, 'guessing'+ pw_salt) # False
The default settings will use about 0.5s to generate each password. Migrating a large password database could take hours. The same random string could be used for all passwords but the speed improvement would be negligible.
The length of the random string could be shorter, making the password entry only 130 bytes long. But perhaps this could affect the security of the model.
More information:
The Scrypt Derivation functionEnough With The Rainbow Tables: What You Need To Know About Secure Password Schemes.
Edit: made the code a bit more clear and added last paragraph.
Monday, 20 December 2010
The Inevitable Revolution
A couple of years ago, Donald Knuth said in an interview[1]:
Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading.
Most algorithms were designed with a classic approach of doing one instruction at a time. This is called scalar processing. Algorithms are then implemented on a programming language and finally run compiled or interpreted. There are important issues on these transformations of the code, but that's out of the scope of this post.
The number of transistors inside processors has been roughly doubling each year[2]. Due to the nature of the transistors electricity gets transformed to heat. A lot of heat in a very small space packed with millions of transistors. So CPU manufacturers have reached a hard limit of CPU power heat dissipation[3]. The only way out of this dead end was to increase the number of CPU cores, stopping the Ghz race.
Another big issue with current computers is latency. While processors can run multiple instructions per nanosecond (ns), accessing memory takes about 100 ns. To work around this problem manufacturers place memory caches running at speeds much closer to CPU execution times. But this caches are very expensive and only processors aimed at the high-end server market have big configurations.
If programs were executed one instruction at a time most of the CPU would be idle waiting for a particular unit to finish its current task. Very often instructions in a given block of code can be run in parallel because they are not dependent (e.g. initializing unrelated variables.) The CPU performs this optimizations by having a deep pipeline of instructions currently executing (resembling assembly lines in factories.) To manage this pipeline the circuitry has to determine the instruction dependencies and sometimes reorder the instructions to improve throughput. Good compilers and VMs optimize compiled programs by ordering the instructions to match the characteristics of different CPU models.
When the flow of control of a program reaches a conditional branch (if-then-else) the processor evaluates the condition (e.g. variable a is zero) and make a jump (or not) to the following instruction. This evaluation usually takes long time and disrupts superscalar pipelining. To overcome this the processor has a dedicated unit called branch predictor[4] to evaluate conditional branches by remembering what branch was taken before. If the predictor is successful the flow is keeps running fast uninterrupted. But when there's a branch misprediction the wrongly picked instructions in the pipeline must be undone. This causes a pipeline flush often costing 5ns. For algorithms with many conditional branches with a probability close to .5 this can multiply into major bottleneck (e.g. walking binary trees or classic compression.)
The premises for making algorithms in the literature are completely out of touch with all these issues. One of the old maxims was to avoid full scans at all costs, but a sequential memory scan nowadays goes at 6 to 14 GiB per second and the limits are mostly the data bus bandwidth. Random access traversal in memory often is several orders of magnitude slower due to the latency issues compounded with branch mispredictions. The data structures often used don't scale due to fragmentation. In many cases are sequential in nature so parallel execution is either impossible or requires very slow locking mechanisms for critical areas making the code error-prone and filled with hard to predict bottlenecks.
To use larger amounts of main memory the processors switched to big data addresses. Every memory reference in a 64bit address space requires 8 bytes. In very common scenarios where linked structures are widely used for data storage in memory these huge pointers become a significant overhead. For example, each node of a binary tree will require at least 16 bytes and often 8 more to link to parent (as in popular red-black tree implementations.) If the node only stores 32bit integers the overhead is 400% to 600%!
In many cases the same code is applied to many elements of a data structure iteratively. With current processors it is possible to run the same instructions to packs of data and it's called SIMD (Single Instruction on Multiple Data.) This is widely used for media processing but it is becoming common for general purpose data processing. Most SIMD implementations are evolving to give better support for non-multimedia uses (e.g. Intel's SIMD string instructions.) Some of the most interesting new algorithms of the last decade were redisigns to exploit SIMD processing. Programming directly to one of this SIMD implementations can be quite tricky but it is possible to simplify this with compiler intrinsics or re-targetting compiler tools (like MIT's Cilk.)
There are many interesting instructions available on processors performing very useful tasks (e.g. bit scan) completely neglected by programming textbooks. The savings in time and complexity for algorithms can be very significant.
I differ with proffessor D. Knuth and think this is a great opportunity and all this limitations actually make these times very interesting for computer scientists. It's time for a revolution! Almost everything has to be revised to match the new hardware paradigm. For this there's a need for an army of Knuths to find the way. We need to shake the CS establishment out of their comfortable leather chairs.
[Note: I do not mean Prof. D. Knuth, in fact he is one of the very few CS masters who ties high level theory with low level realistic implementations.]The New Algorithm Design Maxims
- Find alternatives to sequential processing
- Minimize memory use and avoid bloat (Latency)
- Store related data as close as possible (Caching)
- Minimize branch mispredictions or remove branches altogether
- Favor vector/matrix data structures over linked nodes (Pointers, Caching)
- Exploit vector processing if possible (SIMD)
- Embrace specialized instructions widely available
A very common approach to speed up processing is to perform space-time trade-offs. For example changing programmatic code to big table lookups. While this works in the small micro-benchmarks it is a short-sighted trick that usually makes scaling very hard. CPU cores can perform billions of instructions per second if used wisely and the new trend for big data processing is to do the opposite, compute-space trade-off. In particular with fast compression algorithms.
Good times.
[1] InformIT: Interview with Donald Knuth (2008)
[2] Wikipedia: Moore's Law
[3] Wikipedia: CPU power dissipation
[4] Wikipedia: Branch Predictor
Tuesday, 25 August 2009
SQLite: A Lesson In Low-Defect Software
While looking for some old papers and presentations on SQLite (remember my disk died a few months ago without proper backup), even though it seems the old ones are offline or lost, there's a new presentation on their experience producing excellent software.
It seems this presentation wasn't picked up properly so here's a shortened review with more detail on statement and branch testing with GCC. This is mind-blowing stuff, at least for me. Almost everything in this post comes from the paper. I'm trying to stay as close as possible to the original with only a few things added to make it a bit easier to understand at first sight and allow straight copy-paste to play with. I also skipped the less extraordinary SCM recommendations at the end.
- Use comments on your source code to engage the other half of your brain
- One side does "math" and the other "language".
- Safety != Reliability. Safety: no harm; reliability: no failures.
- Safe Software = Extremely Reliable Software.
- What Programming Languages Does The World's Most Reliable Software Use? Avionics: ADA, C; Space Shuttle: HAL/S. Not Python, TCL/TK, .NET, LUA, Java (safe but not reliable.)
- DO-178B and ED-12B "Software Considerations in Airborne Systems and Equipment Certification": Development process matters, not the programming language. Captures best practices.
- [Remember from An Introduction to SQLite: What makes SQLite great is the extensive testing (60%).]
- Use your whole brain.
- Full coverage testing.
- Good configuration management.
- Don't just fix bugs, fix your process.
- Comment: each function or procedure and major code blocks.
- Comment: all variables and constant declarations.
- SQLite has a code to comment ratio of 2:1.
- Full coverage testing:
- Automated tests that exercise all features of the program: all entry points, subroutines, all branches and conditions, all cases, all boundary values.
- DO-178B places special emphasis on testing.
- Statement Coverage: Tests cause every line of code to run at least once.
- Branch Coverage: Tests cause every machine-language branch operation
to evaluate to both TRUE and FALSE at least once. - (See section below on how to do test coverage with GCC.)
- Fly what you test! (Measuring coverage validates your tests, not your product.)
- Defensive Programming (See section below.)
Testing in SQLite
- 99% Statement Coverage
- 95% Branch Coverage
- Goal: 100% branch coverage by Dec 2009
- Striving for 100% test coverage has been [their] most effective method for finding bugs.
- Testing in C/TCL (1M), C (2.3M), SQL logic (5.8M)
- Crash testing.
- I/O error and out-of-memory testing.
- Fuzz testing.
- Valgrind (memory: usage, profiling, leak tracking.)
- Most bugs are found internally – before release.
- External bugs are mostly build problems.
- [SQLite team] doesn't ship “alpha” or “beta” releases, all SQLite releases are production-ready.
- It is rare for users to find "wrong answers."
Test Coverage with GCC
Consider this snippet:
1 int exampleFunction(int a, int b){ 2 3 int ans = 0; 4 5 if ( a>b && a<2*b ){ 6 ans = a; 7 } else { 8 ans = b; 9 } 10 11 return ans; 12 }There are 6 statements in lines 1, 3, 5, 6, 8, 11. To test this function you could run this test:
exampleFunction(1,1);
That would cover statements in lines 1, 3, 5, 8, 11, but not line 6 (because !(a>b)). The following test addresses this statement:
exampleFunction(3,2);
Now that would get all statements covered (line 6, ans = a.) But what about all possible branches? There are 4 possible branches:
- a>b: Taken on test 2.
- !(a>b): Taken on test 1.
- a<2*b: Taken on test 2.
- !(a<2*b): Never taken.
exampleFunction(4,2);VoilĂ ! All statements and branches tested.
It seems gcc has some great features to help. With -fprofile-arcs and -ftest-coverage it will generate branch coverage data, from gcc Debugging Options manual:
-fprofile-arcsSo let's make this test with testme.c with only the first test:
Add code so that program flow arcs are instrumented. During execution the program records how many times each branch and call is executed and how many times it is taken or returns. When the compiled program exits it saves this data to a file called aux-name.gcda for each source file. The data may be used for profile-directed optimizations (-fbranch-probabilities), or for test coverage analysis (-ftest-coverage).
[...]
-ftest-coverage
Produce a notes file that the gcov code-coverage utility can use to show program coverage. Each source file's note file is called aux-name.gcno. Refer to the -fprofile-arcs option above for a description of auxname and instructions on how to generate test coverage data. Coverage data will match the source files more closely, if you do not optimize.
int exampleFunction(int a, int b){ int ans = 0; if ( a>b && a<2*b ){ ans = a; } else { ans = b; } return ans; } int main() { exampleFunction(1,1); return 0; }Compiling and running:
$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \ && gcov -c testme.c && cat testme.c.gcov File 'testme.c' Lines executed:88.89% of 9 testme.c:creating 'testme.c.gcov' -: 0:Source:testme.c -: 0:Graph:testme.gcno -: 0:Data:testme.gcda -: 0:Runs:1 -: 0:Programs:1 1: 1:int exampleFunction(int a, int b){ -: 2: 1: 3: int ans = 0; -: 4: 1: 5: if ( a>b && a<2*b ){ #####: 6: ans = a; -: 7: } else { 1: 8: ans = b; -: 9: } -: 10: 1: 11: return ans; -: 12:} -: 13: 1: 14:int main() { 1: 15: exampleFunction(1,1); 1: 16: return 0; -: 17:}The first column on the report shows how many times the source line was evaluated. The second column is the source line number. Note the ##### mark on the missed statement. Now if we add the second test:
$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \ && gcov -c testme.c && cat testme.c.gcov File 'testme.c' Lines executed:100.00% of 10 testme.c:creating 'testme.c.gcov' -: 0:Source:testme.c -: 0:Graph:testme.gcno -: 0:Data:testme.gcda -: 0:Runs:1 -: 0:Programs:1 2: 1:int exampleFunction(int a, int b){ -: 2: 2: 3: int ans = 0; -: 4: 3: 5: if ( a>b && a<2*b ){ 1: 6: ans = a; -: 7: } else { 1: 8: ans = b; -: 9: } -: 10: 2: 11: return ans; -: 12:} -: 13: 1: 14:int main() { 1: 15: exampleFunction(1,1); 1: 16: exampleFunction(3,2); 1: 17: return 0; -: 18:}It now shows the statement in line 6 was taken.
Let's do the branch test with only the first test. Note the -b flag to gcov is the only command line change:
$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \ && gcov -b -c testme.c && cat testme.c.gcov File 'testme.c' Lines executed:88.89% of 9 Branches executed:50.00% of 4 Taken at least once:25.00% of 4 Calls executed:100.00% of 1 testme.c:creating 'testme.c.gcov' -: 0:Source:testme.c -: 0:Graph:testme.gcno -: 0:Data:testme.gcda -: 0:Runs:1 -: 0:Programs:1 function exampleFunction called 1 returned 100% blocks executed 60% 1: 1:int exampleFunction(int a, int b){ -: 2: 1: 3: int ans = 0; -: 4: 1: 5: if ( a>b && a<2*b ){ branch 0 taken 0 (fallthrough) branch 1 taken 1 branch 2 never executed branch 3 never executed #####: 6: ans = a; -: 7: } else { 1: 8: ans = b; -: 9: } -: 10: 1: 11: return ans; -: 12:} -: 13: function main called 1 returned 100% blocks executed 100% 1: 14:int main() { 1: 15: exampleFunction(1,1); call 0 returned 1 1: 16: return 0; -: 17:}The report says the branches taken at least once are 25%. Further in the report, after line 5: it lists the 4 branches from that line. Branch 0 (a>b) was evaluated but not taken (because a=b.) Branch 1 (!(a>b)) was taken. Since branches 2 and 3 depend on branch 0 those weren't even executed. Now let's run with the second test:
$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \ && gcov -b -c testme.c && cat testme.c.gcov File 'testme.c' Lines executed:100.00% of 10 Branches executed:100.00% of 4 Taken at least once:75.00% of 4 Calls executed:100.00% of 2 testme.c:creating 'testme.c.gcov' -: 0:Source:testme.c -: 0:Graph:testme.gcno -: 0:Data:testme.gcda -: 0:Runs:1 -: 0:Programs:1 function exampleFunction called 2 returned 100% blocks executed 100% 2: 1:int exampleFunction(int a, int b){ -: 2: 2: 3: int ans = 0; -: 4: 3: 5: if ( a>b && a<2*b ){ branch 0 taken 1 (fallthrough) branch 1 taken 1 branch 2 taken 1 (fallthrough) branch 3 taken 0 1: 6: ans = a; -: 7: } else { 1: 8: ans = b; -: 9: } -: 10: 2: 11: return ans; -: 12:} -: 13: function main called 1 returned 100% blocks executed 100% 1: 14:int main() { 1: 15: exampleFunction(1,1); call 0 returned 1 1: 16: exampleFunction(3,2); call 0 returned 1 1: 17: return 0;Here we see the only remaining branch not taken is 3 (but it was executed). Let's run with the third test:
$ gcc -g -fprofile-arcs -ftest-coverage testme.c -o testme && ./testme \ && gcov -b -c testme.c && cat testme.c.gcov File 'testme.c' Lines executed:100.00% of 11 Branches executed:100.00% of 4 Taken at least once:100.00% of 4 Calls executed:100.00% of 3 testme.c:creating 'testme.c.gcov' -: 0:Source:testme.c -: 0:Graph:testme.gcno -: 0:Data:testme.gcda -: 0:Runs:1 -: 0:Programs:1 function exampleFunction called 3 returned 100% blocks executed 100% 3: 1:int exampleFunction(int a, int b){ -: 2: 3: 3: int ans = 0; -: 4: 4: 5: if ( a>b && a<2*b ){ branch 0 taken 2 (fallthrough) branch 1 taken 1 branch 2 taken 1 (fallthrough) branch 3 taken 1 1: 6: ans = a; -: 7: } else { 2: 8: ans = b; -: 9: } -: 10: 3: 11: return ans; -: 12:} -: 13: function main called 1 returned 100% blocks executed 100% 1: 14:int main() { 1: 15: exampleFunction(1,1); call 0 returned 1 1: 16: exampleFunction(3,2); call 0 returned 1 1: 17: exampleFunction(4,2); call 0 returned 1 1: 18: return 0;Great, 100% coverage now!
Defensive Programming
Input variable boundary checking would create many branches never taken on tests. Consider for example checking the value of input nBytes doesn't get over a limit 0x7fffff00:
void *sqlite3InternalMalloc(int nBytes){ if( nBytes<=0 || nBytes>=0x7fffff00 ){ return 0; }else{ return sqlite3LowLevelMalloc(nBytes); } }That branch can't be tested. SQLite has some interesting macros NEVER() and ALWAYS():
#if defined(SQLITE_COVERAGE_TEST) # define ALWAYS(X) 1 # define NEVER(X) 0 #elif defined(SQLITE_DEBUG) # define ALWAYS(X) ((X)?1:sqlite3Panic()) # define NEVER(X) ((X)?sqlite3Panic():0) #else # define ALWAYS(X) (X) // PASS THROUGH (What you fly) # define NEVER(X) (X) // PASS THROUGH (What you fly) #endifApplied to the bounds check above:
void *sqlite3InternalMalloc(int nBytes){ if( nBytes<=0 || NEVER(nBytes>=0x7fffff00) ){ return 0; }else{ return sqlite3LowLevelMalloc(nBytes); } }
Footnote
AFAIK, testing can't get any better than this. Somebody could do a machine-checked proof, like the
Monday, 8 June 2009
APE, an AJAX push engine
Tuesday, 10 February 2009
Making a free hostel reservation system - pt 1
I know many will tell me about how I should make money out of this, but I don't think in this present system of chaos, abuse, and misinformation anybody can have a decent working business model. I'll be happy to just show how disposable they really are to both hostels and travelers.
Some goals for hostel owners:
- To own the data and be able to pull out as they please, no lock-in.
- Simple way to have their own data layout to play with, thus avoiding multiple reservation tracking systems.
- Zero cost.
- Minimum possible technical knowledge requirements.
- Have a fast availability system either global or on the hostel's page.
- No credit card required.
- Easy reservation form request with confirmation by email/phone confirmation (most hostels have this already in place.)
- Google Docs (it shouldn't be that hard to switch to alternatives and it's a sort of neutral brand for everybody.) Spreadsheet to manage and export the reservations, Forms to receive the reservation requests, Gadget with the core logic and easy to place inside even static pages hostels have, Sites for the central page.
So it's one spreadsheet with two worksheets and a form, per hostel.
On the legal side this site could be set with these:
- Use only Creative Commons non-commercial share-alike license for the all published data (the shared bit of the spreadsheets, not all the data!)
- Use GPLv3 for the JavaScript code (a bit redundant but just to make it clear, there is still heavy server-side JavaScript on the corporate world, you wouldn't believe what I've seen.)
Missing but possibly easy to implement features:
- Internationalization and localization (existing booking systems suck at this, also.)
- A sane way to make a guarantee to replace deposits (e.g. ask for a $1 paypal/amazon donation to a range of charities like EFF, CC, or WaterAid.)
- Hostel media hosting or display (i.e. pictures, videos, descriptions.)
- A fair review system (well, not that easy due to the abundance of trolls.)
It can be placed on any static page with just this iframe:
Monday, 29 December 2008
And The Dog Ate My Homework
Luckily the outlines of upcoming posts were stored on Blogger, and I remember most of the code's ideas. I'll try to move on from this.
Besides that hiccup things are doing very well and even got a suntan.
Backup now to a pendrive, to some DVDs, to an external HD, to Gmail, everywhere. Use TrueCrypt, but wisely, get yourself some subtle warning if you do the hidden volume trick! (And at least write-protect the drive.)
