DynaSoar writes "NASA is soliciting ideas from the public on how best to catalog and digitize the collected notes of Wernher von Braun. 'We're looking for creative ways to get it out to the public,' said project manager Jason Crusan. 'We don't always do the best with putting out large sets of data like this.' The PDF notes are those of rocket scientist Wernher von Braun, the first director of NASA's Marshall Spaceflight Center in Huntsville, Alabama and are typed with copious handwritten notes in the margin. According to the official request for information, NASA needs ideas on what format to use (PDF), how to index the notes, and how to create a useful database. The unique nature and historical value of the data, literally discovered in boxes six months ago, is what motivated NASA to ask the public for ideas."
This past weekend I had a garage sale and, as I was clearing stuff, realized how much junk paperwork I had stashed in the garage. There were books, manuals, class notes, lecture notes (from those I attended and those I gave), meeting notebooks, documentation on long obsolete processes (Token Ring MAU reset procedures, Novell Netware rebuild procedures). I had notebooks of stories, embarrassing journal entries from college ("DH has the most beautiful eyes!!"), and all sort
yes it is. but many whiners here will argue against it.
The thing is, dont half ass the pdf by simply encapsulating images. they need to do a real OCR on it and separate things out to images that are not typewritten.
then donate the boxes to the Smithsonian.
the MOST IMPORTANT aspect of the documents is that it is easily searched. which means all text must be text and not images. Yes that includes his handwriting.
The thing is, dont half ass the pdf by simply encapsulating images. they need to do a real OCR on it and separate things out to images that are not typewritten....the MOST IMPORTANT aspect of the documents is that it is easily searched. which means all text must be text and not images. Yes that includes his handwriting.
I agree, but the second most important aspect is that the images of the original get preserved too. The ideal way to do it is to have the image be displayed, but with the OCR'd text linked t
the SECOND MOST IMPORTANT aspect of the documents is that it is easily searched.
The FIRST is of course making a high fidelity digital copy of the original pages, that will serve as the authority on all questions of possible ambiguity in the handwriting, or whether a figure in the margin is a thumbnail sketch or a mere doodle.
A 600 or 1200 dpi.png image of each page in full color would do as the master digital archive. The.png format is an excellent choice since it is open, well understood, and going to be around for a long, long time. Its accuracy is more than adequate for this work. That it supports lossless compression is a bonus: images of pages usually compress very well. Copies of the master digital library should be kept at various institutions and made available on request to anyone.
Then for public and research use, convert each page to HTML 4.01 strict, (since it is universally available, will be around for a long, long time, and Google, etc, can do the indexing for us). UTF of course, especially since Werner used some German and Greek glyphs in his handwriting.
Suggest using OCR to handle conversion of the typed notes, and volunteers or cheap student labor to transcribe the handwritten material (use consensus of several transcribers to assure accuracy). These can be incorporated into the main pages as divs and spans inserted into the correct place in the flow (use classes like "left margin" and "rightmargin"). CSS can use absolute positioning to make them marginal accordians (expand from the margin on mouseover), etc.
Treat sketches like the handwriting: put an img of the sketch into a div or span at the right place in the flow, then also add a searchable text description of the sketch in that div.
A simple script can process the final HTML fragment of each page and insert id="unique" attributes on each paragraph, etc, and <a name="unique"> targets where these would be useful.
The finished NASA product should be a simple online database using server side scripting to compose and serve out pages on request. It should be built with cooperation from Google and other search platforms so that spiders will have good access to the body of the work without causing excessive bandwidth problems. It should be possible for any researcher to develop his own custom search engine. Ideally, it will support not just the notes, but also concordances, wiki discussions, etc.
I once did a lot of this kind of work in moving sermons and such that were circulated by mimeograph in the 1960s and 1970s to web pages. I digitized the pages with a Minolta Z1 camera on a reverse tripod using indirect lighting, and converted to OCR with OmniScan (IIRC). The OCR came out in Word 97 format, and I used Perl scripts to transcribe to HTML. If the technical quality of the originals is good, this can go pretty fast and is highly accurate, even as a basement project. If the original notes use consistent formatting, which I would expect of Werner, then scripting with good use of regular expressions cna do the bulk of the HTML markup.
For the right persons, transcribing the handwritten notes and sketches would be very rewarding. Werner Von Braun was pivotal technologist whose work for the Nazis either posed one of the greatest threats to England during WWII or, through high level monkeywrenching, managed to keep that threat from becoming a reality. He was definitely a very complex character who succeeded in doing a helluva good balancing act on dangerously high political high wires.
So access to his notes in exchange for doing the drudg
No. There is no such thing as an open source format. Open source is a term that can only apply to an implementation of a standard, not to the standard itself. Things like xpdf/Poppler are open source implementations of the PDF standard. The term 'open standard' applies to formats but is badly defined. The common definitions of an open format are:
Can be licensed under nondescriminatory conditions (e.g. MPEG formats).
Freely available specification, can be implemented by anyone (e.g. PDF).
Future versions of the standard controlled by a a standards committee (e.g. HTML).
PDF, since its creation, has been an open standard according to definition 2. Some people don't like it because it doesn't meet definition 3 (Adobe are the only ones who can create new versions of the PDF spec).
by Anonymous Coward
on Monday June 29 2009, @09:22AM (#28513719)
Gather round while I sing you of Wernher von Braun A man whose allegiance is ruled by expedience Call him a Nazi, he won't even frown "Ha, Nazi schmazi," says Wernher von Braun
Don't say that he's hypocritical Say rather that he's apolitical "Once the rockets are up, who cares where they come down That's not my department," says Wernher von Braun
Some have harsh words for this man of renown But some think our attitude should be one of gratitude Like the widows and cripples in old London town Who owe their large pensions to Wernher von Braun
You too may be a big hero Once you've learned to count backwards to zero "In German oder English I know how to count down Und I'm learning Chinese," says Wernher von Braun
Werner von Braun's autobiography was titled "I Aim For The Stars." Mort Sahl suggested a subtitle, to make it "I Aim For The Stars (But Sometimes I Hit London)"
On the next thing that goes up to space (or even just a suborbital flight), crank down the window at about 20km up and throw the stuff out (or have some automated thingy with an explosive bolt that distributes it into the atmosphere). Now THAT would be a "creative way to get it out to the public".
Scan it at high resolution, OCR what you can, and load it into Distributed Proofreaders [pgdp.net]. Or if the material is too technical for the layperson, ask for a copy of the web-based software and set up your own private site. Let bored grad students work on it in exchange for some kind of minor credit on the final digitized work. (I believe that the bored grad students phenomenon produces half of the highly-technical articles on Wikipedia.)
There are projects that use captchas to digitize old texts, NASA could put those parts which don't lend themselves to OCR as captchas on their webpage.
Unfortunately, the notes are full of non-words, like (RTG), SNAP-10A, B70, n.mi At least, that what i'm assuming they say, because some of them are rather unreadable. Now, slashdotters may recognise some, but many people won't see the "words"
Lets go with a format almost anyone can read. As soon as their all scanned in as high res TIFFs THEN you can begin to OCR them and create hybrid PDF's which CAN be indexed. From there we have a good start with high quality originals and searchable dirivitives. Then people can start rolling whatever custom solutions they want to.
Yes, I know that OCR is going to be very crude, especially for anything hand written. But what it will do is get us a very good starting point. Id like to see a wiki set up with the OCR'd text as the beginning text, a link to the document and then the public can begin to go in and correct the OCR mistakes, and fill in what just flat out couldn't be OCRd.
Well, considering they host over 6,000 pdfs [google.com] and the RFI is in PDF with the title of the document being "Microsoft Word - WvB RFI 6-24-09.doc" by Jason Crusan who used Acrobat Distiller 7.0.5(Windows), I think we know what everyone uses at NASA. Fine. I'm not going to bitch about that. Instead I'm going to point out that if you're already dependent on Adobe Acrobat Reader & Microsoft Word being around until the end of time supporting your old doctypes, you might as well release these in PDF from DOC sources too.
But, if I were doing this: Assuming these are all in images, put the images in whatever format you want and make a generic wiki page for each of them. Then let users log in (NASA fans should pour in) and translate the pages to annotated wiki pages with the footnotes (normally references) being all the side notes that were penciled in. They can categorize them by related missions and maybe even tag them... you will need at least one or two people on your staff to administrate. Diagrams and drawings will probably need to be cropped and retained as images. Keep those in a lossless format but distribute whatever saves you bandwidth.
Once that's done, ideally you'd put it in some XML standards based format (ODF or OOXML, yeah, that's another argument to be had) that you will always be able to read even if you have to build your own viewer/converter. Keep these sources indexed and provide for people the rendered PDF/PS/PNG/whocares and then you could probably build scripts to rebuild all from sources if you want. New technology comes out or people want to view them in HTML 5--no problem, just build a neat little XSLT for them.
As for indexing them, I can tell you one way not to do it. Don't do the thing that curators of classical music did [stason.org]. Man, that's like speaking another language to me. Arrange the notes by mission or date if you can and any natural titles that arise for the favorites, add to it as an alias.
Why don't they release it in the open standard PDF, with annotations for the handwritten notes, which I believe are in the in the standard. (I might be wrong.)
Thanks NASA for making me feel like my opinion is valued and useful. Kind of like that, oh what was it called? The vote for the name of that satellite thingy?
When really you're just passing the buck because your budget didn't include "digitizing old notes."
instead of focring people to pay taxes on some project of dubious desirability, they are trying to see if the public has any support for their idea, before they thrust headlong into it.
government workers should ask the opinion of the taxpayers more often, we are after all , their bosses. i have a lot of respect for the government employees that remember this, and nothing but contempt for those who want to 'play social engineer and tax waster' without regard for what the public thinks.
Even if NASA did do it itself, "society" would be paying for it anyway...
Actually, this should be better in two important ways: not only could crowd-sourcing could accomplish the task much more efficiency than $50-grand-space-pen-NASA could to begin with, but also the cost would be distributed across the entire Internet, rather than being shouldered only by American taxpayers! It's a win-win-win* situation, I'd say.
(* for NASA, and for space geeks, and for taxpayers)
You guys clearly do not read enough electronic media. PDF and Djvu are the more widespread and relatively ubiquitous modern electronic book formats. Djvu tends to be vastly superior to PDF in terms of file size though.
Read all about it here: http://en.wikipedia.org/wiki/Djvu
NASA (Score:5, Insightful)
Re:NASA (Score:5, Funny)
Parent
Re:NASA (Score:4, Funny)
Wow...I didn't know they had that position?!?!
I'm not sure I'd WANT to be fist director....sounds like more of a strange pr0n thing than a NASA office.
Parent
Re: (Score:2)
Next week: What to do with this big golden box thing? We tried opening it and some guy's face melted.
Guy 1: It's the Ark of the Covenant!
Guy 2: No, it's a spare reactor core. Same effect.
Re: (Score:3, Funny)
NASA: We already have top men on that.
Slashdot: But wh--
NASA: Top. Men.
(My favorite line. Uttered by the actor who played Porkins, IIRC.)
Re: (Score:2)
I assure you that they have top men working on it right now.
Re: (Score:3, Interesting)
Not sure if I can really blame them.
This past weekend I had a garage sale and, as I was clearing stuff, realized how much junk paperwork I had stashed in the garage. There were books, manuals, class notes, lecture notes (from those I attended and those I gave), meeting notebooks, documentation on long obsolete processes (Token Ring MAU reset procedures, Novell Netware rebuild procedures). I had notebooks of stories, embarrassing journal entries from college ("DH has the most beautiful eyes!!"), and all sort
Format Suggestion (Score:3, Funny)
Re: (Score:2)
Re: (Score:3, Funny)
Might as well get MediaSentry and the RIAA in on the act ...
Contact MIT and their archival department (Score:5, Informative)
They got that million dollar touchless scanner that can digitize the papers with ease, then put them into either Open Source or PDF formats.
Re: (Score:2)
Isn't the PDF format open source?
Re:Contact MIT and their archival department (Score:5, Insightful)
yes it is. but many whiners here will argue against it.
The thing is, dont half ass the pdf by simply encapsulating images. they need to do a real OCR on it and separate things out to images that are not typewritten.
then donate the boxes to the Smithsonian.
the MOST IMPORTANT aspect of the documents is that it is easily searched. which means all text must be text and not images. Yes that includes his handwriting.
Parent
Re: (Score:2)
I agree, but the second most important aspect is that the images of the original get preserved too. The ideal way to do it is to have the image be displayed, but with the OCR'd text linked t
Re:Contact MIT and their archival department (Score:5, Informative)
Let me fix that for you:
the SECOND MOST IMPORTANT aspect of the documents is that it is easily searched.
The FIRST is of course making a high fidelity digital copy of the original pages, that will serve as the authority on all questions of possible ambiguity in the handwriting, or whether a figure in the margin is a thumbnail sketch or a mere doodle.
A 600 or 1200 dpi .png image of each page in full color would do as the master digital archive. The .png format is an excellent choice since it is open, well understood, and going to be around for a long, long time. Its accuracy is more than adequate for this work. That it supports lossless compression is a bonus: images of pages usually compress very well. Copies of the master digital library should be kept at various institutions and made available on request to anyone.
Then for public and research use, convert each page to HTML 4.01 strict, (since it is universally available, will be around for a long, long time, and Google, etc, can do the indexing for us). UTF of course, especially since Werner used some German and Greek glyphs in his handwriting.
Suggest using OCR to handle conversion of the typed notes, and volunteers or cheap student labor to transcribe the handwritten material (use consensus of several transcribers to assure accuracy). These can be incorporated into the main pages as divs and spans inserted into the correct place in the flow (use classes like "left margin" and "rightmargin"). CSS can use absolute positioning to make them marginal accordians (expand from the margin on mouseover), etc.
Treat sketches like the handwriting: put an img of the sketch into a div or span at the right place in the flow, then also add a searchable text description of the sketch in that div.
A simple script can process the final HTML fragment of each page and insert id="unique" attributes on each paragraph, etc, and <a name="unique"> targets where these would be useful.
The finished NASA product should be a simple online database using server side scripting to compose and serve out pages on request. It should be built with cooperation from Google and other search platforms so that spiders will have good access to the body of the work without causing excessive bandwidth problems. It should be possible for any researcher to develop his own custom search engine. Ideally, it will support not just the notes, but also concordances, wiki discussions, etc.
I once did a lot of this kind of work in moving sermons and such that were circulated by mimeograph in the 1960s and 1970s to web pages. I digitized the pages with a Minolta Z1 camera on a reverse tripod using indirect lighting, and converted to OCR with OmniScan (IIRC). The OCR came out in Word 97 format, and I used Perl scripts to transcribe to HTML. If the technical quality of the originals is good, this can go pretty fast and is highly accurate, even as a basement project. If the original notes use consistent formatting, which I would expect of Werner, then scripting with good use of regular expressions cna do the bulk of the HTML markup.
Parent
Re: (Score:3, Interesting)
For the right persons, transcribing the handwritten notes and sketches would be very rewarding. Werner Von Braun was pivotal technologist whose work for the Nazis either posed one of the greatest threats to England during WWII or, through high level monkeywrenching, managed to keep that threat from becoming a reality. He was definitely a very complex character who succeeded in doing a helluva good balancing act on dangerously high political high wires.
So access to his notes in exchange for doing the drudg
Re:Contact MIT and their archival department (Score:4, Informative)
PDF, since its creation, has been an open standard according to definition 2. Some people don't like it because it doesn't meet definition 3 (Adobe are the only ones who can create new versions of the PDF spec).
Parent
Obligatory Tom Lehrer.. (Score:5, Funny)
Gather round while I sing you of Wernher von Braun
A man whose allegiance is ruled by expedience
Call him a Nazi, he won't even frown
"Ha, Nazi schmazi," says Wernher von Braun
Don't say that he's hypocritical
Say rather that he's apolitical
"Once the rockets are up, who cares where they come down
That's not my department," says Wernher von Braun
Some have harsh words for this man of renown
But some think our attitude should be one of gratitude
Like the widows and cripples in old London town
Who owe their large pensions to Wernher von Braun
You too may be a big hero
Once you've learned to count backwards to zero
"In German oder English I know how to count down
Und I'm learning Chinese," says Wernher von Braun
Re:Obligatory Tom Lehrer.. (Score:5, Informative)
Parent
Re:Obligatory Tom Lehrer.. (Score:5, Funny)
Looks recorded.
Parent
Re: (Score:3, Funny)
Didn't know he was the kinky type.. (Score:2)
Nasty..
Re: (Score:2, Funny)
Is that like Fist Post?
Re: (Score:2)
A suggestion (Score:5, Funny)
On the next thing that goes up to space (or even just a suborbital flight), crank down the window at about 20km up and throw the stuff out (or have some automated thingy with an explosive bolt that distributes it into the atmosphere). Now THAT would be a "creative way to get it out to the public".
Then again, maybe that would be TOO creative.
Distributed Proofreaders (Score:2)
Scan it at high resolution, OCR what you can, and load it into Distributed Proofreaders [pgdp.net]. Or if the material is too technical for the layperson, ask for a copy of the web-based software and set up your own private site. Let bored grad students work on it in exchange for some kind of minor credit on the final digitized work. (I believe that the bored grad students phenomenon produces half of the highly-technical articles on Wikipedia.)
Re: (Score:2)
Captchas.
There are projects that use captchas to digitize old texts, NASA could put those parts which don't lend themselves to OCR as captchas on their webpage.
Re: (Score:2, Insightful)
Unfortunately, the notes are full of non-words, like (RTG), SNAP-10A, B70, n.mi
At least, that what i'm assuming they say, because some of them are rather unreadable. Now, slashdotters may recognise some, but many people won't see the "words"
Re: (Score:2)
There are far more individual numbers/letters/etc. in those notes than equations.
Fist director? (Score:2)
Boy do I not want to work for that particular department.
TIFF FTW (Score:5, Interesting)
Lets go with a format almost anyone can read. As soon as their all scanned in as high res TIFFs THEN you can begin to OCR them and create hybrid PDF's which CAN be indexed. From there we have a good start with high quality originals and searchable dirivitives. Then people can start rolling whatever custom solutions they want to.
Yes, I know that OCR is going to be very crude, especially for anything hand written. But what it will do is get us a very good starting point. Id like to see a wiki set up with the OCR'd text as the beginning text, a link to the document and then the public can begin to go in and correct the OCR mistakes, and fill in what just flat out couldn't be OCRd.
Recaptcha! (Score:2)
Sounds like a job for this project. [recaptcha.net]
Best part is, hand written is going to be more difficult to solve for computers...
Re: (Score:2)
I'd never seen that before, great idea.
Use a Wiki to Process Images to Open Format (Score:5, Insightful)
But, if I were doing this: Assuming these are all in images, put the images in whatever format you want and make a generic wiki page for each of them. Then let users log in (NASA fans should pour in) and translate the pages to annotated wiki pages with the footnotes (normally references) being all the side notes that were penciled in. They can categorize them by related missions and maybe even tag them
Once that's done, ideally you'd put it in some XML standards based format (ODF or OOXML, yeah, that's another argument to be had) that you will always be able to read even if you have to build your own viewer/converter. Keep these sources indexed and provide for people the rendered PDF/PS/PNG/whocares and then you could probably build scripts to rebuild all from sources if you want. New technology comes out or people want to view them in HTML 5--no problem, just build a neat little XSLT for them.
As for indexing them, I can tell you one way not to do it. Don't do the thing that curators of classical music did [stason.org]. Man, that's like speaking another language to me. Arrange the notes by mission or date if you can and any natural titles that arise for the favorites, add to it as an alias.
Re: (Score:2)
PDF with annotations (Score:2)
Brilliant! We'll make society do the work! (Score:3, Interesting)
they are allowing the marketplace to decide (Score:2, Insightful)
instead of focring people to pay taxes on some project of dubious desirability, they are trying to see if the public has any support for their idea, before they thrust headlong into it.
government workers should ask the opinion of the taxpayers more often, we are after all , their bosses. i have a lot of respect for the government employees that remember this, and nothing but contempt for those who want to 'play social engineer and tax waster' without regard for what the public thinks.
Re: (Score:3, Insightful)
Even if NASA did do it itself, "society" would be paying for it anyway...
Actually, this should be better in two important ways: not only could crowd-sourcing could accomplish the task much more efficiency than $50-grand-space-pen-NASA could to begin with, but also the cost would be distributed across the entire Internet, rather than being shouldered only by American taxpayers! It's a win-win-win* situation, I'd say.
(* for NASA, and for space geeks, and for taxpayers)
Anonymous Coward (Score:2, Interesting)
You guys clearly do not read enough electronic media. PDF and Djvu are the more widespread and relatively ubiquitous modern electronic book formats. Djvu tends to be vastly superior to PDF in terms of file size though.
Read all about it here:
http://en.wikipedia.org/wiki/Djvu
Discuss.
Zoom! (Score:3, Informative)
We're looking for creative ways to get it out to the public
By rocket mail!
http://en.wikipedia.org/wiki/Rocket_mail [wikipedia.org]
Re: (Score:2)
Recaptcha be able to might help (Score:2)
http://recaptcha.net/ [recaptcha.net]
Wonderful (but really awful) irony (Score:2, Interesting)
Vonce ze rakets go up . . . (Score:2)
Who cares where they come down.
That's not my department, says Wernher von Braun.
hard copy (Score:2)
personlly, i'd love a printed hard copy on my book shelf. right there with my Goddard books.
Tobacco Documents Online (Score:4, Interesting)
Re: (Score:2)
Sounds like a job for... Google!
Though I'd be happier if they released it in at least two major formats.
Re: (Score:3, Interesting)
What about Project Gutenberg [gutenberg.org]?
Re: (Score:2)