I dropped in on a Clubhouse discussion tonight about whether or not AI-generated music can be as good and wonderful as people-generated music.
I drifted off into thought about an idea I read somewhere.1 The idea is that apart from the beauty or artistry of a work of art moving us, we are also moved by the very act of its creation — its creation is an inspiration, a kind of gift. In fact it’s a kind of double gift because it’s the gift of the artwork and it’s also the gift of saying “this is possible, to create something beautiful is possible”, which is the gift of inspiration.
I got it immediately but it didn’t really hit me until I went to see The Umbrellas of Cherbourg, which I loved, yes for the colors and the story and I don’t have the words frankly to describe why I loved it, but after I saw it, or maybe sometime during, I thought “how could someone love life this much?” and I mean Jacques Demy, the director, that you would have to have a deep love of life for all its sorrows and joys to create something like this, and that, that sense of love was more moving that the picture itself, or perhaps equally moving, or perhaps it’s completely impossible to distinguish between them.2
Which is why I, my tech brain, the one that’s seen what AI can do, how it can be quite creative with words — I actually do think that AI-generated music has a decent chance at making music just as “good” as what great musicians make. I think it’s possible, at least for music.
I don’t think that you can, knowing that it was AI-generated, ever wonder how much the AI loved life because the AI doesn’t love life, it can’t, not in the way we love life.
Context matters in how we experience these things, and sometimes it matters so much that it’s a bit absurd. Modern art seems to me a kind of reductio ad absurdum of the idea that the story behind a work of art or that its context is what matters, to such an extent that most people, lacking the requisite context, have no fucking idea what they’re looking at in a modern art museum and if you’re like me, maybe you’ve found yourself alone in a cavernous white room, surrounded by odd objects when the absurd strikes you like an existential taser.
“Perhaps you’d be more comfortable with something more representational?” asks the security guard, she’s seen that look of terror before. “Yes” I say, “can you take me back to the impressionists?”
Context matters a lot in movies too — I saw The Umbrellas of Cherbourg at the Metrograph in New York on a dark and cold night when I was feeling low and exhausted and then here was this intense bright wave of love, at just the right time and place for my soul.3
My favorite place in Chicago is The Music Box Theatre, not just for their curation, but because there’s really nothing like seeing a movie in a packed grand theater with an organist warming up the audience. It matters that there are still physical places where you can go and see great art and entertainment, because it’s not just that we get to see things on a bigger screen, but it’s that we know that others care, and we can share some of that joy and sorrow and joy with them, a collective inspiration.
It’s not just the thing, it’s the act of making the thing, and then, the act of sharing the thing.
The reverse of this phenomenon is when a film is made cynically. Sometimes I watch a film and have a kind of visceral negative reaction. It’s not just the content or the craft, but something worse, something ugly, like the opposite of a gift — some films seem to despise the audience, to hate people. Watching those films feels ugly, it feels like something has been taken from me. It’s rare, but not rare enough. ↩
I think it also helped that I knew almost nothing about it beforehand, except that it was a musical and French and from the 60s. ↩
Skip to part III if you’re thirsty for music-making. You can come back and read this during your 12-hour render.
Jukebox is a neural net that generates music in a variety of genres and styles of existing bands or musicians. It also generates singing (or something like singing anyway).
Jukebox is pretty amazing because it can sorta almost create original songs by bands like The Beatles or the Gypsy Kings or anyone in the available artist list.
I don’t understand neural nets well enough to understand what it’s doing under the hood, but the results are pretty amazing, despite the rough edges.
This technology has the potential to unleash a wave of artistic (and legal) creativity.
It’s easy to imagine this thing getting really fucking good in a few years and all of a sudden you can create ‘original’ songs ‘The Beatles’ or ‘Taylor Swift’ for the price of a few hundred GPU hours.
Who has the right to publish or use those songs? Who owns what? What does ownership mean in this context? Does anyone think that our current judicial/legal systems are prepared to even understand these questions, let alone adjudicate them? Who is Spain? Why is Hitler? Where are the Snowdens of yesteryear?
Anyway, you can read more about it on the OpenAI website.
II. What I made with Jukebox
Early on in the lockdown, I made an experimental short film built mostly with stock footage.
I went looking for free music at the usual stock sites and as usual, came back disappointed. So I started looking for ways to generate music with AI because maybe it would create a kind of artificial or mediated feeling that I was looking to create with the short.
Here are the some of the songs I created, of varying quality:
Or check out OpenAI’s collection of samples. It’s wild how good some of these are.
I don’t have the compute power that they have, but the samples I created were enough to create a surreal soundscape for the film:
II. Overview and limitations
OpenAI uses a supercomputer to train their models and maybe to generate the songs too, and well, unless you also have a supercomputer or at least a very sweet GPU setup, your creativity will be a bit limited.
When I started playing with Jukebox, I wanted to created 3-minute songs from scratch, which turned out to be more than Google Colab (even with the pro upgrade) could handle.
If you’re going to do this with Google Colab, then you’ll want to upgrade to the Pro version. It’s $10 a month and recommended for everyone that does not enjoy losing their progress when the runtime times out after six hours.
Because it took me about 12 hours to generate each 45-second song, my experimentation was limited, but after a lot of trial and error, I was able to consistently generate 45-second clips of new songs in the style of many musicians in a variety of genres.
Another challenge is the lack of artist-friendly documentation.
AI researchers tend to publish their models and accompanying documentation for a computer-science-researcher-type audience — I’m trying to bridge that gap so that artists and non-technical people can play with the technology.
Hopefully, in the future, we’ll see more user-friendly documentation for these kinds of tools — if musicians had to be electrical engineers and wire their own amplifiers, we probably wouldn’t have rock music.
III. Getting set up
My version of the notebook has been trimmed down to remove some features that I couldn’t get to work and I think it’s an easier way to get started, but feel free to experiment with their version.
If you want to save your edits, save a copy of the notebook to your Google Drive.
If you’re new to all this, Google Colab is an interactive coding notebook that is free to use. It’s built on the open-source software called Jupyter Notebook, which is commonly used to run machine learning experiments. It allows you to run Python code in your browser that is executed on a virtual machine, on some Google server somewhere. Much easier than building your own supercomputer.
You might want to check out Google’s intro to Google Colab or google around for a tutorial.
And as I mentioned above, you’ll want to upgrade to Google Colab Pro. It’s $9.99/month and you can cancel whenever you want. Getting the Pro version means you’ll have access to faster GPUs, more memory, and longer runtimes, which will make your song generating faster and less prone to meltdown.
IV. Lyrics generation with GPT-2
Jukebox allows you to write your own lyrics for the songs you generate but I decided to let an AI generate lyrics for me, with a little help from the creative writers of stock footage clip descriptions.
I was going for a sort of surreal dystopian aesthetic so I pulled the descriptions of some random stock footage clips, e.g. “A lady bursting a laugh moves around holding a microphone as bait as a guy creeps”:
You can train aitextgen with the complete works of Shakespeare or your middle school essays or whatever text you want.
What if the result of deepfakes is the opposite of what everyone expects? Like because it will be so easy to fake a video of someone, anyone can deny that any video is real.
But now, because of this technology’s prevalence, does it make it harder to blackmail someone?
Let’s say the diplomat did have an affair with not his wife. And then a video of it surfaces. Can’t he just say that it’s a deepfake?
What is the threshold for believing something now?
If you can create a deepfake of basically any actor, couldn’t you cast a film this way?
Instead of bringing actors in to read sides in an audition room, you could film a prototype of the scene with a random actor and then try out various different actors in the role, using AI to superimpose their faces and recreate their voices.
Then you could cast the best one.
Of course you could make a whole movie this way.
It’s not legal (or won’t be) but presumably you could get away with deepfake casting more easily than you could get away with making a whole movie this way, as it would never be released to the public.
I don’t know if I like that these things are possible, but they are interesting to think about.
Please enjoy this freewheeling interview in which I use Talk to Transformer to interview GPT-2 on its thoughts on the Golden Age of Television. Later in the interview, we turn to The Wire and David Simon’s other work, the public’s inability to learn, and then we get sucked into some kind of business conference call.
Lightly edited for formatting.
Me: It’s the golden age of television.
GPT-2: You can look at a pilot script and immediately tell who’s on the show. You know if there’s a storyline, if there’s a character, and people can be excited about that. So in a certain sense, people are so excited about this stuff, that this TV industry is so exciting. And there’s this wonderful thing called the Internet, and I don’t know if you’ve ever been exposed to it, but it’s as simple as searching for a show. You don’t have to search for anything. If you want a show, you just go to the Internet. There’s no more guesswork.
I’m familiar with the internet. But I’m wondering, what do you say to people who call it “peak TV”, who say that there are too many shows nowadays, and that this has led to audience splintering, that there’s no shared common culture anymore?
I can’t answer that. I don’t know about the media, and what you’re saying is that it’s just not the case anymore… I could also say to you: “It sounds like more people are coming to social media, the internet, for a lot more important reasons” because it’s easy to go to Youtube and it seems there’s more of them, maybe it’s because the other places are trying to attract people there, that’s why the internet is getting more popular right now. And this will continue to move forward.
All over the world, TV stations have started to do social media promotion to bring in people. But there are some reasons: because now there are big channels with huge budgets for marketing on the internet and a lot of the people on the internet, on these Facebook pages are people who don’t have social media accounts and then they become a part of the network and now there are many channels that have different content.
You can’t compare what you share on WhatsApp and your Facebook accounts, or you can’t compare what you write
But have we had a show as good as say… The Wire or Deadwood in the last few years? Does the golden age refer more to quantity than quality?
Is a show good enough that the person sitting next to you might watch it if you tell him it’s good? Well… Yes, certainly it is.
Show #5: I have no idea What it is : I guess it comes from the fact that my favorite shows to watch are on Netflix so for most people they’d expect to buy the show before checking the box asking specifically whether or not it’s good. A few more years ago this would have been hilarious.
Show #6: I don’t know How to watch it You are right and I have no idea if The Wire came out before The Sopranos, Deadwood, etc.. I’m sure if it came out there’s a good chance that you haven’t been aware that it came out before both.
I mean… How does The Wire compare to any of the other recent shows we’ve mentioned?
I would say The Wire compares quite favorably to the shows that are currently on the air.
The difference is that HBO and Showtime are still getting the same amount of shows, whereas The Wire may not get all the stuff it wants.
The show’s final season was filmed on site in front of the same set of studio cameras it has always used — from which, naturally, the same exact shot is used over and over again. There were no special effects, just the traditional live footage from the same cameras on all four sets. We had a very detailed visual that we shot for years, and never thought that it would have to be filmed in a different way, where there would be different lighting and the actors would stand on separate plates. But it’s also true, The Wire really was shot inside of those exact same spaces.
This was always our goal, and I think it’s what we did on The Wire. We never really got around to it, or to the same goal of doing a full production. But because all of the shows we did with David Simon were made inside those exact same sets, every time a show is made on cable or in the air, it really affects a big part of how a show evolves, whether it’s a network or whether it’s online.
I watched the entire show in its entirety, and I think it had the most fun
You mentioned David Simon and his other shows — were you involved in The Corner and Homicide? I assume you’re not suggesting that Generation Kill was shot on the same sets as The Wire.
I think the two were quite different shows. I didn’t write The Corner because I wanted to do comedy and write. The Corner’s premise was just that one guy is a cop. So I was drawn to that.
I wasn’t interested in producing comedy. I wasn’t interested in making music videos. He asked me to write a script, so I wrote some stuff, and he ended up doing one as The Corner, and the next was Homicide. And it was The Corner where I wrote The Wire.
He brought the idea of the police being corrupt and they’re not held to the same standard, and as a comic you have to think of the public in terms of, well, what is the public in a free society looking for? Then you have to talk about them not as criminals who’re out here, they’re here to be judged.
The reason I brought The Corner and Homicide to HBO, which is a place where I didn’t want to go, was because I wanted to do comedy with a public persona and show people who these characters are. It gives them all their own persona. And at the end of the day, to me, it’s much more interesting
OK, wow, so you see yourself more as a comedian than, I don’t know, an algorithm or a neural net. So I guess we have that in common. I want to follow up on the question that you posed, because I think it’s really interesting. In your opinion, what is the public in a free society looking for?
Do we feel like we’ve learned anything? I mean, the public is a very complex animal.
DAVE: It’s very complicated. I mean, it’s not like the public is going to be able to learn from something as easily as us. But what’s interesting about my response in my interview is that in a book like that, which has a very specific subject, it goes way beyond the simple question of how do we educate people.
The real question is what’s true, what’s true about the human condition if it’s true, and that’s my response.
Who is Dave?
Dave is a self-made entrepreneur who started the blog The Real Deal and co-founded two successful online brands: DDP and DripClub. His products are made from scratch, with the idea of bringing products to users’ wrists, and are developed by his team of engineers, designers, sales, and customer support specialists for his clients.
His mission is to give you better products.
OK. Great. Nice to meet you Dave and congrats on your success. It sounds like things are going well for you. I think we should wrap things up now but before we end this, can you expand on what you said earlier about how the public is not going to be able to learn something as easily as you?
I guess it would be nice if you could elaborate on that further before you come to me tomorrow. Thanks.
David E. Epstein: Yeah and I can say that we are doing all we can, as you can see with the results, all we’ve been doing.
Our first quarter was really nice and we will continue to do what we’re doing for the remainder of the quarter, which is increase our customer contact information and get more customers contacting us.
For instance, we had a huge uptick in the call volume, a phenomenal quarter and we don’t want to do any surprises with it. We’re definitely looking ahead and, I think, I guess, a good sign for the future.
MARK: Just so we are clear. First quarter, that’s pretty good news. Do you see any other results with this strategy, the increased customer contact number coming out in the next quarter or does this all go together with the launch of that mobile app?
DAVID E. PEDERSON: Yeah, I think the company’s got a great product
Thank you for your time.
We appreciate your cooperation
I can’t stop laughing at the random stuff created by AI on AI Weirdness. These lists of AI-generated change.org petitions are wild:
Bad ideas/Lost Causes:
Dogs are not a thing!! Dog Owners are NOT Human beings!!
Help Bring Climate Change to the Philippines!
Taco, Chipotle, and Starbucks: Bring Back Lettuce Fries
Filipinos: We want your help stopping the killing of dolphins in Brazil in 1970’s
Mr.person: I want a fresh puppy in my home
Simple Stats Administration: Make Another proboscis.
Officials at Prince Alfred Hospital: Aurora to Tell The Company To Send A Baby to Mars
Sign Petition for Houston’s New Fireworks to be Offensive
Make a mudchat
Please not punish myself with a $20 fines.
Unicorn: Stop breaking crab products
Rooster Teeth : Have Rooster Teeth Fix Your Responses To Obama
The people of Great Adventure: get lil bl00ty moose loyal to us
The People of Kashmir : Ban of Airbrushed Bamboo Trees By Pune
Barack Obama, Barack Obama, and Barack Obama: STOP PING MY HUSBERS!
Saskatoon Police Service: No more scootty
One Highway, Four Hens, Highway 1
Rhino Amish Culture Association: Cut the horns of the congon sturgeon & treat it better!
Harmonix: Increase the speed limit on Easton Road to 5mph.
Everyone: Put the Bats on YouTube!
Donald Trump: Change the name of the National Anthem to be called the “Fiery Gator”
Taco Bell: Offer hot wings and non-perfumed water for all customers
Do not attack the unions! Keep cowpies!
Anyone: Get a cat to sing on air!
The people of the world: Change the name of the planet to the Planet of the Giants
Dr James Alexander: Make the Power of the Mongoose a Part of the School’s Curriculum
These are funny in the way that those “worst answers to tests” are funny — absurd and completely surprising responses, but in the right form. They have the form of petitions, but they’re insanely playful and creative instances of petitions.
I don’t know if it makes sense to all an algorithm’s output ‘playful,’ but I think ‘creative’ does make sense, if we think of creativity as the combination of disparate things in a coherent way. That’s basically creativity, yeah? At least one form of it.
Whatever it is, AI seems to be really good at it. On its own, it might just be a high-powered amusement generator, but when combined with a human writer/editor, it could be a powerful creative tool — as a writer, it’s really hard to get out of your own way and open the mind.
It’s far too easy to get stuck on a track, to limit where your ideas are sourced from (even within your own brain), to just not be creative.
Not to mention, how often do you have access to all of the possible combinations in your mind?
My brain is pretty mysterious to me, nothing like a database. I just have to kind of get into a certain state and hope that good ideas come through, like tuning a radio to a mysterious radio station.
But if there was a scanner to make all of the ideas available… to combine the ideas to generate new ones… well, now, that would be interesting.
There’s still something uncanny about it, but the technology is obviously very close. I mean, maybe the uncanny thing is just that it’s Steve Buscemi in Jennifer Lawrence’s body with her voice.
The lip-syncing videos are close too, but something still feels a little off to them:
We’re entering a world where you can make a live-action film without any traditional “production”. You can have just writing and post-production.
Write the script.
Then build it with AI-generated voices and images.
Production could be simply a matter of recording the samples to build the voices and recording some video to build the video.
You might not even need that — you could pulls the voices and imagery from a library.
We could bring back dead actors to play leading roles. Want to reunite Bogey and Bacall? Just deepfake it.
We’re going to need a whole new area of IP law for this. Who owns the rights to their image or voice? Do you have rights for that? Does it expire or go into the public domain? How much would the rights to use, say, Bradley Cooper’s digital avatar in perpetuity?
What the fuck:
There’s a company called Dessa that made a software called RealTalk that can learn how to speak in someone’s voice. Apparently it wasn’t that hard to train it.
I don’t listen to Rogan’s podcast regularly so I don’t know if I would be able to realize it was fake, but it’s close enough and it’ll just get better anyway.1
The company isn’t releasing it to the public but it’s only a matter of time before someone else does the same thing.
My first thought was “oh this sounds like a terrifying political propaganda/slander tool!” and my second thought was that this will render useless any voice-based work.
Do you need a news reader on the radio? No, you just need someone to speak to the robot enough to train it. No need to show up to the studio every morning (eventually followed by a robot to write the news report?).
Voiceover artists too, basically unnecessary now. A company could have software trained with a library of thousands (or millions) of voices.
Maybe I’m being naive but I’m skeptical that society will collapse because of technology like this. We have Photoshop and yes there are implications to only seeing women in magazines who have been Photoshopped, but it’s not like people are creating fake images of say a presidential candidate doing cocaine.
Why not? If you were a strategist recommending this, people would probably say “people won’t believe it.”
And I think we’ve developed a kind of norm of skepticism around this kind of thing, like anything scandalous we immediately think “is this fake?” or “what’s the PR angle here?”.
We already have a norm for “don’t believe everything you read” and this does make modern life difficult: it would be nice to know that if a newspaper prints something then it’s 100% true, but that’s not the case and here we are.
I think there’s a dangerous window while that norm is forming — like the 2020 elections could be wild, but then a norm forms where eventually people say “oh you can’t trust video anymore, especially video of politicians.”
It’s still powerful, but I think the power is in making people less trusting of recorded images (or in the near future, recorded audio) that don’t come from pre-vetted sources.
From a creator’s perspective, it’s interesting to think about what could be done without needing to go into a studio to record. Just as music can be composed on a computer now, so can voices.